Outputting difference in two Pandas dataframes side by side - highlighting the difference

I am trying to highlight exactly what changed between two dataframes.

Suppose I have two Python Pandas dataframes:

    "StudentRoster Jan-1":
    id   Name   score                    isEnrolled           Comment
    111  Jack   2.17                     True                 He was late to class
    112  Nick   1.11                     False                Graduated
    113  Zoe    4.12                     True       

    "StudentRoster Jan-2":
    id   Name   score                    isEnrolled           Comment
    111  Jack   2.17                     True                 He was late to class
    112  Nick   1.21                     False                Graduated
    113  Zoe    4.12                     False                On vacation

My goal is to output an HTML table that:

  1. Identifies rows that have changed (could be int, float, boolean, string)
  2. Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:
     "StudentRoster Difference Jan-1 - Jan-2":  
    id   Name   score                    isEnrolled           Comment
    112  Nick   was 1.11| now 1.21       False                Graduated
    113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"

I suppose I could do a row by row and column by column comparison, but is there an easier way?

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

    In [21]: ne = (df1 != df2).any(1)

    In [22]: ne
    Out[22]:
    0    False
    1     True
    2     True
    dtype: bool

Then we can see which entries have changed:

    In [23]: ne_stacked = (df1 != df2).stack()

    In [24]: changed = ne_stacked[ne_stacked]

    In [25]: changed.index.names = ['id', 'col']

    In [26]: changed
    Out[26]:
    id  col
    1   score         True
    2   isEnrolled    True
        Comment       True
    dtype: bool

Here the first entry is the index and the second the columns which has been changed.

    In [27]: difference_locations = np.where(df1 != df2)

    In [28]: changed_from = df1.values[difference_locations]

    In [29]: changed_to = df2.values[difference_locations]

    In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
    Out[30]:
                   from           to
    id col
    1  score       1.11         1.21
    2  isEnrolled  True        False
       Comment     None  On vacation
  • Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.

From: stackoverflow.com/q/17095101

Back to homepage or read more recommendations: