Apply vs transform on a group object

Consider the following dataframe:

         A      B         C         D
    0  foo    one  0.162003  0.087469
    1  bar    one -1.156319 -1.526272
    2  foo    two  0.833892 -1.666304
    3  bar  three -2.026673 -0.322057
    4  foo    two  0.411452 -0.954371
    5  bar    two  0.765878 -0.095968
    6  foo    one -0.654890  0.678091
    7  foo  three -1.789842 -1.130922

The following commands work:

    > df.groupby('A').apply(lambda x: (x['C'] - x['D']))
    > df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

    > df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    ValueError: could not broadcast input array from shape (5) into shape (5,3)

    > df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
     TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

    # Note that the following suggests row-wise operation (x.mean is the column mean)
    zscore = lambda x: (x - x.mean()) / x.std()
    transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

    df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                              'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : randn(8), 'D' : randn(8)})

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.

will yield:

           C      D
    0  0.989  0.128
    1 -0.478  0.489
    2  0.889 -0.589
    3 -0.671 -1.150
    4  0.034 -0.285
    5  1.149  0.662
    6 -1.404 -0.907
    7 -0.509  1.653

Which is exactly the same as if you would use it on only on one column at a time:



    0    0.989
    1   -0.478
    2    0.889
    3   -0.671
    4    0.034
    5    1.149
    6   -1.404
    7   -0.509

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:


gives error:

    ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


         A      B      C      D  sum_C
    1  bar    one  1.998  0.593  3.973
    3  bar  three  1.287 -0.639  3.973
    5  bar    two  0.687 -1.027  3.973
    4  foo    two  0.205  1.274  4.373
    2  foo    two  0.128  0.924  4.373
    6  foo    one  2.113 -0.516  4.373
    7  foo  three  0.657 -1.179  4.373
    0  foo    one  1.270  0.201  4.373

Trying the same with .apply would give NaNs in sum_C. Because .apply would return a reduced Series, which it does not know how to broadcast back:



    bar    3.973
    foo    4.373

There are also cases when .transform is used to filter the data:

    df[df.groupby(['B'])['D'].transform(sum) < -1]

         A      B      C      D
    3  bar  three  1.287 -0.639
    7  foo  three  0.657 -1.179

I hope this adds a bit more clarity.


Back to homepage or read more recommendations: