Pandas groupby: How to get a union of strings

I have a dataframe like this:

       A         B       C
    0  1  0.749065    This
    1  2  0.301084      is
    2  3  0.463468       a
    3  4  0.643961  random
    4  1  0.866521  string
    5  2  0.120737       !

Calling

    In [10]: print df.groupby("A")["B"].sum()

will return

    A
    1    1.615586
    2    0.421821
    3    0.463468
    4    0.643961

Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

    A
    1    {This, string}
    2    {is, !}
    3    {a}
    4    {random}

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although

    df.groupby("A")["B"]

is a

    pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?

    In [4]: df = read_csv(StringIO(data),sep='\s+')

    In [5]: df
    Out[5]: 
       A         B       C
    0  1  0.749065    This
    1  2  0.301084      is
    2  3  0.463468       a
    3  4  0.643961  random
    4  1  0.866521  string
    5  2  0.120737       !

    In [6]: df.dtypes
    Out[6]: 
    A      int64
    B    float64
    C     object
    dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

    In [8]: df.groupby('A').apply(lambda x: x.sum())
    Out[8]: 
       A         B           C
    A                         
    1  2  1.615586  Thisstring
    2  4  0.421821         is!
    3  3  0.463468           a
    4  4  0.643961      random

sum by default concatenates

    In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
    Out[9]: 
    A
    1    Thisstring
    2           is!
    3             a
    4        random
    dtype: object

You can do pretty much what you want

    In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
    Out[11]: 
    A
    1    {This, string}
    2           {is, !}
    3               {a}
    4          {random}
    dtype: object

Doing this a whole frame group at a time. Key is to return a Series

    def f(x):
         return Series(dict(A = x['A'].sum(), 
                            B = x['B'].sum(), 
                            C = "{%s}" % ', '.join(x['C'])))

    In [14]: df.groupby('A').apply(f)
    Out[14]: 
       A         B               C
    A                             
    1  2  1.615586  {This, string}
    2  4  0.421821         {is, !}
    3  3  0.463468             {a}
    4  4  0.643961        {random}

From: stackoverflow.com/q/17841149

Back to homepage or read more recommendations: