Strings in a DataFrame, but dtype is object

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

This is my DataFrame:

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 56992 entries, 0 to 56991
    Data columns (total 7 columns):
    id            56992  non-null values
    attr1         56992  non-null values
    attr2         56992  non-null values
    attr3         56992  non-null values
    attr4         56992  non-null values
    attr5         56992  non-null values
    attr6         56992  non-null values
    dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

    for c in df.columns:
        if df[c].dtype == object:
            print "convert ", df[c].name, " to string"
            df[c] = df[c].astype(str)

Then, df["attr2"] still has dtype object, although type(df["attr2"].ix[0] reveals str, which is correct.

Pandas distinguishes between int64 and float64 and object. What is the logic behind it when there is no dtype str? Why is a str covered by object?

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.

enter image description here


Back to homepage or read more recommendations: