Strings in a DataFrame, but dtype is object
Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.
This is my DataFrame:
<class 'pandas.core.frame.DataFrame'> Int64Index: 56992 entries, 0 to 56991 Data columns (total 7 columns): id 56992 non-null values attr1 56992 non-null values attr2 56992 non-null values attr3 56992 non-null values attr4 56992 non-null values attr5 56992 non-null values attr6 56992 non-null values dtypes: int64(2), object(5)
Five of them are
dtype object. I explicitly convert those objects to strings:
for c in df.columns: if df[c].dtype == object: print "convert ", df[c].name, " to string" df[c] = df[c].astype(str)
df["attr2"] still has
dtype object, although
str, which is correct.
Pandas distinguishes between
object. What is the logic behind it when there is no
dtype str? Why is a
str covered by
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.
Here is an example:
- the int64 array contains 4 int64 value.
- the object array contains 4 pointers to 3 string objects.