What is the difference between NaN and None?
I am reading two columns of a csv file using pandas
readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be
None but instead
nan is assigned. Surely
None is more descriptive of an empty cell as it has a null value, whereas
nan just says that the value read is not a number.
Is my understanding correct, what IS the difference between
nan? Why is
nan assigned instead of
Also, my dictionary check for any empty cells has been using
for k, v in my_dict.iteritems(): if np.isnan(v):
But this gives me an error saying that I cannot use this check for
v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check
v for an "empty cell"/
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions
notnullwhich can be used across the dtypes to detect NA values.
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, seeNA type promotions.
# without forcing dtype it changes None to NaN! s_bad = pd.Series([1, None], dtype=object) s_good = pd.Series([1, np.nan]) In : s_bad.dtype Out: dtype('O') In : s_good.dtype Out: dtype('float64')
Jeff comments (below) on this:
np.nanallows for vectorized operations; its a float value, while
None, by definition, forces object type, which basically disables all efficiency in numpy.
So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In : s_bad.sum() Out: 1 In : s_good.sum() Out: 1.0