How to store a dataframe using Pandas

Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?

The easiest way is to pickle it using to_pickle:

    df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

    df = pd.read_pickle(file_name)

_Note: before 0.11.1save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively)._

Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

    store = HDFStore('store.h5')

    store['df'] = df  # save it
    store['df']  # load it

More advanced strategies are discussed in thecookbook.

Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).


Back to homepage or read more recommendations: