Progress indicator during pandas operations (python)
I regularly perform pandas operations on data frames in excess of 15 million or so rows and I'd love to have access to a progress indicator for particular operations.
Does a text based progress indicator for pandas split-apply-combine operations exist?
For example, in something like:
feature_rollup is a somewhat involved function that take many DF columns and creates new user columns through various methods. These operations can take a while for large data frames so I'd like to know if it is possible to have text based output in an iPython notebook that updates me on the progress.
So far, I've tried canonical loop progress indicators for Python but they don't interact with pandas in any meaningful way.
I'm hoping there's something I've overlooked in the pandas library/documentation that allows one to know the progress of a split-apply-combine. A simple implementation would maybe look at the total number of data frame subsets upon which the
apply function is working and report progress as the completed fraction of those subsets.
Is this perhaps something that needs to be added to the library?
Due to popular demand,
tqdm has added support for
pandas. Unlike the other answers, this will not noticeably slow pandas down -- here's an example for
import pandas as pd import numpy as np from tqdm import tqdm df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000))) # Create and register a new `tqdm` instance with `pandas` # (can use tqdm_gui, optional kwargs, etc.) tqdm.pandas() # Now you can use `progress_apply` instead of `apply` df.groupby(0).progress_apply(lambda x: x**2)
To directly answer the original question, replace:
from tqdm import tqdm tqdm.pandas() df_users.groupby(['userID', 'requestDate']).progress_apply(feature_rollup)
Note: tqdm <= v4.8: For versions of tqdm below 4.8, instead of
tqdm.pandas() you had to do:
from tqdm import tqdm, tqdm_pandas tqdm_pandas(tqdm())