Optimizing Pandas Code: The Impact of Operation Sequence | by Marcin Kozak | Mar, 2024

PYTHON PROGRAMMING

Learn how to rearrange your code to achieve significant speed improvements.

Marcin Kozak
Towards Data Science
Photo by Nick Fewings Unsplash

offer a fantastic framework to operate on dataframes. In data , we with small, big — and sometimes very big dataframes. While analyzing small ones can be blazingly fast, even a single operation on a big dataframe can take noticeable time.

In this article I will show that often you can make this time shorter by something that costs practically nothing: the order of on a dataframe.

Imagine the following dataframe:

import pandas  pd

n = 1_000_000
df = pd.DataFrame({
letter: list(range(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})

With a million rows and 25 columns, big. Many operation on such a dataframe will be noticeable on current personal computers.

Imagine we want to filter the rows, in order to take those which follow the following condition: a < 50_000 and b > 3000 and select five columns: take_cols=['a', 'b', 'g', 'n', 'x']. We can do this in the following way:

subdf = df[take_cols]
subdf = subdf[subdf['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]

In this code, we take the required columns first, and then we perform the filtering of rows. We can achieve the same in a different order of the operations, first performing the filtering and then selecting the columns:

subdf = df[df['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]
subdf = subdf[take_cols]

We can achieve the very same result via chaining Pandas operations. The corresponding pipes of are as follows:

# first take columns then filter rows
df.filter(take_cols).query(query)

# first filter rows then take columns
df.query(query).filter(take_cols)

Since df is big, the four versions will probably differ in . Which will be the fastest and which will be the slowest?

Let’s benchmark this operations. We will use the timeit module:

Source link