Optimizing Pandas Code: The Impact of Operation Sequence | by Marcin Kozak | Mar, 2024

March 18, 2024
by Marcin Kozak
AI, data science, dataframes, pandas, speed improvements, Syndicated
861 Views

PYTHON PROGRAMMING

Learn how to rearrange your code to achieve significant speed improvements.

Pandas offer a fantastic framework to operate on dataframes. In data science, we work with small, big — and sometimes very big dataframes. While analyzing small ones can be blazingly fast, even a single operation on a big dataframe can take noticeable time.

In this article I will show that often you can make this time shorter by something that costs practically nothing: the order of operations on a dataframe.

Imagine the following dataframe:

import pandas as pdn = 1_000_000
df = pd.DataFrame({
letter: list(range(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})

With a million rows and 25 columns, it’s big. Many operation on such a dataframe will be noticeable on current personal computers.

Imagine we want to filter the rows, in order to take those which follow the following condition: a < 50_000 and b > 3000 and select five columns: take_cols=['a', 'b', 'g', 'n', 'x']. We can do this in the following way:

subdf = df[take_cols]
subdf = subdf[subdf['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]

In this code, we take the required columns first, and then we perform the filtering of rows. We can achieve the same in a different order of the operations, first performing the filtering and then selecting the columns:

subdf = df[df['a'] < 50_000]
subdf = subdf[subdf['b'] > 3000]
subdf = subdf[take_cols]

We can achieve the very same result via chaining Pandas operations. The corresponding pipes of commands are as follows:

# first take columns then filter rows
df.filter(take_cols).query(query)# first filter rows then take columns
df.query(query).filter(take_cols)

Since df is big, the four versions will probably differ in performance. Which will be the fastest and which will be the slowest?

Let’s benchmark this operations. We will use the timeit module:

Source link

2024 costs Data data science dataframes Pandas performance Python speed improvements

Optimizing Pandas Code: The Impact of Operation Sequence | by Marcin Kozak | Mar, 2024

PYTHON PROGRAMMING

Learn how to rearrange your code to achieve significant speed improvements.

About Us

Our Services

Latest QSOL IT News

Optimizing Pandas Code: The Impact of Operation Sequence | by Marcin Kozak | Mar, 2024

PYTHON PROGRAMMING

Learn how to rearrange your code to achieve significant speed improvements.

Related Post

Microsoft and FFA help students use smart sensors

PowerShell Best Practices for Intune: Logging

PowerShell Best Practices for Intune: Logging

Advanced Intune Device Query: Joining Across Categories