EDA with Polars: Step-by-Step Guide to Aggregate and Analytic Functions (Part 2) | by Antons Tocilins-Ruberts | Jul, 2023


Interestingly, there’s no overlap between the categories. So even though it might take some time for a music clip to get into the trending, it’s more likely to stay there for longer. The same goes for movie trailers and other entertainment content.

So we know that the live-comedy shows get into the trending the fastest and music and entertainment videos stay there the longest. But has it always been the case? To answer this question, we need to create some rolling aggregates. Let’s answer three main questions in this section:

  • What is the total number of trending videos per category per month?
  • What is the number of new videos per category per month?
  • How do the categories compare when it comes to views over time?

Total Number of Monthly Trending Videos per Category

First, let’s look at the total number of videos per category per month. To get this statistic, we need to use .groupby_dynamic() method that allows us to group by the date column (specified as index_column ) and any other column of choice (specified as by parameter). The grouping frequency is controlled according to the every parameter.

trending_monthly_stats = df.groupby_dynamic(
index_column="trending_date", # date column
every="1mo", # can also me 1w, 1d, 1h etc
closed="both", # including starting and end date
by="category_id", # other grouping columns
include_boundaries=True, # showcase the boudanries
).agg(
pl.col("video_id").n_unique().alias("videos_number"),
)

print(trending_monthly_stats.sample(3))

Resulting resampled data frame. Screenshot by author.

You can see the resulting DataFrame above. Very nice property of Polars is that we can output the boundaries to sense check the results. Now, let’s do some plotting to visualise the patterns.

plotting_df = trending_monthly_stats.filter(pl.col("category_id").is_in(top_categories))

sns.lineplot(
x=plotting_df["trending_date"],
y=plotting_df["videos_number"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)

plt.title("Total Number of Videos in Trending per Category per Month")

Number of videos plot. Generated by author.

From this plot we can see that Music has the largest share of Trending stating from 2018. This might indicate some strategic shift within YouTube to become the go-to platform for music videos. Entertainment seems to be on the gradual decline together with People & Blogs and Howto & Style categories.

Number of New Monthly Trending Videos per Category

The query is exactly the same, except now we need to provide as index_column the first the date when a video got into Trending. Would be nice to create a function here, but I’ll leave this as an exercise for a curious reader.

trending_monthly_stats_unique = (
time_to_trending_df.sort("first_day_in_trending")
.groupby_dynamic(
index_column="first_day_in_trending",
every="1mo",
by="category_id",
include_boundaries=True,
)
.agg(pl.col("video_id").n_unique().alias("videos_number"))
)

plotting_df = trending_monthly_stats_unique.filter(pl.col("category_id").is_in(top_categories))
sns.lineplot(
x=plotting_df["first_day_in_trending"],
y=plotting_df["videos_number"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)

plt.title(" Number of New Trending Videos per Category per Month")

Number of new videos plot. Generated by author.

Here we get an interesting insights — the number of new videos by Entertainment and Music is roughly equal throughout the time. Since Music videos stay in Trending much longer, they are overrepresented in the Trending counts, but when these videos are deduped this pattern disappears.

Running Average of Views per Category

As the last step of this analysis, let’s compare two most popular categories (Music and Entertainment) according to their views over time. To perform this analysis, we’re going to use the 7 day running average statistic to visualise the trends. To calculate this rolling statistic Polars has a handy method called .groupby_rolling() . Before applying it though, let’s sum up all the views by category_id and trending_date and then sort the DataFrame accordingly. This format is required to correctly calculate the rolling statistics.

views_per_category_date = (
df.groupby(["category_id", "trending_date"])
.agg(pl.col("views").sum())
.sort(["category_id", "trending_date"])
)

Once the DataFrame is ready, we can use .groupby_rolling() method to create the rolling average statistic by specifying 1w in the period argument and creating an average expression in the .agg() method.

# Calculate rolling average
views_per_category_date_rolling = views_per_category_date.groupby_rolling(
index_column="trending_date", # Date column
by="category_id", # Grouping column
period="1w" # Rolling length
).agg(
pl.col("views").mean().alias("rolling_weekly_average")
)

# Plotting
plotting_df = views_per_category_date_rolling.filter(pl.col("category_id").is_in(['Music', 'Entertainment']))
sns.lineplot(
x=plotting_df["trending_date"],
y=plotting_df["rolling_weekly_average"],
hue=plotting_df["category_id"],
style=plotting_df["category_id"],
markers=True,
dashes=False,
palette='Set2'
)

plt.title("7-day Views Average")

Plot generated by author.

According to the 7-day rolling average views, Music completely dominates the Trending tab and starting from February 2018 the gap between these two categories has increased massively.

After finishing this post and following along the code you should get a much better understanding of advanced aggregate and analytic functions in Polars. In particular, we’ve covered:

  • Basics of working with pl.datetime
  • .groupby() aggregations with multiple arguments
  • The use of .over() to create aggregates over a specific group
  • The use of .groupby_dynamic() to generate aggregates over time windows
  • The use of .groupby_rolling() to generate rolling aggregates over period

Armed with this knowledge you should be able to perform almost every analytical task you have at the lightning speed.

You might have felt that some of this analysis felt very ad-hoc and you would be right. The next part is going to address exactly this topic — how to structure and create data processing pipelines. So stay tuned!

Not a Medium Member yet?



Source link

This post originally appeared on TechToday.