Topics per Class Using BERTopic. How to understand the differences in… | by Mariya Mansurova | Sep, 2023

Photo by Fas Khan on Unsplash

Graph by author

Graph by author

Image from BERTopic docs (source)

You can find the full code on GitHub.

According to the documentation, we typically don’t need to preprocess data unless there is a lot of noise, for example, HTML tags or other markdowns that don’t add meaning to the documents. It’s a significant advantage of BERTopic because, for many NLP methods, there is a lot of boilerplate to preprocess your data. If you are interested in how it could look like, see this guide for Topic Modelline using LDA.

You can use BERTopic with data in multiple languages specifying BERTopic(language= "multilingual"). However, from my experience, the model works a bit better with texts translated into one language. So, I will translate all comments into English.

For translation, we will use deep-translator package (you can install it from PyPI).

Also, it could be interesting to see distribution by languages, for that we could use langdetect package.

import langdetect
from deep_translator import GoogleTranslatordef get_language(text):
try:
return langdetect.detect(text)
except KeyboardInterrupt as e:
raise(e)
except:
return '<-- ERROR -->'
def get_translation(text):
try:
return GoogleTranslator(source='auto', target='en')\
.translate(str(text))
except KeyboardInterrupt as e:
raise(e)
except:
return '<-- ERROR -->'
df['language'] = df.review.map(get_language)
df['reviews_transl'] = df.review.map(get_translation)

In our case, 95+% of comments are already in English.

To understand our data better, let’s look at the distribution of reviews’ length. It shows that there are a lot of extremely short (and most likely not meaningful comments) — around 5% of reviews are less than 20 symbols.

We can look at the most common examples to ensure that there’s not much information in such comments.

df.reviews_transl.map(lambda x: x.lower().strip()).value_counts().head(10)reviews
none                          74
<-- error -->                 37
great hotel                   12
perfect                        8
excellent value for money      7
good value for money           7
very good hotel                6
excellent hotel                6
great location                 6
very nice hotel                5

So we can filter out all comments shorter than 20 symbols — 556 out of 12 890 reviews (4.3%). Then, we will analyse only long statements with more context. It’s an arbitrary threshold based on examples, you can try a couple of levels and see what texts are filtered out.

It’s worth checking whether this filter disproportionally affects some hotels. Shares of short comments are pretty close for different categories. So, the data looks OK.

Now, it’s time to build our first topic model. Let’s start simple with the most basic one to understand how library works, then we will improve it.

We can train a topic model in just a few code lines that could be easily understood by anyone who has used at least one ML package before.

from bertopic import BERTopic
docs = list(df.reviews.values)
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

The default model returned 113 topics. We can look at top topics.

topic_model.get_topic_info().head(7).set_index('Topic')[
['Count', 'Name', 'Representation']]

The biggest group is Topic -1 , which corresponds to outliers. By default, BERTopic uses HDBSCAN for clustering, and it doesn’t force all data points to be part of clusters. In our case, 6 356 reviews are outliers (around 49.3% of all reviews). It is almost a half of our data, so we will work with this group later.

A topic representation is usually a set of most important words specific to this topic and not others. So, the best way to understand a topic is to look at the main terms (in BERTopic, a class-based TF-IDF score is used to rank the words).

topic_model.visualize_barchart(top_n_topics = 16, n_words = 10)

BERTopic even has Topics per Class representation that can solve our task of understanding the differences in course reviews.

topics_per_class = topic_model.topics_per_class(docs, 
classes=filt_df.hotel)topic_model.visualize_topics_per_class(topics_per_class, 
top_n_topics=10, normalize_frequency = True)

If you are wondering how to interpret this graph, you are not alone — I also wasn’t able to guess. However, the author kindly supports this package, and there are a lot of answers on GitHub. From the discussion, I learned that the current normalisation approach doesn’t show the share of different topics for classes. So, it hasn’t completely solved our initial task.

However, we did the first iteration in less than 10 rows of code. It’s fantastic, but there’s some room for improvement.

As we saw earlier, almost 50% of data points are considered outliers. It’s quite a lot, let’s see what we could do with it.

The documentation provides four different strategies to deal with the outliers:

based on topic-document probabilities,
based on topic distributions,
based on c-TF-IFD representations,
based on document and topic embeddings.

You can try different strategies and see which one fits your data the best.

Let’s look at examples of outliers. Even though these reviews are relatively short, they have multiple topics.

BERTopic uses clustering to define topics. It means that not more than one topic is assigned to each document. In most real-life cases, you can have a mixture of topics in your texts. We may be unable to assign a topic to the documents because they have multiple ones.

Luckily, there’s a solution for it — use Topic Distributions. With such an approach, each document will be split into tokens. Then, we will form subsentences (defined by sliding window and stride) and assign a topic for each such subsentence.

Let’s try this approach and see whether we will be able to reduce the number of outliers without topics.

However, Topic Distributions are based on the fitted topic model, so let’s enhance it.

First of all, we can use CountVectorizer. It defines how a document will be split into tokens. Also, it can help us to get rid of meaningless words like to, not or the (there are a lot of such words in our first model).

Also, we could improve topics’ representations and even try a couple of different models. I used the KeyBERTInspired model (more details), but you could try other options (for example, LLMs).

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevancemain_representation_model = KeyBERTInspired()
aspect_representation_model1 = PartOfSpeech("en_core_web_sm")
aspect_representation_model2 = [KeyBERTInspired(top_n_words=30), 
MaximalMarginalRelevance(diversity=.5)]
representation_model = {
"Main": main_representation_model,
"Aspect1":  aspect_representation_model1,
"Aspect2":  aspect_representation_model2 
}
vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')
topic_model = BERTopic(nr_topics = 'auto', 
vectorizer_model = vectorizer_model,
representation_model = representation_model)
topics, ini_probs = topic_model.fit_transform(docs)

I specified nr_topics = 'auto' to reduce the number of topics. Then, all topics with a similarity over threshold will be merged automatically. With this feature, we got 99 topics.

I’ve created a function to get top topics and their shares so we could analyse it easier. Let’s look at the new set of topics.

def get_topic_stats(topic_model, extra_cols = []):
topics_info_df = topic_model.get_topic_info().sort_values('Count', ascending = False)
topics_info_df['Share'] = 100.*topics_info_df['Count']/topics_info_df['Count'].sum()
topics_info_df['CumulativeShare'] = 100.*topics_info_df['Count'].cumsum()/topics_info_df['Count'].sum()
return topics_info_df[['Topic', 'Count', 'Share', 'CumulativeShare', 
'Name', 'Representation'] + extra_cols]get_topic_stats(topic_model, ['Aspect1', 'Aspect2']).head(10)\
.set_index('Topic')

We can also look at the Interoptic distance map to better understand our clusters, for example, which are close to each other. You can also use it to define some parent topics and subtopics. It’s called Hierarchical Topic Modelling and you can use other tools for it.

topic_model.visualize_topics()

Another insightful way to better understand your topics is to look at visualize_documents graph (documentation).

We can see that the number of topics has reduced significantly. Also, there are no meaningless stop words in topics’ representations.

However, we still see similar topics in the results. We can investigate and merge such topics manually.

For this, we can draw a Similarity matrix. I specified n_clusters, and our topics were clustered to visualise them better.

topic_model.visualize_heatmap(n_clusters = 20)

There are some pretty close topics. Let’s calculate the pair distances and look at the top topics.

from sklearn.metrics.pairwise import cosine_similarity
distance_matrix = cosine_similarity(np.array(topic_model.topic_embeddings_))
dist_df = pd.DataFrame(distance_matrix, columns=topic_model.topic_labels_.values(), 
index=topic_model.topic_labels_.values())tmp = []
for rec in dist_df.reset_index().to_dict('records'):
t1 = rec['index']
for t2 in rec:
if t2 == 'index': 
continue
tmp.append(
{
'topic1': t1, 
'topic2': t2, 
'distance': rec[t2]
}
)
pair_dist_df = pd.DataFrame(tmp)
pair_dist_df = pair_dist_df[(pair_dist_df.topic1.map(
lambda x: not x.startswith('-1'))) & 
(pair_dist_df.topic2.map(lambda x: not x.startswith('-1')))]
pair_dist_df = pair_dist_df[pair_dist_df.topic1 < pair_dist_df.topic2]
pair_dist_df.sort_values('distance', ascending = False).head(20)

I found guidance on how to get the distance matrix from GitHub discussions.

We can now see the top pairs of topics by cosine similarity. There are topics with close meanings that we could merge.

topic_model.merge_topics(docs, [[26, 74], [43, 68, 62], [16, 50, 91]])
df['merged_topic'] = topic_model.topics_

Attention: after merging, all topics’ IDs and representations will be recalculated, so it’s worth updating if you use them.

Now, we’ve improved our initial model and are ready to move on.

With real-life tasks, it’s worth spending more time on merging topics and trying different approaches to representation and clustering to get the best results.

The other potential idea is splitting reviews into separate sentences because comments are rather long.

Let’s calculate topics’ and tokens’ distributions. I’ve used a window equal to 4 (the author advised using 4–8 tokens) and stride equal 1.

topic_distr, topic_token_distr = topic_model.approximate_distribution(
docs, window = 4, calculate_tokens=True)

For example, this comment will be split into subsentences (or sets of four tokens), and the closest of existing topics will be assigned to each. Then, these topics will be aggregated to calculate probabilities for the whole sentence. You can find more details in the documentation.

Example shows how split works with basic CountVectorizer, window = 4 and stride = 1

Using this data, we can get the probabilities of different topics for each review.

topic_model.visualize_distribution(topic_distr[doc_id], min_probability=0.05)

We can even see the distribution of terms for each topic and understand why we got this result. For our sentence, best very beautifulwas the main term for Topic 74, while location close todefined a bunch of location-related topics.

vis_df = topic_model.visualize_approximate_distribution(docs[doc_id], 
topic_token_distr[doc_id])
vis_df

This example also shows that we might have spent more time merging topics because there are still pretty similar ones.

Now, we have probabilities for each topic and review. The next task is to select a threshold to filter irrelevant topics with too low probability.

We can do it as usual using data. Let’s calculate the distribution of selected topics per review for different threshold levels.

tmp_dfs = []# iterating through different threshold levels
for thr in tqdm.tqdm(np.arange(0, 0.35, 0.001)):
# calculating number of topics with probability > threshold for each document
tmp_df = pd.DataFrame(list(map(lambda x: len(list(filter(lambda y: y >= thr, x))), topic_distr))).rename(
columns = {0: 'num_topics'}
)
tmp_df['num_docs'] = 1
tmp_df['num_topics_group'] = tmp_df['num_topics']\
.map(lambda x: str(x) if x < 5 else '5+')
# aggregating stats
tmp_df_aggr = tmp_df.groupby('num_topics_group', as_index = False).num_docs.sum()
tmp_df_aggr['threshold'] = thr
tmp_dfs.append(tmp_df_aggr)
num_topics_stats_df = pd.concat(tmp_dfs).pivot(index = 'threshold', 
values = 'num_docs',
columns = 'num_topics_group').fillna(0)
num_topics_stats_df = num_topics_stats_df.apply(lambda x: 100.*x/num_topics_stats_df.sum(axis = 1))
# visualisation
colormap = px.colors.sequential.YlGnBu
px.area(num_topics_stats_df, 
title = 'Distribution of number of topics',
labels = {'num_topics_group': 'number of topics',
'value': 'share of reviews, %'},
color_discrete_map = {
'0': colormap[0],
'1': colormap[3],
'2': colormap[4],
'3': colormap[5],
'4': colormap[6],
'5+': colormap[7]
})

threshold = 0.05 looks like a good candidate because, with this level, the share of reviews without any topic is still low enough (less than 6%), while the percentage of comments with 4+ topics is also not so high.

This approach has helped us to reduce the number of outliers from 53.4% to 5.8%. So, assigning multiple topics could be an effective way to handle outliers.

Let’s calculate the topics for each doc with this threshold.

threshold = 0.13# define topic with probability > 0.13 for each document
df['multiple_topics'] = list(map(
lambda doc_topic_distr: list(map(
lambda y: y[0], filter(lambda x: x[1] >= threshold, 
(enumerate(doc_topic_distr)))
)), topic_distr
))
# creating a dataset with docid, topic
tmp_data = []
for rec in df.to_dict('records'):
if len(rec['multiple_topics']) != 0:
mult_topics = rec['multiple_topics']
else:
mult_topics = [-1]
for topic in mult_topics: 
tmp_data.append(
{
'topic': topic,
'id': rec['id'],
'course_id': rec['course_id'],
'reviews_transl': rec['reviews_transl']
}
)
mult_topics_df = pd.DataFrame(tmp_data)

Now, we have multiple topics mapped to each review and we can compare topics’ mixtures for different hotel chains.

Let’s find cases when a topic has too high or low share for a particular hotel. For that, we will calculate for each pair topic + hotel share of comments related to the topic for this hotel vs. all others.

tmp_data = []
for hotel in mult_topics_df.hotel.unique():
for topic in mult_topics_df.topic.unique():
tmp_data.append({
'hotel': hotel,
'topic_id': topic,
'total_hotel_reviews': mult_topics_df[mult_topics_df.hotel == hotel].id.nunique(),
'topic_hotel_reviews': mult_topics_df[(mult_topics_df.hotel == hotel) 
& (mult_topics_df.topic == topic)].id.nunique(),
'other_hotels_reviews': mult_topics_df[mult_topics_df.hotel != hotel].id.nunique(),
'topic_other_hotels_reviews': mult_topics_df[(mult_topics_df.hotel != hotel) 
& (mult_topics_df.topic == topic)].id.nunique()
})mult_topics_stats_df = pd.DataFrame(tmp_data)
mult_topics_stats_df['topic_hotel_share'] = 100*mult_topics_stats_df.topic_hotel_reviews/mult_topics_stats_df.total_hotel_reviews
mult_topics_stats_df['topic_other_hotels_share'] = 100*mult_topics_stats_df.topic_other_hotels_reviews/mult_topics_stats_df.other_hotels_reviews

However, not all differences are significant for us. We can say that the difference in topics’ distribution is worth looking at if there are

statistical significance — the difference is not just by chance,
practical significance — the difference is bigger than X% points (I used 1%).

from statsmodels.stats.proportion import proportions_ztestmult_topics_stats_df['difference_pval'] = list(map(
lambda x1, x2, n1, n2: proportions_ztest(
count = [x1, x2],
nobs = [n1, n2],
alternative = 'two-sided'
)[1],
mult_topics_stats_df.topic_other_hotels_reviews,
mult_topics_stats_df.topic_hotel_reviews,
mult_topics_stats_df.other_hotels_reviews,
mult_topics_stats_df.total_hotel_reviews
))
mult_topics_stats_df['sign_difference'] = mult_topics_stats_df.difference_pval.map(
lambda x: 1 if x <= 0.05 else 0
)
def get_significance(d, sign):
sign_percent = 1
if sign == 0:
return 'no diff'
if (d >= -sign_percent) and (d <= sign_percent):
return 'no diff'
if d < -sign_percent:
return 'lower'
if d > sign_percent:
return 'higher'
mult_topics_stats_df['diff_significance_total'] = list(map(
get_significance,
mult_topics_stats_df.topic_hotel_share - mult_topics_stats_df.topic_other_hotels_share,
mult_topics_stats_df.sign_difference
))

We have all the stats for all topics and hotels, and the last step is to create a visualisation comparing topic shares by categories.

import plotly# define color depending on difference significance
def get_color_sign(rel):
if rel == 'no diff':
return plotly.colors.qualitative.Set2[7]
if rel == 'lower':
return plotly.colors.qualitative.Set2[1]
if rel == 'higher':
return plotly.colors.qualitative.Set2[0]
# return topic representation in a suitable for graph title format
def get_topic_representation_title(topic_model, topic):
data = topic_model.get_topic(topic)
data = list(map(lambda x: x[0], data))
return ', '.join(data[:5]) + ', <br>         ' + ', '.join(data[5:])
def get_graphs_for_topic(t):
topic_stats_df = mult_topics_stats_df[mult_topics_stats_df.topic_id == t]\
.sort_values('total_hotel_reviews', ascending = False).set_index('hotel')
colors = list(map(
get_color_sign,
topic_stats_df.diff_significance_total
))
fig = px.bar(topic_stats_df.reset_index(), x = 'hotel', y = 'topic_hotel_share',
title = 'Topic: %s' % get_topic_representation_title(topic_model, 
topic_stats_df.topic_id.min()),
text_auto = '.1f',
labels = {'topic_hotel_share': 'share of reviews, %'},
hover_data=['topic_id'])
fig.update_layout(showlegend = False)
fig.update_traces(marker_color=colors, marker_line_color=colors,
marker_line_width=1.5, opacity=0.9)
topic_total_share = 100.*((topic_stats_df.topic_hotel_reviews + topic_stats_df.topic_other_hotels_reviews)\
/(topic_stats_df.total_hotel_reviews + topic_stats_df.other_hotels_reviews)).min()
print(topic_total_share)
fig.add_shape(type="line",
xref="paper",
x0=0, y0=topic_total_share,
x1=1, y1=topic_total_share,
line=dict(
color=colormap[8],
width=3, dash="dot"
)
)
fig.show()

Then, we can calculate the top topics list and make graphs for them.

top_mult_topics_df = mult_topics_df.groupby('topic', as_index = False).id.nunique()
top_mult_topics_df['share'] = 100.*top_mult_topics_df.id/top_mult_topics_df.id.sum()
top_mult_topics_df['topic_repr'] = top_mult_topics_df.topic.map(
lambda x: get_topic_representation(topic_model, x)
)for t in top_mult_topics_df.head(32).topic.values:
get_graphs_for_topic(t)

Here are a couple of examples of resulting charts. Let’s try to make some conclusions based on this data.

We can see that Holiday Inn, Travelodge and Park Inn have better prices and value for money compared to Hilton or Park Plaza.

The other insight is that in Travelodge noise may be a problem.

It’s a bit challenging for me to interpret this result. I’m not sure what this topic is about.

The best practice for such cases is to look at some examples.

We stayed in the East tower where the lifts are under renovation, only one works, but there are signs showing the way to service lifts which can be used also.
However, the carpet and the furniture could have a refurbishment.
It’s built right over Queensway station. Beware that this tube stop will be closed for refurbishing for one year! So you might consider noise levels.

So, this topic is about the cases of temporary issues during the hotel stay or furniture not in the best condition.

You can find the full code on GitHub.

Today, we’ve done an end-to-end Topic Modelling analysis:

Build a basic topic model using the BERTopic library.
Then, we’ve handled outliers, so only 5.8% of our reviews don’t have a topic assigned.
Reduced the number of topics both automatically and manually to have a concise list.
Learned how to assign multiple topics to each document because, in most cases, your text will have a mixture of topics.

Finally, we were able to compare reviews for different courses, create inspiring graphs and get some insights.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Ganesan, Kavita and Zhai, ChengXiang. (2011). OpinRank Review Dataset.
UCI Machine Learning Repository. https://doi.org/10.24432/C5QW4W

Topics per Class Using BERTopic. How to understand the differences in… | by Mariya Mansurova | Sep, 2023

How to understand the differences in texts by categories

About Us

Our Services

Latest QSOL IT News

Topics per Class Using BERTopic. How to understand the differences in… | by Mariya Mansurova | Sep, 2023

How to understand the differences in texts by categories

Related Post

Does your MSP portfolio need a new security

MSPs must prioritize mobile device security

We’re committed to offering the best and most

Cybersecurity Threat Advisory: Fake CrowdStrike updates observed in