Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

January 20, 2024
by Mina Ghashami
AI, Syndicated
108 Views

Metrics to measure the gap between neural text and human text

Recently, large language models have shown tremendous ability in generating human-like texts. There are many metrics to measure how close/similar a text generated by large language models is to the reference human text. In fact, bridging this gap is an active area of research.

In this post, we look into two well-known metrics for automatically evaluating the machine generated texts.

Consider you are given a reference text that is human-generated, and a machine-generated text that is generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the image below:

Here the reference text is “the weather is cold today” and the candidate text which is machine generated is “it is freezing today”. If we compute the n-gram similarity these two texts will have a low score. However, we know they are semantically very similar. So BERTScore computes the contextual embedding of each token in both reference text and the candidate text and the based on these embedding vectors, it computes the pairwise cosine similarities.

Based on pairwise cosine similarities, we can compute precision, recall and F1 score. To do so as following:

Recall: we get the maximum cosine similarity for every token in the reference text and get their average
Precision: we get the maximum cosine similarity for every token in the candidate text and get their average
F1 score: the harmonic mean of precision and recall

BERTScore[1] also propose a modification to above score called as “importance weighting”. In “importance weighting” , considers the fact that rare word which are common between two sentences are more…

Source link

This post originally appeared on TechToday.

Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

Metrics to measure the gap between neural text and human text

About Us

Our Services

Latest QSOL IT News

Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

Metrics to measure the gap between neural text and human text

Related Post

Tech Time Warp: A look back at Olympics

Does your MSP portfolio need a new security

MSPs must prioritize mobile device security

We’re committed to offering the best and most