ROUGE and BLEU
In natural language processing (NLP), evaluation metrics are often not as straightforward as in computer vision (CV), due to the different ways in a language to express the same meaning. Below are 2 popular metrics in NLP.
ROUGE
ROUGE (recall-oriented understudy for gisting evaluation) is mainly an algorithm for evaluating the quality of automatic machine summarizations. Its output is between 0 and 1, with values closer to 1 representing better summarizations. In the original paper, 4 types of ROUGE scores are presented: ROUGE-N (N-gram co-occurrence statistics), ROUGE-L (longest common subsequence), ROUGE-W (weighted longest common subsequence), and ROUGE-S (skip-bigram co-occurrence statistics). In this post only ROUGE-N and ROUGE-L will be discussed, as they are the most popular. ROUGE-L comes with 2 flavors: the sentence-level ROUGE-L, and the summary-level ROUGE-Lsum.
ROUGE-N is defined as follow in the original paper:
\(\begin{aligned} \text{ROUGE-N}(candidate, reference) &= \frac{\sum_{r_i \in reference} \sum_{\text{N-gram} \in r_i} Count(\text{N-gram}, candidate)}{\sum_{r_i \in reference} Count(\text{N-gram}, r_i)} \end{aligned}\) where $r_i$ are sentences in the reference document. Since the denominator is the reference, this is a recall based measure. In HuggingFace implementation, precision (using candidate as denominator) and f1 scores ($2(\frac{precision \cdot recall}{precision + recall})$)are also computed.
ROUGE-L signifies the longest common subsequence and is defined as follow:
\[\begin{align} \text{ROUGE-L}(c, r) &= \frac{(1+\beta^2)R_{lcs}(c, r)P_{lcs}(c, r)}{R_{lcs}(c, r) + \beta^2P_{lcs}(c, r)} \\ \\ R_{lcs} &= \begin{cases} \frac{LCS(r, c)}{|r|} & \text{if sentence-level (ROUGE-L)} \\ \frac{\sum_{r_i=i}^{\text{num-sentences}}LCS_{\cup}(r_i, c)}{|r|} & \text{if summary level (ROUGE-Lsum)} \end{cases}\\ \\ P_{lcs} &= \begin{cases} \frac{LCS(r, c)}{|c|} & \text{if sentence-level (ROUGE-L)} \\ \frac{\sum_{r_i=i}^{\text{num sentences}}LCS_{\cup}(r_i, c)}{|c|} & \text{if summary-level (ROUGE-Lsum)} \end{cases} \end{align}\]here, $c$ signifies the candidate summary and $r$ the reference summary, $|c|$ signifies the number of words in $c$, $|r|$ the number of words in $r$. Since ROUGE is more associated with recall, the $\beta$ in the F-score is usually set to a higher value such as 2. The longest common subsequence ($LCS(r,c)$) is defined as follow:
A sequence $Z = [z_1, z_2, …, z_n]$ is a subsequence of another sequence $X = [x_1, x_2, …, x_m]$, if there exists a strict increasing sequence $[i_1, i_2, …, i_k]$ of indices of $X$ such that for all $j = 1, 2, …, k$, we have $x_{i_j} = z_j$. Given two sequences $r$ and $c$, the longest common subsequence ($LCS(r, c)$) of $r$ and $c$ is a common subsequence with maximum length. $LCS_\cup$ signifies the union of different $LCS$. For example, if $r_i = w_1 w_2 w_3 w_4 w_5$, and $c$ contains two sentences: $c_1 = w_1 w_2 w_6 w_7 w_8$ and $c2 = w_1 w_3 w_8 w_9 w_5$, then the longest common subsequence of $r_i$ and $c_1$ is $w_1 w_2$ and the longest common subsequence of $r_i$ and $c_2$ is $w_1 w_3 w_5$. The union longest common subsequence of $r_i$, $c_1$, and $c_2$ is $w_1 w_2 w_3 w_5$ and $LCS_\cup(r_i, c) = 4$ (Lin, 2004).
The google-research/huggingface implementation of ROUGE is as follow:
def score(self, target, prediction):
"""Calculates rouge scores between the target and prediction.
Args:
target: Text containing the target (ground truth) text,
or if a list
prediction: Text containing the predicted text.
Returns:
A dict mapping each rouge type to a Score object.
Raises:
ValueError: If an invalid rouge type is encountered.
"""
# Pre-compute target tokens and prediction tokens for use by different
# types, except if only "rougeLsum" is requested.
if len(self.rouge_types) == 1 and self.rouge_types[0] == "rougeLsum":
target_tokens = None
prediction_tokens = None
else:
target_tokens = self._tokenizer.tokenize(target)
prediction_tokens = self._tokenizer.tokenize(prediction)
result = {}
for rouge_type in self.rouge_types:
if rouge_type == "rougeL":
# Rouge from longest common subsequences.
scores = _score_lcs(target_tokens, prediction_tokens)
elif rouge_type == "rougeLsum":
# Note: Does not support multi-line text.
def get_sents(text):
if self._split_summaries:
sents = nltk.sent_tokenize(text)
else:
# Assume sentences are separated by newline.
sents = six.ensure_str(text).split("\n")
sents = [x for x in sents if len(x)]
return sents
target_tokens_list = [
self._tokenizer.tokenize(s) for s in get_sents(target)]
prediction_tokens_list = [
self._tokenizer.tokenize(s) for s in get_sents(prediction)]
scores = _summary_level_lcs(target_tokens_list,
prediction_tokens_list)
elif re.match(r"rouge[0-9]$", six.ensure_str(rouge_type)):
# Rouge from n-grams.
n = int(rouge_type[5:])
if n <= 0:
raise ValueError("rougen requires positive n: %s" % rouge_type)
target_ngrams = _create_ngrams(target_tokens, n)
prediction_ngrams = _create_ngrams(prediction_tokens, n)
scores = _score_ngrams(target_ngrams, prediction_ngrams)
else:
raise ValueError("Invalid rouge type: %s" % rouge_type)
result[rouge_type] = scores
return result
BLEU
BLEU (bilingual evaluation understudy) is mainly an algorithm for evaluating the quality of machine translations. Its output is between 0 and 1, with values closer to 1 representing more similar texts. In practice, a score of 1 is unattainable as it indicates a perfect match between machine and human translated texts. A score of 0.6 to 0.7 is often the best one can get.
BLEU score can be seen as roughly the weighted geometric average precision of 1, 2, …, N N-gram precision scores, which is then slightly adjusted with a brevity prior. In mathematical notation, we have:
\[\begin{aligned} \text{BLEU} (N) &= \text{N-gram precision}(N) \cdot \text{brevity penalty}\\ \\ \text{N-gram precision}(N) &= (p_1)^{w_1} \cdot (p_2)^{w_2} ... (p_N)^{w_N} \\ &= \prod_{n=1}^N p_n^{w_n} \\ &= e^{\sum_{n=1}^N w_n \text{log}(p_n)} \\ \\ p_n &= \frac{\text{total number of N-gram in both translation and reference}}{\text{total number of N-gram in translation}} \\ \\ \text{brevity penalty} &= \begin{cases} 1 & \text{if c $\ge$ r} \\ e^{(1 - \frac{r}{c})} & \text{if c $\lt$ r} \end{cases} \end{aligned}\]Here, $p_1$, $p_2$, …, $p_N$ signifies the 1-gram, 2-gram, … N-gram precision scores, with respective weights of $w_1$, $w_2$, …, $w_N$ in the geometric mean. $c$ is the length of the candidate corpus and $r$ is the effective reference corpus length. As the calculation of precision favors shorter sentences, brevity penalty is introduced to balanced out this effect, as shorter candidates get penalized with a prior less than 1. In practice, N = 4 with all weights being $\frac{1}{4}$ is often used.
Below is the tensorflow/huggingface implementation of BLEU score:
def compute_bleu(reference_corpus, translation_corpus, max_order=4,
smooth=False):
"""Computes BLEU score of translated segments against one or more references.
Args:
reference_corpus: list of lists of references for each translation. Each
reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation
should be tokenized into a list of tokens.
max_order: Maximum n-gram order to use when computing BLEU score.
smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
precisions and brevity penalty.
"""
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
for (references, translation) in zip(reference_corpus,
translation_corpus):
reference_length += min(len(r) for r in references)
translation_length += len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order) # | means set union
translation_ngram_counts = _get_ngrams(translation, max_order)
overlap = translation_ngram_counts & merged_ref_ngram_counts # & means set intersection
for ngram in overlap:
matches_by_order[len(ngram)-1] += overlap[ngram]
for order in range(1, max_order+1):
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order-1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) /
possible_matches_by_order[i])
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(translation_length) / reference_length
if ratio > 1.0:
bp = 1.
else:
bp = math.exp(1 - 1. / ratio)
bleu = geo_mean * bp
return (bleu, precisions, bp, ratio, translation_length, reference_length)