ROUGE and BLEU
In natural language processing (NLP), evaluation metrics are often not as straightforward as in computer vision (CV), due to the different ways in a language to express the same meaning. Below are 2 popular metrics in NLP.
ROUGE
ROUGE (recall-oriented understudy for gisting evaluation) is mainly an algorithm for evaluating the quality of automatic machine summarizations. Its output is between 0 and 1, with values closer to 1 representing better summarizations. In the original paper, 4 types of ROUGE scores are presented: ROUGE-N (N-gram co-occurrence statistics), ROUGE-L (longest common subsequence), ROUGE-W (weighted longest common subsequence), and ROUGE-S (skip-bigram co-occurrence statistics). In this post only ROUGE-N and ROUGE-L will be discussed, as they are the most popular. ROUGE-L comes with 2 flavors: the sentence-level ROUGE-L, and the summary-level ROUGE-Lsum.
ROUGE-N is defined as follow in the original paper:
\(\begin{aligned} \text{ROUGE-N}(candidate, reference) &= \frac{\sum_{r_i \in reference} \sum_{\text{N-gram} \in r_i} Count(\text{N-gram}, candidate)}{\sum_{r_i \in reference} Count(\text{N-gram}, r_i)} \end{aligned}\) where $r_i$ are sentences in the reference document. Since the denominator is the reference, this is a recall based measure. In HuggingFace implementation, precision (using candidate as denominator) and f1 scores ($2(\frac{precision \cdot recall}{precision + recall})$)are also computed.
ROUGE-L signifies the longest common subsequence and is defined as follow:
\[\begin{align} \text{ROUGE-L}(c, r) &= \frac{(1+\beta^2)R_{lcs}(c, r)P_{lcs}(c, r)}{R_{lcs}(c, r) + \beta^2P_{lcs}(c, r)} \\ \\ R_{lcs} &= \begin{cases} \frac{LCS(r, c)}{|r|} & \text{if sentence-level (ROUGE-L)} \\ \frac{\sum_{r_i=i}^{\text{num-sentences}}LCS_{\cup}(r_i, c)}{|r|} & \text{if summary level (ROUGE-Lsum)} \end{cases}\\ \\ P_{lcs} &= \begin{cases} \frac{LCS(r, c)}{|c|} & \text{if sentence-level (ROUGE-L)} \\ \frac{\sum_{r_i=i}^{\text{num sentences}}LCS_{\cup}(r_i, c)}{|c|} & \text{if summary-level (ROUGE-Lsum)} \end{cases} \end{align}\]here, $c$ signifies the candidate summary and $r$ the reference summary, $|c|$ signifies the number of words in $c$, $|r|$ the number of words in $r$. Since ROUGE is more associated with recall, the $\beta$ in the F-score is usually set to a higher value such as 2. The longest common subsequence ($LCS(r,c)$) is defined as follow:
A sequence $Z = [z_1, z_2, …, z_n]$ is a subsequence of another sequence $X = [x_1, x_2, …, x_m]$, if there exists a strict increasing sequence $[i_1, i_2, …, i_k]$ of indices of $X$ such that for all $j = 1, 2, …, k$, we have $x_{i_j} = z_j$. Given two sequences $r$ and $c$, the longest common subsequence ($LCS(r, c)$) of $r$ and $c$ is a common subsequence with maximum length. $LCS_\cup$ signifies the union of different $LCS$. For example, if $r_i = w_1 w_2 w_3 w_4 w_5$, and $c$ contains two sentences: $c_1 = w_1 w_2 w_6 w_7 w_8$ and $c2 = w_1 w_3 w_8 w_9 w_5$, then the longest common subsequence of $r_i$ and $c_1$ is $w_1 w_2$ and the longest common subsequence of $r_i$ and $c_2$ is $w_1 w_3 w_5$. The union longest common subsequence of $r_i$, $c_1$, and $c_2$ is $w_1 w_2 w_3 w_5$ and $LCS_\cup(r_i, c) = 4$ (Lin, 2004).
The google-research/huggingface implementation of ROUGE is as follow:
BLEU
BLEU (bilingual evaluation understudy) is mainly an algorithm for evaluating the quality of machine translations. Its output is between 0 and 1, with values closer to 1 representing more similar texts. In practice, a score of 1 is unattainable as it indicates a perfect match between machine and human translated texts. A score of 0.6 to 0.7 is often the best one can get.
BLEU score can be seen as roughly the weighted geometric average precision of 1, 2, …, N N-gram precision scores, which is then slightly adjusted with a brevity prior. In mathematical notation, we have:
\[\begin{aligned} \text{BLEU} (N) &= \text{N-gram precision}(N) \cdot \text{brevity penalty}\\ \\ \text{N-gram precision}(N) &= (p_1)^{w_1} \cdot (p_2)^{w_2} ... (p_N)^{w_N} \\ &= \prod_{n=1}^N p_n^{w_n} \\ &= e^{\sum_{n=1}^N w_n \text{log}(p_n)} \\ \\ p_n &= \frac{\text{total number of N-gram in both translation and reference}}{\text{total number of N-gram in translation}} \\ \\ \text{brevity penalty} &= \begin{cases} 1 & \text{if c $\ge$ r} \\ e^{(1 - \frac{r}{c})} & \text{if c $\lt$ r} \end{cases} \end{aligned}\]Here, $p_1$, $p_2$, …, $p_N$ signifies the 1-gram, 2-gram, … N-gram precision scores, with respective weights of $w_1$, $w_2$, …, $w_N$ in the geometric mean. $c$ is the length of the candidate corpus and $r$ is the effective reference corpus length. As the calculation of precision favors shorter sentences, brevity penalty is introduced to balanced out this effect, as shorter candidates get penalized with a prior less than 1. In practice, N = 4 with all weights being $\frac{1}{4}$ is often used.
Below is the tensorflow/huggingface implementation of BLEU score: