tydi_eval#

Official evaluation script for the TyDi QA primary tasks.

The primary tasks are the Passage Selection Task (SelectP) and the Minimal Answer Span Task Task (AnsSpan). This script is not used for the secondary task, the SQuAD-compatible gold Passage (GoldP) task.

Note that R@P are only meaningful if your model populates the score fields of the prediction JSON format (which is not required).

Prediction format (written on multiple lines here for clarity, but each prediction should be a single line in your output file):

{

‘example_id’: -2226525965842375672, ‘passage_answer_index’: 2, ‘passage_answer_score’: 13.5, ‘minimal_answer’: {‘start_byte_offset’: 64206, ‘end_byte_offset’: 64280}, ‘minimal_answer_score’: 26.4, ‘yes_no_answer’: ‘NONE’

}

The prediction format mirrors the annotation format in defining each passage or minimal answer span both in terms of byte offsets.

If start_byte_offset >= 0 and end_byte_offset >=0, use byte offsets,

else no span is defined (null answer).

The minimal answer metric takes both minimal answer spans, and the yes/no answer into account. If the ‘minimal_answers’ list contains any non/null spans, then ‘yes_no_answer’ should be set to ‘NONE’.

Metrics:

Each prediction should be provided with a passage answer score, and a minimal answers score. At evaluation time, the evaluation script will find a score threshold at which F1 is maximized. All predictions with scores below this threshold are ignored (assumed to be null). If the score is not provided, the evaluation script considers all predictions to be valid. The script will also output the maximum recall at precision points of >= 0.5, >= 0.75, and >= 0.9.

Key methods:

Scoring passage answer candidates: score_passage_answer() Scoring minimal answer candidates: score_minimal_answer(),

eval_utils.compute_partial_match_scores()

Computing language-wise F1: compute_macro_f1()

Functions

byte_slice

compute_final_f1

Computes overall F1 given passage and minimal answers, ignoring scores.

compute_macro_f1

Computes F1, precision, recall for a list of answer scores.

compute_pr_curves

Computes PR curve and returns R@P for specific targets.

get_latex_str

get_metrics_with_answer_stats

Generate metrics dict using passage and minimal answer stats.

pretty_print

print_r_at_p_table

Pretty prints the R@P table for default targets.

score_answers

Scores all answers for all documents.

score_minimal_answer

Scores a minimal answer.

score_passage_answer

Scores a passage answer as correct or not.