tydi_eval#
Official evaluation script for the TyDi QA primary tasks.
The primary tasks are the Passage Selection Task (SelectP) and the Minimal Answer Span Task Task (AnsSpan). This script is not used for the secondary task, the SQuAD-compatible gold Passage (GoldP) task.
Note that R@P are only meaningful if your model populates the score fields of the prediction JSON format (which is not required).
Prediction format (written on multiple lines here for clarity, but each prediction should be a single line in your output file):
- {
‘example_id’: -2226525965842375672, ‘passage_answer_index’: 2, ‘passage_answer_score’: 13.5, ‘minimal_answer’: {‘start_byte_offset’: 64206, ‘end_byte_offset’: 64280}, ‘minimal_answer_score’: 26.4, ‘yes_no_answer’: ‘NONE’
}
The prediction format mirrors the annotation format in defining each passage or minimal answer span both in terms of byte offsets.
- If start_byte_offset >= 0 and end_byte_offset >=0, use byte offsets,
else no span is defined (null answer).
The minimal answer metric takes both minimal answer spans, and the yes/no answer into account. If the ‘minimal_answers’ list contains any non/null spans, then ‘yes_no_answer’ should be set to ‘NONE’.
Metrics:
Each prediction should be provided with a passage answer score, and a minimal answers score. At evaluation time, the evaluation script will find a score threshold at which F1 is maximized. All predictions with scores below this threshold are ignored (assumed to be null). If the score is not provided, the evaluation script considers all predictions to be valid. The script will also output the maximum recall at precision points of >= 0.5, >= 0.75, and >= 0.9.
- Key methods:
Scoring passage answer candidates: score_passage_answer() Scoring minimal answer candidates: score_minimal_answer(),
eval_utils.compute_partial_match_scores()
Computing language-wise F1: compute_macro_f1()
Functions
Computes overall F1 given passage and minimal answers, ignoring scores. |
|
Computes F1, precision, recall for a list of answer scores. |
|
Computes PR curve and returns R@P for specific targets. |
|
Generate metrics dict using passage and minimal answer stats. |
|
Pretty prints the R@P table for default targets. |
|
Scores all answers for all documents. |
|
Scores a minimal answer. |
|
Scores a passage answer as correct or not. |