Tianze Shi, a doctoral candidate in the Department of Computer Science, and his advisor, Lillian Lee, Charles Roy Davis Professor in the departments of Computer Science and Information Science, won the 2021 Shared Task at the 17th International Conference on Parsing Technologies (IWPT) for their work appearing under the acronym "TGIF."
"TGIF" stands for "Tree-Graph Integrated-Format Parser for Enhanced Universal Dependencies with Two-Stage Generic- to Individual-Language Finetuning." Cornell's "TGIF" team—Shi and Lee—had the best overall score in a global competition involving nine teams; it also yielded the top score on sixteen of the seventeen languages represented. The system description will appear in the conference proceedings. In the meantime, a preprint of the system description is available.
At IWPT, Shi and Lee presented their work to the "shared task on parsing into enhanced Universal Dependencies." As the team explains:
Our main system component is a hybrid tree-graph parser that integrates
- (a) predictions of spanning trees for the enhanced graphs with
- (b) additional graph edges not present in the spanning trees
We also adopt a finetuning strategy where we first train a language-generic parser on the concatenation of data from all available languages, and then, in a second step, finetune on each individual language separately. Additionally, we develop our own complete set of pre-processing modules relevant to the shared task, including tokenization, sentence segmentation, and multiword token expansion, based on pre-trained XLM-R models and our own pre-training of character-level language models. Our submission reaches a macro-average ELAS of 89.24 on the test set. It ranks top among all teams, with a margin of more than 2 absolute ELAS over the next best-performing submission.