Optimizing the impact of data augmentation for low-resource grammatical error correction

Aiman Solyman and Marco Zappatore and Wang Zhenyu and Zeinab Mahmoud and Ali Alfatemi and Ashraf Osman Ibrahim Elsayed and Lubna Abdelkareim Gabralla (2023) Optimizing the impact of data augmentation for low-resource grammatical error correction. Journal of King Saud University – Computer and Information Sciences, 35. pp. 1-15. ISSN 1319-1578

Text
FULL TEXT.pdf
Restricted to Registered users only
Download (2MB) | Request a copy

URL: https://www.sciencedirect.com/science/article/pii/...

Abstract

Grammatical Error Correction (GEC) refers to the automatic identification and amendment of grammatical, spelling, punctuation, and word-positioning errors in monolingual texts. Neural Machine Translation (NMT) is nowadays one of the most valuable techniques used for GEC but it may suffer from scarcity of training data and domain shift, depending on the addressed language. However, current techniques (e.g., tuning pre-trained language models or developing spell-confusion methods without focusing on language diversity) tackling the data sparsity problem associated with NMT create mismatched data distributions. This paper proposes new aggressive transformation approaches to augment data during training that extend the distribution of authentic data. In particular, it uses augmented data as auxiliary tasks to provide new contexts when the target prefix is not helpful for the next word prediction. This enhances the encoder and steadily increases its contribution by forcing the GEC model to pay more attention to the text representations of the encoder during decoding. The impact of these approaches was investigated using the Transformer-based for low-resource GEC task, and Arabic GEC was used as a case study. GEC models trained with our data tend more to source information, are more domain shift robustness, and have less hallucinations with tiny training datasets and domain shift. Experimental results showed that the proposed approaches outperformed the baseline, the most common data augmentation methods, and classical synthetic data approaches. In addition, a combination of the three best approaches Misspelling, Swap, and Reverse achieved the best score in two benchmarks and outperformed previous Arabic GEC approaches.

Item Type:	Article
Keyword:	Grammatical error correction, Data augmentation, Synthetic data, Deep learning
Subjects:	P Language and Literature > P Philology. Linguistics > P1-1091 Philology. Linguistics > P98-98.5 Computational linguistics. Natural language processing Q Science > QA Mathematics > QA1-939 Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science
Department:	FACULTY > Faculty of Computing and Informatics
Depositing User:	SITI AZIZAH BINTI IDRIS -
Date Deposited:	04 Sep 2025 15:51
Last Modified:	04 Sep 2025 15:51
URI:	https://eprints.ums.edu.my/id/eprint/45150

Actions (login required)

View Item