TY - JOUR
T1 - LATS
T2 - Low resource abstractive text summarization
AU - van Yperen, Chris
AU - Frasincar, Flavius
AU - El Kanfoudi, Kamilah
N1 - Publisher Copyright:
© 2025 The Author(s)
PY - 2025/8/15
Y1 - 2025/8/15
N2 - Text summarization is an increasingly crucial focus of Natural Language Processing (NLP), and state-of-the-art models such as PEGASUS have demonstrated remarkable potential to ever more efficient and accurate abstractive summarization. Nonetheless, recent developments of deep learning models that focus on training with large datasets can become at risk of sub-optimal generalization, inefficient training time, and can get stuck at local optima due to high-dimensional non-convex optimization domains. Current research in the field of NLP suggests that leveraging curriculum learning techniques to guide model training (enabling the model to learn from training data with increasing difficulty) could provide a means to achieve enhanced model performance. In this paper we investigate the effectiveness of curriculum learning strategies and data augmentation techniques on PEGASUS to increase performance with low-resource training data from the CNN/DM dataset. We introduce a novel text-summary pair complexity scoring algorithm along with two simple baseline difficulty measures. We find that our novel complexity sorting method consistently outperforms the baseline sorting methods and boosts performance of PEGASUS. The Baby-Steps curriculum learning strategy with this sorting method leads to performance improvements of 5.65 %, from a combined ROUGE F1-score of 83.28 to 87.99. When this strategy is combined with a data augmentation technique, Easy Data Augmentation, this leads to an improvement to 6.54 %. These statistics are relative to a baseline without curriculum learning or data augmentation.
AB - Text summarization is an increasingly crucial focus of Natural Language Processing (NLP), and state-of-the-art models such as PEGASUS have demonstrated remarkable potential to ever more efficient and accurate abstractive summarization. Nonetheless, recent developments of deep learning models that focus on training with large datasets can become at risk of sub-optimal generalization, inefficient training time, and can get stuck at local optima due to high-dimensional non-convex optimization domains. Current research in the field of NLP suggests that leveraging curriculum learning techniques to guide model training (enabling the model to learn from training data with increasing difficulty) could provide a means to achieve enhanced model performance. In this paper we investigate the effectiveness of curriculum learning strategies and data augmentation techniques on PEGASUS to increase performance with low-resource training data from the CNN/DM dataset. We introduce a novel text-summary pair complexity scoring algorithm along with two simple baseline difficulty measures. We find that our novel complexity sorting method consistently outperforms the baseline sorting methods and boosts performance of PEGASUS. The Baby-Steps curriculum learning strategy with this sorting method leads to performance improvements of 5.65 %, from a combined ROUGE F1-score of 83.28 to 87.99. When this strategy is combined with a data augmentation technique, Easy Data Augmentation, this leads to an improvement to 6.54 %. These statistics are relative to a baseline without curriculum learning or data augmentation.
UR - http://www.scopus.com/inward/record.url?scp=105005223350&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2025.128078
DO - 10.1016/j.eswa.2025.128078
M3 - Article
AN - SCOPUS:105005223350
SN - 0957-4174
VL - 286
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 128078
ER -