Extractive Text Summarization-Based Framework for Sindhi Language
Keywords:
Sindhi Language, Extractive Summarization, Natural Language Processing (NLP), Sentence Selection, TF-IDF, Sentence EmbeddingsAbstract
This paper presents an extractive text summarization method specially designed for Sindhi, a culturally rich but low-resource Indo-Aryan language spoken widely in Pakistan. The study focuses on selecting the most relevant sentences from Sindhi texts, employing Natural Language Processing (NLP) techniques to generate concise summaries.
The proposed system incorporates essential preprocessing steps, including text cleaning, tokenization, and stemming/lemmatization. For future extraction, it utilizes TF-IDF and sentence embeddings. After scoring the sentences, the most significant ones are selected to form the final summary.
To evaluate the system's performance in five test paragraphs, several metrics are used, including F1 score, precision, recall, cosine similarity, normalization level distance, and accuracy. The system demonstrates reliable and accurate summarization, and consistency achieving high precision (1.0), strong F1 score (0.89-0.92), a low a low normalized error (0.04), and an overall accuracy of 0.86. Graphic analysis further confirms the model's coherence, semantic retention, and low error rates.
By leveraging NLP for information summarization, this study contributes to preserving and promoting the Sindhi language—potential applications including digital accessibility, education, and content curation. Future research aims to enhance contextual understanding by exploring transformer-based models like BERT and extending the approach to abstraction summarization.
References
S. S. A.-N. Nada, Abdullah M. Abu, Alajrami, Eman, Al-Saqqa, Ahemd A., “Arabic Text Summarization Using AraBERT Model Using Extractive Text Summarization Approach,” Int. J. Acad. Inf. Syst. Res., vol. 4, no. 8, pp. 6–9, 2020, [Online]. Available: http://ijeais.org/wp-content/uploads/2020/8/IJAISR200802.pdf
L. F. Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George D.C. Cavalcanti, Rinaldo Lima, Steven J. Simske, “Assessing sentence scoring techniques for extractive text summarization,” Expert Syst. Appl., vol. 40, no. 14, pp. 5755–5764, 2013, doi: https://doi.org/10.1016/j.eswa.2013.04.023.
Derek Miller, “Leveraging BERT for extractive text summarization on lectures,” arXiv:1906.04165, 2019, doi: https://doi.org/10.48550/arXiv.1906.04165.
A. G. Aakash Sinha, Abhishek Yadav, “Extractive text summarization using neural networks,” arXiv:1802.10137, 2018, doi: https://doi.org/10.48550/arXiv.1802.10137.
J. L. Jiacheng Xu, Zhe Gan, Yu Cheng, “Discourse-Aware Neural Extractive Text Summarization,” Assoc. Comput. Linguist., pp. 5021–5031, 2020, doi: 10.18653/v1/2020.acl-main.451.
M. A. A. Begum Mutlu, Ebru A. Sezer, “Candidate sentence selection for extractive text summarization,” Inf. Process. Manag., vol. 57, no. 6, p. 102359, 2020, doi: https://doi.org/10.1016/j.ipm.2020.102359.
G. R. Qian Ruan, Malte Ostendorff, “HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information,” Assoc. Comput. Linguist., pp. 1292–1308, 2022, doi: 10.18653/v1/2022.findings-acl.102.
M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: a survey,” Artif. Intell. Rev., vol. 47, no. 1, pp. 1–66, Jan. 2017, doi: 10.1007/S10462-016-9475-9/METRICS.
Z. W. Changjian Fang, Dejun Mu, Zhenghong Deng, “Word-sentence co-ranking for automatic extractive text summarization,” Expert Syst. Appl., vol. 72, 2017, doi: 189-195.
L. F. N. Waseemullah, Zainab Fatima, Shehnila Zardari, Muhammad Fahim, Maria Andleeb Siddiqui, Ag. Asri Ag. Ibrahim, Kashif Nisar, “A Novel Approach for Semantic Extractive Text Summarization,” Appl. Sci., vol. 12, no. 9, p. 4479, 2022, doi: https://doi.org/10.3390/app12094479.

Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 50sea

This work is licensed under a Creative Commons Attribution 4.0 International License.