Sindhi Keyword Extraction from Online Articles for SEO Experts Using Web Scraping and MultiBERT Model

Muhammad Hashir; Zulqarnain Channa; Shamshad Lakho; Atta Muhammad Panhyar; Manzoor Hussain; Muhammad Ibrahim Channa

doi:10.33411/IJIST/1766

Authors

Muhammad Hashir Department of Information Technology, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan
Zulqarnain Channa Department of Computer Science, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan
Shamshad Lakho Department of Computer Science, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan
Atta Muhammad Panhyar Department of Artificial Intelligence Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan
Manzoor Hussain Department of Information Technology, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan
Muhammad Ibrahim Channa Department of Computer Science, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan

DOI:

https://doi.org/10.33411/IJIST/1766

Keywords:

Sindhi Language, Keyword Extraction, Deep Learning, Natural Language Processing, Multilingual BERT, NER, Web Scraping, Text Normalization, Search Engine Optimization

Abstract

The unavailability of computational tools, poor optimization for low-resource languages, and the peculiarities of the Sindhi (سنڌي) script present serious difficulties in keyword extraction for search engine optimization (SEO). All these restrictions make it difficult to index the content and make the Sindhi web pages visible in the result pages of search engines. To mitigate these issues, this paper will offer a deep learning-based solution to Sindhi keyword extraction based on a multilingual BERT (MultiBERT) model combined with Named Entity Recognition (NER). Over 6,300 Sindhi news articles were gathered through web scraping of the Daily Kawish. The mined data, including URLs, categories, and textual content, was organized in a CSV format and later subjected to normalization processes to accommodate linguistic differences in Sindhi text. A multilingual BERT-based NER model was further refined to identify keywords on the processed data. The experimental findings indicate that the model proposed has an accuracy of 92.5%, precision of 91.8%, recall of 89.6%, and F1-score of 90.7%. The proposed model outperformed baseline methods by up to 17% in F1-score, demonstrating its effectiveness for low-resource language processing, which is over and above the experimental results of the conventional methods of keyword extraction, including TF-IDF, TextRank, and RAKE. The extracted keywords were then analyzed using visualization in order to comprehend their distribution and relevance. The framework suggested offers a working model through which Sindhi keyword extraction can be improved and provides practical implications for SEO professionals in order to enhance content visibility with low-resource languages. It is also a contribution to the development of natural language processing (NLP) for regional languages and a framework for future studies in the field of Sindhi text analytics.

Author Biographies

Atta Muhammad Panhyar, Department of Artificial Intelligence Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan

Manzoor Hussain, Department of Information Technology, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan

Muhammad Ibrahim Channa, Department of Computer Science, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan

References

Rubab Roshan, Irfan Ali Bhacho, “Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach,” Eng Proc, vol. 46, no. 1, p. 5, 2023, doi: https://doi.org/10.3390/engproc2023046005.

Ali Nawaz, Muhammad Nawaz, “TPTS: Text Pre-processing Techniques for Sindhi Language,” Pakistan J. Emerg. Sci. Technol., vol. 4, no. 3, pp. 1–12, 2023, doi: 10.58619/pjest.v4i3.89.

Irum Naz Sodhar, Muhammad Ibrahim Channa, Akhtar Hussain Jalbani, Dil Nawaz Hakro, “Identification of Issues and Challenges in Romanized Sindhi Text,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 9, 2019, [Online]. Available: https://thesai.org/Downloads/Volume10No9/Paper_29-Identification_of_Issues_and_Challenges_in_Romanized_Sindhi_Text.pdf

Fatma Sezer Çırakoğlu, Özgün Koşaner, “Linguistic challenges in regional language SEO,” Telemat. Informatics Reports, vol. 16, p. 100169, 2024, doi: https://doi.org/10.1016/j.teler.2024.100169.

Feng Liu, Xiaodi Huang, “Performance Evaluation of Keyword Extraction Methods and Visualization for Student Online Comments,” Symmetry (Basel)., vol. 12, no. 11, p. 1923, 2022, doi: 10.3390/sym12111923.

Mohammed Abubaker, Hamza Sattuf, Bilal Babayigit, “BERT-based Models for Keyword Extraction from Arabic Scientific Articles,” ACM Trans. Asian Low-Resource Lang. Inf. Process., vol. 24, no. 10, 2025, [Online]. Available: https://dl.acm.org/doi/full/10.1145/3761805

“(PDF) Deep learning based transformers for Keyword extraction.” Accessed: Apr. 12, 2026. [Online]. Available: https://www.researchgate.net/publication/378923631_Deep_learning_based_transformers_for_Keyword_extraction

M. K. Pasupuleti, “Multilingual NLP for Low-Resource Languages Using Transfer Learning,” Int. J. Acad. Ind. Res. Innov., vol. 05, no. 05, pp. 452–461, May 2025, doi: 10.62311/NESX/RPHCR7.

“SEO Challenges and Strategies for Multilingual Websites | Cademix Institute of Technology.” Accessed: Mar. 03, 2026. [Online]. Available: https://www.cademix.org/seo-challenges-and-strategies-for-multilingual-websites/

W. Antoun, F. Baly, and H. Hajj, “AraBERT: Transformer-based Model for Arabic Language Understanding,” 2020. Accessed: Jan. 10, 2025. [Online]. Available: https://aclanthology.org/2020.osact-1.2/

S. Rose, D. Engel, N. Cramer, and W. Cowley, “Automatic Keyword Extraction from Individual Documents,” Text Min. Appl. Theory, pp. 1–20, Mar. 2010, doi: 10.1002/9780470689646.CH1;PAGEGROUP:STRING:PUBLICATION.

Partha Pakray, Alexander Gelbukh, “Natural language processing applications for low-resource languages,” Nat. Lang. Process., vol. 31, no. 2, 2025, [Online]. Available: https://www.cambridge.org/core/journals/natural-language-processing/article/natural-language-processing-applications-for-lowresource-languages/7D3DA31DB6C01B13C6B1F698D4495951

Raja Vavekanand, Bhagwan Das & Teerath Kumar, “DAugSindhi: a data augmentation approach for enhancing Sindhi language text classification,” Discov. Data, vol. 3, no. 22, 2025, [Online]. Available: https://link.springer.com/article/10.1007/s44248-025-00040-8

Dipendra Yadav, Sumaiya Suravee, Tobias Strauß, Kristina Yordanova, “Cross-Lingual Named Entity Recognition for Low-Resource Languages: A Hindi-Nepali Case Study Using Multilingual BERT Models,” MRL 2024 - 4th Work. Multiling. Represent. Learn. Proc. Work., 2024, [Online]. Available: https://aclanthology.org/2024.mrl-1.12/

Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, “Tokenization Matters: Improving Zero-Shot NER for Indic Languages,” arXiv:2504.16977, 2025, [Online]. Available: https://arxiv.org/abs/2504.16977

Fida Ullah, Alexander Gelbukh, “Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu,” Computers, vol. 13, no. 10, p. 258, 2024, doi: https://doi.org/10.3390/computers13100258.

D. N. H. K.-U.-R. K. Z. B. Nazish Basir*, “Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi,” J. Inf. Commun. btn btn-dark btn-xs btn-round), vol. 19, no. 1.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Apr. 20, 2025. [Online]. Available: https://arxiv.org/abs/1810.04805v2

N. Vaswani, A., Shazeer, N., Parmar, “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 30, pp. 5998–6008, 2017.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692, 2019, [Online]. Available: https://arxiv.org/abs/1907.11692

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, “Unsupervised Cross-lingual Representation Learning at Scale,” arXiv:1911.02116, 2020, [Online]. Available: https://arxiv.org/abs/1911.02116

R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Text,” 2004. Accessed: Jun. 05, 2025. [Online]. Available: https://aclanthology.org/W04-3252/

Ricardo Campos, Vítor Mangaravite, “YAKE! Keyword extraction from single documents using multiple local features,” Inf. Sci. (Ny)., vol. 509, pp. 257–289, 2020, doi: https://doi.org/10.1016/j.ins.2019.09.013.

Corina Florescu, Cornelia Caragea, “PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents Corina Florescu, Cornelia Caragea,” ACL 2017 - 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., 2017, [Online]. Available: https://aclanthology.org/P17-1102/

Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, Thomas Wolf, “Transfer learning in natural language processing,” Proc. 2019 Conf. North, 2019, [Online]. Available: https://aclanthology.org/N19-5004/

Shijie Wu, Mark Dredze, “Are All Languages Created Equal in Multilingual BERT?,” Proc. Annu. Meet. Assoc. Comput. Linguist., 2020, [Online]. Available: https://aclanthology.org/2020.repl4nlp-1.16/

Guillaume Lample, Alexis Conneau, “XLM-R: Cross-lingual language model pretraining,” arXiv:1901.07291, 2019, [Online]. Available: https://arxiv.org/abs/1901.07291

Viktor Hangya, Hossain Shaikh Saadi, Alexander Fraser, “Improving Low-Resource Languages in Pre-Trained Multilingual Language Models,” Proc. 2022 Conf. Empir. Methods Nat. Lang. Process., 2022, [Online]. Available: https://aclanthology.org/2022.emnlp-main.822/

Partha Pakray, Alexander Gelbukh, “Natural language processing applications for low-resource languages,” Nat. Lang. Process., vol. 31, no. 2, pp. 183–197, 2025, doi: 10.1017/nlp.2024.33.

Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow, “A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios,” Assoc. Comput. Linguist., 2021, [Online]. Available: https://aclanthology.org/2021.naacl-main.201/

“Multilingual SEO Guide 2026: Ranking Across Languages |.” Accessed: Apr. 12, 2026. [Online]. Available: https://phrase.com/blog/posts/multilingual-keyword-research/

“Sindhi Kawish Articles Gallery URLs Dataset.” Accessed: Mar. 03, 2026. [Online]. Available: https://www.kaggle.com/datasets/zulqarnainchanna/sindhi-kawish-articles-gallery-urls-dataset/data