AI-Powered Chatbot for Conversational Understanding in Roman Urdu
Keywords:
Retrieval-Augmented Generation, Roman Urdu, AI chatbotAbstract
Many people, especially in Pakistan and India, speak Urdu. However, when they write it online, they often use Roman Urdu (Urdu written with English letters). The problem is that most chatbots struggle to understand Roman Urdu because there is no standard way to write it—people spell the same words differently. This research aims to develop an intelligent AI chatbot that can understand and respond accurately in Roman Urdu. To achieve this, we will use advanced AI techniques such as Retrieval-Augmented Generation (RAG) and GPT-based models. The goal is to improve the chatbot’s accuracy and relevance, making it better at handling conversations in Roman Urdu. This study will explain how the chatbot is designed, trained, tested, and improved, helping AI work more effectively with languages that lack fixed writing rules.
References
“Natural language Processing Based Fake News Detection using Text Content Analysis with LSTM - Peer-reviewed Journal.” Accessed: Dec. 22, 2024. [Online]. Available: https://ijarcce.com/papers/natural-language-processing-based-fake-news-detection-using-text-content-analysis-with-lstm/
Y. Chen and H. Thomas, “Retrieval-augmented generation for enhanced mental health diagnostics,” Int. J. AI Psychol., vol. 12, no. 6, pp. 245–260, 2023.
Weesho Lapara, “RAG Chatbot: Is it ready for enterprises?,” Medium, 2023, [Online]. Available: https://weesholapara.medium.com/rag-chatbot-is-it-ready-for-enterprises-95c8c0a76cf9
Dale Markowitz, “Transformers, Explained: Understand the Model Behind GPT-3, BERT, and T5,” Dale AI, 2021, [Online]. Available: https://daleonai.com/transformers-explained
M. Alam and S. U. Hussain, “Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding,” ACM Trans. Asian Low-Resource Lang. Inf. Process., vol. 21, no. 1, Jan. 2022, doi: 10.1145/3464424;TAXONOMY:TAXONOMY:ACM-PUBTYPE;PAGEGROUP:STRING:PUBLICATION.
L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A survey,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 8, no. 4, Jul. 2018, doi: 10.1002/WIDM.1253.
M. Talat, H. Asim, and A. Asmat, “Classification of Sentiments of the Roman Urdu Reviews of Daraz Products using Natural Language Processing Approach,” 4th Int. Conf. Innov. Comput. ICIC 2021, 2021, doi: 10.1109/ICIC53490.2021.9692987.
K. Mehmood, D. Essam, K. Shafi, and M. K. Malik, “Sentiment analysis for a resource poor language-Roman Urdu,” ACM Trans. Asian Low-Resource Lang. Inf. Process., vol. 19, no. 1, Oct. 2019, doi: 10.1145/3329709;SUBPAGE:STRING:ABSTRACT;WEBSITE:WEBSITE:DL-SITE;TAXONOMY:TAXONOMY:ACM-PUBTYPE;PAGEGROUP:STRING:PUBLICATION.
K. I. Hafsa Masroor, Muhammad Saeed, Maryam Feroz, Kamran Ahsan, “Transtech: development of a novel translator for Roman Urdu to English,” Heliyon, vol. 5, no. 5, p. e01780, 2019, [Online]. Available: https://www.cell.com/heliyon/fulltext/S2405-8440(18)35668-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2405844018356688%3Fshowall%3Dtrue
M. A. and M. T. S. M. P. Akhter, Z. Jiangbin, I. R. Naqvi, “Automatic Detection of Offensive Language for Urdu and Roman Urdu,” IEEE Access, vol. 8, pp. 91213–91226, 2020, doi: 10.1109/ACCESS.2020.2994950.
A. R. Khan, A. Karim, H. Sajjad, F. Kamiran, and J. Xu, “A clustering framework for lexical normalization of Roman Urdu,” Nat. Lang. Eng., vol. 28, no. 1, pp. 93–123, Jan. 2022, doi: 10.1017/S1351324920000285.
A. M. S. and R. N. K. Khalid, H. Afzal, F. Moqaddas, N. Iltaf, “Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing,” 2017 IEEE 15th Intl Conf Dependable, Auton. Secur. Comput. 15th Intl Conf Pervasive Intell. Comput. 3rd Intl Conf Big Data Intell. Comput. Cyber Sci. Technol. Congr., pp. 1322–1325, 2017, doi: 10.1109/DASC-PICom-DataCom-CyberSciTec.2017.214.
M. W. H Muhammad Shakeel, Rashid Khan, “Context based Roman-Urdu to Urdu Script Transliteration System,” arXiv:2109.14197, 2021, doi: https://doi.org/10.48550/arXiv.2109.14197.
H. W. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv:2312.10997, 2021, doi: https://doi.org/10.48550/arXiv.2312.10997.
P. Lewis et al, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” Adv. neural Inf. Process. Syst., pp. 9459–9474, 2020.
Y. Lyu et al., “CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models,” ACM Trans. Inf. Syst., Jan. 2024, doi: 10.1145/3701228;CSUBTYPE:STRING:JOURNAL;SERIALTOPIC:TOPIC:ACM-PUBTYPE>JOURNAL;JOURNAL:JOURNAL:TOIS;PAGE:STRING:ARTICLE/CHAPTER.
Wenhao Yu, “Retrieval-augmented Generation across Heterogeneous Knowledge,” Assoc. Comput. Linguist., pp. 52–58, 2022, doi: 10.18653/v1/2022.naacl-srw.7.
P. K. Sumanth Doddapaneni, Gowtham Ramesh, Mitesh Khapra, Anoop Kunchukuttan, “A Primer on Pretrained Multilingual Language Models,” ACM Comput. Surv., 2025, doi: https://doi.org/10.1145/3727339.
P. F. Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, “Language Models are Few-shot Multilingual Learners,” Assoc. Comput. Linguist., pp. 1–15, 2021, doi: 10.18653/v1/2021.mrl-1.1.
A. D. Monojit Choudhury, “How Linguistically Fair Are Multilingual Pre-Trained Language Models?,” Proc. AAAI Conf. Artif. Intell., vol. 35, no. 14, pp. 12710–12718, 2021, doi: https://doi.org/10.1609/aaai.v35i14.17505.
X. H. Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, “Searching for Best Practices in Retrieval-Augmented Generation,” arXiv:2407.01219, 2024, doi: https://doi.org/10.48550/arXiv.2407.01219 Focus to learn more.

Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 50sea

This work is licensed under a Creative Commons Attribution 4.0 International License.