Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu

Muhammad Faraz Manzoor; Muhammad Shoaib Farooq; Muntazir Mehdi; Adnan Abid

Authors

Muhammad Faraz Manzoor Department of Computer Science, University of Management and Technology, Lahore, Pakistan.
Muhammad Shoaib Farooq Department of Computer Science, University of Management and Technology, Lahore, Pakistan
Muntazir Mehdi Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
Adnan Abid Department of Data Science, Faculty of Computing and Information Technology, University of the Punjab, Pakistan

Keywords:

Intrinsic, Plagiarism, Urdu, Stylometry.

Abstract

Plagiarism detection in natural language processing (NLP) plays a crucial role in maintaining textual integrity across various domains, particularly for low-resource languages like Urdu. This study addresses the emerging challenge of intrinsic plagiarism detection in Urdu, an area with limited research due to the scarcity of datasets and model resources. To bridge this gap, our research investigates the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models specifically designed for Urdu text analysis. We conducted a series of experiments to evaluate the performance of several classifiers, including Random Forest, AdaBoost, K-Nearest Neighbor (KNN), Decision Tree, Gaussian Naive Bayes, and Long Short-Term Memory (LSTM) networks. Our results show that KNN and LSTM achieved the highest accuracy at 74%, with KNN outperforming the others in terms of F1-score (64.3%), highlighting its balanced performance across accuracy, precision, and recall. AdaBoost followed closely with an accuracy of 73% and a precision of 77.5%, although its F1-score was slightly lower at 63.6%. These findings emphasize the need for specialized approaches in NLP for Urdu, demonstrating that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in low-resource languages.

References

P. Samuelson, “Self-plagiarism or fair use,” Commun. ACM, vol. 37, no. 8, pp. 21–25, 1994.

A. Hashemi and W. Shi, “Enhancing Writing Style Change Detection using Transformer-based Models and Data Augmentation,” CEUR Workshop Proc., vol. 3497, pp. 2613–2621, 2023.

N. Beute, E. S. Van Aswegen, and C. Winberg, “Avoiding plagiarism in contexts of development and change,” IEEE Trans. Educ., vol. 51, no. 2, pp. 201–205, 2008.

P. Clough and others, “Old and new challenges in automatic plagiarism detection,” Natl. plagiarism Advis. Serv., vol. 41, pp. 391–407, 2003.

M. AlSallal, R. Iqbal, V. Palade, S. Amin, and V. Chang, “An integrated approach for intrinsic plagiarism detection,” Futur. Gener. Comput. Syst., vol. 96, pp. 700–712, 2019.

D. Curran, “An evolutionary neural network approach to intrinsic plagiarism detection,” in Artificial Intelligence and Cognitive Science: 20th Irish Conference, AICS 2009, Dublin, Ireland, August 19-21, 2009, Revised Selected Papers 20, 2010, pp. 33–40.

H. R. Iqbal, R. Maqsood, A. A. Raza, and S. U. Hassan, “Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus,” Nat. Lang. Eng., pp. 1–31, 2023, doi: 10.1017/S1351324923000189.

S. Burrows, M. Potthast, and B. Stein, “Paraphrase acquisition via crowdsourcing and machine learning,” ACM Trans. Intell. Syst. Technol., vol. 4, no. 3, pp. 1–21, 2013.

M. Potthast, A. Eiselt, L. A. Barrón Cedeño, B. Stein, and P. Rosso, “Overview of the 3rd international competition on plagiarism detection,” in CEUR workshop proceedings, 2011.

and A. S. Andrianna Polydouri(B), Georgios Siolas and Intelligent, “Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning,” Eng. Appl. Neural Networks, vol. 2, pp. 87–98, 2017, doi: 10.1007/978-3-319-65172-9.

M. AlSallal, R. Iqbal, V. Palade, S. Amin, and V. Chang, “An integrated approach for intrinsic plagiarism detection,” Futur. Gener. Comput. Syst., vol. 96, pp. 700–712, 2019, doi: 10.1016/j.future.2017.11.023.

C. Zuo, Y. Zhao, and R. Banerjee, “Style change detection with feed-forward neural networks notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, no. September, pp. 9–12, 2019.

J. A. Khan, “Style breach detection: An unsupervised detection model: Notebook for PAN at CLEF 2017,” CEUR Workshop Proc., vol. 1866, 2017.

A. Saini, M. R. Sri, and M. Thakur, “Intrinsic plagiarism detection system using stylometric features and DBSCAN,” Proc. - IEEE 2021 Int. Conf. Comput. Commun. Intell. Syst. ICCCIS 2021, pp. 13–18, 2021, doi: 10.1109/ICCCIS51004.2021.9397187.

J. Brooke and G. Hirst, “Paragraph Clustering for Intrinsic Plagiarism Detection using a Stylistic Vector Space Model with Extrinsic Features.,” CLEF (Online Work. Notes/Labs/Workshop), pp. 1–9, 2012, [Online]. Available: http://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-BrookeEt2012.pdf

F. Manzoor, M. S. Farooq, A. Abid, and A. Alvi, “Language Resources for Intrinsic Plagiarism Detection in Urdu Language,” Mendeley Data, 2023, doi: 10.17632/8fknny5s5p.2.

K. Lagutina et al., “A survey on stylometric text features,” in 2019 25th Conference of Open Innovations Association (FRUCT), 2019, pp. 184–195.

S. Adamović et al., “An efficient novel approach for iris recognition based on stylometric features and machine learning techniques,” Futur. Gener. Comput. Syst., vol. 107, pp. 144–157, 2020.

J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random forest variable selection methods for classification prediction modeling,” Expert Syst. Appl., vol. 134, pp. 93–101, 2019.

A. Vezhnevets and V. Vezhnevets, “Modest AdaBoost-teaching AdaBoost to generalize better,” in Graphicon, 2005, pp. 987–997.

A. J. Myles, R. N. Feudale, Y. Liu, N. A. Woody, and S. D. Brown, “An introduction to decision tree modeling,” J. Chemom. A J. Chemom. Soc., vol. 18, no. 6, pp. 275–285, 2004, doi: https://doi.org/10.1002/cem.873.

G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN model-based approach in classification,” in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, 2003, pp. 986–996.

A. H. Jahromi and M. Taheri, “A non-parametric mixture of Gaussian naive Bayes classifiers based on local independent features,” in 2017 Artificial intelligence and signal processing conference (AISP), 2017, pp. 209–212.

Y. Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks: LSTM cells and network architectures,” Neural Comput., vol. 31, no. 7, pp. 1235–1270, 2019.

E. Stamatatos, “Intrinsic plagiarism detection using character n-gram profiles,” threshold, vol. 2, no. 1,500, 2009.

M. P. Kuznetsov, A. Motrenko, R. Kuznetsova, and V. V Strijov, “Methods for Intrinsic Plagiarism Detection and Author Diarization.,” in CLEF (Working notes), 2016, pp. 912–919.

M. Tschuggnall and G. Specht, “Detecting plagiarism in text documents through grammar-analysis of authors,” Datenbanksysteme für Business, Technol. und Web 2028, 2013.

M. Alsallal, R. Iqbal, S. Amin, and A. James, “Intrinsic plagiarism detection using latent semantic indexing and stylometry,” in 2013 Sixth International Conference on Developments in eSystems Engineering, 2013, pp. 145–150.

H. S. Alenezi and M. H. Faisal, “Utilizing crowdsourcing and machine learning in education: Literature review,” Educ. Inf. Technol., vol. 25, no. 4, pp. 2971–2986, 2020.