An Identification of Fake Contents Using Text-mining Techniques

Saqlain Sajjad; Hafiz Muhammad Ghazi; Muhammad Asgher Nadeem; Muhammad Irfan Habib; Muhammad Salman Saeed; Syed Ali Hasnain Naqvi; Zeeshan Arfeen; Isheeaq Naeem; Muhammad Irfan

Authors

Saqlain Sajjad 1Department of Computer Science, University of Management and Technology Sialkot Campus, Pakistan
Hafiz Muhammad Ghazi Department of Information Engineering Technology, National Skills University Islamabad, Islamabad, 44310, Pakistan
Muhammad Asgher Nadeem Thal University Bhakhar Punjab, Pakistan
Muhammad Irfan Habib Department of Electrical Engineering Technology, National Skills University Islamabad, Islamabad, 44310, Pakistan.
Muhammad Salman Saeed Multan Electric Power Company (MEPCO), Multan, Pakistan
Syed Ali Hasnain Naqvi Faculty of Social Sciences, Sir Syed University of Engineering and Technology (SSUET), Karachi, Pakistan.
Zeeshan Arfeen Department of Electrical Engineering Technology, National Skills University Islamabad, Islamabad, 44310, Pakistan
Isheeaq Naeem University of Management and Technology, Sialkot, Pakistan
Muhammad Irfan Department of Computer Science, University of Management and Technology Sialkot Campus, Pakistan

Keywords:

Fake Content, Text Mining, Identifications, Text Analysis, Techniques

Abstract

In recent years, social media users have become increasingly concerned about sharing content that may be unpleasant or harmful. The widespread use of platforms like Facebook and Twitter has contributed significantly to this growing awareness. The primary objective of our approach is to accelerate and automate the detection of offensive content posted on these platforms, simplifying the process of taking necessary actions and filtering harmful communications. A benchmark dataset, OLID 2019 (Offensive Language Identification Dataset), is available online to aid in this task. Our study focuses on identifying whether a tweet is offensive. Our team, which included several members, rigorously compared various feature extraction methods and model-building algorithms. Ultimately, our comparative analysis revealed that decision trees were the most effective model. The decision trees applied to the normalized dataset resulted in an 84% improvement in the Macro F1 score, which aligns with previous research. In conclusion, a real-time system could be developed across multiple social media platforms to detect and evaluate objectionable posts, enabling timely interventions to promote healthier online behavior and foster a positive societal impact.

References

and G. M. S. Abro, S. Shaikh, Z. Hussain, Z. Ali, S. Khan, “Automatic Hate Speech Detection using Machine Learning: A Comparative Study,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 8, 2020, doi: doi: 10.14569/ijacsa.2020.0110861.

H. Ahmed, “Detecting opinion spam and fake news using n-gram analysis and semantic similarity,” 2017.

and F. C. S. Ahmed, K. Hinkelmann, “(PDF) Development of Fake News Model using Machine Learning through Natural Language Processing,” arXiv (Cornell University). Accessed: Dec. 22, 2024. [Online]. Available: https://www.researchgate.net/publication/357952759_Development_of_Fake_News_Model_using_Machine_Learning_through_Natural_Language_Processing

and A. I. M. ibn S. S. S. Alanazi, M. B. Khan, “Arabic Fake News Detection In Social Media Using Readers’ Comments: Text Mining Techniques In Action,” IJCSNS Int. J. Comput. Sci. Netw. Secur., vol. 20, no. 9, 2020, doi: 10.22937/IJCSNS.2020.20.09.4.

and R. D. W. Aldjanabi, A. Dahou, M. a. A. Al-Qaness, M. A. Elaziz, A. M. Helmi, “Arabic Offensive and Hate Speech Detection Using a Cross-Corpora Multi-Task Learning Model,” Informatics, vol. 8, no. 4, p. 69, 2021, doi: https://doi.org/10.3390/informatics8040069.

and A. I. E. M. M. Khalil, H. M. Ghazi, M. I. Habib, F. Shahzad, “Guideline for Selecting the Right Content Management System (RCMS) for Web Development: A Comprehensive Approach,” J. Comput. Biomed. Informatics, 2024, [Online]. Available: https://www.researchgate.net/publication/380151503_Guideline_for_Selecting_the_Right_Content_Management_System_RCMS_for_Web_Development_A_Comprehensive_Approach

and J. V. M. A. Alonso, D. Vilares, C. Gómez-Rodríguez, “Sentiment Analysis for Fake News Detection,” Electronics, vol. 10, no. 11, p. 1348, 2021, doi: https://doi.org/10.3390/electronics10111348.

R. A. A. and M. I. E.-K. Ghembaza, “Anti-Islamic Arabic Text Categorization using Text Mining and Sentiment Analysis Techniques,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 8, 2021, doi: DOI: 10.14569/IJACSA.2021.0120889.

and D. H. B. Collins, D. T. Hoang, N. T. Nguyen, “Trends in combating fake news on social media – a survey,” J. Inf. Telecommun., pp. 1–20, 2020, doi: https://doi.org/10.1080/24751839.2020.1847379.

A. and R. Katarya, “Analysis of Online Toxicity Detection Using Machine Learning Approaches,” arXiv (Cornell Univ., 2021, doi: 10.48550/arXiv.2108.01062.

and A. G. N. Ashraf, A. Zubiaga, “Abusive language detection in youtube comments leveraging replies as conversational context,” PeerJ Comput. Sci., vol. 7, p. 742, 2021, [Online]. Available: https://peerj.com/articles/cs-742/

J. A. Waqas Haider Bangyal, Rukhma Qasim, Najeeb ur Rehman, Zeeshan Ahmad, Hafsa Dar, Laiqa Rukhsar, Zahra Aman, “Detection of Fake News Text Classification on COVID-19 Using Deep Learning Approaches,” Comput. Math. Methods Med., 2021, doi: https://doi.org/10.1155/2021/5514220.

P. Bharadwaj and Z. Shao, “Fake News Detection with Semantic Features and Text Mining,” Int. J. Nat. Lang. Comput., vol. 8, no. 3, pp. 17–22, 2019, doi: 10.5121/ijnlc.2019.8302.

S. Aphiwongsophon and P. Chongstitvatana, “Detecting fake news with machine learning method,” ECTI-CON 2018 - 15th Int. Conf. Electr. Eng. Comput. Telecommun. Inf. Technol., pp. 528–531, Jul. 2018, doi: 10.1109/ECTICON.2018.8620051.

and T. R. M. C. Buzea, S. Trausan-Matu, “Automatic Fake News Detection for Romanian Online News,” Information, vol. 13, no. 3, p. 151, 2022, doi: https://doi.org/10.3390/info13030151.

R. Chatterjee, “Profanity detection in social media text using a hybrid approach of NLP and machine learning,” 2021.

G. A. De Souza and M. Da Costa-Abreu, “Automatic offensive language detection from Twitter data using machine learning and feature selection of metadata,” Proc. Int. Jt. Conf. Neural Networks, Jul. 2020, doi: 10.1109/IJCNN48605.2020.9207652.

and D. P. L. A. S. D. Santos, L. F. R. Camargo, “Evaluation of classification techniques for identifying fake reviews about products and services on the internet,” Gestão & Produção, vol. 27, no. 4, 2020, doi: https://doi.org/10.1590/0104-530X4672-20.

E. Elmurngi and A. Gherbi, “An empirical study on detecting fake reviews using machine learning techniques,” 7th Int. Conf. Innov. Comput. Technol. INTECH 2017, pp. 107–114, Nov. 2017, doi: 10.1109/INTECH.2017.8102442.

A. Gaydhani, V. Doma, S. Kendre, and L. Bhagwat, “Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach,” Sep. 2018, Accessed: Dec. 22, 2024. [Online]. Available: http://arxiv.org/abs/1809.08651

E. Hamdy, P. Jelena Mitrovi, and M. Granitzer, “Neural Models for Offensive Language Detection Masterarbeit von,” 2021.

and M. R. Y. H. Hassani, C. Beneki, S. Unger, M. T. Mazinani, “Text Mining in Big Data Analytics,” Big Data Cogn. Comput., vol. 4, no. 1, 2020, doi: https://doi.org/10.3390/bdcc4010001.

“Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles.” Accessed: Dec. 22, 2024. [Online]. Available: https://www.researchgate.net/publication/323387293_Unsupervised_Content-Based_Identification_of_Fake_News_Articles_with_Tensor_Decomposition_Ensembles

N. Oswal, “Identifying and Categorizing Offensive Language in Social Media,” arXiv:2104.04871, 2021, doi: https://doi.org/10.48550/arXiv.2104.04871.

“Natural language Processing Based Fake News Detection using Text Content Analysis with LSTM - Peer-reviewed Journal.” Accessed: Dec. 22, 2024. [Online]. Available: https://ijarcce.com/papers/natural-language-processing-based-fake-news-detection-using-text-content-analysis-with-lstm/

and J. H. D. S. Kaddoura, G. Chandrasekaran, D. E. Popescu, “A systematic literature review on spam content detection and classification,” PeerJ Comput. Sci., vol. 8, p. 830, 2022, doi: 10.7717/peerj-cs.830.

and M. K. N. Hussain, H. T. Mirza, G. Rasool, I. Hussain, “Spam Review Detection Techniques: A Systematic Literature Review,” Appl. Sci., vol. 9, no. 5, p. 987, 2019, doi: https://doi.org/10.3390/app9050987.

and D. S. Prabhjot Kaur, Rajdavinder Singh Boparai, “Hybrid Text Classification Method for Fake News Detection,” Int. J. Eng. Adv. Technol., vol. 8, no. 5, 2019, [Online]. Available: https://www.ijeat.org/wp-content/uploads/papers/v8i5/E7622068519.pdf

H. L. Xin Lyu, Yuxian Gu, Xu Han, “Adapting Meta Knowledge Graph Information for Multi-Hop Reasoning over Few-Shot Relations,” 2019, [Online]. Available: https://www.researchgate.net/publication/335564869_Adapting_Meta_Knowledge_Graph_Information_for_Multi-Hop_Reasoning_over_Few-Shot_Relations

and S.-F. C. D. Lu, S. Whitehead, L. Huang, H. Ji, “Entity-aware Image Caption Generation,” Proc. 2021 Conf. Empir. Methods Nat. Lang. Process., 2018, doi: 10.18653/v1/D18-1435.

T. He and J. Glass, “Negative Training for Neural Dialogue Response Generation,” Assoc. Comput. Linguist. Conf., 2019, doi: 10.18653/v1/2020.acl-main.185.

D. R. Siyi Liu, Sihao Chen, Xander Uyttendaele, “MultiOpEd: A Corpus of Multi-Perspective News Editorials,” Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol., pp. 4345–4361, 2021, [Online]. Available: https://aclanthology.org/2021.naacl-main.344/

G. Singh, I. J. Marshall, J. Thomas, J. Shawe-Taylor, and B. C. Wallace, “A neural candidate-selector architecture for automatic structured clinical text annotation,” Int. Conf. Inf. Knowl. Manag. Proc., vol. Part F131841, pp. 1519–1528, Nov. 2017, doi: 10.1145/3132847.3132989.

and H. L. K. Shu, D. Mahudeswaran, S. Wang, “Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation,” Proc. Int. AAAI Conf. Web Soc. Media, vol. 14, 2020, doi: https://doi.org/10.1609/icwsm.v14i1.7329.

and S. L. J. Alghamdi, Y. Lin, “A Comparative Study of Machine Learning and Deep Learning Techniques for Fake News Detection,” Information, vol. 13, no. 12, p. 576, 2022, doi: https://doi.org/10.3390/info13120576.

J. L. Takeshi Kurashima, Tim Althoff, “Modeling Interdependent and Periodic Real-World Action Sequences,” ACM Digit. Libr., pp. 803–812, 2018, doi: https://doi.org/10.1145/3178876.3186161.

H. G. Hanming Deng, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, “Reinforcing Neural Network Stability with Attractor Dynamics,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, 2020, doi: https://doi.org/10.1609/aaai.v34i04.5787.

C. Z. Nora Hollenstein, “Entity Recognition at First Sight: Improving NER with Eye Movement Information,” Assoc. Comput. Linguist. Conf., 2019, [Online]. Available: https://aclanthology.org/N19-1001/

N. Cao, S. Ji, D. K. W. Chiu, and M. Gong, “A deceptive reviews detection model: Separated training of multi-feature learning and classification,” Expert Syst. Appl., vol. 187, p. 115977, Jan. 2022, doi: 10.1016/J.ESWA.2021.115977.

L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A survey,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 8, no. 4, p. e1253, Jul. 2018, doi: 10.1002/WIDM.1253.

M. Kim, D. A. McFarland, and J. Leskovec, “Modeling afiinity based popularity dynamics,” Int. Conf. Inf. Knowl. Manag. Proc., vol. Part F131841, pp. 477–486, Nov. 2017, doi: 10.1145/3132847.3132923.

“Logistic Boosted Algorithms for Securing Smart Homes Against Anomalies and Security Attacks.” Accessed: Dec. 22, 2024. [Online]. Available: https://www.researchgate.net/publication/380375279_Logistic_Boosted_Algorithms_for_Securing_Smart_Homes_Against_Anomalies_and_Security_Attacks

“Detecting phishing e-mails using Text Mining and features analysis.” Accessed: Dec. 22, 2024. [Online]. Available: https://www.researchgate.net/publication/357053201_Detecting_phishing_e-mails_using_Text_Mining_and_features_analysis