Using Machine Learning and LLM for Classifying of Benign and Malignant Cells from Breast Cancer Dataset

Aqeel Ahmed Khan; Bushra Shaheen; Masroor Ahmed

Authors

Aqeel Ahmed Khan Department of Computer Science, Capital University of Science and Technology, Islamabad, Pakistan
Bushra Shaheen Department of Computer Science, A.Q. Khan Institute of Computer Sciences & Information Technology (KICSIT), Kahuta, Pakistan
Masroor Ahmed Department of Computer Science, Capital University of Science and Technology, Islamabad, Pakistan

Keywords:

Breast Cancer Classification, Machine Learning, Large Language Models, BioBERT, Dimensionality Reduction, Wisconsin Breast Cancer Dataset, Transfer Learning, Clinical Decision Support

Abstract

The most frequently diagnosed cancer and the main cause of cancer death among women globally is breast cancer, with the outcomes of patients being significantly better in case of its early detection. This paper presents a detailed comparison between traditional machine learning and large language model systems to classify breast cancer, and introduces a new system to transform tabular cytological data into meaningful text prompts relevant to clinical practice using BioBERT. Five classic methods of machine learning (MLP, SVM, RF, KNN, and DT) and three dimensionality reduction algorithms (PCA, LDA, FA) were tested using the Wisconsin Breast Cancer dataset. BioBERT is a domain-specific language model that was fine-tuned for binary classification of transformed text representations. Class imbalance was resolved with the help of the SMOTE method which produced a balanced dataset of 888 samples. The highest accuracy with traditional machine learning was on Support Vector Machine and Factor Analysis (98.64% ±0.42% accuracy, 98.92% ±0.38% precision and 98.21% ±0.51% recall on five-fold cross-validation; p < 0.05 compared to baseline MLP). Factor Analysis was chosen based on empirical analysis, as the highest classification accuracy was obtained with the Factor Analysis parameters set to an outlier threshold of 0.3. A final hyperparameter optimization of five trial configurations allowed the BioBERT-based method to reach 97.75% (±0.63) accuracy with a strong precision-recall balance of 97.78%. Even though the classical machine learning model was slightly more accurate (by 0.89 percentage points), there are numerous benefits to the large language model approach: it allows using transfer learning based on large-scale biomedical corpora, a better semantic representation of clinical concepts, and it is inherently scalable to multimodal medical data. Both techniques achieved clinically reliable performance above 97% accuracy, indicating a high potential for helping diagnostic decision-making.

References

B. F. Ferlay J, E.M., Lam F, Laversanne M, Colombet M, Mery L, Piñeros M, Znaor A, Soerjomataram I, “Global Cancer Observatory: Cancer Today,” Int. Agency Res. Cancer, 2024, [Online]. Available: https://gco.iarc.fr/today/en

“Infographics and Photos – IARC.” Accessed: Apr. 23, 2026. [Online]. Available: https://www.iarc.who.int/infographics/

“Breast cancer.” Accessed: Mar. 02, 2026. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer

Konstantina Kourou, Themis P. Exarchos, “Machine learning applications in cancer prognosis and prediction,” Comput. Struct. Biotechnol. J., vol. 13, pp. 8–17, 2015, doi: https://doi.org/10.1016/j.csbj.2014.11.005.

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, “Large language models encode clinical knowledge,” Nature, vol. 620, pp. 172–180, 2023, [Online]. Available: https://www.nature.com/articles/s41586-023-06291-2

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag, “TabLLM: Few-shot Classification of Tabular Data with Large Language Models,” arXiv:2210.10723, 2023, [Online]. Available: https://arxiv.org/abs/2210.10723

Daniel Smolyak, Margrt V. Bjarnadóttir, “Large language models and synthetic health data: progress and prospects,” JAMIA Open, vol. 7, no. 4, 2024, [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/39464796/

M. Miletic and M. Sariyar, “Large Language Models for Synthetic Tabular Health Data: A Benchmark Study,” Stud. Health Technol. Inform., vol. 316, pp. 963–967, Aug. 2024, doi: 10.3233/SHTI240571.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” arXiv:1901.08746, 2019, [Online]. Available: https://arxiv.org/abs/1901.08746

Kelei He, Chen Gan, “Transformers in medical image analysis,” Intell. Med., vol. 3, no. 1, pp. 59–78, 2023, doi: https://doi.org/10.1016/j.imed.2022.07.002.

“Home - UCI Machine Learning Repository.” Accessed: Apr. 23, 2026. [Online]. Available: https://archive.ics.uci.edu/

“Multisurface method of pattern separation for medical diagnosis applied to breast cytology - PMC.” Accessed: Apr. 23, 2026. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC55130/

Aqeel Ahmed Khan, & Muhammad Abu Bakr, “Enhancing Breast Cancer Diagnosis with Integrated Dimensionality Reduction and Machine Learning Techniques,” J. Comput. Biomed. Informatics, vol. 7, no. 2, 2024, [Online]. Available: https://www.jcbi.org/index.php/Main/article/view/573

R. Murtirawat, S. Panchal, V. K. Singh, and Y. Panchal, “Breast Cancer Detection Using K-Nearest Neighbors, Logistic Regression and Ensemble Learning,” Proc. Int. Conf. Electron. Sustain. Commun. Syst. ICESC 2020, pp. 534–540, Jul. 2020, doi: 10.1109/ICESC48915.2020.9155783.

Juhyeon Kim, Hyunjung Shin, “Breast cancer survivability prediction using labeled, unlabeled, and pseudo-labeled patient data,” J. Am. Med. Inform. Assoc., vol. 20, no. 4, pp. 613–618, 2013, [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC3721173/

H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, “Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis,” Procedia Comput. Sci., vol. 83, pp. 1064–1069, Jan. 2016, doi: 10.1016/J.PROCS.2016.04.224.

Chia-Hsuan Chang, Mary M. Lucas, Grace Lu-Yao, Christopher C. Yang, “Classifying Cancer Stage with Open-Source Clinical Large Language Models,” arXiv:2404.01589, 2024, [Online]. Available: https://arxiv.org/abs/2404.01589

S. Miaojiao, L. Xia, Z. Xian Tao, H. Zhi Liang, C. Sheng, and W. Songsong, “Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study.,” JMIR Med. informatics, vol. 13, no. 1, p. e70924, Jun. 2025, doi: 10.2196/70924.

“Exploring the use of large language models for classification, clinical interpretation, and treatment recommendation in breast tumor patient records | Scientific Reports.” Accessed: Apr. 23, 2026. [Online]. Available: https://www.nature.com/articles/s41598-025-16999-y

“Large Language Models in Healthcare and Medical Applications: A Review - PubMed.” Accessed: Apr. 23, 2026. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/40564447/

Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, “Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey,” arXiv:2402.17944, 2024, [Online]. Available: https://arxiv.org/abs/2402.17944

F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 413–422, 2008, doi: 10.1109/ICDM.2008.17.

Nitesh V. Chawla, Kevin W. Bowyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., 2002, [Online]. Available: https://www.jair.org/index.php/jair/article/view/10302

V. N. Vapnik, “The Nature of Statistical Learning Theory,” Nat. Stat. Learn. Theory, 2000, doi: 10.1007/978-1-4757-3264-1.

Leo Breiman, “Random Forests,” Mach. Learn., vol. 45, 2001.

T. M. Cover and P. E. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967, doi: 10.1109/TIT.1967.1053964.

Leo Breiman, “ Classification and regression trees.” Accessed: Jan. 19, 2024. [Online]. Available: https://search.worldcat.org/title/classification-and-regression-trees/oclc/757024130

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805, 2019, [Online]. Available: https://arxiv.org/abs/1810.04805

L. J. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, “Attention Is All You Need,” arXiv:1706.03762, 2017, doi: https://doi.org/10.48550/arXiv.1706.03762.