International Journal of Drug Delivery Technology
Volume 16, Issue 3s

A Novel MCSUT Technique based on FastText Embedding for Improving Multi-URL Classification and Cybersecurity Performance

Zafar Ali 1*, Siti Sophiayati Yuhaniz 2, Wan Noor Hamiza 3, Jawaid Ahmed Siddiqui 4, Noureen 5, Husham M. Ahmed 6

1*Faculty of Artificial Intelligence, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
2Faculty of Artificial Intelligence, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
3Faculty of Artificial Intelligence, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
4Department of Computer Science, Sukkur IBA University, Sukkur 65200, Pakistan
5Department of Applied Computing and Artificial Intelligence, Universiti Teknologi Malaysia, Johor, Malaysia
6College of Engineering, University of Technology, Bahrain, Kingdom of Bahrain

Author information

1*Email: ali.z@graduate.utm.my
2Email: sophia@utm.my
3Email: wannoorhamiza@utm.my
4Email: jawaid@iba-suk.edu.pk
5Email: noureen@graduate.utm.my
6Email: hmahmed@utb.edu.bh


ABSTRACT

The exponential growth of web content requires efficient URL-based classification. Current methodologies utilize public URL classification datasets that fall into two categories, including DMOZ, Web Proxy Data, and WebKB, which are considered a general category. Other dataset categories, such as phishing, OpenPhishing, URLNet, Web Spam, and malicious, are part of the cybersecurity datasets. The datasets face challenges of class imbalance, noise, and ambiguity, which affect the performance of the URL classification models. To address these limitations, this study proposes an innovative multiple contextual semantic URL tokens (MCSUT) augmented technique that improves the quality of the URL classification dataset by reducing the noise and ambiguity contained in the URLs. The strength of the MCSUT technique mainly relies on its utilization of contextual and semantic URL tokens derived from neural word embedding techniques, such as WordNet, Word2Vec, and FastText, which are based on original tokens. This significantly enhances the ability of deep neural networks to comprehend and interpret these contextual and semantically rich tokens. This study presents a series of experimental results based on three-word embeddings using two datasets (DMOZ and phishing Datasets) and the development of data schemes for the DMOZ and phishing datasets, utilizing contextual and semantic tokens. The innovative multiple contextual semantic URL tokens (MCSUT) based on FastText neural word embeddings have outperformed previous studies, achieving a 0.8625 F1 score compared to WordNet, Word2Vec embeddings, and baselines, and achieved an F1 score of 0.99% on the phishing dataset.

Keywords: URL Classification; Cybersecurity; FastText Embedding; Deep Neural Networks; Data Augmentation.

How to cite this article: Ali Z, Yuhaniz SS, Hamiza WN, Siddiqui JA, Noureen, Ahmed HM., A Novel MCSUT Technique based on FastText Embedding for Improving Multi-URL Classification and Cybersecurity Performance .Int J Drug Deliv Technol. 2026;16(3s): 612-624; DOI: 10.25258/ijddt.16.3s.78