Evaluation Of Linear Interpolation Smoothing On Naive Bayes Spam Classifier
[Full Text]
AUTHOR(S)
Adewole A.P, Fakorede O.J, Akwuegbo S.O.N
KEYWORDS
Keywords: Naïve Bayes, Smoothing, Linear Interpolation, Spam, Ham False Positives, False Negatives.
ABSTRACT
ABSTRACT: The inconvenience associated with spams and the cost of having an important mail misclassified as spam have made all efforts at improving spam filtering worthwhile. The Naive Bayes algorithm has been found to be successful in properly classifying mails. However, they are not perfect. Recent researches have introduced the idea of smoothing into the Naive Bayes algorithm and they have been found to produce better classification. This study applies the concept of linear interpolation smoothing to Naive Bayes spam classification. The resulting classifier did well at improving spam classification and also reducing false positives.
REFERENCES
[1] K. McCallum, Nigam, “A comparison of event models for naive Bayes text classification”. AAAI/ICML 98Workshop on Learning for Text Categorization, AAAI Press 41–48, 1998.
[2] C. Zhai, and J. Lafferty, “The Dual Role of Smoothing in the Language Modelling Approach”. In Proceedings of the Workshop on Language Models for Information Retrieval (LMIR) 2001, pages 31–36, 2001.
[3] C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to ad hoc Information Retrieval”, 2001.
[4] D. Vilar, H. Ney, A. Juan, and E. Vidal, “Effect of Feature Smoothing Methods in Text Classification Tasks”. In International Workshop on Pattern Recognition in Information Systems, pages 108117. Porto, Portugal, 2004.
[5] D. Anderson “Statistical Spam Filtering”, http://www.web.eecs.umich.edu/rthomaso/courses/nlp2006/David _Anderson.pdf , 2006.
[6] D. Metz , “International Business Machines(IBM) accessed 27th February 2014
[7] F. Jelinek, and R.L. Mercer, “Interpolated estimation of Markov source parameters from sparse data”, in Proc. Workshop on Pattern Recognition in Practice, pages 381397, Amsterdam, 1980L.
[8] H.C. Hong (STEVEN), “Statistical Machine Learning for Data Mining and Collaborative Multimedia Retrieval”, The Chinese University of Hong Kong, 2006.
[9] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos. Learning to filter spam email: A comparison of a naive bayesian and a memory based approach. Workshop on Machine Learning and Textual Information Access, 4, 2000.
[10] I. Androutsopolous, G. Paliouras, E. Michelakis, “Learning to Filter Unsolicited Commercial EMail”. Athens University of Economics and Business and National Centre for Scientific Research “Demokritos”, 2004.
[11] Jon Kagstrom “Improving Naive Bayesian Spam Filtering”, Mid Sweden University, Sweden, 2005.
[12] K. Tretyakov, Machine Learning Techniques in Spam Filtering, Institute of Computer Science, University of Tartu, Estonia. pp 35, 78, 2004.
[13] Lingspam Corpus [Online], Available: http://csmining.org/index.php/lingspamdatasets.html
[14] N. A. Abdulmutalib, “Language Models and Smoothing Methods for Information Retrieval”, Ph.D dissertation, Department of Computer Science, University of Dortmund Dortmund, Germany , 2010.
[15] Q. Yuan, G. Cong, and N.M. Thalmann, “ Enhancing Naive Bayes with Various Smoothing Methods for Short Text Classification”, in Proc. of the 21st international conference companion on World Wide Web, WWW ‟12 Companion, 2012.
[16] M. Sahami, S. Dumais, D. Heckerman, and E. Harvitz, “A Bayesian approach to filtering Pg 24, 2008.
[17] S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Computer Science Group Harvard University Cambridge, Massachusetts,1998.
[18] S.T. Guzella and W.M. Caminhas, “A review of machine learning approaches to filtering”, Department of Electrical Engineering, Federal University of Minas Gerais, Brazil,2009 (www.elsevier.com/locate/eswa).
[19] Wikipedia, “Naïve Bayes Classifier”, http://en.wikipedia.org/wiki/naive Bayes classifier, 2014.
[20] X. Zhou, X. Zhang, X. Hu, “Semantic Smoothing for Bayesian Text Classification with Small Training Data”, College of Information Science & Technology, Drexel University, Philadelphia, 2008.
[21] Y. Yang, and J.O. Pederson, “A comparative study on feature selection in text categorization”. In Fisher, D.H., ed.: Proceedings of ICML97, 14th International Conference on Machine Learning, Nashville, US, Morgan Kaufmann Publishers, San Francisco, 412–420, 1997.
