Big data feature selection and projection for gender prediction based on user web behaviour


Gülşen E., GÜNDÜZ H., ÇATALTEPE Z., Serinol L.

2015 23nd Signal Processing and Communications Applications Conference (SIU), Malatya, Turkey, Türkiye, 16 - 19 Mayıs 2015 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/siu.2015.7130141
  • Basıldığı Şehir: Malatya, Turkey
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: gender prediction, multimodal classification, feature selection, information gain, chi-square, singular value decomposition, Turkish web mining
  • Kocaeli Üniversitesi Adresli: Hayır

Özet

Prediction of a visitors' gender and other demographic information helps with the presentation of the appropriate content to the user. In this paper, we perform gender prediction based on Turkish users' web log data. Our methods use three different sets of features, namely the URLs (Uniform Resource Locator), the textual contents and the DMOZ (from directory.mozilla.org) hierarchies of the pages visited by each user. Since we have a sparse high-dimensional input dataset, first we apply Information Gain and Chi-square based feature selection. We use a MapReduce based approach to compute these feature relevance measures. We also apply stochastic singular value decomposition (SSVD) feature projection method. We present gender classification results, based on these feature selection and projection methods, using the Logistic Regression classifier. Using the Logistic Regression classifier on the selected URL features results in the best performance.