Gender Classification using Twitter Text Data

Pradeep Vashisth, Kevin Meehan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

30 Citations (Scopus)

Abstract

Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency-Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.

Original languageEnglish
Title of host publication2020 31st Irish Signals and Systems Conference, ISSC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728194189
DOIs
Publication statusPublished - Jun 2020
Event31st Irish Signals and Systems Conference, ISSC 2020 - Letterkenny, Ireland
Duration: 11 Jun 202012 Jun 2020

Publication series

Name2020 31st Irish Signals and Systems Conference, ISSC 2020

Conference

Conference31st Irish Signals and Systems Conference, ISSC 2020
Country/TerritoryIreland
CityLetterkenny
Period11/06/2012/06/20

Keywords

  • Gender Classification
  • Machine Learning
  • Natural Language Processing
  • Twitter
  • Word Embedding

Fingerprint

Dive into the research topics of 'Gender Classification using Twitter Text Data'. Together they form a unique fingerprint.

Cite this