TY - GEN
T1 - Gender Classification using Twitter Text Data
AU - Vashisth, Pradeep
AU - Meehan, Kevin
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency-Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.
AB - Increasingly content sharing websites such as social media have become very popular in many countries across the world. Classifying the gender of a person based on these short messages is an interesting research area that could benefit legal investigation, forensics, marketing analysis, advertising and recommendation. This research will explore the use of Natural Language Processing (NLP) techniques and tweets in a gender classification system. This investigation will compare multiple techniques such as Bag of Words (Term Frequency-Inverse Document Frequency), Word Embedding (W2Vec, GloVe) and traditional Machine Learning techniques (Logistic Regression, Support Vector Machine and Naïve Bayes) in this context. A new dataset has been generated to be used as part of this study comprising of the user gender and associated tweets. This dataset was developed due to the unavailability of any public standard dataset with the volume required to perform this investigation. The results have determined that the traditional Bag of Words model did not provide any significant results in classification. However, word embedding models have significantly performed better using multiple machine learning techniques. Therefore, the word embedding models have been proven to be the most effective technique in classifying gender based on twitter text data.
KW - Gender Classification
KW - Machine Learning
KW - Natural Language Processing
KW - Twitter
KW - Word Embedding
UR - http://www.scopus.com/inward/record.url?scp=85092731728&partnerID=8YFLogxK
U2 - 10.1109/ISSC49989.2020.9180161
DO - 10.1109/ISSC49989.2020.9180161
M3 - Conference contribution
AN - SCOPUS:85092731728
T3 - 2020 31st Irish Signals and Systems Conference, ISSC 2020
BT - 2020 31st Irish Signals and Systems Conference, ISSC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st Irish Signals and Systems Conference, ISSC 2020
Y2 - 11 June 2020 through 12 June 2020
ER -