Using Naive Bayes Classification to Identity a Twitter User's Gender
I have become part of a project at school that has been a lot of fun so
far and it just got a little bit more interesting. I have roughly 600,000
tweets in my possession (each contains screen name, geo location, text,
etc.) and my goal is to try to classify each user as either male or
female. Now using Twitter4J I can get what the user's full name, number of
friends, re-tweets, etc. So I was wondering if a combination of looking at
a users name and also doing text analysis would be a possible answer. I
was originally thinking I could make this like a rule based classifier
where I could first look at the user's name then analyze their text and
attempt to come to a conclusion of M or F. I'm guessing I would have
trouble using something such as naive bayes since I don't have the real
truth values?
Also with the names, I would be checking some kind of dictionary to
interpret whether the name was male or female. I know there are cases
where it's hard to tell but that's why I'd be looking at their tweet texts
as well. I also forgot to mention; with these 600,000 tweets, I have at
minimum two tweets per user available to me.
Any ideas or input on classifying a user's gender would be greatly
appreciated! I don't have a ton of experience in this area and I'm looking
to learn anything I can get my hands on.
No comments:
Post a Comment