How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?
See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.
Here are a couple of lines:
1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account! 0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles 1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No. 0|@STFUTimothy have you tried apple pie shine? 1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx @SuryaRay
Here is the total data set: http://pastebin.com/eJuEb4eB
I need to build a model that classifies "Apple" (Inc). from the rest.
I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).
I would do it as follows:
- Split the sentence into words, normalise them, build a dictionary
- With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
- When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.