Chapter 4 Missing values
4.1 Twitter Dataset
We conducted analysis on the missing values before forming our final dataset. Of all the features, we first choose 8 features to consider. 7 out of the 8 features are introduced in the previous section, and the only feature that we left out was geo
. The missing value graph will support our decision. Another thing worth mentioning is that all the added features will not be presented. Since they are derived from text
, they will not contain any missing value.
Here we use the whole raw data set containing tweets from October 28th to November 3rd. There are a total of 471,996 tweets. The following is a missing value graph of our raw dataset. We modified the feature names a bit to make sure the graph is clean.
Feature | Count |
---|---|
geo | 471944 |
location | 466755 |
full_text | 277690 |
No.follower | 56 |
tweet_id | 56 |
As shown in the graph, the geo
feature has the most missing values that is close to 100%. From the documentation, geo
is the tweet location that is tagged by the user. It indicates that Twitter users seldom specify a location for tweets. Because of the high missing percentage, we choose to remove the feature. Another largely missing data column is location
with 5241 valid values. location
is a self-identified location by the user, and is has the format like ‘Manhattan, NY’. We think that we will be able to derive some geological pattern based on the limited data we have. The missing value in text
is interesting. text
can only holds up to 140 character and if a tweet exceeds the limit, full_text
is the feature that will capture the complete contents. We see that around half of the users write short text when post tweets related to Elon Musk.
Looking at the missing patterns. More than a half of the tweets are missing geo
and location
, and around 46% of the tweets have additional missing value in full_text
. The other three missing patterns are trivial. We do notice that there is a missing pattern that has missing value in every feature. We found out that there we 56 empty rows in our dataset and we removed those row in later section.