Chapter 3 Data transformation

3.1 Feature Selection

There are 35 features in the data set and more sub-features within a feature. After careful consideration, we chose 7 features to construct our data set for analysis. The following is a table describing all selected variables.

Table 3.1: Feature Overview
Features Description
created_at creation time of tweet
id_str tweet id
text truncated tweet of length<140
user_id user id
followers_count number of followers of user
location location of tweet
full_text full text of tweet

Of the 7 features, user_id and follower_count are sub-features from user. They captures the id and the number of followers of a user respectively. location is a sub-feature from place. It is a user-identified location, which they can put anything on it. full_text is a sub-feature from extended tweet. It captures tweets that has over 140 characters. We did not keep the feature geo as we found out that it has a missing percentage of almost 100%. It will show up in the missing value section to support out decision.

After selecting our basic features, we noticed that texts are stored in two different features. If the tweet is longer than 140 characters, the tweet would be truncated and put into text. The original tweet would be put into full_text. To make our analysis more convenient, we add a new variable called original_text to store the tweets.

3.2 Tokenization

Next, we conduct tokenization. This a process that will help us with other natural language processing analysis. Since there are some special characters like hashtags and usernames that might otherwise be stripped away using other tokenizer, we use the specific tokenize_tweets() function from the tokenizers library. Below is a demonstration for how it tokenizes one of the tweets.

## [1] "@AGlobalCitizen @AlexanderBruyns @PPathole @elonmusk @SpaceX We would also somehow need to create some sort of artificial magnetic field around the entire planet to make sure that any biomass we take there isn't immediately destroyed by cosmic radiation"
## [[1]]
##  [1] "@AGlobalCitizen"  "@AlexanderBruyns" "@PPathole"        "@elonmusk"       
##  [5] "@SpaceX"          "we"               "would"            "also"            
##  [9] "somehow"          "need"             "to"               "create"          
## [13] "some"             "sort"             "of"               "artificial"      
## [17] "magnetic"         "field"            "around"           "the"             
## [21] "entire"           "planet"           "to"               "make"            
## [25] "sure"             "that"             "any"              "biomass"         
## [29] "we"               "take"             "there"            "isnt"            
## [33] "immediately"      "destroyed"        "by"               "cosmic"          
## [37] "radiation"

We then add word_tokens as a feature to our dataset.

3.3 Sentiment

Tweets have sentiments and here we try to classify a tweet as positive, negative or netural.

Before extracting sentiments from the tweets, we need to firstly clean the text such that it does not contain any special characters such as hashtags, \,@, website links, etc. Some special character might affect the accuracy of the sentiment score.

Below is a comparison between original text and the cleaned text. Typically @ and website links are removed. After cleaning all the tweets, they were stored in a new feature cleaned_text in our dataset.

## [1] "@elonmusk @ElemonGame Great ❤️❤️❤️ @Corsair will help @Elemongame the first blockchain game ever owns hundred millions users. This is far beyond my imagination. To the Moon and Mars. @ElemonGame @CORSAIR #Elemon #Corsair love ❤️❤️❤️🚀🚀🚀 https://t.co/iakL1e8brw"
## [1] "elonmusk ElemonGame Great ❤️❤️❤️ Corsair will help Elemongame the first blockchain game ever owns hundred millions users. This is far beyond my imagination. To the Moon and Mars. ElemonGame CORSAIR Elemon Corsair love ❤️❤️❤️🚀🚀🚀 "

Now, we determine the sentiment score for each tweet using library “syuzhet”, which is a custom sentiment dictionary developed in the Nebraska Literary Lab.The sentiment scores are stored in a new feature sentiment_score.

We then classify each tweet into three categories: positive(score>0), neutral(socre=0), and negative(score<0). We add sentiment as feature into our dataset to capture the sentiment category of a tweet.

3.4 Summary of added features

We added 5 new features into the dataset after some processing, summing to a total of 12 features. Those new features will help us better conduct analysis and visualizations. Below is a table describing all added features.

Table 3.2: Additional Feature Overview
variables descriptions
original_text original text
word_tokens a list of word tokens
cleaned_text text after removing special characters
sentiment_score sentiment scores
sentiment sentiment of tweet: positive, neutral, or negative