Chapter 3 Data transformation
3.1 Feature Selection
There are 35 features in the data set and more sub-features within a feature. After careful consideration, we chose 7 features to construct our data set for analysis. The following is a table describing all selected variables.
Features | Description |
---|---|
created_at | creation time of tweet |
id_str | tweet id |
text | truncated tweet of length<140 |
user_id | user id |
followers_count | number of followers of user |
location | location of tweet |
full_text | full text of tweet |
Of the 7 features, user_id
and follower_count
are sub-features from user
. They captures the id and the number of followers of a user respectively. location
is a sub-feature from place
. It is a user-identified location, which they can put anything on it. full_text
is a sub-feature from extended tweet
. It captures tweets that has over 140 characters. We did not keep the feature geo
as we found out that it has a missing percentage of almost 100%. It will show up in the missing value section to support out decision.
After selecting our basic features, we noticed that texts are stored in two different features. If the tweet is longer than 140 characters, the tweet would be truncated and put into text
. The original tweet would be put into full_text
. To make our analysis more convenient, we add a new variable called original_text
to store the tweets.
3.2 Tokenization
Next, we conduct tokenization. This a process that will help us with other natural language processing analysis. Since there are some special characters like hashtags and usernames that might otherwise be stripped away using other tokenizer, we use the specific tokenize_tweets() function from the tokenizers library. Below is a demonstration for how it tokenizes one of the tweets.
## [1] "@AGlobalCitizen @AlexanderBruyns @PPathole @elonmusk @SpaceX We would also somehow need to create some sort of artificial magnetic field around the entire planet to make sure that any biomass we take there isn't immediately destroyed by cosmic radiation"
## [[1]]
## [1] "@AGlobalCitizen" "@AlexanderBruyns" "@PPathole" "@elonmusk"
## [5] "@SpaceX" "we" "would" "also"
## [9] "somehow" "need" "to" "create"
## [13] "some" "sort" "of" "artificial"
## [17] "magnetic" "field" "around" "the"
## [21] "entire" "planet" "to" "make"
## [25] "sure" "that" "any" "biomass"
## [29] "we" "take" "there" "isnt"
## [33] "immediately" "destroyed" "by" "cosmic"
## [37] "radiation"
We then add word_tokens
as a feature to our dataset.
3.3 Sentiment
Tweets have sentiments and here we try to classify a tweet as positive, negative or netural.
Before extracting sentiments from the tweets, we need to firstly clean the text such that it does not contain any special characters such as hashtags, \
,@
, website links, etc. Some special character might affect the accuracy of the sentiment score.
Below is a comparison between original text and the cleaned text. Typically @
and website links are removed. After cleaning all the tweets, they were stored in a new feature cleaned_text
in our dataset.
## [1] "@elonmusk @ElemonGame Great ❤️❤️❤️ @Corsair will help @Elemongame the first blockchain game ever owns hundred millions users. This is far beyond my imagination. To the Moon and Mars. @ElemonGame @CORSAIR #Elemon #Corsair love ❤️❤️❤️🚀🚀🚀 https://t.co/iakL1e8brw"
## [1] "elonmusk ElemonGame Great ❤️❤️❤️ Corsair will help Elemongame the first blockchain game ever owns hundred millions users. This is far beyond my imagination. To the Moon and Mars. ElemonGame CORSAIR Elemon Corsair love ❤️❤️❤️🚀🚀🚀 "
Now, we determine the sentiment score for each tweet using library “syuzhet”, which is a custom sentiment dictionary developed in the Nebraska Literary Lab.The sentiment scores are stored in a new feature sentiment_score
.
We then classify each tweet into three categories: positive(score>0), neutral(socre=0), and negative(score<0). We add sentiment
as feature into our dataset to capture the sentiment category of a tweet.
3.4 Summary of added features
We added 5 new features into the dataset after some processing, summing to a total of 12 features. Those new features will help us better conduct analysis and visualizations. Below is a table describing all added features.
variables | descriptions |
---|---|
original_text | original text |
word_tokens | a list of word tokens |
cleaned_text | text after removing special characters |
sentiment_score | sentiment scores |
sentiment | sentiment of tweet: positive, neutral, or negative |