Chapter 2 Data sources
Our data was collected from Twitter using the Tweepy API. Since we are interested in Elon Musk, we chose the keyword set of Elon Musk
, elonmusk
, @elonmusk
, tesla
, Tesla
. From October 28th to November 3rd, we collected data 7 days in a row. During the period when our program was running, we captured every real-time tweets that contain any of the above keywords. The data was collected in the form of a tweet object, and it contained not only the text of the tweet, but other valuable information worth exploring.
2.1 Dataset overview
After combining our dataset, we observed that a total of 471,996 tweets were collected. For a single tweet object, it contains 36 features. Below is a table of the features.
all variable of the raw data | |
---|---|
created_at | extended_tweet |
id | quote_count |
id_str | reply_count |
text | retweet_count |
display_text_range | favorite_count |
source | entities |
truncated | favorited |
in_reply_to_status_id | retweeted |
in_reply_to_status_id_str | filter_level |
in_reply_to_user_id | lang |
in_reply_to_user_id_str | timestamp_ms |
in_reply_to_screen_name | quoted_status_id |
user | quoted_status_id_str |
geo | quoted_status |
coordinates | quoted_status_permalink |
place | possibly_sensitive |
contributors | extended_tweet |
is_quote_status | quote_count |
One thing worth mentioning is that some of features are actually dictionaries. For example, in User
, there are actually 40 more sub features including username
,location
,description
, etc.. Counting those sub features, there are over 150 features in a tweet object and we have to choose carefully from them. Details of selected features and descriptions will be elaborated in the data transformation section.
2.2 Limitations
Since we do not have the resource to a cloud server, we were not able to run the program 24 hours every day in the period. Moreover, the start and end time for each date is different. We will need to find a overlapping period when conducting analysis relating to time. Another limitation is about a feature called “conversation_id”. It builds up a conversation by capturing replies to a specific tweet. However, it has to be configured during the data collection phase and our dataset does not have it. We will try some other methods to construct a conversation.