Chapter 2 Data sources

Our data was collected from Twitter using the Tweepy API. Since we are interested in Elon Musk, we chose the keyword set of Elon Musk, elonmusk, @elonmusk, tesla, Tesla. From October 28th to November 3rd, we collected data 7 days in a row. During the period when our program was running, we captured every real-time tweets that contain any of the above keywords. The data was collected in the form of a tweet object, and it contained not only the text of the tweet, but other valuable information worth exploring.

2.1 Dataset overview

After combining our dataset, we observed that a total of 471,996 tweets were collected. For a single tweet object, it contains 36 features. Below is a table of the features.

Table 2.1: All Feature names
all variable of the raw data
created_at extended_tweet
id quote_count
id_str reply_count
text retweet_count
display_text_range favorite_count
source entities
truncated favorited
in_reply_to_status_id retweeted
in_reply_to_status_id_str filter_level
in_reply_to_user_id lang
in_reply_to_user_id_str timestamp_ms
in_reply_to_screen_name quoted_status_id
user quoted_status_id_str
geo quoted_status
coordinates quoted_status_permalink
place possibly_sensitive
contributors extended_tweet
is_quote_status quote_count

One thing worth mentioning is that some of features are actually dictionaries. For example, in User, there are actually 40 more sub features including username,location,description, etc.. Counting those sub features, there are over 150 features in a tweet object and we have to choose carefully from them. Details of selected features and descriptions will be elaborated in the data transformation section.

2.2 Limitations

Since we do not have the resource to a cloud server, we were not able to run the program 24 hours every day in the period. Moreover, the start and end time for each date is different. We will need to find a overlapping period when conducting analysis relating to time. Another limitation is about a feature called “conversation_id”. It builds up a conversation by capturing replies to a specific tweet. However, it has to be configured during the data collection phase and our dataset does not have it. We will try some other methods to construct a conversation.