Chapter 2 Data sources

Our data was collected from Twitter using the Tweepy API. Since we are interested in Elon Musk, we chose the keyword set of Elon Musk, elonmusk, @elonmusk, tesla, Tesla. From October 28th to November 3rd, we collected data 7 days in a row. During the period when our program was running, we captured every real-time tweets that contain any of the above keywords. The data was collected in the form of a tweet object, and it contained not only the text of the tweet, but other valuable information worth exploring.

2.1 Dataset overview

After combining our dataset, we observed that a total of 471,996 tweets were collected. For a single tweet object, it contains 36 features. Below is a table of the features.

Table 2.1: All Feature names
all variable of the raw data
created_at	extended_tweet
id	quote_count
id_str	reply_count
text	retweet_count
display_text_range	favorite_count
source	entities
truncated	favorited
in_reply_to_status_id	retweeted
in_reply_to_status_id_str	filter_level
in_reply_to_user_id	lang
in_reply_to_user_id_str	timestamp_ms
in_reply_to_screen_name	quoted_status_id
user	quoted_status_id_str
geo	quoted_status
coordinates	quoted_status_permalink
place	possibly_sensitive
contributors	extended_tweet
is_quote_status	quote_count

One thing worth mentioning is that some of features are actually dictionaries. For example, in User, there are actually 40 more sub features including username,location,description, etc.. Counting those sub features, there are over 150 features in a tweet object and we have to choose carefully from them. Details of selected features and descriptions will be elaborated in the data transformation section.

2.2 Limitations

Since we do not have the resource to a cloud server, we were not able to run the program 24 hours every day in the period. Moreover, the start and end time for each date is different. We will need to find a overlapping period when conducting analysis relating to time. Another limitation is about a feature called “conversation_id”. It builds up a conversation by capturing replies to a specific tweet. However, it has to be configured during the data collection phase and our dataset does not have it. We will try some other methods to construct a conversation.