Good papers are usually the ones with interesting ideas. Although techniques are important, motivations should always come first. In this assignment, we will try to gain insight from tweets of Tesla's CEO, Elon Musk (@elonmusk), for trading his company's stock ($TSLA). We hope you will find this exercise interesting. For simplicity, settings in this exercise are simple and therefore might not yield the best outcome. You are always welcome to consider extensions of this exercise, and design your own trading mechanism.
In this assignment, we will cover skills mentioned in the following notebooks:
When writing your answers, please run all cells including those specified #Nothing is required to do in this cell
.
Scraping data from Twitter is not an easy task. Although you can access to Twitter's official API (after applying for an account here) for scrapping tweets, various limitations would be posted to users. For example, with the standard API, you would only be able to retrieve tweets up to 7 days ago.
Developers has devoted great effort in overcoming these challeges. While python packages like twitterscraper
(details in here) has been developed, their performances are highly unstable. When we first prepared this assignment, the package worked well and now it is not working.
Therefore, in this exercise, we will switch to use a dataset avaliable in Kaggle, called "Elon Musk's Tweets". The dataset is downloaded from https://www.kaggle.com/kulgen/elon-musks-tweets.
Q1.1 Download the data file data_elonmusk.csv
from the course website and put it in the same file with your assignment. Load the data and store it with the name raw_data_tweet
. When loading the data, use the encoding latin1
.
Hint: When you want to read a csv file with the encoding latin1
, you can use pandas.read_csv("yourfile.csv", encoding = "latin1")
.
#Put your answer here
Q1.2 Show the first 30 rows of the raw_data_tweet
.
#Put your answer here
Q1.3 Create another dataframe named df_tweets_formatted
using raw_data_tweet
with the following formatting:
Step 1: Delete all the tweets that are retweeted from others.
(i.e. Keep rows which raw_data_tweet["Retweet from"].isna() == True
). Name the new dataframe as s1
.
Step 2: Keep only the 2 columns: Tweet
and Time
of s1
. Name this new dataframe as s2
.
(Given) Step 3&4: Transform the column Time
into datetime
format and leave only data in 2016.
#Put your answer here
###########################################################################
import datetime as dt
import pandas as pd
#Step 3:
s3 = s2.copy()
s3["Time"] = pd.to_datetime(s3["Time"])
s3["Time"] = s3["Time"].dt.date
#Step 4:
start = dt.date(2016,1,1)
end = dt.date(2016,12,31)
df_tweets_formatted = s3[(s3["Time"]>=start) & (s3["Time"]<= end)]
Besides the above formatting, you might also note that it is common to have hyperlinks in tweets. For example:
#Nothing is required to do in this cell
example_hyperlink = df_tweets_formatted["Tweet"][1085]
example_hyperlink
Since nothing can be inferred from these links, we will want to delete them when doing our analysis. Below we will use the re
package for doing so. Details of the package can be referred in here.
#Nothing is required to do in this cell
import re
re.sub('https?:\/\/\S+', "", example_hyperlink)
where
https?:\/\/
matches any "http://" and "https://" in string\S+
matches strings that do not contain white spacesQ1.4 Define a function del_https
to delete hyperlinks in tweets and another function del_n
to delete the phase \n
in tweets. Apply the 2 functions on the column df_tweets_wohttp["Tweet"]
.
Hint: As in the previous assignment, you can apply a function func
on the whole column col
of the dataframe df
by using the line df["col"].apply(func)
.
df_tweets_wohttp = df_tweets_formatted.copy()
#Put your answer here
To mention someone in twitter, as like in instagram and other social media platforms, you will put @
followed by the user's username. For example, people talk about Elon Musk and Donald Trump in their tweets with @elonmusk
and @realDonaldTrump
.
As we are going to perform sentiment analysis on Elon Musk's tweets for trading $TSLA
, tweets mentioning other users (or companies) might bring inrelevant information and give confusing trading signal. For instance, given NIO is one of the major competitior to Tesla, if Elon Musk publishes a tweet: @NIO is doing really well on their new models!
(This is fake and for demonstration purpose only.), it is an obvious selling signal to $TSLA
but the tweet is highly positive in its sentiment and we might be misguided to buy.
Q1.5 Delete the tweets containing "@" in df_tweets_wohttp
(mentioning others). Reset the index of it and call it df_tweets_final
.
Hint: You may want to use the line df_tweets_wohttp[df_tweets_wohttp["Tweet"].str.contains('@') == False]
.
#Put your answer here
Q1.6 Do the sentiment analysis on df_tweets_final["Tweet"]
with TextBlob
. Save the polarity of tweets into another column named Sentiment
of df_tweets_final
. (i.e. df_tweets_final["Sentiment"]
)
#Put your answer here
Q1.7 Since there could have multiple tweets within the same date, calculate the average across same date
Hint: You may want to use the line df_tweets_final.groupby('Time').mean()
.
#Put your answer here
Q2.1 Download adjusted closing stock price of TSLA from 2016-01-01
to 2016-12-31
by using the package yfinance
. Details of the package can be referred in here. Save the adjusted closing stock price of TSLA from 2016-01-04
(first trading day in 2016) to 2016-12-30
(last trading day in 2016) as stock_data
.
Hint: You may want to use the line yf.download('TSLA',start, end)
#Put your answer here
Q2.2 Plot the time series graph of stock_data
using matplotlib.pyplot
, and export it as ts_Yourname.png
and ts_Yourname.pdf
. Details of the package can be referred in here. You DO NOT need to submit the png and pdf file to me.
Hint: You may want to use the functions plt.plot()
and plt.savefig()
.
#Put your answer here
The simpliest way to evaluate the performance of a trading strategy is to compare it with the buy and hold strategy, that is, to buy at the beginning and to sell at the end. If a specific strategy fail to beat even this simpliest buy and hold strategy, it should never be considered as a good strategy.
Q3.1 Save the adjusted closing price of TSLA on its first trading day and its last trading day in 2016 asfirst_trading_day_price_2016
and last_trading_day_price_2016
respectively. Calculate the percentage gain(or loss) if we buy one share of TSLA on its first trading day in 2016 and sell it on its last trading day in 2016. Save the percentage gain(or loss) calculated as buy_hold_gain
and print it out.
Hint: The percentage gain(or loss) is calculated by: $$ \frac{Total Revenue - Total Cost}{Total Cost} $$
Therefore, in this buy and hold strategy, the percentage gain(or loss) is calculated by: $$ \frac{2016LastTradingDayPrice - 2016FirstTradingDayPrice}{2016FirstTradingDayPrice} $$
#Put your answer here
Now we have to prepare the data needed for our sentiment trading strategy. Since it is possible that some of these tweets were publish after the market trading hours, we would trade TSLA according to the average sentiment of Elon Musk's tweets one day before (t-1, where t is the current date).
#Nothing is required to do in this cell
trad_sentiment_data = pd.concat([stock_data,sentiment_t_df], axis = 1)
#We would trade according to Sentiment_tminus1
trad_sentiment_data["Sentiment_tminus1"] = trad_sentiment_data["Sentiment"].shift(+1)
trad_sentiment_data
Q3.2 Define a function buy_sell_signal
where gives 1
when sentiment>0.5
and gives -1
when sentiment<-0.5
. Apply the function on trad_sentiment_data["Sentiment_tminus1"]
to create the buy and sell signal. Save the signal in another column trad_sentiment_data["BuySell"]
.
In here, we have 1
means buy and -1
means sell. When Elon Musk's tweets on yesterday are highly optimistic, we buy TSLA, and when they are highly pessimistic, we sell the stock.
#Put your answer here
Dropping the NA entries, we can now have a clear picture on the buy sell signals indicated by the sentiments of Elon Musk's tweets.
#Nothing is required to do in this cell
trad_sentiment_data_buysell = trad_sentiment_data[["Adj Close", "Sentiment_tminus1", "BuySell"]]
trad_sentiment_data_buysell = trad_sentiment_data_buysell.dropna()
trad_sentiment_data_buysell
Finally, we want to get the profit of our trading strategy.
Q3.3 By using the provided function percentage_profit_senti()
, calculate the profit of our sentiment trading strategy. Save it as senti_gain
and print it out.
def percentage_profit_senti(data, last_trading_day_price):
revenue = 0
cost = 0
accu_position = 0
for i in range(0, len(data)):
if data["BuySell"][i] == 1: #Buy
cost = cost + data["Adj Close"][i]
accu_position = accu_position + 1
elif data["BuySell"][i] == -1: #Sell
revenue = revenue + data["Adj Close"][i] #Allow short sell without incurring any interests
accu_position = accu_position - 1
close_position = accu_position
if close_position == 0:
total_gain = revenue - cost
elif close_position <0:
cost = cost + last_trading_day_price*abs(close_position) #Close all short position at the end
total_gain = revenue - cost
elif close_position >0:
revenue = revenue + last_trading_day_price*close_position #Close all long position at the end
total_gain = revenue - cost
return total_gain/cost
#Put your answer here
Compare with the buy and hold strategy, we can see that trading with elon musk's tweets can generate way higher percentage in profit!
#Nothing is required to do in this cell
print("The percentage gain of a buy and hold strategy on TSLA in 2016 is",
buy_hold_gain*100, "%")
print("The percentage gain of the sentiment trading strategy on TSLA in 2016 is",
senti_gain*100, "%")