Sports Analytics: NBA Fan Engagement Analysis with R Programming

Rachel Sung
14 min readFeb 8, 2021

With the objective to track NBA fan engagement on Twitter to optimize participation and enhance gameplay. The Source of the Study we used for this project is data of 2020 NBA Final Game 5 (9/25/2020)- Miami Heat vs Boston Celtics. The final game result was Celtics won 121–108 being down as much as 12 points in the first quarter.

For complete codes please refer to https://github.com/rachelsung/Sports-Analytics-Sentiment-Analysis

Objective

Nowadays, fans can engage in sports in a multitude of ways such as attending games, watching games through their league related TV subscriptions, enrolling in loyalty programs, or posting online real-time reactions. With so many different avenues of fan interaction, it’s becoming tougher for teams to track fan engagement and optimize their interaction with the fan.

This is especially true with online fan engagement. With an ever-growing number of social media platforms being built on newer technologies, teams not only have to produce content on these platforms, but make sure that the content is engaging, unique, and interactive for the fan. Luckily, the analytical space has grown alongside social media, and it has become attainable to track and optimize online fan engagement.

In this project, our goal is to perform an exploratory analysis of fan engagement via Twitter during Game 5 of the NBA Eastern Conference Finals between the Miami Heat and the Boston Celtics on September 25th, 2020 where the Celtics won 121–108. The Miami Heat were 1 win away from advancing to the NBA Finals but were not able to hold on after the Celtics made their run in the 3rd Quarter. By conducting sentiment analysis and statistical modeling, we can identify trends and provide recommendations to the two teams to help them maximize their fan engagement on Twitter as well as the NBA to enhance the overall game experience for the fans.

Data Preparation

In collecting data for Game 5 for the Eastern Conference Finals between the Miami Heat and the Boston Celtics on September 25th, we looked to extract as many relevant & real-time tweets as possible by focusing on specific team and NBA related hashtags. We then massaged the data within Excel and created new variables that would allow us to better gather insights in R.

A. Data Scraping

The data was collected via web scraping in Python in which we repurposed code used in a previous class and tailored it to fit our needs. We were able to extract real-time tweets by connecting to Twitter’s API via keys and tokens and specified which variables we wanted to extract. We limited our scope by searching for specific hashtags and extracted the data into an excel file.

In identifying which hashtags to filter on, we researched commonly used hashtags that each team was using as well as hashtags that casual NBA fans were using during the playoffs. We wanted our data to capture Miami Heat fans, Boston Celtics fans, and general NBA fans and came up with the distribution of hashtags per each segment:

But in capturing the data, the Twitter API did have some limitations in terms of its capacity and frequency. We seemed to run into errors when we chose to extract more than 1000 tweets at a time or would run the protocol too many times in a small window. Therefore, we capped the amount of tweets extracted per scrape at 1000 and gave enough time between scrapes. We ran this many times throughout the game including pre-game, Q1 — Q4, and post-game, and ended up collecting 12,145 unique tweets and retweets.

Our data collection method was not perfect as it did have certain limitations. The Twitter API seemed to cut certain tweets if media, photos, videos, & gifs, was included and replaced the remaining characters with a hyperlink. Additionally, some hashtags did not filter through to the tweet, although we were able to capture it in our Hashtag column if emoticons immediately were included directly after the hashtag. An example is below.

This made it harder to classify each tweet’s allegiance and is something we’d like to solve for in the future.

B. Data Priming

In order to prime the data for analysis in R, we performed 4 steps in Excel.

Step 1: Removed duplicate tweets as the data collection process had some overlaps

Step 2: Removed tweets posted before 9/24 as those were more likely to pertain to Game 4.

Step 3: Tagged tweets to their respective allegiance by analyzing hashtags used per each respective tweet. If a fan was to use only Miami Heat related hashtags, as specified in the table above, we classified the tweet as a Miami Heat tweet. We did the same for the Boston Celtics as well as general NBA fans. However, if any combination of Miami Heat and Boston Celtics hashtags were used within the same tweet, then we classified the tweet as general.

Step 4: Lastly, we mapped the tweets directly to the plays of the game. We integrated a play by play data set and were able to determine which tweets occurred in which quarter and display its corresponding score.

C. Variable Creation

We then looked to create more variables, most categorical but some numerical, in order to better interpret the data. The variables added were:

Date — separated from a combined Date & Time field so we could better understand tweets by day

Time — separated from a combined Date & Time field so we could better understand tweets by hour or time within the game

Language — we had the abbreviations for each language a tweet occurred in but did not have the full name. Ex: En = English

Tweet or Retweet — classified each record as a tweet or retweet

Retweet — classified if the record had been retweeted

Category — tagged each tweet and retweet in accordance with its team preference. Ex: Miami Heat, Boston Celtics, or General (Mixed)

Time Frame — at what point did each tweet occur. Ex: Pre-Game, Q1, Q2, Halftime, Q3, Q4, & Post-Game

Score Differential — identified the point differential for each respective team at the time of tweet or retweet. Ex: 7:40 PM PST => MIA = -9, BOS = +9

Timeout vs Instant Replay vs Non-Timeout — identified tweets and retweets that occurred during a timeout or during a replay

Referee Referenced Tweet — identified tweets and retweets that specifically mentioned Referees

Description of Data

After we cleaned our data, we ended up with 12,145 records and 22 features. Before we proceeded to the next step, we checked the data quality and looked into each column to understand the distribution of data and perform preliminary analysis.

Tweet Category

Language

Retweet count distribution:

The distribution of tweets that are retweeted at least once and no more than 1,000 times

Seven Time Frames:

Before the game, During Q1/Q2/Q3/Q4, After the game, Halftime

Sentiment Analysis

With the data scraped from Twitter, our group looked to conduct sentiment analysis to better understand the emotions of fans as the course of the game occurred. By identifying trends, we could look to provide suggestions that could improve fan engagement.

For this analysis, we used a package called “sentimentr”, which is designed to take into account valence shifters like amplifiers and negators, which results in a more accurate sentiment score. This library not only categorizes each tweet as positive or negative but also generates a numerical value for each tweet. The higher the sentiment value, the stronger the positive emotion in the post.

We separated the data by Category (Celtics, Heat, and Mixed), and conducted sentiment analysis to discover the most common emotions in Tweets for each category. Below are our results:

  • Boston Celtics

From the bar plot above, we can see that the three most common emotions were “Anticipation”, “Trust” and “Joy”, all of which happened to be positive. Additionally, we can see that Boston's positive tweets more than doubled their negative tweets.

  • Miami Heat

For the Heat, the three most common emotions shown were “ Anticipation”, “Fear” and “Anger” which makes sense given that they were one win away from the NBA Finals but blew a significant lead and lost the game. Although we see that Heat fans showed a more overall positive than negative attitude in their tweets, their percentage of negative tweets were significantly more than their Boston counterparts.

  • Mixed

Lastly, for the category of mixed fans, we saw that they most exuded “Trust”, “Anger”, & “Anticipation”. However, it was surprising to see those negative emotions outweighed the positive emotions for a seemingly neutral crowd.

Our next set of analyses was to better understand how fans’ emotions may change over the course of a high stakes playoff game. So for both the Boston Celtics and Miami Heat, we analyzed the change in emotions through different time frames which includes: Before Game, Quarter 1, Quarter 2, Halftime, Quarter 3, Quarter 4, and Post Game.

  • Boston Celtics

For the Boston Celtics, we found that fans never let the negative emotions outweigh the positive emotions and showed resilience within each time frame. Quarter 2 was the peak of frustration for Celtics fans. But sentiment shifted greatly during halftime and after the team secured the lead in the 3rd Quarter, Celtic fans remained happy throughout the rest of the game. It’s also interesting to note how far the negative reactions fell from Q4 to Post Game, perhaps indicating that Celtic fans felt greater optimism and could overcome a 3–2 deficit in order to proceed to the NBA Finals.

  • Miami Heat

For the Heat, a drastic shift of sentiment occurred during Quarter 2 where the Celtics put themselves back in the game. The optimism increased during halftime, however, as the game progressed in Quarters 3 and 4, fans became increasingly negative. Interestingly enough, post-game sentiment levels reverted back to pre-game levels indicating that Heat fans still felt that they had the best chance of advancing to the finals.

Our third and final sentiment analysis looked at other elements besides sentiment versus time frame. We were curious to see how fan sentiment was affected by timeouts and instant replays. Additionally, we wanted to see if there were significant sentiment shifts when a tweet mentioned Referees.

In comparing sentiment scores of gameplay vs timeouts, it’s no surprise that fans posted or retweeted more negative tweets. Timeouts could serve as a general venting timeframe where fans unload their thoughts and frustrations without having to miss any basketball action. However, during Instant Replays, fans became overwhelmingly negative. This is perhaps due to the nature of the replay where it tremendously slows down the pace of the game. However, this feature is a newer one and has shown value in moments of crucial gameplay.

In this analysis, we looked at tweets and retweets that specifically mentioned Referees. We initially thought there would be somewhat of a negative view towards Referees, however, we did not expect the magnitude of negativeness when referees were mentioned. Although the number of Referee related tweets were low in relation to the total number of tweets, when fans did mention the referees, it did not bode well for the sentiment.

Statistical Modeling for Retweet Analysis

As social media has become more and more important for businesses, Twitter, in particular, can help the business develop a stronger following because it provides a closer connection with a broader audience. One of the most crucial aspects of Twitter is the retweet. The power of the retweet is extremely influential as when a business gets retweeted, it can indicate that:

  • The retweet shows somebody appreciated its content;
  • The retweet spreads the content and increases the probability of it going viral;
  • The retweet signals the brand to the retweeter’s followers. Therefore, the business is not just promoting the work to its followers, but promoting to their followers.
  • The retweet increases the amount of influence by the business.

Because retweets allow content to spread quicker and to more people, we would like to further analyze what are the factors that would affect if a tweet to be retweeted not. As a result, we conducted linear regression and logistic regression models in order to have a deeper understanding from a statistical perspective.

A. Linear Regression Models

Linear regression is a common statistical data analysis technique. It is used to determine the extent to which there is a linear relationship between a dependent variable (x) and one or more independent variables (y). There are two types of linear regression, simple linear regression and multiple linear regression. In simple linear regression, a single independent variable is used to predict the value of a dependent variable. In multiple linear regression, two or more independent variables are used to predict the value of a dependent variable.

We built multiple regression models to see what variables are statistically significant to retweets. The independent variables are numerical variables, including the counts of hashtags, binary values if the user is verified or not, follower counts of the user, friends counts of the user, sentiment scores indicating if the tweet includes positive/negative emotion, and the difference in team scores. The dependent variable is the counts of retweets. We also built separate models to see how each team (Boston Celtics, Miami Heat, or Mixed) performed.

Four variables are statistically significant:

B. Logistic Regression Models

Like linear regression, logistic regression is a predictive analysis. Logistic regression estimates the parameters of a logistic model. A binary logistic model has a dependent value with two possible values, where the two values are labeled “0” and “1”. It is used to describe data and to explain the relationship between one dependent binary variable and one or more variables.

We built logistic models to see what variables are statistically significant to retweets. The independent variables are numerical variables, including the counts of hashtags, binary values if the user is verified or not, follower counts of the user, friends counts of the user, sentiment scores indicating if the tweet includes positive/negative emotion, and the difference in team scores. The dependent variable is the binary value of retweets (that is, 1 to be retweeted and 0 to be not retweeted). We also built separate models to see how each team (Boston Celtics, Miami Heat, or Mixed) performs.

Four variables are statistically significant:

Similar to the linear regression result, in general, there were 4 variables that were statistically significant to a tweet being retweeted: the counts of hashtags, negative tweets, positive tweets, and score differential. We can make the same inferences when it comes to hashtag usage, the power of the positive tweet, and how close games cause users to retweet more.

C. Other surprising findings

  • Verified users, user follower counts, and user-friend counts do not affect retweets
  • For Miami Heat fans: the absolute difference of scores is positively related

Conclusion

A. Results

In regards to our sentiment analysis, we found:

Celtics Fans:

  • Displayed significantly more positive emotions than negative
  • Showed resilience as positive emotions never outweighed negative emotions, even when the team was falling behind

Heats Fans:

  • Overall emotions were generally positive but negative emotions were greater than Celtics fans
  • Negative emotions were predominant during the second half of the game as the Celtics overtook the Heat
  • Post-game optimism levels reverted back to pre-game levels, even after a loss

Mixed Category:

  • For a seemingly neutral crowd, it was surprising to see that negative emotions outweighed the positive
  • Sentiment scores dropped during Timeouts and dropped even further during Instant Replays
  • Tweets and retweets that mentioned Referees elicited more negative emotions

In regards to our retweet analysis we found:

a. For both multiple linear regression models and logistic regression models, there are four variables that are significant:

  • Count of hashtags
  • Negative tweets
  • Positive tweets
  • Score differential

b. Less hashtag usage may not lead to higher retweets

c. Negative tweets will cause fewer amounts of retweets while positive tweets will cause more

Recommendations

There are several recommendations we can make per organization based on our analysis:

Celtics

  • Posting more positive tweets that indicate optimism and hope are more likely to be retweeted and have higher retweet counts
  • Post more tweets when the score difference is low as tweets are more likely to be retweeted
  • Continue to use 1 hashtag, #BleedGreen, but include the hashtag on all tweets as the majority of Celtics tweets during the game did not include any hashtags

Heat

  • Posting more positive tweets that indicate optimism and hope are more likely to be retweeted and have higher retweet counts
  • Post more tweets when the score difference is low as tweets are more likely to be retweeted
  • Consolidate the number of hashtags used from 4 to 1 and focus on the most popular hashtag, #HeatTwitter

NBA

Timeouts

  • Reduce the length of timeout until it adversely affects the risk of the player’s health
  • Replace traditional ads with exclusive content and show more in-game ads throughout the game to make up for lost revenue
  • Exclusive content can include player interviews, highlight reels, expert analysis, mic’d up players
  • In game ads can occur more frequently during fouls and free throws
  • Switch to an ad free mode that in exchange for no ads, the user will pay a small fee
  • In lieu of the ads, fans can have the ability to explore different cameras and angles

Challenges/Instant Replays

  • Sentiment shift was so severe that the NBA should look to
  • Cap the length of how long a challenge can be analyzed
  • Couple timeouts with challenges
  • Only have challenges occur during final minutes of game or remove challenges all together

Referees

  • Provide more training so fewer incorrect calls are made
  • Understand which calls cause the most negative sentiment and adjust appropriately in order to enhance the flow of the game

By analyzing just 1 NBA game, we were able to derive a lot of insight into the behavior of the online fan. We can identify certain trends and recommend actionable items that can allow teams to maximize their fan engagement and the league to make the proper adjustments in order to enhance the flow of the game.

--

--