UNLIMITED DATA | BY JAMES KULICH | 6 MIN READ
This year’s football season is over, but it provided a good spark for an interesting project, which could have implications for predicting NCAA football TV ratings.
The author of the project is one of our Masters students in Data Science at Elmhurst University, Michael McComiskey. While Mike currently works in IT, he has spent 25 years in the world of college athletics.
Mike’s overall goal was to create a tool that a television programmer could use to predict whether a particular college football game would be a “good” game from an audience perspective. Mike’s original definition of “good” was any game with a TV audience at least two standard deviations above the median for the time period in question.
However, a quick look at some data and careful consideration of what a network television manager might actually find useful led Mike down a different path.
Find the value of football viewership figures
First, the data:
The red line in the graph above represents the median number of viewers of the games considered by Mike, and the blue line represents the level of viewership two standard deviations above this median.
There aren’t many high-end games and you don’t need a predictive model to identify them. The networks already offer prime-time billing to marquee games like Ohio State vs. Michigan or Auburn vs. Alabama.
The potential for new value lies in choosing less obvious matches that may still have a strong audience. Mike focused his work on predicting contests that would fall between the red and blue lines, those that had more of the median number of viewers but fell below the threshold of two standard deviations.
Get the right data and get the right data
Mike’s knowledge of the field led him to a number of good sources of data such as Sports media monitoring and Sports reference. As is usually the case, Mike had to do a lot of work, including developing custom code, to bring the data together and put it into a usable form.
Some data fields originally available included date, time, broadcast network, home and away teams, conference affiliations, AFP and AP ratings, football TV ratings NCAA and the number of viewers.
The raw data did not lend itself well to model building. Significant feature engineering was required to create new quantities that better represented the story in the data. This is where Mike’s extensive domain experience came in handy.
Mike created new variables to capture important nuances of the matchups involved, such as whether a game had at least one participant from the major conferences – Big Ten, Big 12, Pac-12, SEC – or the football championship subdivision.
Other new variables captured cases where both participants belonged to a Power Five conference (ACC, Big Ten, Big 12, Pac-12, or SEC), or cases where teams belonged to the Football Bowl subdivision but not to one of the A5 conferences.
Other new features were designed to capture ranking information in enough detail to be useful (but not detailed enough to cause clutter), such as matches in the top two, top five, top 10 , top 15 and top 25.
A third set of new features focused on game time windows: early afternoon, late afternoon, prime time, or prime time.
With these new features in place, models could now be developed to capture the signals in the data. Mike used the semi-automated capabilities of PyCaret and arrived at a collection of candidate models with strong performance characteristics.
A measure of model performance is the ROC Curve, which gives an idea of the overall performance of the random guess models. The higher a model’s ROC curve is above the diagonal line, the faster it makes quality predictions.
These ROC curves show that Mike’s candidate models consistently performed at high levels of predictive power.
Going deeper, Mike’s models identified the relative importance of the variables used to generate the predictions. At the top of the list was a match between the top 25 schools. SEC games showed more potential to be at the targeted audience level than other conference games, although other major conference rivalries had a positive impact.
Interestingly, match time was not an important variable. Who was playing mattered more than when they were playing.
As a final test, Mike applied his chosen Gradient Boosting classifier model to new data with known responses. Accuracy, recall and precision all remained high, with precision reaching 78%.
Refine the model
As always, there are opportunities for improvement. One direction Mike suggests is to flag teams with widely recognized names to measure the impact of reputation on viewership. Other possible entries suggested by Mike included traditional hit records or some form of love/hate index developed from viewers.
There are a lot of ongoing discussions about future directions for data science. A point expressed by Maria Korolov in his CIO blog post, How to know when AI is the right solutionis that simpler methods should be used when effective, leaving more powerful AI approaches for situations where they can lead to substantially better business value.
This is precisely the approach Mike has taken in his work, focusing on the difficult problem of selecting mid-level games that have strong potential to pique viewers’ interest. Indeed, this focus on delivering projects that matter is how we approach our entire Master of Data Science program at Elmhurst University.
As an instructor and program director, I am especially gratified when students like Mike do a great job.
Start your passion project at Elmhurst University
Elmhurst University’s Data Science and Analytics program helps professionals excel in business. Meanwhile, our flexible online format allows you to earn a master’s degree on your terms. Ready to know more? Fill out the form below.