POSTS

Week 5 Scores: A Tough Week to Predict

Lots of surprises during Week 5 of Strictly Series 16!

Not from this week, but still a surprise.

To predict Week 5 Strictly results, I retrained the same three models as I used previously for Week 4, adding in the Week 4 results as additional model inputs. Since this represents a 33% increase in available data about the performance of this series’ celebrities, I was hopeful the model predictions would improve. However, the week also included two factors that threatened to befuddle the models:

  • Bruno was off for the week, and would be replaced by former Dancing with the Stars champion and Fresh Prince of Bel-Air actor Alfonso Ribeiro (—and now I also see on his Wikipedia page current host of America’s Funniest Home Videos, which apparently has decided to thumb its nose at the internet and trudge on).
  • The new “couple’s choice” category had its debut, with two routines in styles never before performed on the show (contemporary and “street/commercial”).

Despite these warning signs, I still went ahead and predicted a score for each routine. Because it did the best last week, I decided to use the XGBoost regressor predictions as the “official” submission, though cross-validation performance of the plain gradient boosting model appeared on average slightly better than XGBoost, the same as last time.

Additionally, an expert Strictly fan again was willing to share their predicted scores as a point of comparison, for which I am thankful!

Aljaz: also thankful.

I’ve tabulated the predictions and actual scores for Week 5:

partners dance fan gbr xgbr rfr actual
Ashley and Pasha Rumba 31 30 32 26 36
Charles and Karen Street/Commercial 28 25 23 25 36
Danny and Amy Jive 26 26 29 25 37
Faye and Giovanni Foxtrot 35 32 31 29 33
Grame and Oti Tango 27 24 23 23 29
Joe and Dianne Waltz 30 28 30 27 29
Kate and Aljaz Viennese Waltz 31 26 29 28 26
Lauren and AJ Contemporary 30 23 23 24 24
Ranj and Janette American Smooth 25 26 27 26 25
Seann and Katya Quickstep 28 23 22 23 24
Stacey and Kevin Samba 27 26 27 25 33
Vick and Graziano Cha cha cha 28 23 26 25 20

I made a few charts to visualize the scoring.

First, a scatter plot to see how predictions compared to actual scoring:

with sns.plotting_context('talk'):
    sns.pairplot(df,x_vars=['fan','xgbr'],y_vars=['actual'],
                 hue='partners',height=4)

Compared to last time, you can see both the expert prediction and the model had a very difficult time predicting scores with a high degree of accuracy. To look at this more quantitatively, I calculated the same measures as last time:

print('number exactly correct:')
for predict in ['fan','gbr','xgbr','rfr']:
    num_right = (df[predict]==df['actual']).sum()
    print('{} : {}'.format(df[predict].name, num_right))

print('------')

print("root mean square error")
for predict in ['fan','gbr','xgbr','rfr']:
    rmse = mean_squared_error(df['actual'], df[predict])**0.5
    print('{} : {:.1f}'.format(df[predict].name, rmse))
    
print('------')

print('r-squared coefficient of correlation:')
for predict in ['fan','gbr','xgbr','rfr']:
    r_2 = r2_score(y_true=df['actual'], y_pred=df[predict])
    print('{} : {:.2f}'.format(df[predict].name, r_2))

resulting in:

number exactly correct:
fan : 1
gbr : 1
xgbr : 0
rfr : 1
------
root mean square error
fan : 5.7
gbr : 5.5
xgbr : 5.6
rfr : 6.6
------
r-squared coefficient of correlation:
fan : -0.14
gbr : -0.05
xgbr : -0.09
rfr : -0.48

On all counts, the predictions were less accurate than last time. In fact, the r-squared values were near-zero or somewhat negative, implying no correlation (or even an anti-correlation!) between the predicted and actual scores.

Once again, the gradient boosting regressors performed better than the random forest regressor, though this time the plain gradient boosting was ever-so-slightly better than the XGBoost. The better performance of XGBoost last week may have been just random, since the cross-validation performance of the two weren’t all that different. The prediction accuracy of those two models were competitive with the expert fan.

I also plotted the residuals:

points = ['x','.','+','_']
with sns.plotting_context('talk'):
    for point, predict in zip(points, ['fan','gbr','xgbr','rfr']):
        resid = df[predict]-df['actual']
        plt.plot(df['actual'],resid, point, label=predict, alpha=0.9)
    plt.ylim(-14.5,14.5)
    plt.legend()
    plt.xlabel('actual score')
    plt.ylabel('residual')
    plt.title('Week 5 score residuals')

The residual plot shows us there was a definite trend in how the predictions were wrong. In all cases, the predictions overestimated the low scores and underestimated the high scores.

It’s not all that surprising this is the way in which the predictions are inaccurate. A score of 20 is low at this point in the competition; it’s more plausible to guess the partnership with the 20 would have scored 5 or so points higher than even lower. Similarly, a guess that’s off by 5 or 10 points of a score actually in the high 30s must have been an underestimate, since it’s impossible to score higher than 40.

The magnitude of the underestimates of the high scores tended to be larger, likely because there were some very high scores from somewhat unexpected dances: Charles and Karen’s couple’s choice street dance, and Danny and Amy’s aviation-themed jive.

A histogram (with kernel density estimates added—thanks, seaborn!) illustrates a consistent point. The distribution of actual scores was broader than any of the predictions:

with sns.plotting_context('talk'):
    for predict in ['fan','gbr','xgbr','rfr','actual']:
        sns.distplot(df[predict], label=predict)
    plt.legend()
    plt.title('Week 5 score distributions')

So, overall a tough week to predict! I think it’s clear why, though—many of the scores this week seemed surprising based on how celebrities had done in the past. It also didn’t help the machine learning models they were thrown situations for which no data existed (Bruno substitute, new types of dances).

I’ll have to see whether next week, Halloween Week on Strictly, will go any better. It may be spooky, but hopefully there won’t be too many scary surprises for the models!

Judges: ready.

And remember, keeeeeeeeeeeep data-ing!