POSTS
Scraping Dance-Off Data from Wikipedia
Now that I have two weeks of predicting Strictly scores under my belt, I’m interested in looking at who gets voted into the dance-off. For example, a 2016 article in The Guardian by Julia Carter and Richard McManus reported that
“after controlling for where the couple have come in the judges’ scoring, an ethnic minority celebrity is statistically significantly more likely to be in the bottom two and therefore to have received a lower public vote.”
In a study, they found a 71% increase in the likelihood of being in the dance-off for non-white celebrities, and 83% for black women—strikingly large effects. It doesn’t seem like a more detailed description of their analysis has been published, so I am interested in whether I could reproduce their results.
It would also be interesting to look at other possible predictive variables: e.g., size of social media following, age, hometown, or type of celebrity.
Because the BBC has never made information about the public voting results public, the only outcome that can be assessed is the identity of the two contestants in the dance-off each week.
The first step is gathering the data. Unfortunately, Ultimate Strictly doesn’t provide a full set of dance-off data. Instead, I scraped the data from the voluminous Strictly series-specific Wikipedia pages.
The Wikipedia pages do provide dance-off data for all series, but dance-off information is only summarized a full series at a time by the background color of cells in each series’ “Scoring chart” (e.g., for Series 15).
Despite this inconvenience, the structure of the Wikipedia pages are uniform enough that I was able to construct a Python script to scrape the couples in each week’s dance-off. As an overview, the steps were to:
- Use
requeststo download the html. - Use Beautiful Soup 4 (
bs4) to navigate the parse tree of the html and isolate the “Scoring chart” table. - Use and slightly modify a very helpful function I found on StackOverflow that converts an html table that uses
rowspan(i.e., a single cell stretching over multiple rows) into a nested list representing a standard m x n array. - Write a function,
check_danceoff(), that determines whether a table cell has the orange (eliminated) or light blue (survived the dance-off) color that denotes being in the dance-off. - Pipe everything together in a function
find_danceoffs()that returns a tidy DataFrame in which each row corresponds to an observation of a single partnership in the dance-off for a single week.
I’ll append all the scraping code to the bottom of this post, but first a look at the final collection of the entire dance-off data set:
df_all = pd.DataFrame(columns=['celebrity','professional','week','series'])
for x in range(16):
series_str = 'series_{}'.format(x+1)
scd_url = 'https://en.wikipedia.org/wiki/Strictly_Come_Dancing_({})'.format(series_str)
df_series = find_danceoffs(scd_url)
df_series['series'] = x+1
df_all = df_all.append(df_series)And then df_all.tail() gives (Wikipedia was incredibly fast to update with today’s series 16 results!):
| | celebrity | professional | week | series |
|-----|-----------|--------------|------|--------|
| 309 | Lee | Nadiya | 3 | 16 |
| 310 | Charles | Karen | 4 | 16 |
| 311 | Katie | Gorka | 4 | 16 |
| 312 | Seann | Katya | 5 | 16 |
| 313 | Vick | Graziano | 5 | 16 |
And df_describe():
| | celebrity | professional | week | series |
|--------|-----------|--------------|------|--------|
| count | 314 | 314 | 314 | 314 |
| unique | 170 | 49 | 13 | 16 |
| top | Mark | Brendan | 5 | 7 |
| freq | 10 | 28 | 32 | 24 |
Turns out Brendan is the professional that’s been in the dance-off more than anyone else.
Rest of the code pasted below. And remember, keeeeeeeeeep data-ing!
import bs4
import numpy as np
import pandas as pd
import requests
def check_danceoff(ele):
if ele.has_attr('style'):
# remove whitespace and ;
style = ele.attrs['style'].replace(' ','').replace(';','')
if style == 'background:lightblue':
return True
elif style == 'background:orange':
return True
else:
return False
elif ele.has_attr('bgcolor'):
bgcolor = ele.attrs['bgcolor']
if (bgcolor=='orange') or (bgcolor=='lightblue'):
return True
else:
return False
else:
return False
# from https://stackoverflow.com/a/48451104/4280216
# use because of rowspans
# this function handles rowspans to return array with
# equal number of entries for each row
#
# I modified so the value returned for each cell is
# cell_func(cell) rather than cell.get_text().
def table_to_2d(table_tag, cell_func):
rowspans = [] # track pending rowspans
rows = table_tag.find_all('tr')
# first scan, see how many columns we need
colcount = 0
for r, row in enumerate(rows):
cells = row.find_all(['td', 'th'], recursive=False)
# count columns (including spanned).
# add active rowspans from preceding rows
# we *ignore* the colspan value on the last cell, to prevent
# creating 'phantom' columns with no actual cells, only extended
# colspans. This is achieved by hardcoding the last cell width as 1.
# a colspan of 0 means “fill until the end” but can really only apply
# to the last cell; ignore it elsewhere.
colcount = max(
colcount,
sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
# update rowspan bookkeeping; 0 is a span to the bottom.
rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
rowspans = [s - 1 for s in rowspans if s > 1]
# it doesn't matter if there are still rowspan numbers 'active'; no extra
# rows to show in the table means the larger than 1 rowspan numbers in the
# last table row are ignored.
# build an empty matrix for all possible cells
table = [[None] * colcount for row in rows]
# fill matrix from row data
rowspans = {} # track pending rowspans, column number mapping to count
for row, row_elem in enumerate(rows):
span_offset = 0 # how many columns are skipped due to row and colspans
for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
# adjust for preceding row and colspans
col += span_offset
while rowspans.get(col, 0):
span_offset += 1
col += 1
# fill table data
rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
colspan = int(cell.get('colspan', 1)) or colcount - col
# next column is offset by the colspan
span_offset += colspan - 1
# define actual value to put in output table
value = cell_func(cell)
for drow, dcol in product(range(rowspan), range(colspan)):
try:
table[row + drow][col + dcol] = value
except IndexError:
# rowspan or colspan outside the confines of the table
pass
# update rowspan bookkeeping
rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}
return table
def find_danceoffs(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
try:
chart_span = soup.find('span',attrs={'id':'Scoring_chart'})
score_table = chart_span.findNext(name='table')
table_body = score_table.find('tbody')
text_array = np.array(table_to_2d(table_body, lambda c: c.get_text().rstrip()))
for_pd = {x[0]: x[1:] for x in text_array.T}
df = pd.DataFrame(for_pd)
# return df columns to original order and set 'Couple' as index
if 'Couple' in df.columns:
df = df[text_array[0]].set_index("Couple")
# this handles inconsistent column name in Series 5 page
elif 'Team' in df.columns:
df = df[text_array[0]].set_index("Team")
df.index.name = "Couple"
else:
raise Exception
danceoff_mask = np.array(table_to_2d(table_body, check_danceoff))
# convert to DataFrame with same columns and index as df
danceoff_df = pd.DataFrame(danceoff_mask[1:,1:], columns=df.columns, index=df.index)
df_filter = df[danceoff_df]
danceoffs = {}
episode_list = []
celeb_list = []
pro_list = []
for col in df_filter.columns:
episode = df_filter[col].dropna().reset_index()
if len(episode)!=0:
# clean up episode name for output dict:
if col=='1+2':
clean_name = 2
else:
only_digits = ''.join(c for c in col if c.isdigit())
clean_name = int(only_digits)
# split list of couples into celeb and pro first names
celebs = episode.apply(lambda x: x.Couple.split('&')[0].strip(), axis=1)
pros = episode.apply(lambda x: x.Couple.split('&')[1].strip(), axis=1)
for celeb, pro in zip(celebs,pros):
episode_list.append(clean_name)
celeb_list.append(celeb)
pro_list.append(pro)
except Exception as e:
print('problem with {}: {}'.format(url, e))
danceoffs_dict = {'week': episode_list, 'celebrity': celeb_list, 'professional': pro_list}
danceoffs = pd.DataFrame(danceoffs_dict)
return danceoffs