I grabbed the files from http://signstealingscandal.com/ to analyse them online via Datasette. Feel free to ask SQL questions here to do your own analysis of Tony’s work!

# SQL Analysis of the Astros Sign Stealing Scandal

The average across the dataset was 14% of pitches had a trashcan bang prior to it. (How did no one figure this out! That’s a lot!)

Going with that number, I was curious how many runs they got in games where they had more than that average. You could also remove the summary query, and just run the sub-query to look for correlations.

But yes, it definitely looks like there was a benefit overall of around a run per game when they were banging on the trash cans.

Runs with >=14% | Games with >=14% | Runs with <14% | Games with <14% |
---|---|---|---|

5.29 | 28 | 4.17 | 30 |

```
with pitch_counts as (
select game_pk, cast(sum(case when has_bangs='y' then 1 else 0 end) as real) banged_pitches,
cast(count(*) as numeric) total_pitches
from astros_bangs_20200127
group by game_pk)
select
round(sum(case when t.banged_pct >= 0.14 then t.runs else 0 end)/cast(sum(case when t.banged_pct >= 0.14 then 1 else 0 end) as real),2) "Runs with >=14%",
sum(case when t.banged_pct >= 0.14 then 1 else 0 end) "Games with >=14%",
round(sum(case when t.banged_pct < 0.14 then t.runs else 0 end)/cast(sum(case when t.banged_pct < 0.14 then 1 else 0 end) as real),2) "Runs with <14%",
sum(case when t.banged_pct < 0.14 then 1 else 0 end) "Games with <14%"
from (
select
a.game_pk, (case when a.final_home_runs > a.final_away_runs then 'W' else 'L' end) result,
(a.final_home_runs - a.final_away_runs) run_diff,
a.final_home_runs runs,
p.banged_pitches, p.total_pitches,
round( (p.banged_pitches/p.total_pitches), 3) banged_pct
from astros_bangs_20200127 a, pitch_counts p
where a.game_pk = p.game_pk
group by a.game_pk
) t
```