First I found on the GitHub a data set on Summer Olympic medals from 1976 to 2008. I made some analysis - Summer Olympic medals 1976 to 2008. Later I found on the Kaggle a more complete data set Summer Olympics Medalists - Every summer Olympics medalist from 1896-2008. (Probably) the same data are available also here and here.
It would be nice to have also more recent data. An option would be to extend this dataset with data from
But, maybe somebody else already did this. An updated version till 2016 (Rio) containing also Winter games is available here.
Searching on Kaggle for Olympics we get several interesting pages. Some provide us with missing data
others describe different ideas on how to analyze the Olympics data:
and at GitHub
The data set athletes_events.csv
contains the data about all (271117) athletes for Summer and Winter games till 2016. It also contains data about age, weight, and height.
I found also the data set All Year Olympic Dataset (with 2020 Tokyo Olympics).csv
. It is a reduced (some variables are missing) extension of the athletes_events.csv
till the 2020 Tokyo Olympics. I decided to use this file Summer Olympic medals till 2020
but from the obtained results it seems that something is wrong with the 2020 data. So, I turned to the data till 2016