Unveiling Attribution Gaps and License Adherence: A Deep Dive into Kaggle's Dataset Landscape

Charchit Shukla, ChityingSussane Chan, Majd Alslman, Isuri Hitinayake

The objective of this project is to delve into Kaggle, a leading platform for dataset sharing and utilization. Specifically, the aim is to conduct an analysis of Kaggle datasets to determine the extent to which proper credits are attributed to authors and whether the datasets are correctly licensed. This report zeroes in on three primary datasets—diamonds, iris, and mtcars—which are included in the default R and ggplot2 package. Our approach commenced by generating metadata for these selected datasets using the Kaggle API. Subsequent stages of our investigation involve a comprehensive analysis and reporting on the absence of adequate attribution and potential license infringements within datasets sourced from Kaggle. These subsequent steps and considerations are elucidated in Section 2: Methodology and Section 3: License Compliance Assessment. Throughout this report, we underscore the significance of proper attribution and strict adherence to licensing terms concerning the utilization of datasets. This emphasis underscores the critical importance of ethical and legal compliance within the realm of dataset usage.