New preprint from Modin Team: Towards Scalable Dataframe Systems

Towards Scalable Dataframe Systems

Devin Petersohn, William Ma, Doris Lee, Stephen Macke, Doris Xin, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, Aditya Parameswaran

Dataframes are a popular and convenient abstraction to represent, structure, clean, and analyze data during exploratory data analysis. Despite the success of dataframe libraries in R and Python (pandas), dataframes face performance issues even on moderately large datasets. In this vision paper, we take the first steps towards formally defining dataframes, characterizing their properties, and outlining a research agenda towards making dataframes more interactive at scale. We draw on tools and techniques from the database community, and describe ways they may be adapted to serve dataframe systems, as well as the new challenges therein. We also describe our current progress toward a scalable dataframe system, Modin, which is already up to 30times faster than pandas in preliminary case studies, while enabling unmodified pandas code to run as-is. In its first 18 months, Modin is already used by over 60 downstream projects, has over 250 forks, and 3,900 stars on GitHub, indicating the pressing need for pursuing this agenda.