Japjot Singh - Introduction


#1

Hello! My name is Japjot Singh and I am a sophomore studying CS and Data Science (with a concentration in Applied Math) at UC Berkeley. I came across Modin while looking at the various RISE Lab projects for this semester. I am a big fan of optimizing workflow so I am especially excited to help implement additional Pandas functionality using Modin to some of the most nontrivial and inefficient Pandas implementations.

Look forward to working with everyone!


#2

Hi @japjot,

Thanks for the interest! It be great if you could profile the groupby functionalities (i.e., creating a flame graph of where the choke points are)

Here is a rough outline of how a groupby call goes through the stack.

  1. The initial call goes to dataframe.py and creates a DataFrameGroupBy object.
  2. The creation of the DataFrameGroupBy object, which handles calls to do the aggregation (e.g., df.groupby().sum())
  3. The actual aggregation of the DataFrameGroupBy objects is done in the QueryCompiler

There are some old benchmarking files for groupby though I am not sure how relevant they are to this. https://github.com/modin-project/modin/blob/17b7fccb28cf525bf1abd1a7be979c4cb5b66688/benchmarks/groupby_benchmark.py https://github.com/modin-project/modin/blob/17b7fccb28cf525bf1abd1a7be979c4cb5b66688/benchmarks/pandas/groupby_benchmark.py

If you have any questions, just let me know!