Hi everyone! I’m Manan Khattar, a sophomore studying studying CS and Applied Math (concentrating in Data Science) at Cal. I was introduced to the Modin project by William, who said that he could use some help with profiling and implementing pandas functions to improve Pandas coverage in Modin.
I want to help make Motin more efficient, through identifying bottlenecks in existing implementations and converting Pandas functions into Modin ones if they have not already been implemented. To do this, I plan to profile functions, identify chokepoints, and ultimately transfer nontrivial Pandas functionality like MultiIndexing or DateTime Indices into Modin.
Looked at Modin’s architecture; want to confirm that we don’t have to worry about the Query Compiler or Partition Manager layer
Accordingly, what will be the differences between the default Pandas method for to_feather and Modin’s to_feather, especially when the default Pandas just uses feather.write_feather? Should I look at how that is implemented and see if that can be parallelized?
The only other to_ function that has already been implemented is to_sql, which uses BaseFactory , specifically the PandasOnRayIO factory. Is that something I should consider using?
to_sql also has two implementations, one in modin/modin/engines/ray/pandas_on_ray/io.py and the other in modin/pandas/dataframe.py. In terms of to_feather, which one would you like me to implement? Do they have similar implementations?
What would be the fastest way to contact you if we need help with something? Is a follow-up on the Introduction page the best approach or are there any other alternatives?
As a personal request, I think group work sessions would be very useful to further the speed and workflow efficiency of all of us working on the project. Is this something we can try to do in the future?
For to_feather and other to functions, you likely won’t have to touch QueryCompiler or PartitionManager.
The differences between pandas to_feather and modin to_feather will likely come from the fact that the modin dataframe is partitioned. As a result, you’ll want to find a way to write each partition independently into the fame feather file.
For every to_sql call, it first goes to the modin/pandas/dataframe.py's to_sql, which calls the BaseFactory's to_sql. Inside modin/data_management/factories.py, we determine the correct backend objects and QueryCompiler to use. For this first pass, let’s keep it on the PandasOnRayIO backend since it is the most comprehensive one. However, do not forget to update the modin/pandas/dataframe.py code so that it will call BaseFactory's to_feather instead of defaulting to pandas.
Let’s try to keep everything on the discuss page for now so other and future contributor’s will feel comfortable and less lonely on the discourse page
That is a good idea! Let’s talk more about this next time we meet.
I know it’s a lot of code to digest so thanks for the hard work! If you have any questions, feel free to let me know and pm me if it seems like I didn’t notice it