Thanks @ddutt, great question!
For smaller DataFrames, we could monkey patch the internals so that the dataset isn’t getting partitioned and it is simply using pandas for those smaller objects. We aren’t doing that yet, but it could be done.
There are some other considerations, like the interaction between these objects and partitioned DataFrames for
where, and similar methods. I imagine for these cases we could just do something similar to the database world, where we can broadcast smaller objects to each partition or just go ahead and partition the objects if they’re a bit larger. How small/large would need to be determined as well.
We do have a minimum partition size at the moment, but there is overhead to calling remote functions in Ray (IPC, RPC overheads) that will end up dominating smaller data. I agree that we need to have something here, and I think the biggest consideration should be given to the operations that take multiple DataFrames.
Related conversation in a GitHub issue: https://github.com/modin-project/modin/issues/626#issuecomment-494596925
Any other thoughts?