Support for Multi-index

inspired by How to use `.loc` in case of multi-indexed dataframes? · Issue #2576 · modin-project/modin · GitHub

It seems that current design of Modin DataFrame “wrapper” (the modin.pandas.DataFrame thingy) is rather rigid in terms of indexing - it seems to support one-dimensional indexers along each of the axis, at least in terms of doing a .loc[] on them.

Example: modin/indexing.py at 3652d19ce6768f9cb739f784d4c96d0d2d55ea06 · modin-project/modin · GitHub

I think we should discuss the steps to implement the support for multi-dimensional indices, as adding it seems to be a nontrivial task (from what I saw it touches a lot of small places around the code).

Maybe I’m misunderstanding something here?

cc @gshimansky

When I tried to look for solution of this bug #2576 I figured that we currently lack a comprehensive parser of .loc expressions that Pandas has. These expressions may be very complex. We have to either copy their logic or somehow reuse it. I am referring for example to function _get_loc_level which is only a part of Pandas MultiIndex support.
Is this what you mean?

Yes, I believe this is true.

This, too. I am currently under impression that we only support the simplest case of single-index dataframes by design, not only in .loc[] but in other places as well. I would be much grateful if someone could point where I am wrong.

Do we have any docs or something written down on why we do not store columns+dtypes and indices in the driver process? At least I haven’t seen any indication of that.

All indices and data types are stored in the driver. Types can be lazily fetched if they were not inferred/materialized, but I see that as a good feature rather than a bug.

Where do we store the indices? I could use a pointer…

Also after looking into internals of how Pandas deal with multi-indices I think it might be worth pursuing to try and make Pandas own _LocIndexer / _iLocIndexer work with Modin Frame somehow, like inherit from them instead of implementing from scratch.

1 Like

I agree that since we reuse Pandas Index classes with no change we could as well try to reuse LocIndexer class.

1 Like

mask is a low level operator that takes a list of positions or names for the rows. We can add the logic there. Currently, I believe a significant portion of the loc is translated to iloc in the LocIndexer class:

I don’t see what you’re trying to illustrate with pointing to _LocIndexer, but if I’m understanding other part you mean we need to either improve what mask() does or improve what we feed into it (only for .loc/.iloc parts with multi-index), am I right here?

Yes, that is what I mean. Probably we want to improve what we feed into mask, but I can see improving how mask handles MultiIndex support. _LocIndexer.__getitem__ is insanely diverse, accepting a callable, list, hashable, series, dataframe, index object, etc. I don’t think we want mask to handle every one of those cases.