Does the backend's masking interface support `slice` usage?

None of the QC or Modin Frame docstrings say a word about passing slices as indexers (view, getitem_row_array, mask), nevertheless, front-end code sometimes passes slices to the backend. Moreover, modin_frame.mask has a logic of processing slices, although its docstring only allows passing list-like indexers.

I’m currently working on #3513 pull request that optimizes __getitem__ flow of iloc/loc accessors, where sometimes it’s desired to pass a slice to the backend’s view method, however, it isn’t clear, whether I should avoid this or it’s allowed.

In the current implementation seen in the modin/master slices are passed to the backend only if its step is 1. In the turn of modin_frame.mask, it only can handle slices with 1 step as well (it doesn’t verify this, it just works incorrectly if the step is not 1). This interface seems easy breakable since it’s not documented anyhow and any accidental leak of a non-1-step slice may result in an undefined modin_frame.mask behavior.

We should get rid of this API inconsistency somehow. From my side, there are three different proposals:

  1. Allow passing any slices to the qc.view and modin_frame.mask and explicitly write it in the method’s contracts. This will require adding the full support of slices to the modin_frame.mask.
  2. Allow passing only 1-step slices explicitly at the docstrings and add an assert statement on this to the modin_frame.mask.
  3. Throw out slice and use pandas.RangeIndex instead. This won’t require us to change the function’s contract (backends are not forced to implement maybe complex logic of slice indexing), however, it still allows us to use all the benefits of indexing with slices if the mask implementation of the specific backend support this (it can check if the indexer is a range like and treat it as a slice).

cc modin-core: @devin-petersohn @YarShev

1 Like

Thanks @dchigarev for the great writeup.

I think I prefer this, but we shouldn’t use pandas.RangeIndex, probably. Can we just use a Python range?

pandas.RangeIndex just was the first that came into my mind as a range-like thing, we can certainly agree on using python’s range

1 Like

I also tend to the third option. IMO, internal objects like QueryCompiler and ModinDataFrame should have strict and concise API contracts and do not depend on any other external libraries if possible.