Indexing update/rewrite


Indexing needs to be rewritten and updated. Currently it takes too long to fetch a large slice from DataFrames, even if the objects are already in-memory and the lengths of each partition are already known. It has previously been blocked on a Distributed Series, but now that Series is merged and we have caught/fixed the known regressions, the next undertaking is to update indexing (loc, iloc, __getitem__).

After some deep profiling, I have found that a decent amount of time is spent here: modin.engines.base.frame.partition_manager.BaseFrameManager._get_dict_block_of_index. The runtime for df[:-1] is unacceptable, and with a rewrite, I was able to speed it up by >10x (7.6 sec to 680ms) on a 1m line DataFrame. Another case of this was demonstrated in modin-project/modin#601.

I still think that is too slow, and some of that time can be prevented by making effective use of masks instead of triggering compute/fetching data when all we’re doing is sub-setting.

As a part of this rewrite, I would like to separate the Metadata from the data itself. This will simplify the code in many places and allow us to create distributed Index objects in the future without having to rewrite code.

I propose a Metadata class that contains the Index, Columns, dtypes, Partition Lengths, and any functionality we need to operate on them.

cc @williamma12 and @wuisawesome


That is a great idea! I do have a few questions though.

  1. How will the metadata class integrate into the existing framework? Will there be a metadata object for each PartitionManager, QueryCompiler, or a 2d array of metadata objects for each partition within the PartitionManager?

  2. Also, would it contain the true Index and Columns or would it have the index and columns that we created with RangeIndex?