Indexing needs to be rewritten and updated. Currently it takes too long to fetch a large slice from DataFrames, even if the objects are already in-memory and the lengths of each partition are already known. It has previously been blocked on a Distributed Series, but now that Series is merged and we have caught/fixed the known regressions, the next undertaking is to update indexing (
After some deep profiling, I have found that a decent amount of time is spent here:
modin.engines.base.frame.partition_manager.BaseFrameManager._get_dict_block_of_index. The runtime for
df[:-1] is unacceptable, and with a rewrite, I was able to speed it up by >10x (7.6 sec to 680ms) on a 1m line DataFrame. Another case of this was demonstrated in modin-project/modin#601.
I still think that is too slow, and some of that time can be prevented by making effective use of masks instead of triggering compute/fetching data when all we’re doing is sub-setting.
As a part of this rewrite, I would like to separate the Metadata from the data itself. This will simplify the code in many places and allow us to create distributed Index objects in the future without having to rewrite code.
I propose a Metadata class that contains the Index, Columns, dtypes, Partition Lengths, and any functionality we need to operate on them.