Implementation for parallel `read_excel`

@devin-petersohn
hi devin-petersohn:
I’m contributing to implementing parallel for read_excel, as your mentioned https://github.com/modin-project/modin/issues/467 read_excel is similar with read_hdf, which should be reading a subset of the columns and distributed the reading that way. but I think maybe it’s hard to do that since the original xlrdReader didn’t support read a sub-set of excel columns.
I wonder to know if we can read a subset of rows and distributed the reading that ways, cause I’m not sure if that can be implemented, any idea or suggests for me.

A big thanks

  • Jay
2 Likes

Thanks Jay! I will try to clarify here.

We are internally using pandas.read_x within the partitions. I will link provide links/explanations for how things are done with I/O currently.

We are trying to get this implemented for the Ray backend at first, we can worry about the other execution engines later. The definition for read_hdf currently is implemented in the generic I/O class here: https://github.com/modin-project/modin/blob/551e4c48f799b7e13c51a00a36f3d2cf462b2244/modin/engines/ray/generic/io.py#L721-L785

The purpose of that functionality is to determine the metadata and get the columns that we will read from each worker so that the functionality can be passed off to the workers. For read_excel a lot of this code will be very similar (and we should eventually abstract away the common functionality but at first we can just reuse a lot of the logic).

Within the partitions we will use pandas.read_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html. This is because we are only reading parts of the file at the same time. The columns parameter for pandas.read_excel is usecols, and the computation of the columns is done in the generic I/O class (see HDF computation here: https://github.com/modin-project/modin/blob/551e4c48f799b7e13c51a00a36f3d2cf462b2244/modin/engines/ray/generic/io.py#L756-L764).

The remote task will be defined in pandas_on_ray I/O class (HDF here: https://github.com/modin-project/modin/blob/551e4c48f799b7e13c51a00a36f3d2cf462b2244/modin/engines/ray/pandas_on_ray/io.py#L105-L125). So we don’t have to parse or use xlrdReader ourselves because we will let pandas do the parsing functionality for us. It will be simpler this way I believe.

Let me know if you have any questions!

hi devin-petersohn:

thanks for your reply,

I checked pandas read_excel use_cols parameter underlying implementation that the way is read wholly excel File firstly and return sub-DataFrame base on use_cols secondly, so I think it is not a good idea to use usecols parameter cause that pandas TextParser do much-duplicated works.

I try to check it out to see is there a more useful way to implement excel reader or writing a unique excel reader for modin.

Any suggestions or advice? Thanks .

here is pandas source code:

     def TextParser(*args, **kwds):
        .....

        try:
            parser = TextParser(data,
                                names=names,
                                header=header,
                                index_col=index_col,
                                has_index_names=has_index_names,
                                squeeze=squeeze,
                                dtype=dtype,
                                true_values=true_values,
                                false_values=false_values,
                                skiprows=skiprows,
                                nrows=nrows,
                                na_values=na_values,
                                parse_dates=parse_dates,
                                date_parser=date_parser,
                                thousands=thousands,
                                comment=comment,
                                skipfooter=skipfooter,
                                usecols=usecols,
                                mangle_dupe_cols=mangle_dupe_cols,
                                **kwds)
    
            output[asheetname] = parser.read(nrows=nrows)
    
            if not squeeze or isinstance(output[asheetname], DataFrame):
                if header_names:
                    output[asheetname].columns = output[
                        asheetname].columns.set_names(header_names)
                elif compat.PY2:
                    output[asheetname].columns = _maybe_convert_to_string(
                        output[asheetname].columns)
    
        except EmptyDataError:
            # No Data, return an empty DataFrame
            output[asheetname] = DataFrame()
    
    if ret_dict:
        return output
    else:
        return output[asheetname]

Thanks @sunnyjiechao, sorry for the delay in response.

To be comprehensive, I ran some timing tests, and it does take exactly the same amount of time as you say.

To read the Excel file, we have to unzip it first, then do some reading on the correct XML file from the correct spreadsheet(s). If we implement a custom version, we should make sure that we can read the XML file(s) in parallel. They look like they are minified, so we can’t use newlines as the delimiter for parsing.

@williamma12 Do you have any suggestions?

It might be useful to take a look at https://github.com/python-excel/xlrd. It is not being actively developed but still maintained. It may let us read excel files column- or row-wise but it should definitely help with reading in the files.

Either way, this looks promising! I am excited to see read_excel finally get some attention @sunnyjiechao!

What about Openpyxl? https://openpyxl.readthedocs.io/en/stable/pandas.html

That might be better but it seems like you have to load the entire workbook at once. However, if you can read portions of it at a time, it be better to use openpyxl since that one is being actively developed.