Modin reading files from Azure Blob Store

Hello,

Does Modin support reading files from the Azure Blob Store such as with S3? If yes, is this provided out of the box or does it need additional modules to be installed?

Thank you,
Cherif

Hi @Cherif_Jazra , Modin supports Azure Blob storage to the same degree as pandas. Here is a tutorial for how Microsoft recommends using Azure Blob storage with pandas: Explore data in Azure Blob storage with pandas - Azure Architecture Center | Microsoft Docs

Thanks @devin-petersohn, following up one this, I was able to read from azure storage account after installing the adlfs module, like this:

first setup os environ

os.environ["AZURE_STORAGE_ACCOUNT_NAME"] = "someaccount"
os.environ["AZURE_STORAGE_ACCOUNT_KEY"] = "somekey"

then do

modin_pd.read_csv('abfs://container@blob.core.windows.net/filename.csv')

works!

This however did not work on read_csv_glob

modin_experimental_pd.read_csv_glob('abfs://container@blob.core.windows.net/filename.csv')
File /python3.9/site-packages/modin/experimental/pandas/io.py:183, in _make_parser_func.<locals>.parser_func(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    180     f_locals["sep"] = "\t"
    182 kwargs = {k: v for k, v in f_locals.items() if k in _pd_read_csv_signature}
--> 183 return _read(**kwargs)

File python3.9/site-packages/modin/experimental/pandas/io.py:208, in _read(**kwargs)
    205 Engine.subscribe(_update_engine)
    207 try:
--> 208     pd_obj = FactoryDispatcher.read_csv_glob(**kwargs)
    209 except AttributeError:
    210     raise AttributeError("read_csv_glob() is only implemented for pandas on Ray.")

File /python3.9/site-packages/modin/core/execution/dispatching/factories/dispatcher.py:185, in FactoryDispatcher.read_csv_glob(cls, **kwargs)
    182 @classmethod
    183 @_inherit_docstrings(factories.ExperimentalPandasOnRayFactory._read_csv_glob)
    184 def read_csv_glob(cls, **kwargs):
--> 185     return cls.__factory._read_csv_glob(**kwargs)

File /python3.9/site-packages/modin/core/execution/dispatching/factories/factories.py:513, in ExperimentalPandasOnRayFactory._read_csv_glob(cls, **kwargs)
    506 @classmethod
    507 @doc(
    508     _doc_io_method_raw_template,
   (...)
    511 )
    512 def _read_csv_glob(cls, **kwargs):
--> 513     return cls.io_cls.read_csv_glob(**kwargs)

File python3.9/site-packages/modin/core/io/text/csv_glob_dispatcher.py:62, in CSVGlobDispatcher._read(cls, filepath_or_buffer, **kwargs)
     60 if isinstance(filepath_or_buffer, str):
     61     if not cls.file_exists(filepath_or_buffer):
---> 62         return cls.single_worker_read(filepath_or_buffer, **kwargs)
     63     filepath_or_buffer = cls.get_path(filepath_or_buffer)
     64 elif not cls.pathlib_or_pypath(filepath_or_buffer):

File python3.9/site-packages/modin/core/storage_formats/pandas/parsers.py:269, in PandasParser.single_worker_read(cls, fname, **kwargs)
    267 ErrorMessage.default_to_pandas("Parameters provided")
    268 # Use default args for everything
--> 269 pandas_frame = cls.parse(fname, **kwargs)
    270 if isinstance(pandas_frame, pandas.io.parsers.TextFileReader):
    271     pd_read = pandas_frame.read

File python3.9/site-packages/modin/core/storage_formats/pandas/parsers.py:312, in PandasCSVGlobParser.parse(chunks, **kwargs)
    309 index_col = kwargs.get("index_col", None)
    311 pandas_dfs = []
--> 312 for fname, start, end in chunks:
    313     if start is not None and end is not None:
    314         # pop "compression" from kwargs because bio is uncompressed
    315         with OpenFile(fname, "rb", kwargs.pop("compression", "infer")) as bio:

ValueError: not enough values to unpack (expected 3, got 1)

is this a bug or actual missing feature support?

Hi @Cherif_Jazra!

Does the same issue occur if you use a directory with read_csv_glob instead of a direct filename? I think the reader may assume a directory/globbable path.

Thanks for following up, I get the same error if I try something like the following:

modin_experimental_pd.read_csv_glob('abfs://container@blob.core.windows.net/*')
modin_experimental_pd.read_csv_glob('abfs://container@blob.core.windows.net/*.csv')

@Cherif_Jazra, could you file an issue regarding the problem in Modin repo?