Read_csv_glob doesn't work on S3 prefixes

Hello,

I’m running into an issue with read_csv_glob pointing to an S3 prefix that would results in multiple files being returned. I’m using the url s3://nyc-tlc/trip data/yellow_tripdata_2020- which returns several files. Can you confirm if I’m running this incorrectly or if this is not properly supported? Note that this work when I point to an S3 url to a single .csv file.

The test was done on ray 1.8.0 and modin 0.12.0

---------------------------------------------------------------------------
2ValueError                                Traceback (most recent call last)
3/tmp/ipykernel_13805/755823039.py in <module>
4      2 
5      3 file_path = 's3://nyc-tlc/trip data/yellow_tripdata_2020-'
6----> 4 modin_df = pd.read_csv_glob(file_path, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
7      5 
8      6 
9
10~/venv/lib/python3.8/site-packages/modin/experimental/pandas/io.py in parser_func(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
11    181 
12    182         kwargs = {k: v for k, v in f_locals.items() if k in _pd_read_csv_signature}
13--> 183         return _read(**kwargs)
14    184 
15    185     parser_func.__doc__ = _read.__doc__
16
17~/venv/lib/python3.8/site-packages/modin/experimental/pandas/io.py in _read(**kwargs)
18    206 
19    207     try:
20--> 208         pd_obj = FactoryDispatcher.read_csv_glob(**kwargs)
21    209     except AttributeError:
22    210         raise AttributeError("read_csv_glob() is only implemented for pandas on Ray.")
23
24~/venv/lib/python3.8/site-packages/modin/core/execution/dispatching/factories/dispatcher.py in read_csv_glob(cls, **kwargs)
25    183     @_inherit_docstrings(factories.ExperimentalPandasOnRayFactory._read_csv_glob)
26    184     def read_csv_glob(cls, **kwargs):
27--> 185         return cls.__factory._read_csv_glob(**kwargs)
28    186 
29    187     @classmethod
30
31~/venv/lib/python3.8/site-packages/modin/core/execution/dispatching/factories/factories.py in _read_csv_glob(cls, **kwargs)
32    511     )
33    512     def _read_csv_glob(cls, **kwargs):
34--> 513         return cls.io_cls.read_csv_glob(**kwargs)
35    514 
36    515     @classmethod
37
38~/venv/lib/python3.8/site-packages/modin/core/io/text/csv_glob_dispatcher.py in _read(cls, filepath_or_buffer, **kwargs)
39     60         if isinstance(filepath_or_buffer, str):
40     61             if not cls.file_exists(filepath_or_buffer):
41---> 62                 return cls.single_worker_read(filepath_or_buffer, **kwargs)
42     63             filepath_or_buffer = cls.get_path(filepath_or_buffer)
43     64         elif not cls.pathlib_or_pypath(filepath_or_buffer):
44
45~/venv/lib/python3.8/site-packages/modin/core/storage_formats/pandas/parsers.py in single_worker_read(cls, fname, **kwargs)
46    267         ErrorMessage.default_to_pandas("Parameters provided")
47    268         # Use default args for everything
48--> 269         pandas_frame = cls.parse(fname, **kwargs)
49    270         if isinstance(pandas_frame, pandas.io.parsers.TextFileReader):
50    271             pd_read = pandas_frame.read
51
52~/venv/lib/python3.8/site-packages/modin/core/storage_formats/pandas/parsers.py in parse(chunks, **kwargs)
53    310 
54    311         pandas_dfs = []
55--> 312         for fname, start, end in chunks:
56    313             if start is not None and end is not None:
57    314                 # pop "compression" from kwargs because bio is uncompressed
58
59ValueError: not enough values to unpack (expected 3, got 1)

@Cherif_Jazra Thanks for the note. It seems like it’s not properly supported. Looking at the code, there is the assumption that it’s a local filesystem, but luckily we recently added a file opener interface so it should be easy enough to port this to that.

Would you open an issue on the GitHub for this? Thanks!

Thank you for the quick follow up @devin-petersohn ! Yes will open a GitHub issue

Created issue Read_csv_glob doesn’t work on S3 prefixes · Issue #3766 · modin-project/modin · GitHub

In case others land here, the read_csv_glob does work with S3, it is an issue with the documentation and error message being correct :slight_smile: .

Thanks @Cherif_Jazra for pointing it out!