Read multiple CSV files

Hi,
I have a bunch of CSV files (as a result of an export of a BigQuery table) and I am trying to use modin to work with them in memory.

Prior to this, I have used dask to read the multilpe csv files, which is done easily. If we suppose that the csv files are: data-001.csv, data-002.csv, …, we can load them all using:

import dask.dataframe as dd
ddf = dd.read_csv(
    'data-*.csv'
)
df = ddf.compute()

This loads the data as a pandas DataFrame. Is there a way to load the data using modin library? And is it possible to have a modin.pandas.DataFrame instead of pandas.DataFrame?

Thanks !

Hi @lucasrodes, thanks for the question!

At the moment, there isn’t a syntax for doing this in the modin.pandas API. This is because the pandas API itself does not support this style of access. As it stands, you would need to loop over the files and pd.concat them all together.

We would definitely consider an extension to the API that supports this style of access. Do you want to request the feature on the GitHub? https://github.com/modin-project/modin/issues/new/choose

Thanks for your response, @devin-petersohn, I understand! Will be posting an issue then.

It looks like this was solved, at least, for csv files here:

Does anyone know if it also should work for multiple json files?

Thanks @EvanZ, it was solved for CSV files! It doesn’t currently work for multiple json files (json lines?) Is that something needed in your workflow?

We could probably extend it to support json lines since there are similar abstractions/logic between the two. For generic JSON files it is a bit tougher since CSV and JSON lines both use newline delimiters.

We use GitHub issues to track feature requests like this, would you be able to open one here: Issues · modin-project/modin · GitHub?

Ah, shoot. Yeah, I was hoping to use it for multiple json files (yes, jsonlines) spit out by Spark. Can’t use CSV in this case because some of the fields are arrays. I can open an issue sure.

1 Like

The readers are pretty modular, so I don’t think it would be too much effort to reuse a bunch of the logic from the CSV side.

Thanks, Devin. I created a feature request here:

I linked back to this thread. It sounds like it’s pretty self-explanatory.

1 Like

Heya,

Noob here–how exactly do I read multiple CSVs? I tried figuring it out using the commit but couldn’t figure it out.

@wmiguel does the documentation of read_csv_glob on the Modin readthedocs help?

1 Like