Glob reading csv files in parallel

Will Modin read in parallel and concatenate the data from a set of csv files defined by a glob, with identical columns, into a single modin dataframe?

I’m running through this tutorial that uses SNP data:
and want to convert it from Dask to Modin.
In pure Pandas you would read the files serially, joining each time, or read them in first and then concat
df = pd.concat([pd.read_csv(f) for f in glob.glob(‘data*.csv’)], ignore_index = True)

Dask supports reading directly a glob of csv files.

I’d like for Modin to be reading in parallel, though I’m not sure if that could be expressed with only the Pandas API …

Thanks @Justin_Paschall, you bring up a good point that I would like to reiterate.

Modin is not trying to reproduce or recreate Dask. Dask is a good tool with good support for the things that it does. Sometimes, the API is not enough or people find it difficult to use and prefer the pandas API. That is the purpose that Modin serves: to scale the pandas API whenever possible.

If the metadata is aligned (identical columns) concatenation should be trivial. I have played around with the NYC Taxi Dataset and concatenated a full year’s worth of data together with no problem, so please let me know if you have this issue yourself.

If you’d like to contribute a functionality for our “experimental” API, that would be fine. Those functionalities can deviate from pandas.

When you worked with this NYC taxi dataset, which I assume is many csv files, did you do the concat as a preproccesing step with ‘cat’ at command line, writing to a new file?

I’ll look later at the experimental api as you suggest as modin does so well support parallel reading, feels like a shame not to support globs in an extended api.

They were not preprocessed, rather we did something like this:

import modin.pandas as pd

filnames = [...]
df = pd.concat([pd.read_csv(f, ...) for f in filenames])