Ingesting large size datasets into Modin/Ray


What is Modin on Ray expectations regarding ingesting large datasets in the dozens of terabytes and possibly more? Is there any presentation/paper discussing benchmarks with such dataset size?

In the Modin technical report, Dataframe Systems: Theory, Architecture, and Implementation | EECS at UC Berkeley, benchmarks are described in chapter 6, I see the 2 set ups:

  1. A single EC2 x1.32xlarge 128 cores and 1,952 GB RAM with a dataset up to 250GB (1.6B rows)
  2. A single x1e.32xlarge instance with 128 CPUs and 4TB of memory but the dataset itself is only 23GB (150M rows).

Can you provide benchmarks for larger datasets in the dozen of terrabytes size?

Specifically, Iā€™m interested in understanding if you have any recommended approach for loading several files into a single Modin DataFrame specifically in the two cases

  1. A small amount of very large files
  2. A large number of small files

Iā€™m also interested in understanding the major limitations for these dataset sizes across the different IO supported in pd.read_<file> and I/O APIs ā€” Modin 0+untagged.50.g9498b92.dirty documentation

  • read_csv
  • read_parquet
  • read_sql
  • ā€¦
1 Like