What is Modin on Ray expectations regarding ingesting large datasets in the dozens of terabytes and possibly more? Is there any presentation/paper discussing benchmarks with such dataset size?
In the Modin technical report, Dataframe Systems: Theory, Architecture, and Implementation | EECS at UC Berkeley, benchmarks are described in chapter 6, I see the 2 set ups:
- A single
EC2 x1.32xlarge128 cores and 1,952 GB RAM with a dataset up to 250GB (1.6B rows)
- A single
x1e.32xlargeinstance with 128 CPUs and 4TB of memory but the dataset itself is only
Can you provide benchmarks for larger datasets in the dozen of terrabytes size?
Specifically, I’m interested in understanding if you have any recommended approach for loading several files into a single Modin DataFrame specifically in the two cases
- A small amount of very large files
- A large number of small files
I’m also interested in understanding the major limitations for these dataset sizes across the different IO supported in pd.read_<file> and I/O APIs — Modin 0+untagged.50.g9498b92.dirty documentation