Processing bigger than memory data


I’m looking for information about Modin’s handling of bigger than memory data. All the information I find about Modin clustering revolves around speed, but I couldn’t find if it also avoids to run out of memory by splitting large computations. Does it?

Dask seems to have that ability, so I suppose Modin could do it as well, with my contribution if needed.

Hi @bbp, thanks for posting!

Here is our page on Out of Core, currently it only works on Ray: It is not a very elegant solution internally, but it seems to be similar to what Dask does internally. We effectively allocate a large memory-mapped file and read/write objects from/to it.

The Dask engine is implemented on Dask Futures, and the code difference between Dask and Ray engines in Modin is <600 LOC. The two engines are quite similar in many ways, but I am less familiar with Dask.

You mentioned splitting large computations, are you talking about query optimization and query planning?

Hi Devin, and thanks for your answer and for the work you put in modin, I find it very promising!

I’m specifically talking about a use case where the data is fetched from a database, and is in total larger than would fit in the memory of any single machine. So what I’d like to do is to split the work across multiple workers running on different machines, so that they all process a subset of the data, without running out of memory, and then combine the results (which are much, much smaller than the original data). So I think that in some way, yes, it has to do with query optimization and planning, but more under then angle of sharing and reducing memory footprint than under the usual angle of accelerating computation. Here I’m rather talking about making the computation possible at all.