Processing bigger than memory data

Hi,

I’m looking for information about Modin’s handling of bigger than memory data. All the information I find about Modin clustering revolves around speed, but I couldn’t find if it also avoids to run out of memory by splitting large computations. Does it?

Dask seems to have that ability, so I suppose Modin could do it as well, with my contribution if needed.

Hi @bbp, thanks for posting!

Here is our page on Out of Core, currently it only works on Ray: https://modin.readthedocs.io/en/latest/out_of_core.html. It is not a very elegant solution internally, but it seems to be similar to what Dask does internally. We effectively allocate a large memory-mapped file and read/write objects from/to it.

The Dask engine is implemented on Dask Futures, and the code difference between Dask and Ray engines in Modin is <600 LOC. The two engines are quite similar in many ways, but I am less familiar with Dask.

You mentioned splitting large computations, are you talking about query optimization and query planning?

Hi Devin, and thanks for your answer and for the work you put in modin, I find it very promising!

I’m specifically talking about a use case where the data is fetched from a database, and is in total larger than would fit in the memory of any single machine. So what I’d like to do is to split the work across multiple workers running on different machines, so that they all process a subset of the data, without running out of memory, and then combine the results (which are much, much smaller than the original data). So I think that in some way, yes, it has to do with query optimization and planning, but more under then angle of sharing and reducing memory footprint than under the usual angle of accelerating computation. Here I’m rather talking about making the computation possible at all.

Best,

It seems like a lot of the filtering you might do in this workload can be pushed into the database, how are you currently handling this use case?

Hi Devin, I’m not handling anything for now, but I want to know what to expect before committing to a particular architecture. I’m in design stage, and still making technology choices at this point. I can’t push everything to the DB unfortunately, because it would require to have lazy evaluation: to slice and dice the work, one needs to have the full picture of what data really needs to be processed.

1 Like

That is true, and Modin currently doesn’t defer execution. I have toyed with adding it to Modin, it can add meaningful performance benefit in batch job environments, is this the environment you have?

It’s an important thing to support, the bottleneck so far has been the engineering time needed to add these big features. I’m happy to discuss more.

Hi Devin,

this isn’t a batch environment, my use case is primarily user facing stuff. I have a lot of timeseries, and I want users to write expressions that will process some of those timeseries (a small subset of them, but sometimes, that subset can result in enough data to make a single machine run out of memory). I also want them to be able to focus on writing the expressions they need in the simplest possible form, even if that form is suboptimal, hence my interest in lazy evaluation that would allow to optimize and slice the expression as much as possible before really processing any data.

Sorry for the delayed response, I have a paper deadline next week and I’m still behind on communications.

I have an ongoing effort to add query planning to Modin. I have designed a mechanism that allows us to do this without lazy evaluation, described in section 6.1.1 in this paper. At worst, the mechanism performs the same as lazy evaluation, and at best it is orders of magnitude faster. It is research, and so it may be several months before it makes it in to a release.