Modin - Fault tolerance

I am using modin to parallelize apply() function of panda-dataframe. I am using Ray as backend for modin. I am planning to use Ray on cluster through Google-compute-platform (GCP). As worker nodes, I will be using the preemptible instances, which can be brought down by GCP anytime. Thus, I am interested to know whether modin will automatically take care of the fault-tolerance of parallel operations. If yes, any documentation link or specific Github link where I can read how modin is exploiting the fault-tolerant features of the backend (e.g https://docs.ray.io/en/master/fault-tolerance.html) would be helpful.

Hi @Shaswata_Jash

I have not written any documentation on this yet, but it will be good to put here and make a documentation page from these notes in the future :slight_smile: . These comments only apply to Ray.

From your comment, I see that you are only using preemtible instances for workers, this is good. You should not do this for the driver because it will kill the cluster.

When a worker is removed, the data on that machine is lost. Ray keeps a lineage for each object, which is very powerful because we can lose any object and reconstruct it as long as the data was not originally on the driver. The origin of the data is important here.

I assume your data is coming from S3 or Google cloud storage. This is necessary because workers want equal access to the data and for reconstruction it will be important. We don’t have Google cloud storage reads natively implemented yet, if it is necessary for your application please open an issue. We have proper abstractions to add it but have been prioritizing requests first. Requesting does make a difference on what we implement :slight_smile: .

Depending on the complexity of the apply, it may take some time to reconstruct the data. Data shuffling can often be a big part of apply and if multiple nodes are removed, we may have to start the entire computation over from the beginning. This would of course be the case with other fault tolerant systems, but with Ray we have finer grained fault tolerance so we don’t always have to start over.

Does this make sense? Do you have any other questions?