Df.merge fails with a ObjectLost failure in ray

Hi all,

I’m trying to use Modin to do a distributed merge operation on random data (10M rows, 2 columns, int). I have a 2 nodes in the cluster with 16 cores each. I’m using Ray 1.4.1. And I believe there’s enough memory in the cluster for this operation. But the operation fails with a “ray.exceptions.ObjectLostError: Object is lost due to node failure”.
I was ONLY able to merge 1M rows.

Given the resource available in the nodes, I believe the cluster should be able to do a better job than this IMO.

Please let me know what I’m missing here.

I have opened a github issue for this, and it has more information on the cluster and error logs.

Hi @nirandaperera, thanks for the report!

As @YarShev mentioned in the GitHub issue, I think the best thing to do first is to upgrade to Ray 1.7, since a lot of the node stability issues have been fixed since.