Using Modin with Ray Tune

I am working on a project based around Ray Tune and Pandas. I have known about Modin for a while but only recently I realized it could help with data processing in my project by replacing Pandas. However, I have quickly ran into issues with running Modin dataframes inside Ray Tune Trainables.

First, the issue was that Modin dataframes were not serializable, and thus could not be shared by Ray. This was fixed by installing from the master branch.

However, the second issue has arisen - when cross validating a sklearn model with a Modin dataframe inside a Ray Tune Trainable, nothing happens. It’s as if no computations are being done, or they are being done very slowly - much slower than if it was a Pandas dataframe. I thought that this was a resource allocation issue and I have reduced the number of CPUs used by Ray Tune, but that did not change anything. Then I tried to give more CPUs to each Ray Tune Trainable, but it did not work either. I have narrowed down the issue to sklearn’s cv.split - that’s where the Modin dataframe hangs (note - I have not checked if this happens outside of Ray Tune - if so, is this a known issue?).

I imagine the problem is with nested Ray usage, with a Modin dataframe which runs on Ray itself, running inside another Ray actor. I was under the impression that Ray would know how to handle such situations, but perhaps I am doing something wrong. I have tried to both pass the Modin dataframe to trainables through the Ray object store, and to pass a pandas dataframe and then convert it back to Modin inside the trainable.

Has anyone tried using Modin with Ray Tune? Or is the issue perhaps with cv.split? Any pointers would be appreciated.

@Yard1 I think it might be the case that running Modin inside of several Ray Actors is causing the problem. A more natural approach might be to pass the individual partitions within Modin into the Tune actors.

I’m reaching out to the Ray team to see if there’s any thoughts on their side.

Hey @Yard1 I don’t know of anyone who has gotten this to work before. I think the first thing that we should try doing is isolating the issue (as you’ve done with cv.split).

Maybe you could try doing something with basic Scikit-learn (and setting n_jobs=1) to see if you can create a reproducible script.

I’ll have to put this on the backburner for now, I’ll revisit it some time later. Thanks.