I am working on a project based around Ray Tune and Pandas. I have known about Modin for a while but only recently I realized it could help with data processing in my project by replacing Pandas. However, I have quickly ran into issues with running Modin dataframes inside Ray Tune Trainables.
First, the issue was that Modin dataframes were not serializable, and thus could not be shared by Ray. This was fixed by installing from the master branch.
However, the second issue has arisen - when cross validating a sklearn model with a Modin dataframe inside a Ray Tune Trainable, nothing happens. It’s as if no computations are being done, or they are being done very slowly - much slower than if it was a Pandas dataframe. I thought that this was a resource allocation issue and I have reduced the number of CPUs used by Ray Tune, but that did not change anything. Then I tried to give more CPUs to each Ray Tune Trainable, but it did not work either. I have narrowed down the issue to sklearn’s cv.split - that’s where the Modin dataframe hangs (note - I have not checked if this happens outside of Ray Tune - if so, is this a known issue?).
I imagine the problem is with nested Ray usage, with a Modin dataframe which runs on Ray itself, running inside another Ray actor. I was under the impression that Ray would know how to handle such situations, but perhaps I am doing something wrong. I have tried to both pass the Modin dataframe to trainables through the Ray object store, and to pass a pandas dataframe and then convert it back to Modin inside the trainable.
Has anyone tried using Modin with Ray Tune? Or is the issue perhaps with cv.split? Any pointers would be appreciated.