Modin engine: Dask or Ray?

Hi all,

Thanks for the amazing project: all efforts which aim at making Pandas faster are to be commended, and this one looks really good.
I see that Modin can use either Dask or Ray as an engine. Looking at https://github.com/modin-project/modin#api-coverage it also seems that the API coverage is similar. However, by looking at a few issues:

I got the impression that Modin on Ray is more mature (there are also some threads here, but as a new user I cannot add more than 2 links to a post). Am I correct? I’ll be running Modin in Docker, so I can install either Dask or Ray as an engine. Since I have some (limited) experience with Dask (and 0 w/ Ray), and I already have Dask as a dependence in some projects, I’d rather use it than introduce another dependence. However, if Modin on Ray is more stable, I’ll bite the bullet :slightly_smiling_face: and just add the Ray dependency.

Hi @AndreaPi, welcome to the Discourse!

Modin on Ray is more mature. We started with the Ray backend proof of concept almost 2 years ago. The Dask backend has been under development for much less time and the issue you are referencing is something that I’ve been uncovering since I’ve been adding more functionality for Dask. There’s a warning that gets thrown when Dask is used that says it is still experimental/under development because it’s not as mature. I’m working to fix this lately and some of my early work has been paying off. Dask handles some of the functionality slightly differently than Ray, so some extra work has been required to add the support.

I encourage you to test them both for yourself. I believe that the issue you linked is probably linked to my MacOS operating system as we test this functionality on Ubuntu and Windows as a part of the test infrastructure we set up. I am still investigating that issue to see where it is originating.

Thanks for posting the question!

1 Like