Modin on a Dask cluster


It looks like Modin using Dask as a back end is supported but it is not clear if it is working on a cluster mode.

I have dask on AWS fargate running but I am not sure how to enable modin to run across the cluster.

I have not tested this on Dask myself, so I cannot speak to how well it will work.

You should be able to start a Dask client before you import modin and Modin will connect to it. If you try this, please let me know how it works!

Trying this now with pip install modin[Dask] on a fargate cluster. Dask and modin are in a container.

So you’re saying without any other changes modin will be the default pandas on Dask workers?

Suppose I do a something.compute() with a Dask Dataframe, how does it tell the cluster to use modin.pandas instead of pandas by default?

What does pip install dask[modin] do? Automatically use modin calls for all pandas calls?

@w601sxs Modin and Dask Dataframe objects are not interchangeable or compatible. We use the Dask multiprocessing engine via the Client API to schedule computation. If you wanted to use Modin, you would not also be able to use dask.dataframe.

How it works is effectively start your Dask cluster and have the Client on your driver connect to that cluster. Then, you just run import modin.pandas as pd and Modin should detect that existing Client object and be able to use the cluster you set up.

For some context about Dask DataFrame vs Modin, here is a link to a more detailed explanation of the differences: