Dividing a Modin DataFrame into partitions on a cluster

Hi guys,

Why do we divide a Modin DataFrame into partitions based on num_cpus of a head node when we run Modin on a cluster? Shouldn’t we consider num_cpus of all nodes in a cluster?

Thanks in advance!

num_cpus is only calculated from local machine when a local cluster is initialized by Modin. Otherwise we should be getting that number from the execution engine (Ray or Dask).

Have you found that the number of CPUs is incorrect in the cluster? Are you sure all of the workers were connected when you checked?

If we have two nodes with num_cpus=16 on each one, for instance, I would expect the following dividing of a Modin DataFrame: n_row_parts = df.shape[0] / (num_cpus * 2), n_col_parts = df.shape[1] / (num_cpus * 2). However, since we use multiprocessing.num_cpus() for setting NPartitions on a head node by default, we will have another dividing: n_row_parts = df.shape[0] / num_cpus, n_col_parts = df.shape[1] / num_cpus. We don’t get info on available cpus in a cluster from execution engine in any way. I think we should be setting NPartitions in respective functions of engines initialization or somewhere else.

You are correct. Previously, the number of partitions was computed based on the num_cpus provided by the engines but it looks like that was removed.

It should probably go back here: modin/utils.py at 954f84a205597e2f198024b4fd83d1ff75940016 · modin-project/modin · GitHub. Will you open an issue on this? It needs to be fixed as soon as possible.

Sure, the issue is here. We should also fix this for Dask.

1 Like