Modin really slow with ray client

I’m trying to get moding working with Ray client, and the performance is really bad. From the ray dashboard, it looks like only one worker is getting utilised, despite setting MODIN_CPUS explicitly. I am using master branch for modin and nightly for ray. The dataset is from the higgs XGBoost example here - xgboost_ray/higgs.py at 7f0384b1d74935f52aae9ab6b25e74d96bd7aef8 · ray-project/xgboost_ray · GitHub

Here’s repro code -

import ray
import ray.util
ray.util.connect("<service_ip>:50051")
import os
os.environ["MODIN_CPUS"] = "16"
import modin.pandas as pd

colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
df = pd.read_csv("s3://<s3_bucket>/HIGGS.csv", names=colnames)

for i, row in (df.iterrows()):
    df.at[i,"sum"] = row['feature-01'] + row['feature-02'] + row['feature-03']

Hi @Bhavya_Agarwal , thanks for the report! Can you print the shape of your DataFrame, please? After reading data from your csv file you need to do the following: print("shape =", df._query_compiler._modin_frame._partitions.shape). If the shape is really equal to (1, 1), please, try setting the NPartitions for your DataFrame explicitly as follows:

from modin.config import NPartitions
NPartitions.put(<your_number>)

@YarShev The shape is (56,1), so I think it has partitions already which is same as total number of cores we have. I tried setting it up explicitly too, but it yielded the same result.

@Bhavya_Agarwal , hm, you want to say that pd.read_csv and the rest operations are being executed using one core only?

@YarShev That is the impression I got from ray dashboard where I saw only one worker busy. I think it is happening for iterrows as I did not get a results back even after waiting for an hour.

@Bhavya_Agarwal Looping with iterrows is not as parallelizable as something like apply. We can parallelize pandas APIs, but iterrows does not scale. Can you share a little more about your workflow? Is it possible to avoid using iterrows?

Also, do you think a warning about not being able to parallelize the Python for loop over iterrows would be helpful? I have thought about adding a warning, but I’m not sure if people find that it’s too much.

That makes sense, thanks for clarifying that @devin-petersohn. I can avoid iterrows if apply is supported. I think I assumed that the tag Y in this page - pd.DataFrame supported APIs — Modin 0.9.1+17.g15eed05 documentation means that it will be parallelized when called via modin. I think adding a note to the supported API table on that page would be really helpful.

@Bhavya_Agarwal I see what you mean. In that table, Y means we will not convert to a single node implementation, I agree there should at least be some note there. Thanks!