Modin on Ray get stuck in AWS

I’m using Modin API on a Ray cluster launched in AWS. However, it hangs at df = df.mode() and never give any output. Based on my observation, the job always get stuck at the place where data shuffling is required among nodes, such as mode() , mean() and groupby().
I have one head node and 3 worker nodes, and on the dashboard, I can see that all the 4 nodes are alive.

Here is the code I’m running:

import ray
import modin.pandas as pd
import numpy as np

ray.init()

input_path = '/home/ubuntu/test_data.snappy.parquet' 

df = pd.read_parquet(input_path)
df = df.select_dtypes(np.number)
df = df.mode()
print(df)

ray.shutdown()

Has anyone encountered this issue before? Or has anyone successfully run the Modin on Ray in AWS? Any suggestion would be appreciated.

Hi @j_j12!
Thank you so much for opening this issue! If you’re running in cluster mode, I believe one potential problem may be that you are specifying a local file path to read_parquet, which may not be available on all nodes of the cluster! Could you please try printing the dataframe directly after the read_parquet call to confirm that the data is ingested correctly?

Hi @rdurrani , thank you for the reply! I tried with both reading from local (the data is available on all nodes) and reading from S3, they are both ingested correctly, and after I apply select_dtypes(np.number) it also give the output, it only get stuck at the line df.mode().

Finally I found out it’s because the ports between worker nodes are not opened. After open the ports I’m able to run the code.

@j_j12 I’m happy to hear that you figured that out! How did you diagnose and fix the problem? Other people viewing this post might find that information helpful.

Hi @mahesh, ok sure.
So firstly I ran the following sample code:

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

and I was able to get the result, which indicating that all the worker nodes and head node are doing some work, and that means the communication between head nodes and worker nodes are set properly, and if I call the api which only do 1 to 1 mapping (so that no shuffling required between nodes), I can also get the result, so I suspect that the issue might be caused by the communication between worker nodes.

@j_j12 thank you. Do you think someone could update the Ray documentation or implementation to prevent other people from running into the same problem? If so, I recommend opening an issue on the Ray GitHub.