Dask Engine Performance issues and Experimental Ray

Hi, I came upon this project recently and was impressed by the work and initiative so kudos for all you have done so far.

I was looking at applying Modin to some of my work to see if I can get performance improvements and ran into some trouble. For context, I am running Python 3.8.5 on Jupyter Notebook on a Windows 2012 Server. I am running with Modin with the Dask engine based on the recommendations in the docs (Although I noticed Ray is now experimental for Windows so is it available for Modin as well?).

from distributed import Client
client = Client(processes=False) #1 worker/6 cores/68GB Memory
import modin.pandas as pd

#25GB parquet file
model_df = pd.read_parquet(path = r'E:\Example.parquet',engine='pyarrow')

The runtime is ~23 mins while normal pandas is taking closer ~6 mins. Is this expected or am I missing something here? For the record, initially I was getting “OSError: [Errno 28] No space left on device” before clearing 10GBs on my C drive which allowed it run (did not think it would have to write to disk). I could look at switching the Dask directory to another disk if additional space would help. Thanks for the help in advance.

Hi @ThomasMorland, thanks for the question!

Does the same thing occur if you use processes=True? We have not been testing on the multithreaded Dask Client. This should be documented, but it is not.

Ray does have experimental windows support, I have been using Ray on windows for a little over a month as my primary development machine. It does have occasional hiccups, but generally I am happy with it. If you try it let me know how it goes!

Thanks for prompt reply Devin.

I had a feeling that the multithreading was an issue so I had tested the single threaded default parameters after:

from distributed import Client
client = Client(processes=True) #3 worker/6 cores/68GB Memory
import modin.pandas as pd

#25GB parquet file
model_df = pd.read_parquet(path = r'E:\Example.parquet',engine='pyarrow')

This is leading to several warnings (see below) before ultimately killing the worker. I have been working through the Dask GitHub related issues tonight and cant quite figure it out. Seems strange to be running into memory issues considering the clearance I should theoretically have. Any ideas?
'distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting

KilledWorker

The news about Ray is exciting, I will most certainly give it a go this week and report back to you. Hopefully I will be a good test case to see if it is robust enough for a novice-intermediate user to figure out.

Thanks @ThomasMorland, I am looking into the implementation to see if there’s something there that would cause this.

We do a lot of performance testing on Ray, but the Dask engine is a bit newer so it’s possible something slipped through the cracks with I/O performance. It will take a little time, but I will report back when I know more.

I appreciate you looking into it @devin-petersohn. Although I am having good success with Ray after a few early bumps. It brought that parquet file I/O read down to 3 & 1/2 mins compared to the original 6 mins I was getting with default pandas. I noticed that Ray is defaulting to using only 25% of RAM is the best way to explicitly set the memory in Ray (for Modin) this:

import ray
ray.init(object_store_memory=(32*(10**9))) ##32GBs (50% of RAM, not sure what I should limit this to)
import modin.pandas as pd

Seems like as Ray develops on Windows you won’t have to focus a lot of your development time on multiple engines.

50% is perfectly fine, normally I allocate 60-70%. Your machine has a lot of memory so there shouldn’t be memory problems given your initial dataset size.