Hi, I came upon this project recently and was impressed by the work and initiative so kudos for all you have done so far.
I was looking at applying Modin to some of my work to see if I can get performance improvements and ran into some trouble. For context, I am running Python 3.8.5 on Jupyter Notebook on a Windows 2012 Server. I am running with Modin with the Dask engine based on the recommendations in the docs (Although I noticed Ray is now experimental for Windows so is it available for Modin as well?).
from distributed import Client client = Client(processes=False) #1 worker/6 cores/68GB Memory import modin.pandas as pd #25GB parquet file model_df = pd.read_parquet(path = r'E:\Example.parquet',engine='pyarrow')
The runtime is ~23 mins while normal pandas is taking closer ~6 mins. Is this expected or am I missing something here? For the record, initially I was getting “OSError: [Errno 28] No space left on device” before clearing 10GBs on my C drive which allowed it run (did not think it would have to write to disk). I could look at switching the Dask directory to another disk if additional space would help. Thanks for the help in advance.