How to process multiple parquet files effectively using Modin?


I am a newbie here and I am exploring Spark alternative for transforming my huge (approx. 1TB) numerical data.

I have 10000’s of parquet files. Most of my operations involve transformation and grouping by parquet file. Hence, I think (correct me if I am wrong), if each parquet file is considered as a Modin partition it should be quick and effective. So, I tried below,

def selva_read_parquet(fp):
    return pd.read_parquet(fp)
futures = [selva_read_parquet.remote(fp) for fp in cleaned_files]
dfs = ray.get(futures)
df = pd.concat(dfs)

When I do this, I am getting below exception,

RayTaskError(_DeadlockError): e[36mray::selva_read_parquet()
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('modin.core.execution.ray.common.utils')

Am I doing something wrong here?


What is pd in your selva_read_parquet(), am I right assuming it’s modin.pandas?
If that is true, then you should remove @ray.remote decoration, .remote() call and ray.get() - Modin handles all these internally, and doing this manually makes it try all the things which should happen on “driver process” (i.e. the main process running your code) in workers.

Are your files by any chance laid out as parquet directory? Modin is able to read partitioned dataset in a parallel way by itself, without manual parallelization and concatenation from user side.