I am a newbie here and I am exploring Spark alternative for transforming my huge (approx. 1TB) numerical data.
I have 10000’s of parquet files. Most of my operations involve transformation and grouping by parquet file. Hence, I think (correct me if I am wrong), if each parquet file is considered as a Modin partition it should be quick and effective. So, I tried below,
@ray.remote def selva_read_parquet(fp): return pd.read_parquet(fp) futures = [selva_read_parquet.remote(fp) for fp in cleaned_files] dfs = ray.get(futures) df = pd.concat(dfs)
When I do this, I am getting below exception,
RayTaskError(_DeadlockError): e[36mray::selva_read_parquet() _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('modin.core.execution.ray.common.utils')
Am I doing something wrong here?