Hi,
I am trying to best utilize Modin to parallelize a few pandas operations. But, when I run, and do ‘htop’ on a terminal to see the CPU usage, they do not seem to be using all the 8 cores that my machine has.
Here is what I am doing: A large dataframe df (553257 rows). This is a subset of a much larger dataset.
df1 = df.groupby([‘Id’, ‘Title’]).agg({‘Text’: ’ '.join}).reset_index()
Here: https://modin.readthedocs.io/en/latest/UsingPandasonRay/dataframe_supported.html, I noticed that ‘groupby’ operation is ‘not yet optimized’. Furthermore, ‘agg’ operation is “Partially implemented”. If the docs are uptodate, then, this may be the cause?
I also tried this: import ray; ray.init(num_cpus = 8). Did not work. I have both Dask and Ray installed. I have not explicitly set my compute engine…hoping that Modin will choose one automagically.
Or, if you have any suggestions to ensure that all cores are being utilized, that would be great.
Thanks!
Sri.