Modin provides speed up but cores are not utilised

I tried using modin to read from a csv(size~5gb) file. The below is the code I used

import modin.pandas as pd
for run in range(0,1):        
        df = pd_modin.read_csv("DM_ALUNO.CSV")

I am able to get good speed up compared to pandas on a 80 core intel skylake CPU. But something I found confusing was when I profiled my code with intel Vtune profiler (One of the most standard tools to profile CPU usage) the cpu usage histogram was almost the same for both pandas as well as modin. The attached image is of cpu histogram collected for Modin( we can see avg core utilisation is way too low in 80 core cpu).

Could you throw me some light on why the core usage has not increased with modin. And how does modin provide the speed up if not by using more cores.

Hi @Arun_Jose thanks for the question.

I think there is a known issue with Vtune and Python, where it does not show the utilization correctly. Others on the team will have more to say about this, but I believe all CPUs are being utilized.

Please check that you are profiling a whole process tree with VTune. Modin spawns child processes which do data processing while parent process does coordination. To see the whole picture you need to profile a whole tree of processes. VTune also should take into account whole process tree when calculating CPU utilization (am not sure that it doesn’t do it for parent process onle).

1 Like