I have a large JSON file (~ 160 GB) that needs to be processed. Currently, I cut up this file, use regular multiprocessing (pool) on 96 cores, do some processing, and then merge all the results, do more processing on the merged data => little messy, but works.
I am looking for a way to have all my processes read this one JSON file sitting either in disk or somewhere. I do not want to load the same large-file into all the processes.
Will Modin here help me? I read that the Ray Object store lends this capability by having all the processes access one object-store through shared-memory. If so, is there a minimal example I can look at? Will be very useful to me.
I understand that Modin has identical API as that of Pandas. I am looking for a minimal Modin example where one can store a large JSON file and all the other processes accessing it in parallel. I do not know how to do that.
NOTE: I have not tried Dask yet. Reading up both Dask and Modin…currently.
Thanks and much appreciated!