Read a very large JSON file and process in parallel

I have a large JSON file (~ 160 GB) that needs to be processed. Currently, I cut up this file, use regular multiprocessing (pool) on 96 cores, do some processing, and then merge all the results, do more processing on the merged data => little messy, but works.

I am looking for a way to have all my processes read this one JSON file sitting either in disk or somewhere. I do not want to load the same large-file into all the processes.

Will Modin here help me? I read that the Ray Object store lends this capability by having all the processes access one object-store through shared-memory. If so, is there a minimal example I can look at? Will be very useful to me.

I understand that Modin has identical API as that of Pandas. I am looking for a minimal Modin example where one can store a large JSON file and all the other processes accessing it in parallel. I do not know how to do that.

NOTE: I have not tried Dask yet. Reading up both Dask and Modin…currently.

Thanks and much appreciated!
Srini.

Hi @srini, thanks for the question! Modin supports JSON files, provided the rows are delimited by newlines – pd.read_json("file.json", lines=True).

There is one more caveat with our current JSON implementation: It cannot be sparse (yet), all lines must contain labels and values for each object (dense representation). This is because there is a communication overhead associated with the schema of the partitions, and we haven’t had the opportunity to implement that communication yet.

I am looking for a minimal Modin example where one can store a large JSON file and all the other processes accessing it in parallel

Modin will handle all of the data distribution. It is simply a drop-in replacement, so no additional code to handle distributed data is necessary. Just use pd.read_json(...) and then operate on the object as if it were a pandas DataFrame. You do not need to know distributed computing or parallel processing or anything like that.

We have a Dask backend that uses Dask instead of Ray in the experimental phase. I am in the process of adding some features to it. Dask also has a DataFrame, which I am sure is what you are talking about. I have some outline on the high level differences here: https://github.com/modin-project/modin/issues/515

If you want a short crash-course in Modin and its design, I recently gave a talk you can find here: https://www.youtube.com/watch?v=_0eVVLXrtfY

Let me know if you have any other questions!

Thanks @devin-petersohn. Will play around and watch your presentation. Thanks for the quick response.