Sort taking extra memory

I am measuring memory usage of different Modin operations. I am using Modin with Ray. My question is about sort. My test dataframe is 100MB. If I pass true for inpalce for sort, and I measure the plasma memory storage before and after calling the sort, it still takes 11MB, even though no new dataframe is created. After calling sort, I am calling the Python garbage collector and wait for 20 seconds to make sure the garbage collector finishes. So, the following is the code snippet:

memory_before = int(ray.available_resources()["object_store_memory"])
df.sort_values(by=df.columns[1], inplace=True)
df.__repr__()   # Materialize the data
memory_after = int(ray.available_resources()["object_store_memory"])
display('Memory usage for sort: ' + str(memory_before - memory_after))

Is there any explanation for that extra memory usage, even after calling the garbage collector?

Modin version: tested on both 0.12.0 and 0.14.1

Hi @curious, thanks for posting!

We’ve made several improvements to memory and shuffling since Modin 0.12.0, are you able to upgrade and see if that helps?

pip install -U modin

Hi @devin-petersohn, thanks for your response.

I just upgraded to Modin 0.14.1. It did not change the extra memory usage.

@curious is the expected behavior that the memory footprint stays the same before and after the sort?

gc.collect doesn’t cause the Ray object store to trigger a garbage collection to my knowledge. The stale objects in the object store will be garbage collected when the system determines that more space is needed for new objects (or if the system determines it is a good time to garbage collect).

Are you operating in an environment that has low memory?

@devin-petersohn I am doing some benchmarking of different Pandas / Modin operations, studying both performance and memory usage. I asked about triggering Ray garbage collector on the Ray forums and someone from the Ray team told me that:

Ray uses distributed reference counting to manage objects in the object store, and the reference counting is tied to Python GC. So to answer your question, to delete objects in the Ray object store, you can just call del on the object reference from Python.

If the memory before and after an operation that does not return a new object, is not the same, even with calling the garbage collector, I am trying to make sure if it is memory leak or not. If it is memory leak, then based on the above quote, probably some object used in the sort needs to be deleted. Not sure though about this statement.

@curious it might be worth moving this conversation to a github issue. We have tested and check for memory leaks (I am not aware of any), but probably worth having the conversation there so all developers can weigh in.

Thanks for the suggestion @devin-petersohn ! Sure, I will post an issue on the repo.

1 Like