Force Python to release memory
Let me start by clarifying that there are far better solutions to manage Python’s memory other than the one I’ll showcase here. The solution/workaround that I will present was appropriate for the problem I was facing, which included having to process multiple really large pickle objects. Please note that loading multiple of these objects to memory was not even an option as they could take up to 20 Gs of memory.
The issue started when loading the second pickle object, as for some reason, Python was not deallocating the memory from the first pickle object, thus exceeding all the memory available.
I started by deleting the object - del
- and forcing the garbage collector to run rightaway - gc.collect()
- after processing each object, to make sure there was room available for the next one. However, due to memory optimizations, the memory that a Python application uses might not be available right away. Once the memory is allocated, even if all the references are freed, some objects (e.g. lists) might still be available for reuse, avoiding having to allocate the memory again later on. While it can be useful for most situations, especially when we are not dealing with large objects, it can also be a nightmare if we want to actually discard the allocated memory after processing a particular pickle file.
After a while, I found out that one common solution is to create large objects within a child process as the memory will be released once the process is finished. It turned out to be the perfect solution, each pickle object would be loaded and processed on its own process, not impacting the pickle objects that come next.
process = multiprocessing.Process(
target = load_pickle,
args = (pickle_filename,)
)
process.start()
process.join()
Bonus part
As processes don’t share memory, I took advantage of the multiprocessing.Manager class to create a proxy list. The goal was to create a dictionary with some of the data contained on the different processed pickle objects. Due to the large size of data that had to be processed, using a proxy dictionary would require millions of operations being locked - slowing down the application.
A better solution was to use a proxy list, so that each process creates its own dictionary and, once all the pickle objects are processed, all the dictionaries on the list can be merged into a single one - only one operation per process needs to be locked when using this alternative.
manager = multiprocessing.Manager()
shared_list_of_dictionaries = manager.list()
process = multiprocessing.Process(
target = load_pickle,
args = (pickle_filename, shared_list_of_dictionaries)
)
process.start()
process.join()