I’m a newbie with HDF5, less so with PyTorch yet I found it hard to find guidelines regarding good practices to load data from HDF5 data.
So here’s my take on the issue, inspired by torchmeta
First Attempt - TypeError: h5py objects cannot be pickled
Here’s my usecase: I have .h5 files containing samples as datasets (no groups/subgroups but that wouldn’t be so different I expect).
Initially I thought “well, let’s just open the files and dynamically load the datasets”. So I wrote something like:
TypeError: h5py objects cannot be pickled
So that’s bad news. The issue is when using num_workers > 0 the Datasets are created and then passed to the DataLoader’s worker processes, which requires any data sent to be pickleable… unlike h5py.File objects.
One could want to shift file opening to __getitem__ but this means that you will need to open and read the file once for every sample along the training procedure, which could create overhead and filesystem pressure.
The solution is to lazy-load the files: load them the first time they are needed and store them after the first call: