Option to specify relative sample occurrence in batch generator

In an active-learning scenario where a human validates model outputs and identifies certain samples as particularly important, it would be useful to have a mechanism for over-sampling individual samples in the training process. One way to achieve this could be to assign to each sample an integer to indicate how many times that sample should appear in a training epoch. By default, the value would be 1, i.e., each sample appears once in an epoch. This integer (which we could call multiplicity') could be saved to a column in the hdf5 table. Then we could then point the BatchGenerator to this column in the hdf5 table using an argument like mult_field` or something like that.

Discussion

Fabio

I like the idea of making certain samples more important than others. Oversampling is one way to achieve this. What I've been doing recently is setting up multiple batch generators and then combining them with the JointBatchGen class. This gives me some control over the batch composition. For example, I can get batches composed by 1/3 "less important samples and 2/3 "more important" samples, or have the important samples appearing multiple times over an epoch, while the less important only appear once. I think what you suggested would make it easier to set-up similar arrangements, but if we are specifying an extra attribute to each sample, maybe it would be worth using weighted training as a strategy to make some samples more important than others instead of oversampling. I can have a look at implementing that option in the training methods of the neural network interfaces.

Oliver

The solution with joining multiple batch generators works well when all sample categories have comparable sizes (e.g. you have comparable number of less important and more important samples). However, if the categories have very different size it doesn't work as well, as far as I can tell. Say, you have 10,000 less important samples and only 10 important samples. Moreover, assume you are working with a batch size of 32. Then, you are forced to over-sample the important samples by a large factor. Even if you include only 1 important sample in every batch, the ratio that the model is trained on 1/31=3% is still way larger than the relative size of the two samples categories 10/10,000=0.1% (in the present example, you are oversampling the more important samples by a factor of 30, which may be more than you want to) I like the idea of implementing weighted training, especially if we can set it up so that the weights can be loaded from the hdf5 table.