Confusing mandatory fields and reserved names in ketos data formats
This is a slightly rarer issue but can be very confusing when it happens. Essentially, it is how we handle mandatory fields and reserved names in ketos.
Essentially, the mandatory fields change depending on which stage on the processing pipeline we are on. For instance, the ketos annotation format requires start, end, label, and filename and annotation id. When we move to the selection stage, our dataframe now has sel_id instead of annot_id. At this point, i think this is fine, but the documentation on reserved names is extremely poor, most of it must be inferred from the docs here or from tutorials and examples. In my opinion we should have a very clear list of reserved names and mandatory fields to make it crystal clear what is needed and what is reserved
However, the issue arises when we are converting from the selections to the actual database. A basic hdf5 table with just the default configuration is created with the following fields:
- "data"
- "filename"
- "id"
- "label"
- "offset"
Suddenly, there is an offset field appearing out of nowhere (and a data field). Now, i understand why those are there and why it is needed, but nowhere is this documented. Worst of all, this offset field is actually a hidden "reserved" name. Meaning that if we have an extra column called offset that we wish to add to our dataset, this column will be overwritten by an internal algorithm in ketos that calculates the offset value we see here leaving the user wondering what happened. And to top it off, this internal "offset" value is calculated with an internal "duration" value (inferred internally from the length of the selections, i think in the selection generator) that is doubly hidden (doesn't appear anywhere) meaning that an extra column called duration will be simply gone from the dataset or maybe the user will be confused as to why the duration he chose wasnt what was created.
This exact situation happened with @fsfrazao
I dont think that we need to necessarily change the pipeline to avoid using those names, but we need to make sure that every reserved name is properly documented and clearly stated so no surprises can happen.
One final aspect that i will mention is with the duration keyword again. As i mentioned, any specified duration is not used during dataset creation because it is inferred from the selection length. However... most of our audio representation config files include a duration field. This is not used when creating the database... Is that field even necessary after all? We could simplify the config files and remove it or make it clear that it is an optional field so that it serves simply as metadata .