There is a typesystem in the GALE Multi-Modal Example in the Sandbox that has been used for processing audio. We created an AudioSpan type whose begin & end are seconds (float) from the start of a block of audio that was referenced via the SofaDataUri. Speech recognizers annotated words on an AudioSpan in the audio view, and the words were later combined into a text string in another view for further textual processing.
~Burn
