Commons-CSV team, We recently integrated Commons-CSV into Apache Tika. For now, we’re relying strictly on the filename for csv detection, and we’re relying on our AutodetectReader to identify the charset. It would be really useful for us to be able to detect:
1) A csv/tsv file vs a regular .txt file by content heuristics 2) The parameters: delimiter, escape and quote characters We realize that no detection will be perfect, but we have two questions: 1) Do you have any pointers for this kind of thing? 2) If we develop it, would you want to put it in commons-csv or should we leave it in Tika? I'm not sure, yet, if there'd be a clean/useful way to integrate this without using a charset detector...but we can hold off on that for now. Thank you for all of your fantastic work! Cheers, Tim --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org