[csv] csv format detector/sniffer?

Tim Allison Mon, 25 Feb 2019 07:23:56 -0800

Commons-CSV team,

  We recently integrated Commons-CSV into Apache Tika.  For now, we’re
relying strictly on the filename for csv detection, and we’re relying
on our AutodetectReader to identify the charset.  It would be really
useful for us to be able to detect:


1) A csv/tsv file vs a regular .txt file by content heuristics
2) The parameters: delimiter, escape and quote characters

  We realize that no detection will be perfect, but we have two questions:

1) Do you have any pointers for this kind of thing?
2) If we develop it, would you want to put it in commons-csv or should
we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
way to integrate this without using a charset detector...but we can
hold off on that for now.

  Thank you for all of your fantastic work!

           Cheers,

                           Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org

[csv] csv format detector/sniffer?

Reply via email to