Hi Gary, Our charset detector stuff is a combo of html-metaheader detection, juniversalchardet and a cut and paste of a small portion of icu4j...we could add that to commons-io, but I don't think you'd want to add juniversalchardet as a dependency or would you? Happy to discuss...
My main question to commons-csv was intended rather to focus on: 1) text vs csv detection (aside from filename glob) 2) detection of most likely: a) delimiter, b) quote character, c) escape character More like: org.apache.commons.csv.CSVParser.parse(path, charset); or ideally: CSVFormat format = CSVDetector.detect(path) where format includes charset and one value is "probably straight text, not likely a csv" On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <garydgreg...@gmail.com> wrote: > > Hi, > > A Charset detector sounds like something generally useful that belongs in > Commons IO. > > Path path = Path.get(...); > Charset cs = org.apache.commons.io.CharsetDetector.detect(path); > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat); > > Thoughts? > > Gary > > > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <talli...@apache.org> wrote: > > > Commons-CSV team, > > > > We recently integrated Commons-CSV into Apache Tika. For now, we’re > > relying strictly on the filename for csv detection, and we’re relying > > on our AutodetectReader to identify the charset. It would be really > > useful for us to be able to detect: > > > > 1) A csv/tsv file vs a regular .txt file by content heuristics > > 2) The parameters: delimiter, escape and quote characters > > > > We realize that no detection will be perfect, but we have two questions: > > > > 1) Do you have any pointers for this kind of thing? > > 2) If we develop it, would you want to put it in commons-csv or should > > we leave it in Tika? I'm not sure, yet, if there'd be a clean/useful > > way to integrate this without using a charset detector...but we can > > hold off on that for now. > > > > Thank you for all of your fantastic work! > > > > Cheers, > > > > Tim > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > For additional commands, e-mail: user-h...@commons.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org