On Mon, 25 Feb 2019 at 18:38, Tim Allison <talli...@apache.org> wrote: > > Hi Gary, > > Our charset detector stuff is a combo of html-metaheader detection, > juniversalchardet and a cut and paste of a small portion of icu4j...we > could add that to commons-io, but I don't think you'd want to add > juniversalchardet as a dependency or would you? Happy to discuss...
I think the HTML stuff is out of scope for IO; not sure about the other bits. > My main question to commons-csv was intended rather to focus on: > > 1) text vs csv detection (aside from filename glob) > 2) detection of most likely: a) delimiter, b) quote character, c) > escape character That seems reasonable for CSV. But it should probably be in its own package as it is somewhat outside the rest of CSV. > More like: > > org.apache.commons.csv.CSVParser.parse(path, charset); > > or ideally: > > CSVFormat format = CSVDetector.detect(path) > > where format includes charset and one value is "probably straight > text, not likely a csv" > > On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <garydgreg...@gmail.com> wrote: > > > > Hi, > > > > A Charset detector sounds like something generally useful that belongs in > > Commons IO. > > > > Path path = Path.get(...); > > Charset cs = org.apache.commons.io.CharsetDetector.detect(path); > > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat); > > > > Thoughts? > > > > Gary > > > > > > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <talli...@apache.org> wrote: > > > > > Commons-CSV team, > > > > > > We recently integrated Commons-CSV into Apache Tika. For now, we’re > > > relying strictly on the filename for csv detection, and we’re relying > > > on our AutodetectReader to identify the charset. It would be really > > > useful for us to be able to detect: > > > > > > 1) A csv/tsv file vs a regular .txt file by content heuristics > > > 2) The parameters: delimiter, escape and quote characters > > > > > > We realize that no detection will be perfect, but we have two questions: > > > > > > 1) Do you have any pointers for this kind of thing? > > > 2) If we develop it, would you want to put it in commons-csv or should > > > we leave it in Tika? I'm not sure, yet, if there'd be a clean/useful > > > way to integrate this without using a charset detector...but we can > > > hold off on that for now. > > > > > > Thank you for all of your fantastic work! > > > > > > Cheers, > > > > > > Tim > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > For additional commands, e-mail: user-h...@commons.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org