Tika Users, First off, Tika has been super helpful. The new REST server especially so.
My problem is that newlines are not escaped in the CSV Metadata from the Tika Rest Server. Would it make sense to escape the newlines? Any opinions either way? I realize that there is no single CSV standard. That said, if you go for simplicity, being able to split on a newline first makes the parsing easier. For now, I'm going to write a custom CSV parser. Here is an example from http://www.sbaer.uca.edu/research/icsb -- note the newline in the title. "Author","Nicola Costantino and Guido Sivo" "dcterms:created","2004-02-02T16:59:38Z" "date","2004-02-02T16:59:38Z" "creator","Nicola Costantino and Guido Sivo" "Creation-Date","2004-02-02T16:59:38Z" "title","PROFESSIONAL SKILLS AND INFORMATION TECHNOLOGY IN COMPLEX BUILDING REFURBISHMENT PROJECTS: EMERGING INTER-ORGANIZATIONS" Here is the Tika Server Startup log: Oct 26, 2012 3:48:26 PM org.apache.tika.server.TikaServerCli main INFO: Starting Tikaserver ${project.version} Oct 26, 2012 3:48:26 PM org.apache.tika.server.TikaServerCli main INFO: Starting Tika Server Apache Tika 1.2 Oct 26, 2012 3:48:27 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ 2012-10-26 15:48:27.151:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT 2012-10-26 15:48:27.184:INFO:oejs.AbstractConnector:Started SelectChannelConnector@localhost:9998 STARTING 2012-10-26 15:48:27.204:INFO:oejsh.ContextHandler:started o.e.j.s.h.ContextHandler{,null} Oct 26, 2012 3:48:27 PM org.apache.tika.server.TikaServerCli main INFO: Started -David
