Tika Users,

First off, Tika has been super helpful. The new REST server especially so.

My problem is that newlines are not escaped in the CSV Metadata from
the Tika Rest Server. Would it make sense to escape the newlines? Any
opinions either way? I realize that there is no single CSV standard.
That said, if you go for simplicity, being able to split on a newline
first makes the parsing easier. For now, I'm going to write a custom
CSV parser.

Here is an example from http://www.sbaer.uca.edu/research/icsb -- note
the newline in the title.

"Author","Nicola Costantino and Guido Sivo"
"dcterms:created","2004-02-02T16:59:38Z"
"date","2004-02-02T16:59:38Z"
"creator","Nicola Costantino and Guido Sivo"
"Creation-Date","2004-02-02T16:59:38Z"
"title","PROFESSIONAL SKILLS AND INFORMATION TECHNOLOGY IN
COMPLEX BUILDING REFURBISHMENT PROJECTS:
EMERGING INTER-ORGANIZATIONS"

Here is the Tika Server Startup log:

Oct 26, 2012 3:48:26 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Tikaserver ${project.version}
Oct 26, 2012 3:48:26 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Tika Server Apache Tika 1.2
Oct 26, 2012 3:48:27 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
2012-10-26 15:48:27.151:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2012-10-26 15:48:27.184:INFO:oejs.AbstractConnector:Started
SelectChannelConnector@localhost:9998 STARTING
2012-10-26 15:48:27.204:INFO:oejsh.ContextHandler:started
o.e.j.s.h.ContextHandler{,null}
Oct 26, 2012 3:48:27 PM org.apache.tika.server.TikaServerCli main
INFO: Started

-David

Reply via email to