Greetings -- I am new to tika and trying to get a handle on its structure and semantics. Forgive me if what I ask is either obviously correct or blatantly wrong... I am coming from the following environment and experience: I did system software development on unix back until the mid-to-late 90s, then switched to windows where I have not done much with programming except to answer questions from students learning C/C++. Java is new to me, but learning it is not difficult after C++. Now I am working with a group of programmers who are developing a file transcription tool. You can read about it on www.brailleblaster.com. I have volunteered to get it working on windows, and also to see what could be done with tika, which we are considering for use in the architecture. I have been trying to find my way around the documentation for tika. My assumption is that the parsers you mention are are chosen by tika based on mime type or file extension or some other information. and that the parsers read the input and build an internal representation. Your mail suggests that the .doc file parser is hardcoded. Is that true of other formats as well? How hard is it to add/replace a parser? Second and more importantly, what is the procedure for outputting the data in a given format? Specifically, if I wanted to output in DAISY? TIA for any comments. --laura e
----- Original Message ----- From: "Jan Høydahl / Cominvent" <[email protected]> To: <[email protected]> Sent: Wednesday, October 06, 2010 6:51 AM Subject: Plugging in your own parser to override an existing Hi, What if I want to provide an external parser for .DOC files? I want it to override the POI provided .DOC parser. How can we support such a usecase through configuration (without recompile) now that the plugin itself decides what mime types to support? Is there somewhere we can configure priority of which parser to perfer for a certain mimetype? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
