Greetings --
I am new to tika and trying to get a handle on its structure and semantics. 
Forgive me if what I ask is either obviously correct or blatantly wrong... 
I am coming from the following environment and experience: I did system 
software development on unix back until the mid-to-late 90s, then switched 
to windows where I have not done much with programming except to answer 
questions from students learning C/C++.  Java is new to me, but learning it 
is not difficult after C++.
Now I am working with a group of programmers who are developing a file 
transcription tool.  You can read about it on www.brailleblaster.com.  I 
have volunteered to get it working on windows, and also to see what could be 
done with tika, which we are considering for use in the architecture.
I have been trying to find my way around the documentation for tika.  My 
assumption is that the parsers you mention are are chosen by tika based on 
mime type or file extension or some other information. and that the parsers 
read the input and build an internal representation.
Your mail suggests that the .doc file parser is hardcoded.  Is that true of 
other formats as well? How hard is it to add/replace a parser?
Second and more importantly, what is the procedure for outputting the data 
in a given format? Specifically, if I wanted to output in DAISY?
TIA for any comments.
--laura e

----- Original Message ----- 
From: "Jan Høydahl / Cominvent" <[email protected]>
To: <[email protected]>
Sent: Wednesday, October 06, 2010 6:51 AM
Subject: Plugging in your own parser to override an existing


Hi,

What if I want to provide an external parser for .DOC files?
I want it to override the POI provided .DOC parser.
How can we support such a usecase through configuration (without recompile) 
now that the plugin itself decides what mime types to support? Is there 
somewhere we can configure priority of which parser to perfer for a certain 
mimetype?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

Reply via email to