Hi, I encourage you to read this article/tutorial to get a better grip on what Tika is, and then come back to the mailing list with further questions: http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/
Tika is really modular and it's easy to add parsers. My question was for a very specific usecase which is easy to do by a small source code modification but perhaps harder to do with configuration only. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 7. okt. 2010, at 12.31, qubit wrote: > Greetings -- > I am new to tika and trying to get a handle on its structure and semantics. > Forgive me if what I ask is either obviously correct or blatantly wrong... > I am coming from the following environment and experience: I did system > software development on unix back until the mid-to-late 90s, then switched > to windows where I have not done much with programming except to answer > questions from students learning C/C++. Java is new to me, but learning it > is not difficult after C++. > Now I am working with a group of programmers who are developing a file > transcription tool. You can read about it on www.brailleblaster.com. I > have volunteered to get it working on windows, and also to see what could be > done with tika, which we are considering for use in the architecture. > I have been trying to find my way around the documentation for tika. My > assumption is that the parsers you mention are are chosen by tika based on > mime type or file extension or some other information. and that the parsers > read the input and build an internal representation. > Your mail suggests that the .doc file parser is hardcoded. Is that true of > other formats as well? How hard is it to add/replace a parser? > Second and more importantly, what is the procedure for outputting the data > in a given format? Specifically, if I wanted to output in DAISY? > TIA for any comments. > --laura e > > ----- Original Message ----- > From: "Jan Høydahl / Cominvent" <[email protected]> > To: <[email protected]> > Sent: Wednesday, October 06, 2010 6:51 AM > Subject: Plugging in your own parser to override an existing > > > Hi, > > What if I want to provide an external parser for .DOC files? > I want it to override the POI provided .DOC parser. > How can we support such a usecase through configuration (without recompile) > now that the plugin itself decides what mime types to support? Is there > somewhere we can configure priority of which parser to perfer for a certain > mimetype? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com >
