Hi,

I encourage you to read this article/tutorial to get a better grip on what Tika 
is, and then come back to the mailing list with further questions:
http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/

Tika is really modular and it's easy to add parsers. My question was for a very 
specific usecase which is easy to do by a small source code modification but 
perhaps harder to do with configuration only.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 7. okt. 2010, at 12.31, qubit wrote:

> Greetings --
> I am new to tika and trying to get a handle on its structure and semantics. 
> Forgive me if what I ask is either obviously correct or blatantly wrong... 
> I am coming from the following environment and experience: I did system 
> software development on unix back until the mid-to-late 90s, then switched 
> to windows where I have not done much with programming except to answer 
> questions from students learning C/C++.  Java is new to me, but learning it 
> is not difficult after C++.
> Now I am working with a group of programmers who are developing a file 
> transcription tool.  You can read about it on www.brailleblaster.com.  I 
> have volunteered to get it working on windows, and also to see what could be 
> done with tika, which we are considering for use in the architecture.
> I have been trying to find my way around the documentation for tika.  My 
> assumption is that the parsers you mention are are chosen by tika based on 
> mime type or file extension or some other information. and that the parsers 
> read the input and build an internal representation.
> Your mail suggests that the .doc file parser is hardcoded.  Is that true of 
> other formats as well? How hard is it to add/replace a parser?
> Second and more importantly, what is the procedure for outputting the data 
> in a given format? Specifically, if I wanted to output in DAISY?
> TIA for any comments.
> --laura e
> 
> ----- Original Message ----- 
> From: "Jan Høydahl / Cominvent" <[email protected]>
> To: <[email protected]>
> Sent: Wednesday, October 06, 2010 6:51 AM
> Subject: Plugging in your own parser to override an existing
> 
> 
> Hi,
> 
> What if I want to provide an external parser for .DOC files?
> I want it to override the POI provided .DOC parser.
> How can we support such a usecase through configuration (without recompile) 
> now that the plugin itself decides what mime types to support? Is there 
> somewhere we can configure priority of which parser to perfer for a certain 
> mimetype?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 

Reply via email to