On Fri, 25 Nov 2011, chethan wrote:
as i am new to tika, i want to know following things.

I think most of your questions are hadoop ones, so you may have more luck learning more about hadoop and then asking your queries on the hadoop user list

1. how to integrate tika within hadoop, so that tika will use map
reduce to implement the parsing.

Probably something like have hadoop treat each file as a map input, and then have the mapper step pass that through Tika. You're then using Hadoop as a way to manage running tika once against each file

2. we wanted tika to parse ocr files too...but as tika is not supporting ocr parsing and also recommending to use tesseract, i want to
know how to call tesseract ( command line operation ) through tika
( which in-turn uses map reduce to parse ocr files ).

You might be better off just running tesseract directly on each file from the mapper, rather than trying to send it via Tika. The external parser support in Tika should help you if you did want to wrap it in Tika, otherwise hadoop has support for calling external programs

Nick

Reply via email to