Re: integrating tika into hadoop and tika with tesseract.

Nick Burch Fri, 25 Nov 2011 06:05:00 -0800

On Fri, 25 Nov 2011, chethan wrote:

as i am new to tika, i want to know following things.

I think most of your questions are hadoop ones, so you may have more lucklearning more about hadoop and then asking your queries on the hadoop userlist

1. how to integrate tika within hadoop, so that tika will use map
reduce to implement the parsing.

Probably something like have hadoop treat each file as a map input, andthen have the mapper step pass that through Tika. You're then using Hadoopas a way to manage running tika once against each file

2. we wanted tika to parse ocr files too...but as tika is not supportingocr parsing and also recommending to use tesseract, i want to
know how to call tesseract ( command line operation ) through tika
( which in-turn uses map reduce to parse ocr files ).

You might be better off just running tesseract directly on each file fromthe mapper, rather than trying to send it via Tika. The external parsersupport in Tika should help you if you did want to wrap it in Tika,otherwise hadoop has support for calling external programs


Nick

Re: integrating tika into hadoop and tika with tesseract.

Reply via email to