Hi, Tim,

here is my Tika with Hadoop project, tested on Enron, http://freeeed.org/,
and it works quite well.

Mark

On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler <[email protected]>
wrote:

> Hi Tim,
>
> When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it
> with a TikaCallable (
> https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java
> )
>
> This lets us orphan the parsing thread if it times out (
> https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187
> )
>
> And provides a bit of protection against things like NoSuchMethodErrors
> that can be thrown by Tika if the mime-type detection code tries to use a
> parser that we exclude, in order to keep the Hadoop job jar size to
> something reasonable.
>
> -- Ken
>
> ------------------------------
>
> *From:* Allison, Timothy B.
>
> *Sent:* July 15, 2015 4:38:56am PDT
>
> *To:* [email protected]
>
> *Subject:* robust Tika and Hadoop
>
> All,
>
>   I’d like to fill out our Wiki a bit more on using Tika robustly within
> Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I
> haven’t looked carefully into these packages yet.
>
>   Does anyone have any recommendations for specific configurations/design
> patterns that will defend against oom and permanent hangs within Hadoop?
>
>   Thank you!
>
>         Best,
>
>                   Tim
>
>
> [0] https://github.com/DigitalPebble/behemoth
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> [2]
> http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>,
To schedule a meeting with me: http://www.meetme.so/markkerzner

Mobile: 713-724-2534
Skype: mark.kerzner1
Office: One Riverway Suite 1700
Houston, TX 77056

*Privileged and Confidential *
<http://shmsoft.com/>

Reply via email to