awesome work Mark and Ken

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Mark Kerzner <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, July 20, 2015 at 4:22 PM
To: Tika User <[email protected]>
Subject: Re: robust Tika and Hadoop

>Hi, Tim,
>
>
>here is my Tika with Hadoop project, tested on Enron,
>http://freeeed.org/, and it works quite well.
>
>
>Mark
>
>
>On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler
><[email protected]> wrote:
>
>Hi Tim,
>
>
>When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it
>with a TikaCallable
>(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCa
>llable.java)
>
>
>This lets us orphan the parsing thread if it times out
>(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/Simple
>Parser.java#L187)
>
>
>And provides a bit of protection against things like NoSuchMethodErrors
>that can be thrown by Tika if the mime-type detection code tries to use a
>parser that we exclude, in order to keep the Hadoop job jar size to
>something reasonable.
>
>
>-- Ken
>
>
>________________________________________
>From: Allison, Timothy B.
>Sent: July 15, 2015 4:38:56am PDT
>To:[email protected]
>Subject: robust Tika and Hadoop
>
>
>All,
> 
>  I’d like to fill out our Wiki a bit more on using Tika robustly within
>Hadoop.  I’m aware of Behemoth [0], Nanite [1] and Morphlines [2].  I
>haven’t looked carefully into these packages yet.
> 
>  Does anyone have any recommendations for specific configurations/design
>patterns that will defend against oom and permanent hangs within Hadoop?
> 
>  Thank you!
> 
>        Best,
> 
>                  Tim
> 
> 
>[0] https://github.com/DigitalPebble/behemoth
>[1] 
>http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c
>ontent-nanite/
>[2] 
>http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and
>-integrate-etl-apps-for-apache-hadoop/
>
>
>
>
>
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378 <tel:%2B1%20530-210-6378>
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378 <tel:%2B1%20530-210-6378>
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>,
>To schedule a meeting with me: http://www.meetme.so/markkerzner
>
>Mobile: 713-724-2534
>Skype: mark.kerzner1
>Office: One Riverway Suite 1700
>Houston, TX 77056
>
>Privileged and Confidential
> <http://shmsoft.com/>
>
>
>
>

Reply via email to