Re: TikaIO concerns

2017-09-23 Thread Sergey Beryozkin
..@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg.

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
hare/fileshare processing for our low volume users 4) We're trying to get the message out. Thank you for working with us!!! -Original Message- From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] Sent: Friday, September 22, 2017 12:48 PM To: d...@beam.apache.org Cc: dev@tika.apac

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. acceptin

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Great. Thank you! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, September 22, 2017 1:46 PM To: dev@tika.apache.org Subject: Re: TikaIO concerns [dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data

Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression.

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests >>against this TB (thank you, Rackspace for hosting our vm!) to try to identify >>these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Nice! Thank you! -Original Message- From: Ben Chambers [mailto:bchamb...@apache.org] Sent: Friday, September 22, 2017 1:24 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns BigQueryIO allows a side-output for elements that failed to be inserted when

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Sent: Friday, September 22, 2017 12:50 PM > To: d...@beam.apache.org > Cc: dev@tika.apache.org > Subject: Re: TikaIO concerns > > Regarding specifically elements that are failing -- I believe some other > IO has used the concept of a "Dead Letter" side-output,, wher

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Do tell... Interesting. Any pointers? -Original Message- From: Ben Chambers [mailto:bchamb...@google.com.INVALID] Sent: Friday, September 22, 2017 12:50 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Regarding specifically elements that are failing

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
e.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in m

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>> How will it work now, with new Metadata() passed to the AutoDetect parser, >> will this Metadata have a Metadata value per every attachment, possibly >> keyed by a name ? An example of how to call the RecursiveParserWrapper:

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alter

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
ut it… *From:* Eugene Kirpichov [mailto:kirpic...@google.com] *Sent:* Thursday, September 21, 2017 4:41 PM *To:* Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org *Cc:* dev@tika.apache.org *Subject:* Re: TikaIO concerns Thanks all for the discussion. It seems we have consensus t

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
c: dev@tika.apache.org Subject: Re: TikaIO concerns Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. @Chris: unorderedness

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
are a problem, no doubt about it… From: Eugene Kirpichov [mailto:kirpic...@google.com] Sent: Thursday, September 21, 2017 4:41 PM To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Thanks all for the discussion. It see

Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
MHO, very few devs realize this because >> Tika works well lots of the time...which is why it is critical for us to >> make it easy for people to get it right all of the time. >>> >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch >

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, th

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apa

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
jor thanks for your input :-) Cheers, Sergey Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced