..@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Hi Tim,
From what you're saying it sounds like the Tika library has a big
problem with crashes and freezes, and when applying it at scale (eg. in
the
context of Beam) requires explicitly addressing this problem, eg.
hare/fileshare
processing for our low volume users
4) We're trying to get the message out. Thank you for working with us!!!
-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apac
apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with
crashes and freezes, and when applying it at scale (eg. in the context of Beam)
requires explicitly addressing this problem, eg. acceptin
Great. Thank you!
-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org]
Sent: Friday, September 22, 2017 1:46 PM
To: dev@tika.apache.org
Subject: Re: TikaIO concerns
[dropping Beam on this]
Tim, another thing is that you can finally download the TREC-DD Polar data
[dropping Beam on this]
Tim, another thing is that you can finally download the TREC-DD Polar data
either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described
here:
http://github.com/chrismattmann/trec-dd-polar/
In case we want to use as part of our regression.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests
>>against this TB (thank you, Rackspace for hosting our vm!) to try to identify
>>these problems.
And if anyone with connections at a big company doing open source + cloud would
be interested in floating us some
Nice! Thank you!
-Original Message-
From: Ben Chambers [mailto:bchamb...@apache.org]
Sent: Friday, September 22, 2017 1:24 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
BigQueryIO allows a side-output for elements that failed to be inserted when
Sent: Friday, September 22, 2017 12:50 PM
> To: d...@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some other
> IO has used the concept of a "Dead Letter" side-output,, wher
Do tell...
Interesting. Any pointers?
-Original Message-
From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:50 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Regarding specifically elements that are failing
e.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with
crashes and freezes, and when applying it at scale (eg. in the context of Beam)
requires explicitly addressing this problem, eg. accepting the fact that in
m
Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.
On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
Reuven,
Thank you! This suggests to me that it is a good idea to integrate Tika with
Beam so that people don't have to 1) (re)discover the need to make their
wrappers robust and then 2) have to reinvent these wheels for robustness.
For kicks, see William Palmer's post on his toe-stubbing
>> How will it work now, with new Metadata() passed to the AutoDetect parser,
>> will this Metadata have a Metadata value per every attachment, possibly
>> keyed by a name ?
An example of how to call the RecursiveParserWrapper:
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could
alter
@Eugene: What's the best way to have Beam help us with these issues, or do
these come for free with the Beam framework?
1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document
ut it…
*From:* Eugene Kirpichov [mailto:kirpic...@google.com]
*Sent:* Thursday, September 21, 2017 4:41 PM
*To:* Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
*Cc:* dev@tika.apache.org
*Subject:* Re: TikaIO concerns
Thanks all for the discussion. It seems we have consensus t
c: dev@tika.apache.org
Subject: Re: TikaIO concerns
Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could
alternatively put it into ParseResult - don't have a strong opinion.
@Chris: unorderedness
are a problem, no doubt about it…
From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Thanks all for the discussion. It see
MHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
>
Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns
Hi All
Please welcome Tim, one of Apache Tika leads and practitioners.
Tim, th
sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns
Hi All
Please welcome Tim, one of Apache Tika leads and practitioners.
Tim, thanks for joining in :-). If you have some great Apa
jor thanks for your input :-)
Cheers, Sergey
Cheers,
Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns
>
Subject: Re: TikaIO concerns
Hi All
Please welcome Tim, one of Apache Tika leads and practitioners.
Tim, thanks for joining in :-). If you have some great Apache Tika stories to
share (preferably involving the cases where it did not really matter the
ordering in which Tika-produced
23 matches
Mail list logo