Re: TikaIO concerns

2017-09-22 Thread Eugene Kirpichov
On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin wrote: > Hi, > On 22/09/17 22:02, Eugene Kirpichov wrote: > > Sure - with hundreds of different file formats and the abundance of > weird / > > malformed / malicious files in the wild, it's quite expected that > sometimes > >

Re: [VOTE RESULT] Release 2.1.1, release candidate #1

2017-09-22 Thread Robert Bradshaw
Correction, Chamikara Jayalath is a committer, not a member of the PMC. This does not change the results; the voting still stands unanimous at 4 PMC votes + a significant committer vote in the affirmative. On Fri, Sep 22, 2017 at 2:16 PM, Robert Bradshaw wrote: > I'm happy

Re: [Proposal] Beam Newsletter

2017-09-22 Thread Reza Rokni
+1 On Sep 22, 2017 2:32 AM, "Griselda Cuevas" wrote: > Hi Beam Community, > > I have a proposal to start sending *monthly newsletters* to our dev and > user mailing lists. The idea is to summarize what's happening in the > project and keep everyone informed of what's happening,

Re: [Proposal] Beam Newsletter

2017-09-22 Thread Steve Anderson
+1 would love this. Our team (maestro.io, we do livestreams for brands like the Grammys and Playstation and use dataflow to collect massive amounts of data) checks the beam blog and the google bigdata blog daily for new features and bug fixes. Would love to see something more regular! Some of us

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi, On 22/09/17 22:02, Eugene Kirpichov wrote: Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash. Some kinds of issues are easier to address than others. We can catch

[VOTE RESULT] Release 2.1.1, release candidate #1

2017-09-22 Thread Robert Bradshaw
I'm happy to announce that we have unanimously approved this bugfix release. There are 5 approving PMC member votes: - Chamikara Jayalath - Kenneth Knowles - Daniel Halperin - Jean-Baptiste Onofré - Aljoscha Krettek There are no disapproving votes. Thanks everyone! I will be

Re: TikaIO concerns

2017-09-22 Thread Eugene Kirpichov
Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash. Some kinds of issues are easier to address than others. We can catch exceptions and return a ParseResult representing a

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim, All On 22/09/17 18:17, Allison, Timothy B. wrote: Y, I think you have it right. Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1],

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests >>against this TB (thank you, Rackspace for hosting our vm!) to try to identify >>these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some

Re: TikaIO concerns

2017-09-22 Thread Reuven Lax
This is similar to what I suggested. This will not work well to handle crashes and freezes however. On Fri, Sep 22, 2017 at 10:24 AM, Ben Chambers wrote: > BigQueryIO allows a side-output for elements that failed to be inserted > when using the Streaming BigQuery sink: > >

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Nice! Thank you! -Original Message- From: Ben Chambers [mailto:bchamb...@apache.org] Sent: Friday, September 22, 2017 1:24 PM To: dev@beam.apache.org Cc: d...@tika.apache.org Subject: Re: TikaIO concerns BigQueryIO allows a side-output for elements that failed to be inserted when

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92 This follows the pattern

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Do tell... Interesting. Any pointers? -Original Message- From: Ben Chambers [mailto:bchamb...@google.com.INVALID] Sent: Friday, September 22, 2017 12:50 PM To: dev@beam.apache.org Cc: d...@tika.apache.org Subject: Re: TikaIO concerns Regarding specifically elements that are failing --

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Y, I think you have it right. > Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov

Re: TikaIO concerns

2017-09-22 Thread Eugene Kirpichov
Hi Tim, >From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing

Re: Jenkins build is still unstable: beam_Release_NightlySnapshot #540

2017-09-22 Thread Kenneth Knowles
Filed https://issues.apache.org/jira/browse/BEAM-2981 for the failure here. It is a release blocker - at HEAD it is likely that ProtoCoder cannot be used with Dataflow. On Fri, Sep 22, 2017 at 1:36 AM, Apache Jenkins Server < jenk...@builds.apache.org> wrote: > See

Re: TikaIO concerns

2017-09-22 Thread Reuven Lax
The answer will be different for the different Beam runners, and even then probably different in batch and streaming runners. On Fri, Sep 22, 2017 at 5:01 AM, Allison, Timothy B. wrote: > @Eugene: What's the best way to have Beam help us with these issues, or do > these come

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>> How will it work now, with new Metadata() passed to the AutoDetect parser, >> will this Metadata have a Metadata value per every attachment, possibly >> keyed by a name ? An example of how to call the RecursiveParserWrapper:

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim Sorry for getting into the RecursiveParserWrapper discussion first, I was certain the time zone difference was on my side :-) How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? Not at the moment, we’d have to do some coding on our end or within Beam. The format is a list of maps/dicts for each file. Each

Jenkins build is still unstable: beam_Release_NightlySnapshot #540

2017-09-22 Thread Apache Jenkins Server
See