Re: [Proposal] Beam Newsletter

2017-09-21 Thread Jean-Baptiste Onofré
Hi, It's a great idea. I think user mailing list is enough and makes more sense. So, the "Call for updates" with the doc should be send on the dev mailing list and the result (not the google doc but a copy-paste in the mail body) on the user mailing list. Regards JB On 09/22/2017 03:31 AM,

Re: [VOTE] Release 2.1.1, release candidate #1

2017-09-21 Thread Jean-Baptiste Onofré
+1 (binding) Tested & checked: - build - ASF header/rat - examples - samples Thanks Regards JB On 09/21/2017 10:01 AM, Robert Bradshaw wrote: Hi everyone, As discussed earlier in this list [1] we'd like to get a bugfix release out for beam 2.1. Please review and vote on the release candidate

[Proposal] Beam Newsletter

2017-09-21 Thread Griselda Cuevas
Hi Beam Community, I have a proposal to start sending *monthly newsletters* to our dev and user mailing lists. The idea is to summarize what's happening in the project and keep everyone informed of what's happening, specially new members, people interested in specific initiatives/efforts and help

Re: TikaIO concerns

2017-09-21 Thread Eugene Kirpichov
Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. - Yes, the String in KV is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. @Chris: unorderedness of Metadata would have helped if we extracted

Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
Hi all, One other thing is that Tika extracts metadata, and language information in which order doesn’t matter (Keys can be out of order). Would this be useful? Cheers, Chris On 9/21/17, 2:10 PM, "Sergey Beryozkin" wrote: Hi Eugene Thank you, very

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi Eugene Thank you, very helpful, let me read it few times before I get what exactly I need to clarify :-), two questions so far: On 21/09/17 21:40, Eugene Kirpichov wrote: Thanks all for the discussion. It seems we have consensus that both within-document order and association with the

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi all, Please also welcome Chris to this thread, Chris, thanks for joining in :-), FYI, the main concern that was raised is that it was not obvious when to use TikaIO in the current form, given that Beam+TikaIO will have a totally unordered sequence of data (originally extracted by Tika in

Re: TikaIO concerns

2017-09-21 Thread Eugene Kirpichov
Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO. *Association with original file:* Sergey - Beam does not *automatically* provide a way to associate an element with

[Event] Strata Data Conference - New York 2017

2017-09-21 Thread Griselda Cuevas
Hi Beam Community, Apache Beam will be featured at Strata Data Conference New York next week [1]. Scheduled events: ** Realizing the promise of portability with Apache Beam* Speaker: Reuven Lax 11:20am–12:00pm Thursday, September 28, 2017 ** Foundations of streaming SQL; or, How I learned to

Re: [VOTE] Release 2.1.1, release candidate #1

2017-09-21 Thread Kenneth Knowles
+1 On Thu, Sep 21, 2017 at 10:15 AM, Chamikara Jayalath wrote: > +1. > > Ran wordcount and verified checksums and signature. > > Thanks, > Cham > > On Thu, Sep 21, 2017 at 1:02 AM Robert Bradshaw > > wrote: > > > Hi everyone, > > > > As

Re: [VOTE] Release 2.1.1, release candidate #1

2017-09-21 Thread Chamikara Jayalath
+1. Ran wordcount and verified checksums and signature. Thanks, Cham On Thu, Sep 21, 2017 at 1:02 AM Robert Bradshaw wrote: > Hi everyone, > > As discussed earlier in this list [1] we'd like to get a bugfix > release out for beam 2.1. Please review and vote on the

Re: Proposal: Unbreak Beam Python 2.1.0 with 2.1.1 bugfix release

2017-09-21 Thread Thomas Groh
+1 on cutting a release to fix this. As an aside, if we later determine that we require a release that includes Java, that release will be 2.1.2 (or equivalent) - the reason we aren't releasing Java artifacts is a matter of convenience (they have the same contents as the 2.1.0 release), not

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
Thank you, Sergey. My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet. From my perspective, if I understand this thread (and I may not!), getting unordered

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the consumers)

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Thanks for the comments, On 20/09/17 22:46, Robert Bradshaw wrote: On Wed, Sep 20, 2017 at 2:17 PM, Sergey Beryozkin wrote: Hi, thanks for the explanations, On 20/09/17 16:41, Eugene Kirpichov wrote: Hi! TextIO returns an unordered soup of lines contained in all

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
I noticed that the PDF and ODT parsers actually split by lines, not individual words and nearly 100% sure I saw Tika reporting individual lines when it was parsing the text files. The 'min text length' feature can help with reporting several lines at a time, etc... I'm working with this PDF

Jenkins build is unstable: beam_Release_NightlySnapshot #539

2017-09-21 Thread Apache Jenkins Server
See

[VOTE] Release 2.1.1, release candidate #1

2017-09-21 Thread Robert Bradshaw
Hi everyone, As discussed earlier in this list [1] we'd like to get a bugfix release out for beam 2.1. Please review and vote on the release candidate #1 for the version 2.1.1, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments)