Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Kenneth Knowles
Windows are a user-defined type, though there aren't too many occasions for wildly interesting ones. In other places where we need to key by them (state namespaces) the solution is not readable, as it goes through the coder and then injects that result into the allowed keyspace (e.g. via base64 enc

Re: First stable release: Acceptance criteria

2017-05-11 Thread Jason Kuster
Just validated a decently-sized wordcount on a YARN cluster successfully. On Thu, May 11, 2017 at 3:51 PM, Kenneth Knowles wrote: > I gave the archetype-based quickstart a try on as many runners and > configurations as I could manage today, mostly embedded and YARN. > > There are some issues (fi

Re: First stable release: Acceptance criteria

2017-05-11 Thread Jason Kuster
on Spark, doh On Thu, May 11, 2017 at 5:45 PM, Jason Kuster wrote: > Just validated a decently-sized wordcount on a YARN cluster successfully. > > On Thu, May 11, 2017 at 3:51 PM, Kenneth Knowles > wrote: > >> I gave the archetype-based quickstart a try on as many runners and >> configurations

Re: [PROPOSAL] Apache Hive connector

2017-05-11 Thread Seshadri Raghunathan
Thanks Eugene, that makes sense. This solution heavily borrows on HadoopInputFormatIO with a tweak for HCatalog (and related parameters). I will try to re-use HadoopInputFormatIO rather than the current approach. On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wr

Re: [PROPOSAL] Apache Hive connector

2017-05-11 Thread Eugene Kirpichov
Thanks Seshadri! This seems to have a great deal of copy-paste from HadoopInputFormatIO. Is it possible to instead implement this connector as a wrapper around it, rather than copy-paste? On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan wrote: > Hi all, > > Here is a draft implementation of

Re: [PROPOSAL] Apache Hive connector

2017-05-11 Thread Seshadri Raghunathan
Hi all, Here is a draft implementation of this proposal - https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9cd018b1c99c3ad0854157c1 Many thanks to Ismaël Mejía who helped in a high level review & follow-up of this design / approach. Looking forward for further review/comments from wide

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Robert Bradshaw
I like the idea of WWW and PPP, assuming there is a standard enough stringification of windows and panes. However, we may want to elide adjacent tokes if the window is global or the pane is the only possible (or first?) one to avoid writing things like --of-0005---. On Thu, May 11, 2017 at 8:4

Re: First stable release: Acceptance criteria

2017-05-11 Thread Kenneth Knowles
I gave the archetype-based quickstart a try on as many runners and configurations as I could manage today, mostly embedded and YARN. There are some issues (filed and added to the doc) that may have to do with my setup, but may not. I'd prefer the runner maintainers / system experts try these on th

Re: First stable release: Acceptance criteria

2017-05-11 Thread Thomas Groh
I'm making sure the direct runner plays nice in a variety of scenarios (primarily the game examples, at the moment. Been a couple of hours and still going strong in streaming) On Thu, May 11, 2017 at 3:09 PM, Dan Halperin wrote: > I'm focusing on: > > * user reported bugs (Avro, TextIO, MongoDb)

Re: First stable release: Acceptance criteria

2017-05-11 Thread Dan Halperin
I'm focusing on: * user reported bugs (Avro, TextIO, MongoDb) * the actual Apache Release criteria (licensing, dependencies, etc.) On Thu, May 11, 2017 at 3:04 PM, Lukasz Cwik wrote: > I have been trying out various Python scenarios on Windows. > > On Thu, May 11, 2017 at 3:01 PM, Jason Kuster

Re: First stable release: Acceptance criteria

2017-05-11 Thread Lukasz Cwik
I have been trying out various Python scenarios on Windows. On Thu, May 11, 2017 at 3:01 PM, Jason Kuster < jasonkus...@google.com.invalid> wrote: > I'll try to get wordcount running against a Spark cluster. > > On Wed, May 10, 2017 at 10:32 PM, Davor Bonaci wrote: > > > Just a quick remainder t

Re: First stable release: Acceptance criteria

2017-05-11 Thread Jason Kuster
I'll try to get wordcount running against a Spark cluster. On Wed, May 10, 2017 at 10:32 PM, Davor Bonaci wrote: > Just a quick remainder to consider to consider contributing here. > > We are now at 6 criteria -- thanks! > > On Tue, May 9, 2017 at 2:29 AM, Aljoscha Krettek > wrote: > > > Thanks

Apache Beam Meetup in Bay Area

2017-05-11 Thread Davor Bonaci
I'm happy to announce that we'll have our very first Bay Area meetup [1] on Wednesday, May 24, hosted by the fine folks at Hortonworks. Abstract: The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims

Re: What is the use case and the expected behavior for a Flatten transform with no input?

2017-05-11 Thread Shen Li
Thanks! Shen On Thu, May 11, 2017 at 2:29 PM, Reuven Lax wrote: > We sometimes see this when users generate flattens with a for loop (e.g. > over input sources for a side input). Sometimes there is no input, and the > flatten is empty. > > > On Thu, May 11, 2017 at 11:24 AM, Shen Li wrote: > >

Re: What is the use case and the expected behavior for a Flatten transform with no input?

2017-05-11 Thread Reuven Lax
We sometimes see this when users generate flattens with a for loop (e.g. over input sources for a side input). Sometimes there is no input, and the flatten is empty. On Thu, May 11, 2017 at 11:24 AM, Shen Li wrote: > Hi, > > The FlattenTest enforces that Flatten transform can handle empty input

Re: What is the use case and the expected behavior for a Flatten transform with no input?

2017-05-11 Thread Kenneth Knowles
Hi Shen, The most obvious use case to me would be a pipeline that is generated by a program that loops out the input PCollections. Of course, it is quite easy to work around this by adding an empty PCollection to the flatten, so I don't think it is important. Kenn On Thu, May 11, 2017 at 11:24 A

What is the use case and the expected behavior for a Flatten transform with no input?

2017-05-11 Thread Shen Li
Hi, The FlattenTest enforces that Flatten transform can handle empty input. Is there any use case for this feature? https://github.com/apache/beam/blob/release-2.0.0/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/FlattenTest.java#L128 How should a runner translate an empty Flatten?

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Reuven Lax
Another idea - we can extend the existing pattern that DefaultFileNamePolicy understands to include windows. Today it replaces SSS with the shard, and NNN with the number of shards (so many templates contain -SSS-of-NNN). We could also have it recognize WWW and PPP, for the window and the pane res

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Dan Halperin
(we should probably throw an exception at construction time in the various FileBasedSinks if you use WindowedWrites and the default filename policy though, that's a no-brainer and it's backwards-compatible.) On Thu, May 11, 2017 at 8:41 AM, Dan Halperin wrote: > +Eugene, Reuven who reviewed and

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Dan Halperin
+Eugene, Reuven who reviewed and implemented this code. They may have opinions. Note that changing the default filename policy would be backwards-incompatible, so this would either need to go into 2.0.0 (and a new RC3) or it would not go in. On Thu, May 11, 2017 at 8:36 AM, Borisa Zivkovic wrote

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Borisa Zivkovic
great JB, thanks I do not mind working on this - let's see if anyone else has additional input. cheers On Thu, 11 May 2017 at 16:28 Jean-Baptiste Onofré wrote: > Got it. > > Yes, agree, I think the PerWindowFilesPolicy could be the default and let > the > user provides its own policy if he wan

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Jean-Baptiste Onofré
Got it. Yes, agree, I think the PerWindowFilesPolicy could be the default and let the user provides its own policy if he wants to. Regards JB On 05/11/2017 05:23 PM, Borisa Zivkovic wrote: Hi JB, yes I saw that thread - I also copied your code but did not want to pollute it with my proposa

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Borisa Zivkovic
Hi JB, yes I saw that thread - I also copied your code but did not want to pollute it with my proposal :) Well ok maybe default FilePerWindow policy for windowedWrites in TextIO does not make sense - not sure TBH... But would it make sense to promote a version of PerWindowFiles from https://git

Re: TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Jean-Baptiste Onofré
Hi Borisa, You can take a look about the other thread ("Direct runner doesn't seem to finalize checkpoint "quickly""). It's basically the same point ;) The default trigger (event-time) doesn't fire any data. I'm investigating the element timestamp and watermark. I'm also playing with that,

TextIO and .withWindowedWrites() - filenamepolicy

2017-05-11 Thread Borisa Zivkovic
Hi guys, just playing with reading data from PubSub and writing using TextIO. First thing is that it is very hard to get any output - a lot of temp files written but not always would get final files created. So, I am playing with triggers etc... If I do following PCollection streamData = p.appl

Re: Direct runner doesn't seem to finalize checkpoint "quickly"

2017-05-11 Thread Jean-Baptiste Onofré
Hi, I moved forward a bit on this. Actually, the IOs seem to work fine from a reading perspective. My issue is more on window/trigger in the pipeline. First, I used FixedWindows with the default trigger (event-time based). With this windows, it seems the trigger is never executed (I have to c

[DISCUSS] Source Watermark Metrics

2017-05-11 Thread JingsongLee
Hi everyone, The source watermark metrics show the consumer latency of Source.  It allows the user to know the health of the job, or it can be used to monitor and alarm. We should have the runner report the watermark metricsrather than having the source report it using metrics. This addresses th

Re: [PROPOSAL] Running Splittable DoFn via Source API

2017-05-11 Thread Aljoscha Krettek
Yes, that additional piece of work was basically my concern. It’s a very mild concern, though, and I’m in favour of implementing SDF as a source. Best, Aljoscha > On 11. May 2017, at 01:00, Eugene Kirpichov > wrote: > > Hi, > > Aljoscha - can you clarify your concern? > > Basically, previou

Jenkins build is back to stable : beam_Release_NightlySnapshot #412

2017-05-11 Thread Apache Jenkins Server
See