Re: Let's make Beam transforms comply with PTransform Style Guide

2017-01-31 Thread Eugene Kirpichov
On Mon, Jan 30, 2017 at 7:56 PM Dan Halperin <dhalp...@google.com.invalid> wrote: > On Mon, Jan 30, 2017 at 5:42 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hello, > > > > The PTransform Style Guide is live > > https://beam

Re: TextIO binary file

2017-01-31 Thread Eugene Kirpichov
should be removed. > > Is there merit for Beam to supply an IO which does allow writing objects to > a file using Beam coders and Beam FS (To write these files to > GS/Hadoop/Local)? > > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
Hi Etienne, Could you post some snippets of how your transform is to be used in a pipeline? I think that would make it easier to discuss on this thread and could save a lot of churn if the discussion ends up leading to a different API. On Thu, Jan 26, 2017 at 8:29 AM Etienne Chauchot

Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
The use of Coder in TextIO is a long standing design issue because coders are not intended to be used for general purpose converting things from and to bytes, their only proper use is letting the runner materialize and restore objects if the runner thinks it's necessary. IMO it should have been

Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
to text source first). Probably coder > parameter should not be configurable for text source/sink and they should > be updated to only read/write UTF-8 encoded strings. > > - Cham > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov > <kirpic...@google.com.invalid> w

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
{ > ... > } > })); > > If batchSize (overrided by user) returns a positive long, then DoFn can > batch with this size. > > Regards > JB > > On 01/26/2017 05:38 PM, Eugene Kirpichov wrote: > > Hi Etienne, > > > > Could you post some snippets of

Re: Consistent Placement

2017-01-27 Thread Eugene Kirpichov
Hi Jesse, +1 to Dan - I think it makes sense to return the specific type corresponding to the given transform (e.g. returning Combine.Globally from Combine.globally()), because it very often serves as a builder for adding more parameters. You mentioned users extending transform classes. I

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-01-26 Thread Eugene Kirpichov
can happen entirely within a bundle -- if you get to the end of the bundle > and only have 5 elements, even if 5 < N, process that as a batch (rather > than shifting it somewhere else). > > On Thu, Jan 26, 2017 at 3:01 PM Robert Bradshaw > <rober...@google.com.invalid> >

Re: Conceptually, what are bundles?

2017-01-25 Thread Eugene Kirpichov
One more thing. I think ideally, bundles should not leak into the model at all - e.g. ideally, startBundle/finishBundle methods in DoFn should not exist. They interact poorly with windowing. The proper way to address what is commonly done in these methods is either Setup/Teardown methods, or a

Pipeline termination in the unified Beam model

2017-03-01 Thread Eugene Kirpichov
Raising this onto the mailing list from https://issues.apache.org/jira/browse/BEAM-849 The issue came up: what does it mean for a pipeline to finish, in the Beam model? Note that I am deliberately not talking about "batch" and "streaming" pipelines, because this distinction does not exist in the

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-27 Thread Eugene Kirpichov
it's worth discussing anything there over a videocall? Apex: Thomas - how about same time next Monday? (9:30am PST) Who else would like to join? On Mon, Mar 20, 2017 at 9:59 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Meeting notes: > Me and Thomas had a video call and we pretty

Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2017-03-27 Thread Eugene Kirpichov
Kenn - can you also remind for everybody, what is the difference between @NeedsRunner and @ValidatesRunner, and when should one use one or the other? I always find myself confused about this especially in code reviews. On Mon, Mar 27, 2017 at 11:32 AM Kenneth Knowles

Re: [PROPOSAL] @OnWindowExpiration

2017-03-28 Thread Eugene Kirpichov
Kenn, can you quote some use cases for this, to make it more clear what are the consequences of having this API in this form? I recall that one of the main use cases was batching DoFn, right? On Tue, Mar 28, 2017 at 1:37 PM Kenneth Knowles wrote: > On Tue, Mar 28, 2017

Re: Style: how much testing for transform builder classes?

2017-03-27 Thread Eugene Kirpichov
> > > validation happens in the validate method rather than at > > construction. > > > I > > > > > understand that the reasoning here is that we want to support > > > > constructing > > > > > them with options in any order and using Builder pattern can

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-17 Thread Eugene Kirpichov
s the Apex changes? > > Thanks > > On Tue, Mar 14, 2017 at 7:27 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hi! Please feel free to join this call, but I think we'd be mostly > > discussing how to do it in the Spark runner in particular; s

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-17 Thread Eugene Kirpichov
ther videotalk in 2 weeks to check on progress/issues. Thanks all! On Fri, Mar 17, 2017 at 8:29 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Yes, Monday morning works! How about also 8am PST, same Hangout link - > does that work for you? > > On Fri, Mar 17, 2017 at 7:50 AM T

Re: why Source#validate() is not declared to throw any exception

2017-03-21 Thread Eugene Kirpichov
I think it would make sense to allow the validate method to throw Exception. On Mon, Mar 20, 2017, 11:21 PM Jean-Baptiste Onofré wrote: > Hi Ted, > > validate() is supposed to throw runtime exception (IllegalStateException, > RuntimeException, ...) to "traverse" the executor.

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-15 Thread Eugene Kirpichov
/2017 07:45 PM, Amit Sela wrote: > > I have dinner at 9am.. which doesn't sound like a real thing if you > forget > > about timezones J > > How about 8am ? or something later like 12pm mid-day ? > > Apex can take the 9am time slot ;-) > > > > On Wed, Mar 15, 201

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-20 Thread Eugene Kirpichov
call and Ismaël also provided to me some updates. > > I will sync with Amit on Spark runner and start to experiment and test SDF > on > the JMS IO. > > Thanks ! > Regards > JB > > On 03/17/2017 04:36 PM, Eugene Kirpichov wrote: > > Meeting notes from today's ca

Re: Style: how much testing for transform builder classes?

2017-03-14 Thread Eugene Kirpichov
provide, as long as it is not a burden for adding > > new tests. > > > > > > > > On 11 March 2017 at 14:16, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > > > > > +1 > > > > > > Testing is always hard, especially to have con

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-14 Thread Eugene Kirpichov
ork for me also. Please let me know if you want to keep the > Apex related discussion separate or want me to join this call. > > Thanks, > Thomas > > > On Tue, Mar 14, 2017 at 1:56 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Sure

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-04-01 Thread Eugene Kirpichov
t; > > > Hi, > > > > sorry for being so slow but I’m currently traveling. > > > > > > > > The Flink code works but I think it could benefit from some > refactoring > > > > to make the code nice and maintainable. > > > > > > >

Re: Should you always have a separate PTransform class for a new transform?

2017-04-20 Thread Eugene Kirpichov
(returning a Combine.Globally), then afterwards this change *would* be incompatible and would have to be done before stable release. So I'd suggest to hold off the PR until we reach consensus here. On Wed, Feb 8, 2017 at 2:09 PM Eugene Kirpichov <kirpic...@google.com> wrote: > I think

Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-20 Thread Eugene Kirpichov
gt; > On Thu, 20 Apr 2017 at 11:18 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > >> Gonna take a look on the pending IOs. > >> > >> Thanks ! > >> Regards > >> JB > >> > >> On 04/19/2017 10:05 PM, Eugene Ki

Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-19 Thread Eugene Kirpichov
, 2017 at 10:47 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Eugene, > > thanks for the update. I'm volunteer to tackle some those IOs (and make > them > conform with PTransform style guide). I'm pretty sure other people will > jump on ;) > > Regards > JB &

Re: Naming of Combine.Globally

2017-04-18 Thread Eugene Kirpichov
...Curiously enough, ReduceFn is by far the closest of all these to a sequential fold. It is also internal (runner-facing rather than user-facing). On Tue, Apr 18, 2017 at 8:27 AM Dan Halperin wrote: > Great discussion! As Aljoscha says, Fold, Reduce, and Combine

Re: Let's make Beam transforms comply with PTransform Style Guide

2017-03-03 Thread Eugene Kirpichov
> know each other and get a lot done. Would there be any interest in this? > > On Wed, Mar 1, 2017 at 11:30 AM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hey all, > > > > First couple rounds of fixes are in. Thanks Aviem Zur for contributin

Style: how much testing for transform builder classes?

2017-03-10 Thread Eugene Kirpichov
Hello, I've seen a pattern in a couple of different transforms (IOs) where we, I think, spend an excessive amount of code unit-testing the trivial builder methods. E.g. a significant part of

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-13 Thread Eugene Kirpichov
+Stas Levin <stasle...@apache.org> +Thomas Groh <tg...@google.com> On Mon, Mar 13, 2017 at 5:30 PM Eugene Kirpichov <kirpic...@google.com> wrote: > https://issues.apache.org/jira/browse/BEAM-1712 might be a release > blocker. > > On Mon, Mar 13, 2

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-13 Thread Eugene Kirpichov
https://issues.apache.org/jira/browse/BEAM-1712 might be a release blocker. On Mon, Mar 13, 2017 at 4:53 PM Ahmet Altay wrote: > Thank you for all the comment so far. > > On Mon, Mar 13, 2017 at 4:23 PM, Ted Yu wrote: > > > bq. I would prefer

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-13 Thread Eugene Kirpichov
+Aljoscha Krettek <aljos...@data-artisans.com> On Mon, Mar 13, 2017 at 5:30 PM Eugene Kirpichov <kirpic...@google.com> wrote: > +Stas Levin <stasle...@apache.org> +Thomas Groh <tg...@google.com> > > On Mon, Mar 13, 2017 at 5:30 PM Eugene Kirpichov <kirp

Re: [VOTE] Release 0.6.0, release candidate #2

2017-03-13 Thread Eugene Kirpichov
Conclusion (see JIRA): Not a release blocker (but still a bug in TestPipeline). On Mon, Mar 13, 2017 at 5:40 PM Eugene Kirpichov <kirpic...@google.com> wrote: > +Aljoscha Krettek <aljos...@data-artisans.com> > > On Mon, Mar 13, 2017 at 5:30 PM Eugene Kirpichov <kirpi

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-03-13 Thread Eugene Kirpichov
straight forward for the > > Spark runner after the work on read from UnboundedSource and after > > GroupAlsoByWindow, but from my experience such a call could move us > > forward > > fast enough. > > > > On Mon, Mar 13, 2017, 20:37 Eugene Kirpichov <kirpi

Re: Interest in a (virtual) contributor meeting?

2017-03-06 Thread Eugene Kirpichov
I'd like to join too, depending on the time. Thanks! On Mon, Mar 6, 2017 at 8:46 AM Etienne Chauchot wrote: > Very good idea! Thanks for the initiative > > Definitely +1 ! > > I'll be there > > Etienne > > > > Le 22/02/2017 à 04:18, Davor Bonaci a écrit : > > In the early

Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Eugene Kirpichov
asonable > for users to want to understand and handle these cases. > > +1 > > Dan > > On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > +1 > > > > Good idea !! > > > > Regards > > JB > > &

Re: Proposed Splittable DoFn API changes

2017-04-07 Thread Eugene Kirpichov
laim(i); ++i) { c.output(KV.of(c.element(), i)); } } @GetInitialRestriction public OffsetRange getInitialRange(String element) { return new OffsetRange(0, 100); } } On Thu, Apr 6, 2017 at 3:16 PM Eugene Kirpichov <kirpic...@google.com> wrote: > FWIW,

Re: Proposed API for a Whole File IO

2017-08-01 Thread Eugene Kirpichov
Hi, As mentioned on the PR - I support the creation of such an IO (both read and write) with the caveats that Reuven mentioned; we can refine the naming during code review. Note that you won't be able to create a PCollection because elements of a PCollection must have a coder and it's not possible

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-03 Thread Eugene Kirpichov
both. The suggestion is new methods > > > > >> > > > > .getDeterministicCoder(TypeDescriptor) and > > > > >> > > > > .getLexicographicCoder(TypeDescriptor). > > > > >> > > > > >

Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Eugene Kirpichov
she...@gmail.com> wrote: > Hi Eugene, > > Thanks for sharing the info. That PAssertionSite tracks where an assertion > error occurred. Do you know if it is possible to get the class name and > line number where a PTransform was added? > > Thanks, > Shen > > On M

Re: Adding back PipelineRunner#apply method

2017-08-15 Thread Eugene Kirpichov
... And remember it and make available inside PCollection (which application produced this collection). On Tue, Aug 15, 2017, 8:39 AM Eugene Kirpichov <kirpic...@google.com> wrote: > In general, no - but the implementation of PAssertionSite exemplifies the > approach. I guess it cou

Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-15 Thread Eugene Kirpichov
Hey all, Seems like we're missing one more affirmative vote from a PMC member (so far we have JB and Ahmet) to proceed with the release. On Mon, Aug 14, 2017 at 9:30 AM Ahmet Altay wrote: > On Mon, Aug 14, 2017 at 6:32 AM, Ismaël Mejía wrote: > > >

Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-16 Thread Eugene Kirpichov
ication run successfully so we can say that spark runner passed > > the sanity tests needed. > > > > Still there is an open ticket > > https://issues.apache.org/jira/browse/BEAM-2671 which Stas is working on > > and its implications should be taken into consideration re

Re: Adding back PipelineRunner#apply method

2017-08-14 Thread Eugene Kirpichov
Hi Shen, Responding just to one part of your message - "remember the line at which the PTransform was added": take a look at https://github.com/apache/beam/pull/2247 which does this for PAssert. On Mon, Aug 14, 2017 at 7:32 PM Shen Li wrote: > In 0.5.0 or earlier releases,

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-10 Thread Eugene Kirpichov
wrote: > On Thu, Aug 3, 2017 at 6:08 PM, Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > https://github.com/apache/beam/pull/3649 has landed. The main > contribution > > of this PR is deprecating PTransform.getDefaultOutputCoder(). > > > > Next

Re: Style of messages for checkArgument/checkNotNull in IOs

2017-08-10 Thread Eugene Kirpichov
> > > > > NPE is uninformative and this feeds into the prior two bullets: If I > see > > > "NPE on line XYZ of file ABC" I am _always_ going to file a bug against > > the > > > author of file ABC because they dereferenced null. Their fix might be &

New blog post: Splittable DoFn

2017-08-16 Thread Eugene Kirpichov
Hi all, The blog post Powerful and modular IO connectors with Splittable DoFn in Apache Beam just went live - take a look! *One of the most important parts of the Apache Beam ecosystem is its quickly growing set of connectors that

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
Eugene Kirpichov <kirpic...@google.com> wrote: > Thanks all. The first PR is out for review: > https://github.com/apache/beam/pull/3443 > Next work (watching for new files) is in progress, based on > https://github.com/apache/beam/pull/3360 > > On Tue, Jun 27, 2017 at 11:

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
aming pipeline can run "forever." So someone > watching a GCS bucket "forever" will eventually crash due to the value > getting too large. Is there any reasonable way to garbage collect this > state? > > On Tue, Jul 11, 2017 at 9:08 PM, Eugene Kirpichov < &g

Re: Proposal and plan: new TextIO features based on SDF

2017-07-12 Thread Eugene Kirpichov
;re...@google.com.invalid> wrote: > As a thought experiment: could this be done by expanding the set into a > PCollection and running it through a Distinct (in the global window, > trigger every element) transform? > > On Tue, Jul 11, 2017 at 9:48 PM, Eugene Kirpichov < > kirpic..

Re: Pull requests for review

2017-07-16 Thread Eugene Kirpichov
You will not be able to merge them: an Apache Beam committer has to do it for you. The people you specified as reviewers on the first two PRs (Kenn and JB) are definitely committers so they can do it. On Sun, Jul 16, 2017, 10:30 AM Apache Enthu wrote: > Hi, > > Am i fine

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-10 Thread Eugene Kirpichov
I think Madhusudan's proposal does not involve reading the whole contents of the memcached cluster - it's applied to a PCollection of keys. So I'd suggest to call it MemcachedIO.lookup() rather than MemcachedIO.read(). And it will not involve the questions of splitting - however, it *will*

New template PR description spams BEAM-1234

2017-07-18 Thread Eugene Kirpichov
Hi, Was there a recent change to the default PR description on Beam? https://issues.apache.org/jira/browse/BEAM-1234 just got a couple of notifications from unrelated PRs because of the following part of the template PR description: - Format the pull request title like [BEAM-1234] Fixes bug

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-20 Thread Eugene Kirpichov
n for this, > I suppose we should restrict it to idempotent operations (so not > incr/decr), and eventually make users pass the expiry time in date format > so it does not get ‘overwritten’ > if a worker fails and the operation is re-executed. And about this point > it is probably a g

Re: Proposal and plan: new TextIO features based on SDF

2017-07-20 Thread Eugene Kirpichov
ted. > > On Wed, Jul 12, 2017 at 7:50 AM Reuven Lax <re...@google.com.invalid> > wrote: > > > Yes, you still need SDF to do the root expansion. However it means that > the > > state storage is now distributed. > > > > Garbage collection might be trickier with Dis

Requiring PTransform to set a coder on its resulting collections

2017-07-25 Thread Eugene Kirpichov
Hello, I've worked on a few different things recently and ran repeatedly into the same issue: that we do not have clear guidance on who should set the Coder on a PCollection: is it responsibility of the PTransform that outputs it, or is it responsibility of the user, or is it sometimes one and

Re: [S]FTP support as Pipeline I/O

2017-07-24 Thread Eugene Kirpichov
, Jul 24, 2017, 12:31 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > I guess TextIO ? ;) > > Regards > JB > > On Jul 24, 2017, 21:27, at 21:27, Eugene Kirpichov > <kirpic...@google.com.INVALID> wrote: > >What is StringIO? > > > >On Mon, Jul 2

Re: [S]FTP support as Pipeline I/O

2017-07-24 Thread Eugene Kirpichov
What is StringIO? On Mon, Jul 24, 2017 at 1:47 AM Tolsa, Camille wrote: > Not necessary with StringIO > > On 24 July 2017 at 09:47, Reuven Lax wrote: > > > This would require writing data to local files in order to upload it to > the > >

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
z Cwik <lc...@google.com.invalid> wrote: > I'm split between our current one pass model of pipeline construction and a > two pass model where all information is gathered and then PTransform > expansions are performed. > > > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
er should > have > > the knowledge), I would prefer a strict way so users won't forget to call > > withSomethingCoder(), like > > - a Coder is required to new the PTransform; > > - or an interface 'getOutputCoder' to be implemented; > > > > On Wed, Jul 26, 2017

Re: Requiring PTransform to set a coder on its resulting collections

2017-07-26 Thread Eugene Kirpichov
, removing at least some of the mess? On Wed, Jul 26, 2017 at 8:16 PM Kenneth Knowles <k...@google.com.invalid> wrote: > +1 but maybe go ever further > > On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hello, >

How to test a transform against an inaccessible ValueProvider?

2017-07-19 Thread Eugene Kirpichov
Hi, Just filed JIRA https://issues.apache.org/jira/browse/BEAM-2644 Many transforms that take ValueProvider's have different codepaths for when the provider is accessible or not. However, as far as I can tell, there is no good way to construct a pipeline with PipelineOptions containing an

Re: Proposal: Watch: a transform for watching growth of sets

2017-07-19 Thread Eugene Kirpichov
()/readAll().watchForNewFiles(), which will be the first use case of SDF for a previously impossible but much wanted IO connector! I'm going to implement these tomorrow. On Thu, Jun 29, 2017 at 4:43 PM Eugene Kirpichov <kirpic...@google.com> wrote: > Hi all, > > Please take a look at this

Proposal: Watch: a transform for watching growth of sets

2017-06-29 Thread Eugene Kirpichov
Hi all, Please take a look at this short proposal that came out of implementing http://s.apache.org/textio-sdf. I think it's a nice generalization. I would welcome comments on the proposed API, or corner cases of semantics that I haven't thought about, or more generalizations, etc.

Re: Proposal and plan: new TextIO features based on SDF

2017-06-27 Thread Eugene Kirpichov
ptiste Onofré <j...@nanthrax.net> > wrote: > > > > > > Fair enough ;) > > > > > > Let me review the different Jira and provide some feedback. > > > > > > Regards > > > JB > > > > > > On Jun 24, 2017, 20:54, at 2

Re: [VOTE] Release 2.1.0, release candidate #2

2017-08-04 Thread Eugene Kirpichov
would be great to fix BEAM-2671 in the coming 36 hours. I > would like > > to submit RC3 to vote tomorrow or the day after (my time). > > > > Thanks ! > > Regards > > JB > > > > On 08/02/2017 08:24 PM, Eugene Kirpichov wrote: > >> We're down to 2 issue

Re: Proposal: Watch: a transform for watching growth of sets

2017-07-30 Thread Eugene Kirpichov
This PR has been submitted and Watch is available. Next PR https://github.com/apache/beam/pull/3607 is in review for adding TextIO.read().watchForNewFiles()! (will extend it to Avro and provide utility transforms for authors of similar IOs) On Wed, Jul 19, 2017 at 8:37 PM Eugene Kirpichov

Style of messages for checkArgument/checkNotNull in IOs

2017-07-28 Thread Eugene Kirpichov
Hey all, I think this has been discussed before on a JIRA issue but I can't find it, so raising again on the mailing list. Various IO (and non-IO) transforms validate their builder parameters using Preconditions.checkArgument/checkNotNull, and use different styles for error messages. There are 2

Re: Style of messages for checkArgument/checkNotNull in IOs

2017-07-28 Thread Eugene Kirpichov
to > >the type of exception thrown, so I'd usually prefer using that as the > >`Preconditions` method. Beyond that, +1 > > > >On Fri, Jul 28, 2017 at 11:17 AM, Eugene Kirpichov < > >kirpic...@google.com.invalid> wrote: > > > >> Hey all, > >

Re: Style of messages for checkArgument/checkNotNull in IOs

2017-07-28 Thread Eugene Kirpichov
(I don't think changing messages in existing IOs is a high priority, btw - just a nice-to-have, as long as we have guidance for future IOs) On Fri, Jul 28, 2017 at 11:37 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Okay, so then let's change guidance to: > - Always use c

beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Eugene Kirpichov
Hello, I've been trying to merge a PR https://github.com/apache/beam-site/pull/278 and ran into the following issues: 1) When I do "git fetch --all" on beam-site, I get an error "fatal: repository 'https://git-wip-us.apache.org/repos/asf/beam-site.git/' not found". Has the git address of the

Re: [VOTE] Release 2.1.0, release candidate #2

2017-08-07 Thread Eugene Kirpichov
> Regards > JB > > On 08/05/2017 02:37 AM, Eugene Kirpichov wrote: > > I did some more investigation on that JIRA > > https://issues.apache.org/jira/browse/BEAM-2671 and my conclusion is: > > > > We need to postpone that JIRA to 2.2.0 and finalize release 2.1.0

Re: Should Pipeline wait till all processing time timers fire before exit?

2017-07-25 Thread Eugene Kirpichov
Yes, and I think in this case the pipeline should never transition to DONE. On Tue, Jul 25, 2017 at 3:42 PM Robert Bradshaw wrote: > I generally agree, but it's unclear what to do with timers that are > scheduled during the execution of existing timers. (For

Re: Style: how much testing for transform builder classes?

2017-08-18 Thread Eugene Kirpichov
() method generally has effect (is not ignored), using TestPipeline and PAssert Does this sound reasonable? On Mon, Mar 27, 2017 at 2:28 PM Eugene Kirpichov <kirpic...@google.com> wrote: > From Ismael's and Dan's comments, I think I agree that there's a class of > easy-to-make errors that

Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-18 Thread Eugene Kirpichov
t;> > >>> Still there is an open ticket > >>> https://issues.apache.org/jira/browse/BEAM-2671 which Stas is working > on > >>> and its implications should be taken into consideration regarding the > >>> release. > >>> > >>&g

Re: beam-site issues with Jenkins and MergeBot

2017-08-17 Thread Eugene Kirpichov
I don't know how to test or deploy it. On Thu, Aug 10, 2017 at 3:32 PM Jason Kuster <jasonkus...@google.com> wrote: > Investigating mergebot outage currently. Apologies for the downtime. > > On Wed, Aug 9, 2017 at 9:55 PM, Eugene Kirpichov <kirpic...@google.com> > wrot

Re: Proposal: adding a built-in I/O source for VCF files

2017-08-17 Thread Eugene Kirpichov
I filed JIRA https://issues.apache.org/jira/browse/BEAM-2776 for general-purpose support of reading headers in TextIO. On Thu, Aug 17, 2017 at 4:19 PM Jean-Philippe Martin wrote: > ... and SAM. I'm working on a SAM parser that uses TextIO. For now I just > add a

Re: Proposal: adding a built-in I/O source for VCF files

2017-08-16 Thread Eugene Kirpichov
+Chamikara Jayalath Also you may find useful the recent discussion on WholeFileIO https://lists.apache.org/thread.html/6ea193b7178f8ab44de5562bfdd94dc3fe740bc440e8a05e533e40cf@%3Cdev.beam.apache.org%3E https://github.com/apache/beam/pull/3543 (I think bulk of discussion

Proposal: file-based IOs should support readAllMatches()

2017-08-18 Thread Eugene Kirpichov
Hi all, I've been adding new features to TextIO and AvroIO recently, see e.g. https://github.com/apache/beam/pull/3725. The features are: - withHintMatchesManyFiles() - readAll() that reads a PCollection of filepatterns - configurable treatment of filepatterns that match no files -

Re: [PROPOSAL] Apache Hive connector

2017-05-11 Thread Eugene Kirpichov
Thanks Seshadri! This seems to have a great deal of copy-paste from HadoopInputFormatIO. Is it possible to instead implement this connector as a wrapper around it, rather than copy-paste? On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan wrote: > Hi all, > > Here is a

Re: [PROPOSAL] Running Splittable DoFn via Source API

2017-05-10 Thread Eugene Kirpichov
gt; I just have a little question: are we blocked to move forward for the > support in > the runners or it's just a question of focus ? > > I think we could focus on this after the first stable release. > > Thought ? > > Regards > JB > > On 05/01/2017 07:22 AM, Eu

Re: [PROPOSAL] Apache Hive connector

2017-05-12 Thread Eugene Kirpichov
s solution heavily borrows on > HadoopInputFormatIO > > with a tweak for HCatalog (and related parameters). I will try to > re-use HadoopInputFormatIO > > rather than the current approach. > > > > On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov < > > kirpic

Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-06-20 Thread Eugene Kirpichov
homas > > > On Sat, Apr 1, 2017 at 1:17 AM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hey all, > > > > The Flink PR has been merged, and thus - Flink becomes the first > > distributed runner to support Splittable DoFn!!! > > Th

Bundling multiple TestPipeline tests into one pipeline

2017-06-22 Thread Eugene Kirpichov
Hi folks and especially runner developers, https://issues.apache.org/jira/browse/BEAM-2506 - quoting from there: Currently ValidatesRunner test suites run 1 pipeline per unit test. That's a lot of small pipelines, and consumes a lot of resources especially in case of a pretty heavyweight runner

Re: Bundling multiple TestPipeline tests into one pipeline

2017-06-22 Thread Eugene Kirpichov
th Knowles <k...@apache.org> wrote: > This is a great idea! Your suggestion to do it via a JUnit test runner > makes it very concrete. > > Kenn > > On Thu, Jun 22, 2017 at 3:27 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hi f

Re: Proposal and plan: new TextIO features based on SDF

2017-06-24 Thread Eugene Kirpichov
t; Thanks Eugene > > I will pick up some. > > Regards > JB > > On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov > <kirpic...@google.com.INVALID> wrote: > >Filed JIRAs for the proposed features and linked with the doc: > >https://issues.apache.org/jir

Re: [New Proposal] Hive connector using native api

2017-05-23 Thread Eugene Kirpichov
>From the point of view of general source/sink development, this code looks reasonable, except for a few violations of https://beam.apache.org/contribute/ptransform-style-guide/ (mainly around https://beam.apache.org/contribute/ptransform-style-guide/#runtime-errors-and-data-consistency) and other

Re: Beam Example 2.0 Update

2017-05-18 Thread Eugene Kirpichov
Looks great, thanks Jesse! On Thu, May 18, 2017 at 4:59 PM Jesse Anderson wrote: > Could I get a pair of eyeballs (or more) to look over the 2.0 updates I > made to the Beam example? This is the commit > < >

How can I disable running Python SDK tests when testing my Java change?

2017-05-18 Thread Eugene Kirpichov
I've noticed that when I run "mvn verify", most of the time when I look at the screen it's running Python tests. Indeed, the Reactor Summary says: ... [INFO] Apache Beam :: SDKs :: Python .. SUCCESS [11:56 min] ... [INFO] Total time: 12:03 min (Wall Clock) i.e. it's clearly

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-10 Thread Eugene Kirpichov
To elaborate a bit on what JB said: Suppose the table has 1,000,000 rows, and suppose you split it into 1000 bundles, 1000 rows per bundle. Does Aurora provide an API that allows to efficiently read the bundle containing rows 999,000-1,000,000, that does not involve reading and throwing away the

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
20", because both of these queries are in essence asking to give you 10 arbitrary rows from the table. On Tue, Jun 13, 2017 at 4:38 PM Eugene Kirpichov <kirpic...@google.com> wrote: > Thanks Madhusudan. Please note that in your case, likely, the time was > dominated by shipp

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
r benchmarks account for this too. On Tue, Jun 13, 2017 at 4:40 PM Eugene Kirpichov <kirpic...@google.com> wrote: > Most likely the identical performance you observed for "limit" clause is > because you are not sorting the rows. Without sorting, a "limit" query is > meaningl

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
_schema.tables. It doesn't have to even > access the actual table. > Please, give us more time to provide more on bench marking. > > > Madhu Borkar > > On Sat, Jun 10, 2017 at 10:51 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > To elab

Re: What's the easiest way for an application to convert an Iterable to an UnboundedSource

2017-04-29 Thread Eugene Kirpichov
Clarification: likely you meant TestStream ? On Sat, Apr 29, 2017 at 10:03 PM Jean-Baptiste Onofré wrote: > Hi, > > In addition of Eugene and Jesse's answers,

Re: Dynamic file-based sinks

2017-05-24 Thread Eugene Kirpichov
Hmm, on one hand this looks syntactically very appealing, on the other hand, it's icky to have a function return a PTransform at runtime, only to have some information be immediately extracted from that transform. Moreover, not all TextIO.Write transforms will be legal to return - e.g. most likely

Re: Using side inputs in any user code via thread-local side input accessor

2017-09-13 Thread Eugene Kirpichov
a composite transform rather than to individual DoFn's). > > On Wed, Sep 6, 2017 at 11:10 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > > Hi, > > > > On Wed, Sep 6, 2017 at 10:55 AM Kenneth Knowles <k...@google.com.invalid&

Re: PBegin, PDone

2017-09-14 Thread Eugene Kirpichov
For doing something before starting the pipeline, can you do it in the main program? The only disadvantage I can see is that it wouldn't be amenable to using templates (ValueProvider's) - is that the blocker? For doing something after a transform finishes processing a window of a PCollection - we

Re: multiple PCollections

2017-09-16 Thread Eugene Kirpichov
com> wrote: > i am using batch, since streaming cannot be done with partitions with > old data more than 30 days. > the question is how can i catch the exception in the pipline so that > other collections do not fail > > On Fri, Sep 15, 2017 at 7:37 PM, Eugene Kirpichov > &l

Re: TikaIO concerns

2017-09-19 Thread Eugene Kirpichov
On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré wrote: > Hi Sergey, > > as discussed together during the review, I fully understand the choices > you did. > > Your plan sounds reasonable. Thanks ! > > Generally speaking, in order to give visibility and encourage >

Re: TikaIO concerns

2017-09-19 Thread Eugene Kirpichov
Hi, Replies inline. On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin wrote: > Hi All > > This is my first post the the dev list, I work for Talend, I'm a Beam > novice, Apache Tika fan, and thought it would be really great to try and > link both projects together, which

Re: TikaIO concerns

2017-09-21 Thread Eugene Kirpichov
t; >> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf > >> > >> try it too if you get a chance. > >> > >> (and I can imagine not all PDFs/etc representing the 'story' but can > >> be for ex a log-like content too) > >> > >> That said, I do

  1   2   3   4   >