Re: [PROPOSAL] MultiLineIO

2016-03-14 Thread Eugene Kirpichov
Hi Peter, Looking forward to your PR. Please note that source classes are relatively tricky to develop, so would you mind briefly explaining what your source will do here over email, so that we hash out some possible issues early rather than in PR comments? Also note that now recommend to package

Re: [PROPOSAL] MultiLineIO

2016-03-14 Thread Eugene Kirpichov
uot; (as did in the > >PubSubIO). I'm now refactoring it to use the "new style". > > > >Regards > >JB > > > >On 03/14/2016 06:47 PM, Eugene Kirpichov wrote: > >> Hi Peter, > >> Looking forward to your PR. Please note that source classes

Re: Suggestion for Writing Sink Implementation

2016-07-27 Thread Eugene Kirpichov
Hi Sumit, All reusable parts of a pipeline, including connectors to storage systems, should be packaged as PTransform's. Sink is an advanced API that you can use under the hood to implement the transform, if this particular connector benefits from this API - but you don't have to, and many

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-11 Thread Eugene Kirpichov
/_/google.com/splittabledofn - I confirmed that it can be joined without being logged into a Google account) Who'd be interested in attending, and does this time/date work for people? On Fri, Aug 5, 2016 at 10:48 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Hi JB, thanks fo

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-11 Thread Eugene Kirpichov
r my understanding. > > Thanks > Regards > JB > > > > On Aug 11, 2016, 19:46, at 19:46, Eugene Kirpichov > <kirpic...@google.com.INVALID> wrote: > >Hi JB, > > > >What are your thoughts on this? > > > >I'm also thinking of having a virtual m

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-12 Thread Eugene Kirpichov
, Aparup Banerjee (apbanerj) < > apban...@cisco.com> wrote: > > > + 1, me2 > > > > > > > > > > On 8/12/16, 9:27 AM, "Amit Sela" <amitsel...@gmail.com> wrote: > > > > >+1 as in I'll join ;-) > > > > > >On

[PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-04 Thread Eugene Kirpichov
Hello Beam community, We (myself, Daniel Mills and Robert Bradshaw) would like to propose "Splittable DoFn" - a major generalization of DoFn, which allows processing of a single element to be non-monolithic, i.e. checkpointable and parallelizable, as well as doing an unbounded amount of work per

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-18 Thread Eugene Kirpichov
t; apban...@cisco.com > > > > > wrote: > > > > > + 1, me2 > > > > > > > > > > > > > > > On 8/12/16, 9:27 AM, "Amit Sela" <amitsel...@gmail.com <javascript:;>> > > > wrote: > > > > > > >+1 as in I'll

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-21 Thread Eugene Kirpichov
Onofré <j...@nanthrax.net> wrote: > Anyway, from a runner perspective, we will have kind of API (part of the > Runner API) to "orchestrate" the SDF as we discussed during the call, > right ? > > Regards > JB > > On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:

Re: Should UnboundedSource provide a split identifier ?

2016-09-06 Thread Eugene Kirpichov
/io/kafka/KafkaIO.java#L636 > > On Wed, Sep 7, 2016 at 1:24 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > > Hi Amit, > > Could you explain more about why you're saying the order of splits > matters? > > AFAIK the semantics of Read.Unbounded is &q

Re: Should UnboundedSource provide a split identifier ?

2016-09-06 Thread Eugene Kirpichov
Hi Amit, Could you explain more about why you're saying the order of splits matters? AFAIK the semantics of Read.Unbounded is "read from all of the splits in parallel, checkpointing each of them independently", so their order shouldn't matter. On Tue, Sep 6, 2016 at 3:17 PM Amit Sela

Re: [REMINDER] Technical discussion on the mailing list

2016-10-06 Thread Eugene Kirpichov
Hi Daniel, Thanks for raising this. I think I was a major contributor to your frustration with the process by suggesting big changes to your IO PR. As others have said, ideally that process should have gone differently: 1) we should have had documentation on the best practices in developing IOs,

Re: Beam IO: suggestions and new features

2016-09-20 Thread Eugene Kirpichov
schema-using-reflection . But this should be probably provided as a collection of utilities, rather than as a core feature of the programming model. > > Regards > JB > > On 09/20/2016 03:59 AM, Eugene Kirpichov wrote: > > Hi! > > > > Thank you for raising these issues.

Re: Remove legacy import-order?

2016-08-23 Thread Eugene Kirpichov
Two cents: While we're at it, we could consider enforcing formatting as well (https://github.com/google/google-java-format). That's a bigger change though, and I don't think it has checkstyle integration or anything like that. On Tue, Aug 23, 2016 at 4:54 PM Dan Halperin

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-29 Thread Eugene Kirpichov
eDoFns be asked for a watermark, does this mean that watermarks > can then be "generated" at any operation? > > Cheers, > Aljoscha > > On Sun, 21 Aug 2016 at 22:34 Eugene Kirpichov <kirpic...@google.com.invalid > > > wrote: > > > Hi JB, > > >

Re: Introducing a Redistribute transform

2016-10-10 Thread Eugene Kirpichov
o letting the user specify a > > > > number here; the data is spread out among as many machines as are > > > > handling the shuffling (for N elements, there are ~N unique keys, > > > > which gets partitioned by the system to the M workers). > > > > >

Re: Placement of temporary files by FileBasedSink

2016-10-27 Thread Eugene Kirpichov
ks, > Cham > > On Thu, Oct 27, 2016 at 1:45 PM Chamikara Jayalath <chamik...@apache.org> > wrote: > > > On Thu, Oct 27, 2016 at 1:27 PM Eugene Kirpichov > > <kirpic...@google.com.invalid> wrote: > > > > Getting back to this. I noticed that the origi

Placement of temporary files by FileBasedSink

2016-10-19 Thread Eugene Kirpichov
Hello, This is a continuation of the discussion on PR https://github.com/apache/incubator-beam/pull/1050 which turned out more complex than expected. Short summary: Currently FileBasedSink, when writing to /path/to/foo (in practice, /path/to/foo-x-of-y where y is the total number of

Re: [DISCUSS] Using Verbs for Transforms

2016-10-24 Thread Eugene Kirpichov
$0.02: Deduplicate? (lends to extensions like Deduplicate.by(some key extractor function)) On Mon, Oct 24, 2016 at 1:22 PM Dan Halperin wrote: > I find "MakeDistinct" more confusing. My votes in decreasing preference: > > 1. Keep `RemoveDuplicates` name, ensure that

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Eugene Kirpichov
> > > Siblings fundamentally will not work. Consider the following > > perfectly-valid output path: s3://bucket/file-SSS-NNN.txt . A sibling > would > > be a new bucket, so not guaranteed to exist. > > > > On Thu, Oct 20, 2016 at 1:57 AM, Chamik

Re: Flink runner. Wrapper for DoFn

2016-11-18 Thread Eugene Kirpichov
Hi Alexey, In general, things like establishing connections and initializing caches are better done in @Setup and @TearDown methods, rather than @StartBundle and @FinishBundle, because DoFn's can be reused between bundles and this way you get more benefit from reuse. Bundles can be pretty small,

Re: Specifying type arguments for generic PTransform builders

2016-10-13 Thread Eugene Kirpichov
t; methods pointless, too :-) > > One of the better examples of the pattern of "ready-to-go" builders - > though not a transform - is WindowingStrategy (props to Ben), where there > are intelligent defaults and you can override them, and it tracks whether > or not eac

Re: Introducing a Redistribute transform

2016-10-12 Thread Eugene Kirpichov
robably wouldn't provide #1). It is also potentially non-optimal: there can exist better implementations of each of these 3 semantics not involving a global group-by-key. It's not clear yet what to do about all this. Thoughts welcome. On Tue, Oct 11, 2016 at 11:35 AM Kenneth Knowles <k...@google.com

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-10-12 Thread Eugene Kirpichov
t; > On Mon, 29 Aug 2016 at 18:00 Eugene Kirpichov <kirpic...@google.com.invalid > > > wrote: > > > Hi Aljoscha, > > > > The watermark reporting is done via > > ProcessContinuation.futureOutputWatermark, at the granularity of > returning > > f

Re: New DoFn and WindowedValue/WinowingInternals

2016-12-11 Thread Eugene Kirpichov
Hi Amit, I'll comment in more detail later, but meanwhile please take a look at https://github.com/apache/incubator-beam/pull/1565 There is a small amount of relevant changes to spark runner. Take a look at implementation of SplittableParDo (already committed) in particular ProcessFn and it's

Re: New DoFn and WindowedValue/WinowingInternals

2016-12-11 Thread Eugene Kirpichov
ParDo*, *Window.Bound*, and > > *GroupAlsoByWindow* requires a custom implementation by per runner as > they > > are not handled by DoFn anymore, right ? > > > > On Sun, Dec 11, 2016 at 3:42 PM Eugene Kirpichov > > <kirpic...@google.com.invalid> wrote: &g

Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Eugene Kirpichov
There is one more data-loss type error, a fix for which should go into the release. https://github.com/apache/incubator-beam/pull/1620 On Thu, Dec 15, 2016 at 10:42 AM Davor Bonaci wrote: > I think we should build another RC. > > Two issues: > * Metrics issue that JB pointed

Re: [DISCUSS] ExecIO

2016-12-07 Thread Eugene Kirpichov
lection of one or more commands, yielding > > their outputs). Especially for the case of a Read, which in this case > > is not splittable (initially or dynamically) and always produces a > > single element--feels much more like a Map to me. > > > > On Tue, Dec 6, 201

Re: PCollection to PCollection Conversion

2016-11-29 Thread Eugene Kirpichov
Hi JB, Depending on the scope of what you want to ultimately accomplish with this extension, I think it may make sense to write a proposal document and discuss it. If it's just a collection of utility DoFn's for various well-defined source/target format pairs, then that's probably not needed, but

Re: [DISCUSS] ExecIO

2016-12-05 Thread Eugene Kirpichov
Hi JB, Thanks for bringing this to the mailing list. I also think that this is useful in general (and that use cases for Beam are more than just classic bigdata), and that there are interesting questions here at different levels about how to do it right. I suggest to start with the highest-level

Re: Questions about coders

2016-11-30 Thread Eugene Kirpichov
Thanks Kenn! This is very helpful. On Wed, Nov 30, 2016 at 4:13 PM Kenneth Knowles <k...@google.com.invalid> wrote: > On Wed, Nov 30, 2016 at 3:52 PM, Eugene Kirpichov < > kirpic...@google.com.invalid> wrote: > > > Hello, > > > > Do we have anywhere a se

Re: [DISCUSS] ExecIO

2016-12-06 Thread Eugene Kirpichov
ies matching > those requirements. > Then, the end user is not more responsible: the runner will try to > define if the pipeline can be executed, and where a DoFn has to be run > (on which worker). > > For me, it's two different levels where 2 is smarter but 1 can also make > s

Re: Naming and API for executing shell commands

2016-12-07 Thread Eugene Kirpichov
My question is where do we put such ShellCommands extension ? As a > module under IO ? As a new extensions module ? > > Regards > JB > > On 12/07/2016 06:24 PM, Eugene Kirpichov wrote: > > Branched off into a separate thread. > > > > How about ShellCommands.execute().wi