Re: (java) stream & beam?

2018-03-17 Thread Robert Bradshaw
Yes. The very original Python API didn't have GBK, just a
lambda-parameterized groupBy.

On Sat, Mar 17, 2018, 12:21 AM Romain Manni-Bucau 
wrote:

> Gbk can be fluent if you pass a key extractor lambda ;)
>
> Le 17 mars 2018 00:00, "Jean-Baptiste Onofré"  a écrit :
>
>> Big +1
>>
>> Regards
>> JB
>> Le 16 mars 2018, à 15:59, Reuven Lax  a écrit:
>>>
>>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>>> element type). once we have schema support we can introduce groupByField,
>>> and that can be fluent.
>>>
>>>
>>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw 
>>> wrote:
>>>
 On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
 rmannibu...@gmail.com>
 wrote:

 > Le 15 mars 2018 06:52, "Robert Bradshaw"  a
 écrit :

 >> The stream API was looked at way back when we were designing the API;
 one of the primary reasons it was not further pursued at the time was
 the
 demand for Java 7 compatibility. It is also much more natural with
 lambdas,
 but unfortunately the Java compiler discards types in this case, making
 coder inference impossible. Still is interesting to explore, and I've
 been
 toying with using this wrapping method for other applications
 (specifically, giving a Pandas Dataframe API to PCollections in Python).

 >> There's a higher level question lingering here about making things
 more
 fluent by putting methods on PCollections in our primary API. It was
 somewhat of an experiment to go the very pure approach of *everything*
 being expressed a PTransform, and this is not without its disadvantages,
 and (gasp) may be worth revisiting. In particular, some things that have
 changed in the meantime are

 >> * The Java SDK is no longer *the* definition of the model. The model
 has
 been (mostly) formalized in the portability work, and the general Beam
 concepts and notion of PTransform are much more widely fleshed out and
 understood.

 > This is wrong for all java users which are still the mainstream. It is
 important to keep that in mind and even if I know portable API is
 something
 important for you,

 I think you miss-understood me. My point is that it is now much easier
 to
 disentangle the essence of the Beam model (reified in part in the
 portable
 API) from the Java API itself (which may evolve more independently,
 whereas
 formerly syntactic sugar here would be conflated with core concepts).

 > it is solething which should stay on top of runners and their api
 which
 means java for all but one.

 > All that to say that the most common default is java.

 I don't think it'll be that way for long; scala alone might give Java a
 run
 for its money.

 > However I agree each language should have its natural API and should
 absolutely not just port over the same API. Goal being indeed to respect
 its own philosophy.

 > Conclusion: java needs a most expressive stream like API.

 > There is another way to see it: catching up API debt compared to
 concurrent API.


 >> * Java 8's lambdas, etc. allows for much more succinct
 representation of
 operations, which makes the relative ratio of boilerplate of using apply
 that much higher. This is one of the struggles we had with the Python
 API,
 pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
 pcoll
 | Map(...) is at least closer to pcoll.map(...).
 >> * With over two years of experience with the 100% pure approach, we
 still haven't "gotten used to it" enough that adding such methods isn't
 appealing. (Note that by design adding such methods later is always
 easier
 than taking them away, which was one justification for starting at the
 extreme point).

 >> Even if we go this route, there's no need to remove apply, and

 >> pcoll
 >>  .map(...)
 >>  .apply(...)
 >>  .flatMap(...)

 >> flows fairly well (with map/flatMap being syntactic sugar to apply).

 >> Agree but the issue with that is you loose the natural approach and
 it
 is harder to rework it whereas having an api on top of "apply" let you
 keep
 both concerns split.

 Having multiple APIs undesirable, best to have one unless there are hard
 constraints that prevent it (e.g. if the two would be jarringly
 inconsistent, or one is forced by an interface, etc.)

 >> Also pcollection api is what is complex (coders, sides, ...) and
 what I
 hope we can hide behind another API.

 I'd like to simplify things as well.

 >> I think we would also have to still use apply for parameterless
 operations like gbk that place 

Re: (java) stream & beam?

2018-03-17 Thread Romain Manni-Bucau
Gbk can be fluent if you pass a key extractor lambda ;)

Le 17 mars 2018 00:00, "Jean-Baptiste Onofré"  a écrit :

> Big +1
>
> Regards
> JB
> Le 16 mars 2018, à 15:59, Reuven Lax  a écrit:
>>
>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>> element type). once we have schema support we can introduce groupByField,
>> and that can be fluent.
>>
>>
>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com>
>>> wrote:
>>>
>>> > Le 15 mars 2018 06:52, "Robert Bradshaw"  a
>>> écrit :
>>>
>>> >> The stream API was looked at way back when we were designing the API;
>>> one of the primary reasons it was not further pursued at the time was the
>>> demand for Java 7 compatibility. It is also much more natural with
>>> lambdas,
>>> but unfortunately the Java compiler discards types in this case, making
>>> coder inference impossible. Still is interesting to explore, and I've
>>> been
>>> toying with using this wrapping method for other applications
>>> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>>>
>>> >> There's a higher level question lingering here about making things
>>> more
>>> fluent by putting methods on PCollections in our primary API. It was
>>> somewhat of an experiment to go the very pure approach of *everything*
>>> being expressed a PTransform, and this is not without its disadvantages,
>>> and (gasp) may be worth revisiting. In particular, some things that have
>>> changed in the meantime are
>>>
>>> >> * The Java SDK is no longer *the* definition of the model. The model
>>> has
>>> been (mostly) formalized in the portability work, and the general Beam
>>> concepts and notion of PTransform are much more widely fleshed out and
>>> understood.
>>>
>>> > This is wrong for all java users which are still the mainstream. It is
>>> important to keep that in mind and even if I know portable API is
>>> something
>>> important for you,
>>>
>>> I think you miss-understood me. My point is that it is now much easier to
>>> disentangle the essence of the Beam model (reified in part in the
>>> portable
>>> API) from the Java API itself (which may evolve more independently,
>>> whereas
>>> formerly syntactic sugar here would be conflated with core concepts).
>>>
>>> > it is solething which should stay on top of runners and their api which
>>> means java for all but one.
>>>
>>> > All that to say that the most common default is java.
>>>
>>> I don't think it'll be that way for long; scala alone might give Java a
>>> run
>>> for its money.
>>>
>>> > However I agree each language should have its natural API and should
>>> absolutely not just port over the same API. Goal being indeed to respect
>>> its own philosophy.
>>>
>>> > Conclusion: java needs a most expressive stream like API.
>>>
>>> > There is another way to see it: catching up API debt compared to
>>> concurrent API.
>>>
>>>
>>> >> * Java 8's lambdas, etc. allows for much more succinct representation
>>> of
>>> operations, which makes the relative ratio of boilerplate of using apply
>>> that much higher. This is one of the struggles we had with the Python
>>> API,
>>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>>> pcoll
>>> | Map(...) is at least closer to pcoll.map(...).
>>> >> * With over two years of experience with the 100% pure approach, we
>>> still haven't "gotten used to it" enough that adding such methods isn't
>>> appealing. (Note that by design adding such methods later is always
>>> easier
>>> than taking them away, which was one justification for starting at the
>>> extreme point).
>>>
>>> >> Even if we go this route, there's no need to remove apply, and
>>>
>>> >> pcoll
>>> >>  .map(...)
>>> >>  .apply(...)
>>> >>  .flatMap(...)
>>>
>>> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>>>
>>> >> Agree but the issue with that is you loose the natural approach and it
>>> is harder to rework it whereas having an api on top of "apply" let you
>>> keep
>>> both concerns split.
>>>
>>> Having multiple APIs undesirable, best to have one unless there are hard
>>> constraints that prevent it (e.g. if the two would be jarringly
>>> inconsistent, or one is forced by an interface, etc.)
>>>
>>> >> Also pcollection api is what is complex (coders, sides, ...) and what
>>> I
>>> hope we can hide behind another API.
>>>
>>> I'd like to simplify things as well.
>>>
>>> >> I think we would also have to still use apply for parameterless
>>> operations like gbk that place constraints on the element types. I don't
>>> see how to do combinePerKey either (though, asymmetrically, globalCombine
>>> is fine).
>>>
>>> >> The largest fear I have is feature creep. There would have to be a
>>> very
>>> clear line of what's in and what's not, likely with what's in being a
>>> very
>>> short 

Re: (java) stream & beam?

2018-03-16 Thread Jean-Baptiste Onofré
Big +1

Regards
JB

Le 16 mars 2018 à 15:59, à 15:59, Reuven Lax  a écrit:
>BTW while it's true that raw GBK can't be fluent (due to constraint on
>element type). once we have schema support we can introduce
>groupByField,
>and that can be fluent.
>
>
>On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw 
>wrote:
>
>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau
>> >
>> wrote:
>>
>> > Le 15 mars 2018 06:52, "Robert Bradshaw"  a
>écrit :
>>
>> >> The stream API was looked at way back when we were designing the
>API;
>> one of the primary reasons it was not further pursued at the time was
>the
>> demand for Java 7 compatibility. It is also much more natural with
>lambdas,
>> but unfortunately the Java compiler discards types in this case,
>making
>> coder inference impossible. Still is interesting to explore, and I've
>been
>> toying with using this wrapping method for other applications
>> (specifically, giving a Pandas Dataframe API to PCollections in
>Python).
>>
>> >> There's a higher level question lingering here about making things
>more
>> fluent by putting methods on PCollections in our primary API. It was
>> somewhat of an experiment to go the very pure approach of
>*everything*
>> being expressed a PTransform, and this is not without its
>disadvantages,
>> and (gasp) may be worth revisiting. In particular, some things that
>have
>> changed in the meantime are
>>
>> >> * The Java SDK is no longer *the* definition of the model. The
>model has
>> been (mostly) formalized in the portability work, and the general
>Beam
>> concepts and notion of PTransform are much more widely fleshed out
>and
>> understood.
>>
>> > This is wrong for all java users which are still the mainstream. It
>is
>> important to keep that in mind and even if I know portable API is
>something
>> important for you,
>>
>> I think you miss-understood me. My point is that it is now much
>easier to
>> disentangle the essence of the Beam model (reified in part in the
>portable
>> API) from the Java API itself (which may evolve more independently,
>whereas
>> formerly syntactic sugar here would be conflated with core concepts).
>>
>> > it is solething which should stay on top of runners and their api
>which
>> means java for all but one.
>>
>> > All that to say that the most common default is java.
>>
>> I don't think it'll be that way for long; scala alone might give Java
>a run
>> for its money.
>>
>> > However I agree each language should have its natural API and
>should
>> absolutely not just port over the same API. Goal being indeed to
>respect
>> its own philosophy.
>>
>> > Conclusion: java needs a most expressive stream like API.
>>
>> > There is another way to see it: catching up API debt compared to
>> concurrent API.
>>
>>
>> >> * Java 8's lambdas, etc. allows for much more succinct
>representation of
>> operations, which makes the relative ratio of boilerplate of using
>apply
>> that much higher. This is one of the struggles we had with the Python
>API,
>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>pcoll
>> | Map(...) is at least closer to pcoll.map(...).
>> >> * With over two years of experience with the 100% pure approach,
>we
>> still haven't "gotten used to it" enough that adding such methods
>isn't
>> appealing. (Note that by design adding such methods later is always
>easier
>> than taking them away, which was one justification for starting at
>the
>> extreme point).
>>
>> >> Even if we go this route, there's no need to remove apply, and
>>
>> >> pcoll
>> >>  .map(...)
>> >>  .apply(...)
>> >>  .flatMap(...)
>>
>> >> flows fairly well (with map/flatMap being syntactic sugar to
>apply).
>>
>> >> Agree but the issue with that is you loose the natural approach
>and it
>> is harder to rework it whereas having an api on top of "apply" let
>you keep
>> both concerns split.
>>
>> Having multiple APIs undesirable, best to have one unless there are
>hard
>> constraints that prevent it (e.g. if the two would be jarringly
>> inconsistent, or one is forced by an interface, etc.)
>>
>> >> Also pcollection api is what is complex (coders, sides, ...) and
>what I
>> hope we can hide behind another API.
>>
>> I'd like to simplify things as well.
>>
>> >> I think we would also have to still use apply for parameterless
>> operations like gbk that place constraints on the element types. I
>don't
>> see how to do combinePerKey either (though, asymmetrically,
>globalCombine
>> is fine).
>>
>> >> The largest fear I have is feature creep. There would have to be a
>very
>> clear line of what's in and what's not, likely with what's in being a
>very
>> short list (which is probably OK and would give the biggest gain, but
>not
>> much discoverability). The criteria can't be primitives (gbk is
>> problematic, and the most natural map isn't really the full ParDo
>> primitive--in fact the full ParDo might be "advanced" enough 

Re: (java) stream & beam?

2018-03-16 Thread Reuven Lax
BTW while it's true that raw GBK can't be fluent (due to constraint on
element type). once we have schema support we can introduce groupByField,
and that can be fluent.


On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw 
wrote:

> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau  >
> wrote:
>
> > Le 15 mars 2018 06:52, "Robert Bradshaw"  a écrit :
>
> >> The stream API was looked at way back when we were designing the API;
> one of the primary reasons it was not further pursued at the time was the
> demand for Java 7 compatibility. It is also much more natural with lambdas,
> but unfortunately the Java compiler discards types in this case, making
> coder inference impossible. Still is interesting to explore, and I've been
> toying with using this wrapping method for other applications
> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>
> >> There's a higher level question lingering here about making things more
> fluent by putting methods on PCollections in our primary API. It was
> somewhat of an experiment to go the very pure approach of *everything*
> being expressed a PTransform, and this is not without its disadvantages,
> and (gasp) may be worth revisiting. In particular, some things that have
> changed in the meantime are
>
> >> * The Java SDK is no longer *the* definition of the model. The model has
> been (mostly) formalized in the portability work, and the general Beam
> concepts and notion of PTransform are much more widely fleshed out and
> understood.
>
> > This is wrong for all java users which are still the mainstream. It is
> important to keep that in mind and even if I know portable API is something
> important for you,
>
> I think you miss-understood me. My point is that it is now much easier to
> disentangle the essence of the Beam model (reified in part in the portable
> API) from the Java API itself (which may evolve more independently, whereas
> formerly syntactic sugar here would be conflated with core concepts).
>
> > it is solething which should stay on top of runners and their api which
> means java for all but one.
>
> > All that to say that the most common default is java.
>
> I don't think it'll be that way for long; scala alone might give Java a run
> for its money.
>
> > However I agree each language should have its natural API and should
> absolutely not just port over the same API. Goal being indeed to respect
> its own philosophy.
>
> > Conclusion: java needs a most expressive stream like API.
>
> > There is another way to see it: catching up API debt compared to
> concurrent API.
>
>
> >> * Java 8's lambdas, etc. allows for much more succinct representation of
> operations, which makes the relative ratio of boilerplate of using apply
> that much higher. This is one of the struggles we had with the Python API,
> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
> | Map(...) is at least closer to pcoll.map(...).
> >> * With over two years of experience with the 100% pure approach, we
> still haven't "gotten used to it" enough that adding such methods isn't
> appealing. (Note that by design adding such methods later is always easier
> than taking them away, which was one justification for starting at the
> extreme point).
>
> >> Even if we go this route, there's no need to remove apply, and
>
> >> pcoll
> >>  .map(...)
> >>  .apply(...)
> >>  .flatMap(...)
>
> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>
> >> Agree but the issue with that is you loose the natural approach and it
> is harder to rework it whereas having an api on top of "apply" let you keep
> both concerns split.
>
> Having multiple APIs undesirable, best to have one unless there are hard
> constraints that prevent it (e.g. if the two would be jarringly
> inconsistent, or one is forced by an interface, etc.)
>
> >> Also pcollection api is what is complex (coders, sides, ...) and what I
> hope we can hide behind another API.
>
> I'd like to simplify things as well.
>
> >> I think we would also have to still use apply for parameterless
> operations like gbk that place constraints on the element types. I don't
> see how to do combinePerKey either (though, asymmetrically, globalCombine
> is fine).
>
> >> The largest fear I have is feature creep. There would have to be a very
> clear line of what's in and what's not, likely with what's in being a very
> short list (which is probably OK and would give the biggest gain, but not
> much discoverability). The criteria can't be primitives (gbk is
> problematic, and the most natural map isn't really the full ParDo
> primitive--in fact the full ParDo might be "advanced" enough to merit
> requiring apply).
>
> > Is the previous proposal an issue (jet api)?
>
> On first glance, StreamStage doesn't sound to me like a PCollection (mixes
> the notion of operations and values), and methods like flatMapUsingContext
> and 

Re: (java) stream & beam?

2018-03-15 Thread Romain Manni-Bucau
Le 15 mars 2018 07:50, "Robert Bradshaw"  a écrit :

On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau 
wrote:

> Le 15 mars 2018 06:52, "Robert Bradshaw"  a écrit :

>> The stream API was looked at way back when we were designing the API;
one of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

>> There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

>> * The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.

> This is wrong for all java users which are still the mainstream. It is
important to keep that in mind and even if I know portable API is something
important for you,

I think you miss-understood me. My point is that it is now much easier to
disentangle the essence of the Beam model (reified in part in the portable
API) from the Java API itself (which may evolve more independently, whereas
formerly syntactic sugar here would be conflated with core concepts).


Oh ok. Agree.


> it is solething which should stay on top of runners and their api which
means java for all but one.

> All that to say that the most common default is java.

I don't think it'll be that way for long; scala alone might give Java a run
for its money.


Scala will probably need its own api but also generally goes with the best
of breed approach which is the opposite of beam by design (vendor
portability gives much more important guarantees but not being always the
best) do let see how it goes :).


> However I agree each language should have its natural API and should
absolutely not just port over the same API. Goal being indeed to respect
its own philosophy.

> Conclusion: java needs a most expressive stream like API.

> There is another way to see it: catching up API debt compared to
concurrent API.


>> * Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
>> * With over two years of experience with the 100% pure approach, we
still haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

>> Even if we go this route, there's no need to remove apply, and

>> pcoll
>>  .map(...)
>>  .apply(...)
>>  .flatMap(...)

>> flows fairly well (with map/flatMap being syntactic sugar to apply).

>> Agree but the issue with that is you loose the natural approach and it
is harder to rework it whereas having an api on top of "apply" let you keep
both concerns split.

Having multiple APIs undesirable, best to have one unless there are hard
constraints that prevent it (e.g. if the two would be jarringly
inconsistent, or one is forced by an interface, etc.)

>> Also pcollection api is what is complex (coders, sides, ...) and what I
hope we can hide behind another API.

I'd like to simplify things as well.

>> I think we would also have to still use apply for parameterless
operations like gbk that place constraints on the element types. I don't
see how to do combinePerKey either (though, asymmetrically, globalCombine
is fine).

>> The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

> Is the previous proposal an issue (jet api)?

On first glance, StreamStage doesn't sound to me like a PCollection (mixes
the notion of operations and values), and methods like flatMapUsingContext
and hashJoin2 seem far down the slippery slope. But I haven't spent that
much time looking at 

Re: (java) stream & beam?

2018-03-14 Thread Robert Bradshaw
The stream API was looked at way back when we were designing the API; one
of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

* The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.
* Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
* With over two years of experience with the 100% pure approach, we still
haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

Even if we go this route, there's no need to remove apply, and

pcoll
.map(...)
.apply(...)
.flatMap(...)

flows fairly well (with map/flatMap being syntactic sugar to apply).

I think we would also have to still use apply for parameterless operations
like gbk that place constraints on the element types. I don't see how to do
combinePerKey either (though, asymmetrically, globalCombine is fine).

The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

- Robert




On Wed, Mar 14, 2018 at 1:28 AM Romain Manni-Bucau 
wrote:

> Hi Jan,
>
> The wrapping is almost exactly what I had un mind (I would pass the
> expected Class to support a bit more like in most jre or javax API but
> that's a detail) but I would really try to align it on java stream just to
> keep the dev comfortable:
> https://github.com/hazelcast/hazelcast-jet/blob/9c4ea86a59ae3b899498f389b5459d67c2b4cdcd/hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/StreamStage.java
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-14 9:03 GMT+01:00 Jan Lukavský :
>
>> Hi all,
>>
>> the are actually some steps taken in this direction - a few emails
>> already went to this channel about donation of Euphoria API (
>> https://github.com/seznam/euphoria) to Apache Beam. SGA has already been
>> signed, currently there is work in progress for porting all Euphoria's
>> features to Beam. The idea is that Euphoria could be this "user friendly"
>> layer on top of Beam. In our proof-of-concept this works like this:
>>
>>// create input
>>String raw = "hi there hi hi sue bob hi sue ZOW bob";
>>List words = Arrays.asList(raw.split(" "));
>>
>>Pipeline pipeline = Pipeline.create(options());
>>
>>// create input PCollection
>>PCollection input = pipeline.apply(
>>
>> Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.class));
>>
>>// holder of mapping between Euphoria and Beam
>>BeamFlow flow = BeamFlow.create(pipeline);
>>
>>// lift this PCollection to Euphoria API
>>Dataset dataset = flow.wrapped(input);
>>
>>// do something with the data
>>Dataset> output = CountByKey.of(dataset)
>>.keyBy(e -> e)
>>.output();
>>
>>// convert Euphoria API back to Beam
>>PCollection> beamOut = 

Re: (java) stream & beam?

2018-03-14 Thread Romain Manni-Bucau
Hi Jan,

The wrapping is almost exactly what I had un mind (I would pass the
expected Class to support a bit more like in most jre or javax API but
that's a detail) but I would really try to align it on java stream just to
keep the dev comfortable:
https://github.com/hazelcast/hazelcast-jet/blob/9c4ea86a59ae3b899498f389b5459d67c2b4cdcd/hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/StreamStage.java


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-14 9:03 GMT+01:00 Jan Lukavský :

> Hi all,
>
> the are actually some steps taken in this direction - a few emails already
> went to this channel about donation of Euphoria API (
> https://github.com/seznam/euphoria) to Apache Beam. SGA has already been
> signed, currently there is work in progress for porting all Euphoria's
> features to Beam. The idea is that Euphoria could be this "user friendly"
> layer on top of Beam. In our proof-of-concept this works like this:
>
>// create input
>String raw = "hi there hi hi sue bob hi sue ZOW bob";
>List words = Arrays.asList(raw.split(" "));
>
>Pipeline pipeline = Pipeline.create(options());
>
>// create input PCollection
>PCollection input = pipeline.apply(
>Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.
> class));
>
>// holder of mapping between Euphoria and Beam
>BeamFlow flow = BeamFlow.create(pipeline);
>
>// lift this PCollection to Euphoria API
>Dataset dataset = flow.wrapped(input);
>
>// do something with the data
>Dataset> output = CountByKey.of(dataset)
>.keyBy(e -> e)
>.output();
>
>// convert Euphoria API back to Beam
>PCollection> beamOut = flow.unwrapped(output);
>
>// do whatever with the resulting PCollection
>PAssert.that(beamOut)
>.containsInAnyOrder(
>Pair.of("hi", 4L),
>Pair.of("there", 1L),
>Pair.of("sue", 2L),
>Pair.of("ZOW", 1L),
>Pair.of("bob", 2L));
>
>// run, forrest, run
>pipeline.run();
> I'm aware that this is not the "stream" API this thread was about, but
> Euphoria also has a "fluent" package - https://github.com/seznam/
> euphoria/tree/master/euphoria-fluent. This is by no means a complete or
> production ready API, but it could help solve this dichotomy between
> whether to keep Beam API as is, or introduce some more use-friendly API. As
> I said, there is work in progress in this, so if anyone could spare some
> time and give us helping hand with this porting, it would be just awesome.
> :-)
>
> Jan
>
>
> On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:
>
> Yep
>
> I know the rational and it makes sense but it also increases the entering
> steps for users and is not that smooth in ides, in particular for custom
> code.
>
> So I really think it makes sense to build an user friendly api on top of
> beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek"  a écrit :
>
>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollect
>> ion-dot-map.html
>>
>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
>> wrote:
>>
>>
>>
>> Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :
>>
>> I think it would be interesting to see what a Java stream-based API would
>> look like. As I mentioned elsewhere, we are not limited to having only one
>> API for Beam.
>>
>> If I remember correctly, a Java stream API was considered for Dataflow
>> back at the very beginning. I don't completely remember why it was
>> rejected, but I suspect at least part of the reason might have been that
>> Java streams were considered too new and untested back then.
>>
>>
>> Coders are broken - typevariables dont have bounds except object - and
>> reducers are not trivial to impl generally I guess.
>>
>> However being close of this api can help a lot so +1 to try to have a
>> java dsl on top of current api. Would also be neat to integrate it with
>> completionstage :).
>>
>>
>>
>> Reuven
>>
>>
>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a
>>> écrit :
>>>
>>> Hi Romain,
>>>
>>> I remember we have discussed about the way to express pipeline while ago.
>>>
>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>> it's the approach in flume).
>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>
>>> Using Java Stream is interesting but I'm afraid we would have the same
>>> issue as the 

Re: (java) stream & beam?

2018-03-14 Thread Jan Lukavský

Hi all,

the are actually some steps taken in this direction - a few emails 
already went to this channel about donation of Euphoria API 
(https://github.com/seznam/euphoria) to Apache Beam. SGA has already 
been signed, currently there is work in progress for porting all 
Euphoria's features to Beam. The idea is that Euphoria could be this 
"user friendly" layer on top of Beam. In our proof-of-concept this works 
like this:


   // create input
   String raw = "hi there hi hi sue bob hi sue ZOW bob";
   List words = Arrays.asList(raw.split(" "));

   Pipeline pipeline = Pipeline.create(options());

   // create input PCollection
   PCollection input = pipeline.apply(
Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.class));

   // holder of mapping between Euphoria and Beam
   BeamFlow flow = BeamFlow.create(pipeline);

   // lift this PCollection to Euphoria API
   Dataset dataset = flow.wrapped(input);

   // do something with the data
   Dataset> output = CountByKey.of(dataset)
   .keyBy(e -> e)
   .output();

   // convert Euphoria API back to Beam
   PCollection> beamOut = flow.unwrapped(output);

   // do whatever with the resulting PCollection
   PAssert.that(beamOut)
   .containsInAnyOrder(
   Pair.of("hi", 4L),
   Pair.of("there", 1L),
   Pair.of("sue", 2L),
   Pair.of("ZOW", 1L),
   Pair.of("bob", 2L));

   // run, forrest, run
   pipeline.run();

I'm aware that this is not the "stream" API this thread was about, but 
Euphoria also has a "fluent" package - 
https://github.com/seznam/euphoria/tree/master/euphoria-fluent. This is 
by no means a complete or production ready API, but it could help solve 
this dichotomy between whether to keep Beam API as is, or introduce some 
more use-friendly API. As I said, there is work in progress in this, so 
if anyone could spare some time and give us helping hand with this 
porting, it would be just awesome. :-)


Jan

On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:

Yep

I know the rational and it makes sense but it also increases the 
entering steps for users and is not that smooth in ides, in particular 
for custom code.


So I really think it makes sense to build an user friendly api on top 
of beam core dev one.



Le 13 mars 2018 18:35, "Aljoscha Krettek" > a écrit :


https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html




On 11. Mar 2018, at 22:21, Romain Manni-Bucau
> wrote:



Le 12 mars 2018 00:16, "Reuven Lax" > a écrit :

I think it would be interesting to see what a Java
stream-based API would look like. As I mentioned elsewhere,
we are not limited to having only one API for Beam.

If I remember correctly, a Java stream API was considered for
Dataflow back at the very beginning. I don't completely
remember why it was rejected, but I suspect at least part of
the reason might have been that Java streams were considered
too new and untested back then.


Coders are broken - typevariables dont have bounds except object
- and reducers are not trivial to impl generally I guess.

However being close of this api can help a lot so +1 to try to
have a java dsl on top of current api. Would also be neat to
integrate it with completionstage :).



Reuven


On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau
> wrote:



Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"
> a écrit :

Hi Romain,

I remember we have discussed about the way to express
pipeline while ago.

I was fan of a "DSL" compared to the one we have in
Camel: instead of using apply(), use a dedicated form
(like .map(), .reduce(), etc, AFAIR, it's the
approach in flume).
However, we agreed that apply() syntax gives a more
flexible approach.

Using Java Stream is interesting but I'm afraid we
would have the same issue as the one we identified
discussing "fluent Java SDK". However, we can have a
Stream API DSL on top of the SDK IMHO.


Agree and a beam stream interface (copying jdk api but
making lambda serializable to avoid the cast need).

On my side i think it enables user to discover the api.
If you check my poc impl you quickly see the steps needed
to do simple things like a map which is a first citizen.

Also curious if we 

Re: (java) stream & beam?

2018-03-13 Thread Romain Manni-Bucau
Yep, while we can pass lambdas i guess it is fine or we have to use proxies
to hide the mutation but i dont think we need to be that purist to move to
a more expressive dsl.


Le 13 mars 2018 19:49, "Ben Chambers"  a écrit :

> The CombineFn API has three types parameters (input, accumulator, and
> output) and methods that approximately correspond to those parts of the
> collector
>
> CombineFn.createAccumulator = supplier
> CombineFn.addInput = accumulator
> CombineFn.mergeAccumlator = combiner
> CombineFn.extractOutput = finisher
>
> That said, the Collector API has some minimal, cosmetic differences, such
> as CombineFn.addInput may either mutate the accumulator or return it. The
> Collector accumulator method is a BiConsumer, meaning it must modify.
>
> On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau 
> wrote:
>
>> Misses the collect split in 3 (supplier, combiner, aggregator) but
>> globally agree.
>>
>>  I d just take java stream, remove "client" method or make them big data
>> if possible, ensure all hooks are serializable to avoid hacks and add an
>> unwrap to be able to access the pipeline in case we need a very custom
>> thing and we are done for me.
>>
>> Le 13 mars 2018 19:26, "Ben Chambers"  a écrit :
>>
>>> I think the existing rationale (not introducing lots of special fluent
>>> methods) makes sense. However, if we look at the Java Stream API, we
>>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>>> of the functionality. Specifically, if we focus on map, flatMap, and
>>> collect from the Stream API, and a few extensions, we get something like:
>>>
>>> * collection.map(DoFn) for applying aParDo
>>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>>> * collection.collect(CombineFn) for applying a CombineFn
>>> * collection.apply(PTransform) for applying a composite transform. note
>>> that PTransforms could also use serializable lambdas for definition.
>>>
>>> (Note that GroupByKey doesn't even show up here -- it could, but that
>>> could also be way of wrapping a collector, as in the Java8
>>> Collectors.groupyingBy [1]
>>>
>>> With this, we could write code like:
>>>
>>> collection
>>>   .map(myDoFn)
>>>   .map((s) -> s.toString())
>>>   .collect(new IntegerCombineFn())
>>>   .apply(GroupByKey.of());
>>>
>>> That said, my two concerns are:
>>> (1) having two similar but different Java APIs. If we have more
>>> idiomatic way of writing pipelines in Java, we should make that the
>>> standard. Otherwise, users will be confused by seeing "Beam" examples
>>> written in multiple, incompatible syntaxes.
>>> (2) making sure the above is truly idiomatic Java and that it doesn't
>>> any conflicts with the cross-language Beam programming model. I don't think
>>> it does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>>> for those languages where possible.
>>>
>>> If this work is focused on making the Java SDK more idiomatic (and thus
>>> easier for Java users to learn), it seems like a good thing. We should just
>>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>>
>>> [1] https://docs.oracle.com/javase/8/docs/api/java/util/
>>> stream/Collectors.html#groupingBy-java.util.function.Function-
>>>
>>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Yep

 I know the rational and it makes sense but it also increases the
 entering steps for users and is not that smooth in ides, in particular for
 custom code.

 So I really think it makes sense to build an user friendly api on top
 of beam core dev one.


 Le 13 mars 2018 18:35, "Aljoscha Krettek"  a
 écrit :

> https://beam.apache.org/blog/2016/05/27/where-is-my-
> pcollection-dot-map.html
>
> On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
> wrote:
>
>
>
> Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :
>
> I think it would be interesting to see what a Java stream-based API
> would look like. As I mentioned elsewhere, we are not limited to having
> only one API for Beam.
>
> If I remember correctly, a Java stream API was considered for Dataflow
> back at the very beginning. I don't completely remember why it was
> rejected, but I suspect at least part of the reason might have been that
> Java streams were considered too new and untested back then.
>
>
> Coders are broken - typevariables dont have bounds except object - and
> reducers are not trivial to impl generally I guess.
>
> However being close of this api can help a lot so +1 to try to have a
> java dsl on top of current api. Would also be neat to integrate it with
> completionstage :).
>
>
>
> Reuven

Re: (java) stream & beam?

2018-03-13 Thread Ben Chambers
The CombineFn API has three types parameters (input, accumulator, and
output) and methods that approximately correspond to those parts of the
collector

CombineFn.createAccumulator = supplier
CombineFn.addInput = accumulator
CombineFn.mergeAccumlator = combiner
CombineFn.extractOutput = finisher

That said, the Collector API has some minimal, cosmetic differences, such
as CombineFn.addInput may either mutate the accumulator or return it. The
Collector accumulator method is a BiConsumer, meaning it must modify.

On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau 
wrote:

> Misses the collect split in 3 (supplier, combiner, aggregator) but
> globally agree.
>
>  I d just take java stream, remove "client" method or make them big data
> if possible, ensure all hooks are serializable to avoid hacks and add an
> unwrap to be able to access the pipeline in case we need a very custom
> thing and we are done for me.
>
> Le 13 mars 2018 19:26, "Ben Chambers"  a écrit :
>
>> I think the existing rationale (not introducing lots of special fluent
>> methods) makes sense. However, if we look at the Java Stream API, we
>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>> of the functionality. Specifically, if we focus on map, flatMap, and
>> collect from the Stream API, and a few extensions, we get something like:
>>
>> * collection.map(DoFn) for applying aParDo
>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>> * collection.collect(CombineFn) for applying a CombineFn
>> * collection.apply(PTransform) for applying a composite transform. note
>> that PTransforms could also use serializable lambdas for definition.
>>
>> (Note that GroupByKey doesn't even show up here -- it could, but that
>> could also be way of wrapping a collector, as in the Java8
>> Collectors.groupyingBy [1]
>>
>> With this, we could write code like:
>>
>> collection
>>   .map(myDoFn)
>>   .map((s) -> s.toString())
>>   .collect(new IntegerCombineFn())
>>   .apply(GroupByKey.of());
>>
>> That said, my two concerns are:
>> (1) having two similar but different Java APIs. If we have more idiomatic
>> way of writing pipelines in Java, we should make that the standard.
>> Otherwise, users will be confused by seeing "Beam" examples written in
>> multiple, incompatible syntaxes.
>> (2) making sure the above is truly idiomatic Java and that it doesn't any
>> conflicts with the cross-language Beam programming model. I don't think it
>> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>> for those languages where possible.
>>
>> If this work is focused on making the Java SDK more idiomatic (and thus
>> easier for Java users to learn), it seems like a good thing. We should just
>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>
>> [1]
>> https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-
>>
>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Yep
>>>
>>> I know the rational and it makes sense but it also increases the
>>> entering steps for users and is not that smooth in ides, in particular for
>>> custom code.
>>>
>>> So I really think it makes sense to build an user friendly api on top of
>>> beam core dev one.
>>>
>>>
>>> Le 13 mars 2018 18:35, "Aljoscha Krettek"  a
>>> écrit :
>>>

 https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

 On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
 wrote:



 Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :

 I think it would be interesting to see what a Java stream-based API
 would look like. As I mentioned elsewhere, we are not limited to having
 only one API for Beam.

 If I remember correctly, a Java stream API was considered for Dataflow
 back at the very beginning. I don't completely remember why it was
 rejected, but I suspect at least part of the reason might have been that
 Java streams were considered too new and untested back then.


 Coders are broken - typevariables dont have bounds except object - and
 reducers are not trivial to impl generally I guess.

 However being close of this api can help a lot so +1 to try to have a
 java dsl on top of current api. Would also be neat to integrate it with
 completionstage :).



 Reuven


 On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a
> écrit :
>
> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while
> ago.
>
> I was fan of a "DSL" compared to the one 

Re: (java) stream & beam?

2018-03-13 Thread Romain Manni-Bucau
Misses the collect split in 3 (supplier, combiner, aggregator) but globally
agree.

 I d just take java stream, remove "client" method or make them big data if
possible, ensure all hooks are serializable to avoid hacks and add an
unwrap to be able to access the pipeline in case we need a very custom
thing and we are done for me.

Le 13 mars 2018 19:26, "Ben Chambers"  a écrit :

> I think the existing rationale (not introducing lots of special fluent
> methods) makes sense. However, if we look at the Java Stream API, we
> probably wouldn't need to introduce *a lot* of fluent builders to get most
> of the functionality. Specifically, if we focus on map, flatMap, and
> collect from the Stream API, and a few extensions, we get something like:
>
> * collection.map(DoFn) for applying aParDo
> * collection.map(SerialiazableFn) for Java8 lambda shorthand
> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
> * collection.collect(CombineFn) for applying a CombineFn
> * collection.apply(PTransform) for applying a composite transform. note
> that PTransforms could also use serializable lambdas for definition.
>
> (Note that GroupByKey doesn't even show up here -- it could, but that
> could also be way of wrapping a collector, as in the Java8
> Collectors.groupyingBy [1]
>
> With this, we could write code like:
>
> collection
>   .map(myDoFn)
>   .map((s) -> s.toString())
>   .collect(new IntegerCombineFn())
>   .apply(GroupByKey.of());
>
> That said, my two concerns are:
> (1) having two similar but different Java APIs. If we have more idiomatic
> way of writing pipelines in Java, we should make that the standard.
> Otherwise, users will be confused by seeing "Beam" examples written in
> multiple, incompatible syntaxes.
> (2) making sure the above is truly idiomatic Java and that it doesn't any
> conflicts with the cross-language Beam programming model. I don't think it
> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
> for those languages where possible.
>
> If this work is focused on making the Java SDK more idiomatic (and thus
> easier for Java users to learn), it seems like a good thing. We should just
> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/util/
> stream/Collectors.html#groupingBy-java.util.function.Function-
>
> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau 
> wrote:
>
>> Yep
>>
>> I know the rational and it makes sense but it also increases the entering
>> steps for users and is not that smooth in ides, in particular for custom
>> code.
>>
>> So I really think it makes sense to build an user friendly api on top of
>> beam core dev one.
>>
>>
>> Le 13 mars 2018 18:35, "Aljoscha Krettek"  a écrit :
>>
>>> https://beam.apache.org/blog/2016/05/27/where-is-my-
>>> pcollection-dot-map.html
>>>
>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
>>> wrote:
>>>
>>>
>>>
>>> Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :
>>>
>>> I think it would be interesting to see what a Java stream-based API
>>> would look like. As I mentioned elsewhere, we are not limited to having
>>> only one API for Beam.
>>>
>>> If I remember correctly, a Java stream API was considered for Dataflow
>>> back at the very beginning. I don't completely remember why it was
>>> rejected, but I suspect at least part of the reason might have been that
>>> Java streams were considered too new and untested back then.
>>>
>>>
>>> Coders are broken - typevariables dont have bounds except object - and
>>> reducers are not trivial to impl generally I guess.
>>>
>>> However being close of this api can help a lot so +1 to try to have a
>>> java dsl on top of current api. Would also be neat to integrate it with
>>> completionstage :).
>>>
>>>
>>>
>>> Reuven
>>>
>>>
>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a
 écrit :

 Hi Romain,

 I remember we have discussed about the way to express pipeline while
 ago.

 I was fan of a "DSL" compared to the one we have in Camel: instead of
 using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
 it's the approach in flume).
 However, we agreed that apply() syntax gives a more flexible approach.

 Using Java Stream is interesting but I'm afraid we would have the same
 issue as the one we identified discussing "fluent Java SDK". However, we
 can have a Stream API DSL on top of the SDK IMHO.


 Agree and a beam stream interface (copying jdk api but making lambda
 serializable to avoid the cast need).

 On my side i think it enables user to discover the api. If you check my
 poc impl you quickly see the steps needed to do simple things like a map
 

Re: (java) stream & beam?

2018-03-13 Thread Ben Chambers
I think the existing rationale (not introducing lots of special fluent
methods) makes sense. However, if we look at the Java Stream API, we
probably wouldn't need to introduce *a lot* of fluent builders to get most
of the functionality. Specifically, if we focus on map, flatMap, and
collect from the Stream API, and a few extensions, we get something like:

* collection.map(DoFn) for applying aParDo
* collection.map(SerialiazableFn) for Java8 lambda shorthand
* collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
* collection.collect(CombineFn) for applying a CombineFn
* collection.apply(PTransform) for applying a composite transform. note
that PTransforms could also use serializable lambdas for definition.

(Note that GroupByKey doesn't even show up here -- it could, but that could
also be way of wrapping a collector, as in the Java8 Collectors.groupyingBy
[1]

With this, we could write code like:

collection
  .map(myDoFn)
  .map((s) -> s.toString())
  .collect(new IntegerCombineFn())
  .apply(GroupByKey.of());

That said, my two concerns are:
(1) having two similar but different Java APIs. If we have more idiomatic
way of writing pipelines in Java, we should make that the standard.
Otherwise, users will be confused by seeing "Beam" examples written in
multiple, incompatible syntaxes.
(2) making sure the above is truly idiomatic Java and that it doesn't any
conflicts with the cross-language Beam programming model. I don't think it
does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
for those languages where possible.

If this work is focused on making the Java SDK more idiomatic (and thus
easier for Java users to learn), it seems like a good thing. We should just
make sure it doesn't scope-creep into defining an entirely new DSL or SDK.

[1]
https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-

On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau 
wrote:

> Yep
>
> I know the rational and it makes sense but it also increases the entering
> steps for users and is not that smooth in ides, in particular for custom
> code.
>
> So I really think it makes sense to build an user friendly api on top of
> beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek"  a écrit :
>
>>
>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>
>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
>> wrote:
>>
>>
>>
>> Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :
>>
>> I think it would be interesting to see what a Java stream-based API would
>> look like. As I mentioned elsewhere, we are not limited to having only one
>> API for Beam.
>>
>> If I remember correctly, a Java stream API was considered for Dataflow
>> back at the very beginning. I don't completely remember why it was
>> rejected, but I suspect at least part of the reason might have been that
>> Java streams were considered too new and untested back then.
>>
>>
>> Coders are broken - typevariables dont have bounds except object - and
>> reducers are not trivial to impl generally I guess.
>>
>> However being close of this api can help a lot so +1 to try to have a
>> java dsl on top of current api. Would also be neat to integrate it with
>> completionstage :).
>>
>>
>>
>> Reuven
>>
>>
>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a
>>> écrit :
>>>
>>> Hi Romain,
>>>
>>> I remember we have discussed about the way to express pipeline while ago.
>>>
>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>> it's the approach in flume).
>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>
>>> Using Java Stream is interesting but I'm afraid we would have the same
>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>> can have a Stream API DSL on top of the SDK IMHO.
>>>
>>>
>>> Agree and a beam stream interface (copying jdk api but making lambda
>>> serializable to avoid the cast need).
>>>
>>> On my side i think it enables user to discover the api. If you check my
>>> poc impl you quickly see the steps needed to do simple things like a map
>>> which is a first citizen.
>>>
>>> Also curious if we could impl reduce with pipeline result = get an
>>> output of a batch from the runner (client) jvm. I see how to do it for
>>> longs - with metrics - but not for collect().
>>>
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>
 Hi guys,

 don't know if you already experienced using java Stream API as a
 replacement for pipeline API but did some tests:
 https://github.com/rmannibucau/jbeam

 It is far to be complete but already shows where it 

Re: (java) stream & beam?

2018-03-13 Thread Romain Manni-Bucau
Yep

I know the rational and it makes sense but it also increases the entering
steps for users and is not that smooth in ides, in particular for custom
code.

So I really think it makes sense to build an user friendly api on top of
beam core dev one.


Le 13 mars 2018 18:35, "Aljoscha Krettek"  a écrit :

> https://beam.apache.org/blog/2016/05/27/where-is-my-
> pcollection-dot-map.html
>
> On 11. Mar 2018, at 22:21, Romain Manni-Bucau 
> wrote:
>
>
>
> Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :
>
> I think it would be interesting to see what a Java stream-based API would
> look like. As I mentioned elsewhere, we are not limited to having only one
> API for Beam.
>
> If I remember correctly, a Java stream API was considered for Dataflow
> back at the very beginning. I don't completely remember why it was
> rejected, but I suspect at least part of the reason might have been that
> Java streams were considered too new and untested back then.
>
>
> Coders are broken - typevariables dont have bounds except object - and
> reducers are not trivial to impl generally I guess.
>
> However being close of this api can help a lot so +1 to try to have a java
> dsl on top of current api. Would also be neat to integrate it with
> completionstage :).
>
>
>
> Reuven
>
>
> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a écrit :
>>
>> Hi Romain,
>>
>> I remember we have discussed about the way to express pipeline while ago.
>>
>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>> it's the approach in flume).
>> However, we agreed that apply() syntax gives a more flexible approach.
>>
>> Using Java Stream is interesting but I'm afraid we would have the same
>> issue as the one we identified discussing "fluent Java SDK". However, we
>> can have a Stream API DSL on top of the SDK IMHO.
>>
>>
>> Agree and a beam stream interface (copying jdk api but making lambda
>> serializable to avoid the cast need).
>>
>> On my side i think it enables user to discover the api. If you check my
>> poc impl you quickly see the steps needed to do simple things like a map
>> which is a first citizen.
>>
>> Also curious if we could impl reduce with pipeline result = get an output
>> of a batch from the runner (client) jvm. I see how to do it for longs -
>> with metrics - but not for collect().
>>
>>
>> Regards
>> JB
>>
>>
>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>
>>> Hi guys,
>>>
>>> don't know if you already experienced using java Stream API as a
>>> replacement for pipeline API but did some tests:
>>> https://github.com/rmannibucau/jbeam
>>>
>>> It is far to be complete but already shows where it fails (beam doesn't
>>> have a way to reduce in the caller machine for instance, coder handling is
>>> not that trivial, lambda are not working well with default Stream API
>>> etc...).
>>>
>>> However it is interesting to see that having such an API is pretty
>>> natural compare to the pipeline API
>>> so wonder if beam should work on its own Stream API (with surely another
>>> name for obvious reasons ;)).
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  | Blog <
>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>> http://rmannibucau.wordpress.com> | Github <
>>> https://github.com/rmannibucau> | LinkedIn <
>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>> https://www.packtpub.com/application-development/java-ee-8-
>>> high-performance>
>>>
>>
>>
>
>


Re: (java) stream & beam?

2018-03-13 Thread Aljoscha Krettek
https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

> On 11. Mar 2018, at 22:21, Romain Manni-Bucau  wrote:
> 
> 
> 
> Le 12 mars 2018 00:16, "Reuven Lax"  > a écrit :
> I think it would be interesting to see what a Java stream-based API would 
> look like. As I mentioned elsewhere, we are not limited to having only one 
> API for Beam.
> 
> If I remember correctly, a Java stream API was considered for Dataflow back 
> at the very beginning. I don't completely remember why it was rejected, but I 
> suspect at least part of the reason might have been that Java streams were 
> considered too new and untested back then.
> 
> Coders are broken - typevariables dont have bounds except object - and 
> reducers are not trivial to impl generally I guess.
> 
> However being close of this api can help a lot so +1 to try to have a java 
> dsl on top of current api. Would also be neat to integrate it with 
> completionstage :).
> 
> 
> 
> Reuven
> 
> 
> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau  > wrote:
> 
> 
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  > a écrit :
> Hi Romain,
> 
> I remember we have discussed about the way to express pipeline while ago.
> 
> I was fan of a "DSL" compared to the one we have in Camel: instead of using 
> apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, it's the 
> approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
> 
> Using Java Stream is interesting but I'm afraid we would have the same issue 
> as the one we identified discussing "fluent Java SDK". However, we can have a 
> Stream API DSL on top of the SDK IMHO.
> 
> Agree and a beam stream interface (copying jdk api but making lambda 
> serializable to avoid the cast need).
> 
> On my side i think it enables user to discover the api. If you check my poc 
> impl you quickly see the steps needed to do simple things like a map which is 
> a first citizen.
> 
> Also curious if we could impl reduce with pipeline result = get an output of 
> a batch from the runner (client) jvm. I see how to do it for longs - with 
> metrics - but not for collect().
> 
> 
> Regards
> JB
> 
> 
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
> Hi guys,
> 
> don't know if you already experienced using java Stream API as a replacement 
> for pipeline API but did some tests: https://github.com/rmannibucau/jbeam 
> 
> 
> It is far to be complete but already shows where it fails (beam doesn't have 
> a way to reduce in the caller machine for instance, coder handling is not 
> that trivial, lambda are not working well with default Stream API etc...).
> 
> However it is interesting to see that having such an API is pretty natural 
> compare to the pipeline API
> so wonder if beam should work on its own Stream API (with surely another name 
> for obvious reasons ;)).
> 
> Romain Manni-Bucau
> @rmannibucau  > | Blog  > | Old Blog 
> > | 
> Github > | 
> LinkedIn  > | Book 
>  >
> 
> 



Re: (java) stream & beam?

2018-03-11 Thread Romain Manni-Bucau
Le 12 mars 2018 00:16, "Reuven Lax"  a écrit :

I think it would be interesting to see what a Java stream-based API would
look like. As I mentioned elsewhere, we are not limited to having only one
API for Beam.

If I remember correctly, a Java stream API was considered for Dataflow back
at the very beginning. I don't completely remember why it was rejected, but
I suspect at least part of the reason might have been that Java streams
were considered too new and untested back then.


Coders are broken - typevariables dont have bounds except object - and
reducers are not trivial to impl generally I guess.

However being close of this api can help a lot so +1 to try to have a java
dsl on top of current api. Would also be neat to integrate it with
completionstage :).



Reuven


On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau 
wrote:

>
>
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a écrit :
>
> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
>
> Agree and a beam stream interface (copying jdk api but making lambda
> serializable to avoid the cast need).
>
> On my side i think it enables user to discover the api. If you check my
> poc impl you quickly see the steps needed to do simple things like a map
> which is a first citizen.
>
> Also curious if we could impl reduce with pipeline result = get an output
> of a batch from the runner (client) jvm. I see how to do it for longs -
> with metrics - but not for collect().
>
>
> Regards
> JB
>
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>
>> Hi guys,
>>
>> don't know if you already experienced using java Stream API as a
>> replacement for pipeline API but did some tests: https://github.com/
>> rmannibucau/jbeam
>>
>> It is far to be complete but already shows where it fails (beam doesn't
>> have a way to reduce in the caller machine for instance, coder handling is
>> not that trivial, lambda are not working well with default Stream API
>> etc...).
>>
>> However it is interesting to see that having such an API is pretty
>> natural compare to the pipeline API
>> so wonder if beam should work on its own Stream API (with surely another
>> name for obvious reasons ;)).
>>
>> Romain Manni-Bucau
>> @rmannibucau  | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github > rmannibucau> | LinkedIn  | Book
>> > ee-8-high-performance>
>>
>
>


Re: (java) stream & beam?

2018-03-11 Thread Reuven Lax
I think it would be interesting to see what a Java stream-based API would
look like. As I mentioned elsewhere, we are not limited to having only one
API for Beam.

If I remember correctly, a Java stream API was considered for Dataflow back
at the very beginning. I don't completely remember why it was rejected, but
I suspect at least part of the reason might have been that Java streams
were considered too new and untested back then.

Reuven


On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau 
wrote:

>
>
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a écrit :
>
> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
>
> Agree and a beam stream interface (copying jdk api but making lambda
> serializable to avoid the cast need).
>
> On my side i think it enables user to discover the api. If you check my
> poc impl you quickly see the steps needed to do simple things like a map
> which is a first citizen.
>
> Also curious if we could impl reduce with pipeline result = get an output
> of a batch from the runner (client) jvm. I see how to do it for longs -
> with metrics - but not for collect().
>
>
> Regards
> JB
>
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>
>> Hi guys,
>>
>> don't know if you already experienced using java Stream API as a
>> replacement for pipeline API but did some tests:
>> https://github.com/rmannibucau/jbeam
>>
>> It is far to be complete but already shows where it fails (beam doesn't
>> have a way to reduce in the caller machine for instance, coder handling is
>> not that trivial, lambda are not working well with default Stream API
>> etc...).
>>
>> However it is interesting to see that having such an API is pretty
>> natural compare to the pipeline API
>> so wonder if beam should work on its own Stream API (with surely another
>> name for obvious reasons ;)).
>>
>> Romain Manni-Bucau
>> @rmannibucau  | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github <
>> https://github.com/rmannibucau> | LinkedIn <
>> https://www.linkedin.com/in/rmannibucau> | Book <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>>
>
>


Re: (java) stream & beam?

2018-03-11 Thread Reuven Lax
A "fluent" API isn't completely incompatible with our current apply-based
API. We could easily add fluent member functions to PCollections which are
syntactic sugar (i.e. delegate to apply). We would need to be disciplined
though, as there will be a tendency for everyone to ask for their transform
to be added as well (this would be a lot saner in a language that supported
mixin methods). This does have some advantages in cleaner user code and
more discoverable transform (i.e. IDE autocomplete and dropdowns work).

One potential concern would be losing some type safety. e.g. today if I
have a PCollection, I can't apply GroupByKey to it - the Java type
system will only allow me to do this if I have a Pollection. If however
groupByKey was a method on PCollection, then we can't stop it from being
called.


On Sun, Mar 11, 2018 at 1:18 PM Jean-Baptiste Onofré 
wrote:

> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
> Regards
> JB
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
> > Hi guys,
> >
> > don't know if you already experienced using java Stream API as a
> > replacement for pipeline API but did some tests:
> > https://github.com/rmannibucau/jbeam
> >
> > It is far to be complete but already shows where it fails (beam doesn't
> > have a way to reduce in the caller machine for instance, coder handling
> > is not that trivial, lambda are not working well with default Stream API
> > etc...).
> >
> > However it is interesting to see that having such an API is pretty
> > natural compare to the pipeline API
> > so wonder if beam should work on its own Stream API (with surely another
> > name for obvious reasons ;)).
> >
> > Romain Manni-Bucau
> > @rmannibucau  | Blog
> >  | Old Blog
> >  | Github
> >  | LinkedIn
> >  | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
>


Re: (java) stream & beam?

2018-03-11 Thread Romain Manni-Bucau
Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"  a écrit :

Hi Romain,

I remember we have discussed about the way to express pipeline while ago.

I was fan of a "DSL" compared to the one we have in Camel: instead of using
apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, it's the
approach in flume).
However, we agreed that apply() syntax gives a more flexible approach.

Using Java Stream is interesting but I'm afraid we would have the same
issue as the one we identified discussing "fluent Java SDK". However, we
can have a Stream API DSL on top of the SDK IMHO.


Agree and a beam stream interface (copying jdk api but making lambda
serializable to avoid the cast need).

On my side i think it enables user to discover the api. If you check my poc
impl you quickly see the steps needed to do simple things like a map which
is a first citizen.

Also curious if we could impl reduce with pipeline result = get an output
of a batch from the runner (client) jvm. I see how to do it for longs -
with metrics - but not for collect().


Regards
JB


On 11/03/2018 19:46, Romain Manni-Bucau wrote:

> Hi guys,
>
> don't know if you already experienced using java Stream API as a
> replacement for pipeline API but did some tests:
> https://github.com/rmannibucau/jbeam
>
> It is far to be complete but already shows where it fails (beam doesn't
> have a way to reduce in the caller machine for instance, coder handling is
> not that trivial, lambda are not working well with default Stream API
> etc...).
>
> However it is interesting to see that having such an API is pretty natural
> compare to the pipeline API
> so wonder if beam should work on its own Stream API (with surely another
> name for obvious reasons ;)).
>
> Romain Manni-Bucau
> @rmannibucau  | Blog <
> https://rmannibucau.metawerx.net/> | Old Blog <
> http://rmannibucau.wordpress.com> | Github  |
> LinkedIn  | Book <
> https://www.packtpub.com/application-development/java-ee-8-
> high-performance>
>


Re: (java) stream & beam?

2018-03-11 Thread Jean-Baptiste Onofré

Hi Romain,

I remember we have discussed about the way to express pipeline while ago.

I was fan of a "DSL" compared to the one we have in Camel: instead of 
using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, 
it's the approach in flume).

However, we agreed that apply() syntax gives a more flexible approach.

Using Java Stream is interesting but I'm afraid we would have the same 
issue as the one we identified discussing "fluent Java SDK". However, we 
can have a Stream API DSL on top of the SDK IMHO.


Regards
JB

On 11/03/2018 19:46, Romain Manni-Bucau wrote:

Hi guys,

don't know if you already experienced using java Stream API as a 
replacement for pipeline API but did some tests: 
https://github.com/rmannibucau/jbeam


It is far to be complete but already shows where it fails (beam doesn't 
have a way to reduce in the caller machine for instance, coder handling 
is not that trivial, lambda are not working well with default Stream API 
etc...).


However it is interesting to see that having such an API is pretty 
natural compare to the pipeline API
so wonder if beam should work on its own Stream API (with surely another 
name for obvious reasons ;)).


Romain Manni-Bucau
@rmannibucau  | Blog 
 | Old Blog 
 | Github 
 | LinkedIn 
 | Book 



(java) stream & beam?

2018-03-11 Thread Romain Manni-Bucau
Hi guys,

don't know if you already experienced using java Stream API as a
replacement for pipeline API but did some tests:
https://github.com/rmannibucau/jbeam

It is far to be complete but already shows where it fails (beam doesn't
have a way to reduce in the caller machine for instance, coder handling is
not that trivial, lambda are not working well with default Stream API
etc...).

However it is interesting to see that having such an API is pretty natural
compare to the pipeline API
so wonder if beam should work on its own Stream API (with surely another
name for obvious reasons ;)).

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book