No Translator Found issue

2018-12-03 Thread Vinay Patil
Hi,

I am using Beam 2.8.0 version. When I submit pipeline to Flink 1.5.2
cluster, I am getting the following exception:

Caused by: java.lang.IllegalStateException: No translator known for
org.apache.beam.sdk.io.Read$Bounded

Can you please let me know what could be the problem?

Regards,
Vinay Patil


2019 Beam Events

2018-12-03 Thread Griselda Cuevas
Hi Beam Community,

I started curating industry conferences, meetups and events that are
relevant for Beam, this initial list I came up with
.
*I'd love your help adding others that I might have overlooked.* Once we're
satisfied with the list, let's re-share so we can coordinate proposal
submissions, attendance and community meetups there.


Cheers,

G


Re: How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-03 Thread Ankur Goenka
Thanks Ruoyun!

For Flink, we use a different job server which you can start using
"./gradlew beam-runners-flink_2.11-job-server:runShadow "
The host:port for this jobserver is localhost:8099

On Mon, Dec 3, 2018 at 2:24 PM Ruoyun Huang  wrote:

> Maybe this helps:
> https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide
>
> On Mon, Dec 3, 2018 at 2:10 PM sai.inamp...@gmail.com <
> sai.inamp...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Can someone point me to how to kick off a Beam pipeline using the
>> PortableRunner (w/Flink) in Java? I saw some examples in Python but I
>> haven't been able to find any for Java.
>>
>> I tried to modify the runner option to use PortableRunner but I get the
>> following error below:
>> java.lang.IllegalArgumentException: Unknown 'runner' specified
>> 'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner,
>> SparkRunner, TestFlinkRunner, TestSparkRunner]
>>
>> For reference, I am on Beam 2.8.0 and the reason I want to try to use the
>> PortableRunner is so I can confirm a comment[1] in BEAM-593 that states
>> that pipeline.run() is no longer a blocking call in the PortableRunner with
>> Flink.
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-593?focusedCommentId=16618916=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16618916
>>
>
>
> --
> 
> Ruoyun  Huang
>
>


Re: How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-03 Thread Ruoyun Huang
Maybe this helps:
https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide

On Mon, Dec 3, 2018 at 2:10 PM sai.inamp...@gmail.com <
sai.inamp...@gmail.com> wrote:

> Hi everyone,
>
> Can someone point me to how to kick off a Beam pipeline using the
> PortableRunner (w/Flink) in Java? I saw some examples in Python but I
> haven't been able to find any for Java.
>
> I tried to modify the runner option to use PortableRunner but I get the
> following error below:
> java.lang.IllegalArgumentException: Unknown 'runner' specified
> 'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner,
> SparkRunner, TestFlinkRunner, TestSparkRunner]
>
> For reference, I am on Beam 2.8.0 and the reason I want to try to use the
> PortableRunner is so I can confirm a comment[1] in BEAM-593 that states
> that pipeline.run() is no longer a blocking call in the PortableRunner with
> Flink.
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-593?focusedCommentId=16618916=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16618916
>


-- 

Ruoyun  Huang


How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-03 Thread sai . inampudi
Hi everyone,

Can someone point me to how to kick off a Beam pipeline using the 
PortableRunner (w/Flink) in Java? I saw some examples in Python but I haven't 
been able to find any for Java.

I tried to modify the runner option to use PortableRunner but I get the 
following error below:
java.lang.IllegalArgumentException: Unknown 'runner' specified 
'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner, 
SparkRunner, TestFlinkRunner, TestSparkRunner]

For reference, I am on Beam 2.8.0 and the reason I want to try to use the 
PortableRunner is so I can confirm a comment[1] in BEAM-593 that states that 
pipeline.run() is no longer a blocking call in the PortableRunner with Flink.

[1] 
https://issues.apache.org/jira/browse/BEAM-593?focusedCommentId=16618916=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16618916


Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-03 Thread Chandan Biswas
Hello All,
I have a use case where I have PCollection> data coming from
Kafka source. When processing each record (KV) I need all old
values for that Key stored in a hbase table. The naive approach is to do
HBase lookup in the DoFn.processElement. I considered sideinput but it' not
going to work because of large dataset. Any suggestion?

Thanks,
Chandan


Dynamic Naming of file using KV in IO

2018-12-03 Thread Vinay Patil
Hi,

I need a help regarding dynamic naming for Xml with KV PCollection.

PCollection> xmlCollection =….

I am not able to use XmlIO for this PCollection
XmlDTO is actually the dto marshalled and String is the key

I tried using KV but XmlIO needs a Class type, KV.getClass does not work… I
need to get the Key for distribution and XmlDTO does not have it.

Any suggestions?

Regards,
Vinay Patil


Re: Graceful shutdown of long-running Beam pipeline on Flink

2018-12-03 Thread Wayne Collins
Excellent proposals!
They go beyond our requirements but would provide a great foundation for
runner-agnostic life cycle management of pipelines.

Will jump into discussion on the other side...
Thanks!
Wayne


On 2018-12-03 11:53, Lukasz Cwik wrote:
> There are propoosals for pipeline drain[1] and also for snapshot and
> update[2] for Apache Beam. We would love contributions in this space.
>
> 1: 
> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
> 2: 
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>
> On Mon, Dec 3, 2018 at 7:05 AM Wayne Collins  > wrote:
>
> Hi JC,
>
> Thanks for the quick response!
> I had hoped for an in-pipeline solution for runner portability but
> it is nice to know we're not the only ones stepping outside to
> interact with runner management. :-)
>
> Wayne
>
>
> On 2018-12-03 01:23, Juan Carlos Garcia wrote:
>> Hi Wayne, 
>>
>> We have the same setup and we do daily updates to our pipeline.
>>
>> The way we do it is using the flink tool via a Jenkins. 
>>
>> Basically our deployment job do as follow:
>>
>> 1. Detect if the pipeline is running (it matches via job name) 
>>
>> 2. If found, do a flink cancel with a savepoint (we uses hdfs for
>> checkpoint / savepoint) under a given directory. 
>>
>> 3. It uses the flink run command for the new job and specify the
>> savepoint from step 2.
>>
>> I don't think there is any support to achieve the same from
>> within the pipeline. You need to do this externally as explained
>> above. 
>>
>> Best regards, 
>> JC
>>
>>
>> Am Mo., 3. Dez. 2018, 00:46 hat Wayne Collins > > geschrieben:
>>
>> Hi all,
>> We have a number of Beam pipelines processing unbounded
>> streams sourced from Kafka on the Flink runner and are very
>> happy with both the platform and performance!
>>
>> The problem is with shutting down the pipelines...for version
>> upgrades, system maintenance, load management, etc. it would
>> be nice to be able to gracefully shut these down under
>> software control but haven't been able to find a way to do
>> so. We're in good shape on checkpointing and then cleanly
>> recovering but shutdowns are all destructive to Flink or the
>> Flink TaskManager.
>>
>> Methods tried:
>>
>> 1) Calling cancel on FlinkRunnerResult returned from
>> pipeline.run()
>> This would be our preferred method but p.run() doesn't return
>> until termination and even if it did, the runner code simply
>> throws:
>> "throw new UnsupportedOperationException("FlinkRunnerResult
>> does not support cancel.");"
>> so this doesn't appear to be a near-term option.
>>
>> 2) Inject a "termination" message into the pipeline via Kafka
>> This does get through, but calling exit() from a stage in the
>> pipeline also terminates the Flink TaskManager.
>>
>> 3) Inject a "sleep" message, then manually restart the cluster
>> This is our current method: we pause the data at the source,
>> flood all branches of the pipeline with a "we're going down"
>> msg so the stages can do a bit of housekeeping, then
>> hard-stop the entire environment and re-launch with the new
>> version.
>>
>> Is there a "Best Practice" method for gracefully terminating
>> an unbounded pipeline from within the pipeline or from the
>> mainline that launches it?
>>
>> Thanks!
>> Wayne
>>
>> -- 
>> Wayne Collins
>> dades.ca  Inc.
>> mailto:wayn...@dades.ca
>> cell:416-898-5137
>>
>
> -- 
> Wayne Collins
> dades.ca  Inc.
> mailto:wayn...@dades.ca
> cell:416-898-5137
>

-- 
Wayne Collins
dades.ca Inc.
mailto:wayn...@dades.ca
cell:416-898-5137



Re: Generic Type PTransform

2018-12-03 Thread Lukasz Cwik
Apache Beam attempts to propagate coders through by looking at any typing
information available but because Java has a lot of type erasure and there
are many scenarios where these coders can't be propagated forward from a
previous transform.

Take the following two examples (note that there are many subtle variations
that can give different results):
List erasedType = new List();  // type is lost
List keptType = new List() {};  // type is kept because of
anonymous inner class being declared
In the first the type is erased and in the second the type information is
available. I would suggest

In your case we can't infer what K and what V are because after the code
compiles the types have been erased hence the error message. To immediately
fix the problem, you'll want to set the coder on the PCollection created
after you apply the "MapToKV" transform (you might need to do it on the
"MapToSimpleImmutableEntry" transform as well).

If you want to get into the details, take a look at they CoderRegistry[1]
as it contains the type inference / propagation code.

Finally, this not an uncommon problem that users face and it seems as
though the error message that is given doesn't make sense so feel free to
propose changes in the error messages to help others such as yourself.

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java

On Sun, Dec 2, 2018 at 10:54 PM Matt Casters  wrote:

> There are probably smarter people than me on this list but since I
> recently been through a similar thought exercise...
>
> For the generic use in Kettle I have a PCollection going
> through the pipeline.
> KettleRow is just an Object[] wrapper for which I can implement a Coder.
>
> The "group by" that I implemented does the following:Split
> PCollection into PCollection>
> Then it  applies the standard GroupByKey.create() giving us
> PCollection>>
> This means that we can simple aggregate all the elements in
> Iterable to aggregate a group.
>
> Well, at least that works for me. The code is open so look at it over here:
>
> https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/transform/GroupByTransform.java
>
> Like you I had trouble with the Coder for my KettleRows so I hacked up
> this to make it work:
>
> https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/coder/KettleRowCoder.java
>
> It's set on the pipeline:
> pipeline.getCoderRegistry().registerCoderForClass( KettleRow.class, new
> KettleRowCoder());
>
> Good luck!
> Matt
>
> Op zo 2 dec. 2018 om 20:57 schreef Eran Twili  >:
>
>> Hi,
>>
>>
>>
>> We are considering using Beam in our software.
>>
>> We wish to create a service for a user which will operate Beam for him,
>> and obviously the user code doesn't have Beam API visibility.
>>
>> For that we need to generify some Beam API.
>>
>> So the user supply functions and we embed them in a generic *PTransform*
>> and run them in a Beam pipeline.
>>
>> We have some difficulties to understand how can we provide the user with
>> option to perform *GroupByKey* operation.
>>
>> The problem is that *GroupByKey* takes *KV* and our *PCollections* holds
>> only user datatypes which should not be Beam datatypes.
>>
>> So we thought about having this * PTransform*:
>>
>> public class PlatformGroupByKey extends
>> PTransform>>,
>> PCollection {
>> @Override
>> public PCollection>>>
>> expand(PCollection>> input) {
>>
>> return input
>> .apply("MapToKV",
>> MapElements.*via*(
>> new
>> SimpleFunction>, KV>() {
>> @Override
>> public KV apply
>> (CustomType> kv) {
>> return KV.*of*(kv.field.getKey(), kv.
>> field.getValue()); }}))
>> .apply("GroupByKey",
>> GroupByKey.*create*())
>> .apply("MapToSimpleImmutableEntry",
>> MapElements.*via*(
>> new SimpleFunction>,
>> CustomType>>>() {
>> @Override
>> public CustomType> Iterable>> apply(KV> kv) {
>> return new CustomType<>(new
>> SimpleImmutableEntry<>(kv.getKey(), kv.getValue())); }}));
>> }
>> }
>>
>> In which we will get *PCollection* from our key-value type (java's
>> *SimpleImmutableEntry*),
>>
>> Convert it to *KV*,
>>
>> Preform the *GroupByKey*,
>>
>> And re-convert it again to *SimpleImmutableEntry*.
>>
>>
>>
>> But we get this error in runtime:
>>
>>
>>
>> java.lang.IllegalStateException: Unable to return a default Coder for
>> GroupByKey/MapToKV/Map/ParMultiDo(Anonymous).output [PCollection]. Correct
>> one of the following root causes:
>>
>>   No Coder has been manually specified;  you may do so using .setCoder().
>>
>>   Inferring a 

Re: Beam Metrics questions

2018-12-03 Thread Etienne Chauchot
Hi Phil,
Thanks for the update I was checking the code and I was not understanding how 
the filtering could fail.

Etienne
Le vendredi 30 novembre 2018 à 10:53 -0600, Phil Franklin a écrit :
> Etienne, I’ve just discovered that the code I used for my tests overrides the 
> command-line arguments, and while I thought I was testing with the 
> SparkRunner and FlinkRunner, in fact every test used DirectRunner, which 
> explains why I was seeing the committed values.  So there’s no need for a 
> ticket concerning committed values from the FlinkRunner.  Sorry for the 
> confusion.
> 
> -Phil


Re: bug in context.timestamp() after GroupByKey transform?

2018-12-03 Thread Robert Bradshaw
After a GroupByKey, a (single) timestamp needs to be assigned to the
full KV> element. By default the timestamp chosen is
the end of the window, which in the case of the global window is a
timestamp as far into the future as can be represented. (Python prints
these as MAX_TIMESTAMP rather than an exact time, perhaps we should do
similar for Java).

You can use the withTimestampCombiner [1] method to adjust this
behavior. The currently supported options are end of window (the
default) earliest (meaning the timestamp of the grouped result is the
timestamp of the earliest element in that group) or latest (analogous
to earliest).

pcoll.apply(Window.into(...).withTimestampCombiner(TimestampCombiner.LATEST)

Hopefully that helps,
Robert


[1] 
https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/transforms/windowing/Window.html#withTimestampCombiner-org.apache.beam.sdk.transforms.windowing.TimestampCombiner-
On Sat, Dec 1, 2018 at 10:49 PM  wrote:
>
> Hi,
>
> I was trying out apache Beam, and got unexpected output for the following 
> program:
>
> 
>
> StreamingOptions options = 
> PipelineOptionsFactory.create().as(StreamingOptions.class);
> options.setStreaming(true);
> Pipeline p = Pipeline.create(options);
>
> PCollection> kvpCollection = p
> .apply(Utils.readKafka(
> "inputs",
> StringDeserializer.class,
> StringDeserializer.class,
> "group-id"))
> .apply(Window.>into(
> new GlobalWindows())
> .triggering(AfterPane.elementCountAtLeast(1))
> .discardingFiredPanes());
>
> kvpCollection.apply(ParDo.of(new DoFn, KV Instant>>() {
> @ProcessElement
> public void processElement(@Element KV input,
>ProcessContext context,
>OutputReceiver> 
> out) {
> Instant timestamp = context.timestamp();
> out.output(KV.of(input.getValue(), timestamp));
> }
> }))
> .apply(GroupByKey.create())
> .apply(ParDo.of(new DoFn>, Void>() {
> @ProcessElement
> public void processElement(@Element KV> 
> input,
>ProcessContext context,
>OutputReceiver out) {
> log.info("processElement input: {}, timestamp: {}", input, 
> context.timestamp());
> }
> }))
> ;
>
> p.run().waitUntilFinish();
>
> ===
>
> The output is as follows:
>
> processElement input: KV{some-input, [2018-12-01T21:13:55.621Z]}, timestamp: 
> 294247-01-09T04:00:54.775Z
>
> (debugging gets the same results)
>
> So it looks like context.timestamp() is returning garbage after the 
> GroupByKey transform.
> I expected the timestamp from before the GroupByKey to be identical to the 
> one after the GroupByKey.
>
> Is this expected? Am I doing something wrong (related to the continuously 
> triggering global window perhaps)?
> Does it indicate that watermarks (and corresponding timers) will not work 
> properly after grouping by key?
>
> I used the java beam direct runner version 2.8.0.
>
> Thanks in advance, kind regards,
>
> Jan