Re: Need help running beam word count example on apex/hdfs

2018-01-09 Thread Shashank Prabhakara
For anyone who faces the same issue, I was not able to make work with "mvn
compile exec:java ...". Instead, I ran with "hadoop jar ..." command which
magically fixed this. Best guess is that maven is picking up incompatible
version of commons-io from the wrong side of dependency tree.

Regards,
Shashank

On Mon, Jan 8, 2018 at 4:07 PM, Shashank Prabhakara 
wrote:

> Forgot to mention:
>
> Execution works in embedded mode and counts are created on the local fs. I
> need this to run on hdfs/yarn with --embeddedExecution=false.
>
> Regards,
> Shashank
>
> On Mon, Jan 8, 2018 at 3:06 PM, Shashank Prabhakara  > wrote:
>
>> Hi All,
>>
>> I want to test beam on apex using the word count example provided in the
>> beam repository, but I'm facing some difficulties while executing word
>> count as described in the documentation.
>>
>> I'm running hadoop version 2.8.2 on debian in a multi-node environment.
>> I cloned the beam github repository - master branch and executed:
>>
>> cd examples/java
>> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
>> -Dexec.args="--inputFile=/tmp/input/pom.xml --output=/tmp/output/counts
>> --runner=ApexRunner --embeddedExecution=false" -Papex-runner
>>
>> However the driver hangs (waited for > 1hr) after printing the classpath
>> on the console. I have attached the stdout and stacktrace to this (Pls let
>> me know if not visible in ML).
>>
>> Thanks in advance for any help.
>>
>> Regards,
>> Shashank
>>
>
>


Re: Triggers based on size

2018-01-09 Thread Kenneth Knowles
Often, when you need or want more control than triggers provide, such as
input-type-specific logic like yours, you can use state and timers in ParDo
to control when to output. You lose any potential optimizations of Combine
based on associativity/commutativity and assume the burden of making sure
your output is sensible, but dropping to low-level stateful computation may
be your best bet.

Kenn

On Tue, Jan 9, 2018 at 11:59 AM, Robert Bradshaw 
wrote:

> We've tossed around the idea of "metadata-driven" triggers which would
> essentially let you provide a mapping element -> metadata and a
> monotonic CombineFn metadata* -> bool that would allow for this (the
> AfterCount being a special case of this, with the mapping fn being _
> -> 1, and the CombineFn being sum(...) >= N, for size one would
> provide a (perhaps approximate) sizing mapping fn).
>
> Note, however, that there's no guarantee that the trigger fire as soon
> as possible; due to runtime characteristics a significant amount of
> data may be buffered (or come in at once) before the trigger is
> queried. One possibility would be to follow your triggering with a
> DoFn that breaks up large value streams into multiple manageable sized
> ones as needed.
>
> On Tue, Jan 9, 2018 at 11:43 AM, Carlos Alonso 
> wrote:
> > Hi everyone!!
> >
> > I was wondering if there is an option to trigger window panes based on
> the
> > size of the pane itself (rather than the number of elements).
> >
> > To provide a little bit more of context we're backing up a PubSub topic
> into
> > GCS with the "special" feature that, depending on the "type" of the
> message,
> > the GCS destination is one or another.
> >
> > Messages' 'shape' published there is quite random, some of them are very
> > frequent and small, some others very big but sparse... We have around 150
> > messages per second (in total) and we're firing every 15 minutes and
> > experiencing OOM errors, we've considered firing based on the number of
> > items as well, but given the randomness of the input, I don't think it
> will
> > be a final solution either.
> >
> > Having a trigger based on size would be great, another option would be to
> > have a dynamic shards number for the PTransform that actually writes the
> > files.
> >
> > What is your recommendation for this use case?
> >
> > Thanks!!
>


Re: Triggers based on size

2018-01-09 Thread Robert Bradshaw
We've tossed around the idea of "metadata-driven" triggers which would
essentially let you provide a mapping element -> metadata and a
monotonic CombineFn metadata* -> bool that would allow for this (the
AfterCount being a special case of this, with the mapping fn being _
-> 1, and the CombineFn being sum(...) >= N, for size one would
provide a (perhaps approximate) sizing mapping fn).

Note, however, that there's no guarantee that the trigger fire as soon
as possible; due to runtime characteristics a significant amount of
data may be buffered (or come in at once) before the trigger is
queried. One possibility would be to follow your triggering with a
DoFn that breaks up large value streams into multiple manageable sized
ones as needed.

On Tue, Jan 9, 2018 at 11:43 AM, Carlos Alonso  wrote:
> Hi everyone!!
>
> I was wondering if there is an option to trigger window panes based on the
> size of the pane itself (rather than the number of elements).
>
> To provide a little bit more of context we're backing up a PubSub topic into
> GCS with the "special" feature that, depending on the "type" of the message,
> the GCS destination is one or another.
>
> Messages' 'shape' published there is quite random, some of them are very
> frequent and small, some others very big but sparse... We have around 150
> messages per second (in total) and we're firing every 15 minutes and
> experiencing OOM errors, we've considered firing based on the number of
> items as well, but given the randomness of the input, I don't think it will
> be a final solution either.
>
> Having a trigger based on size would be great, another option would be to
> have a dynamic shards number for the PTransform that actually writes the
> files.
>
> What is your recommendation for this use case?
>
> Thanks!!


Triggers based on size

2018-01-09 Thread Carlos Alonso
Hi everyone!!

I was wondering if there is an option to trigger window panes based on the
size of the pane itself (rather than the number of elements).

To provide a little bit more of context we're backing up a PubSub topic
into GCS with the "special" feature that, depending on the "type" of the
message, the GCS destination is one or another.

Messages' 'shape' published there is quite random, some of them are very
frequent and small, some others very big but sparse... We have around 150
messages per second (in total) and we're firing every 15 minutes and
experiencing OOM errors, we've considered firing based on the number of
items as well, but given the randomness of the input, I don't think it will
be a final solution either.

Having a trigger based on size would be great, another option would be to
have a dynamic shards number for the PTransform that actually writes the
files.

What is your recommendation for this use case?

Thanks!!


Re: London Apache Beam meetup 2: 11/01

2018-01-09 Thread Carlos Alonso
Cool, thanks!!

On Mon, Jan 8, 2018 at 1:38 PM Matthias Baetens <
matthias.baet...@datatonic.com> wrote:

> Yes, we put everything in place to record this time and hope to share the
> recordings soon after the meetup. Stay tuned!
>
> On 8 Jan 2018 10:32, "Carlos Alonso"  wrote:
>
> Will it be recorded?
>
> On Fri, Jan 5, 2018 at 5:11 PM Matthias Baetens <
> matthias.baet...@datatonic.com> wrote:
>
>> Hi all,
>>
>> Excited to announce the second Beam meet up located in the *Qubit
>> offices  *next *Thursday 11/01.*
>>
>> We are very excited to have JB flying in to talk about IO and Splittable
>> DoFns and Vadim Sobolevski to share on how FutureFlow uses Beam in a
>> finance use case.
>>
>> More info and RSVP here . We are looking forward
>> to welcome you all!
>>
>> Best regards,
>> Matthias
>>
>
>