Re: Strange errors running on DataFlow

2017-08-03 Thread Ben Chambers
These errors are often seen at the end of a pipeline -- they indicate that
due to the failure the backend has been torn down and the attempts to
report the current status have failed. If you look in the "Stack Traces"
tab in the UI [1] or earlier in the Stackdriver logs, you should
(hopefully) be able to find the errors that caused the failure.

In general the best place to ask future questions is probably the
google-cloud-dataflow Stackoverflow topic.

1:
https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf#error-reporting


On Thu, Aug 3, 2017 at 2:29 PM Randal Moore  wrote:

> I have a batch pipeline that runs well with small inputs but fails with a
> larger dataset.
> Looking at stackdriver I find a fair number of the following:
>
> Request failed with code 400, will NOT retry:
> https://dataflow.googleapis.com/v1b3/projects/cgs-nonprod/locations/us-central1/jobs/2017-08-03_13_06_11-1588537374036956973/workItems:reportStatus
>
> How do I investigate to learn more about the cause?
> Am I reading this correctly that it is the reason the pipeline failed?
> Is this perhaps the result of memory pressure?
> How would I monitor the running job to determine its memory needs?
> Is there a better place to query about what is likely a dataflow-centric
> question?
>
> Thanks in advance!
> rdm
>
>


Strange errors running on DataFlow

2017-08-03 Thread Randal Moore
I have a batch pipeline that runs well with small inputs but fails with a
larger dataset.
Looking at stackdriver I find a fair number of the following:

Request failed with code 400, will NOT retry:
https://dataflow.googleapis.com/v1b3/projects/cgs-nonprod/locations/us-central1/jobs/2017-08-03_13_06_11-1588537374036956973/workItems:reportStatus

How do I investigate to learn more about the cause?
Am I reading this correctly that it is the reason the pipeline failed?
Is this perhaps the result of memory pressure?
How would I monitor the running job to determine its memory needs?
Is there a better place to query about what is likely a dataflow-centric
question?

Thanks in advance!
rdm


Re: Ordering in PCollection

2017-08-03 Thread Eric Fang
Thanks.

On Thu, Aug 3, 2017 at 2:11 PM Lukasz Cwik  wrote:

> But the windows can still be processed out of order.
>
> On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik  wrote:
>
>> Yes.
>>
>> On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang 
>> wrote:
>>
>>> Thanks Lukasz. If the stream is infinite, I am assuming you mean by
>>> window the stream into pans and sort the bundle for each trigger to an
>>> order?
>>>
>>> Eric
>>>
>>> On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik  wrote:
>>>
 There is currently no strict ordering which is supported within Apache
 Beam (timestamp or not) and any ordering which may be occurring is just a
 side effect and not guaranteed in any way.

 Since the smallest unit of work is a bundle containing 1 element, the
 only way to get ordering is to make one giant element containing all your
 data that needs to be ordered and perform the ordering yourself (e.g
  GroupByKey with single dummy key).

 On Thu, Aug 3, 2017 at 12:41 PM, Eric Fang 
 wrote:

> Hi all,
>
> We have a stream of data that's ordered by a timestamp and our use
> case requires us to process the data in order with respect to the previous
> element. For example, we have a stream of true/false ingested from PubSub
> and we want to make sure for each key, a true always follows by a false.
>
> I know from PubSub, the order is not guaranteed, but for the same
> Dataflow job, does the ProcessContext.output guarantee order when
> processElement is called based on event time or process time? From my
> experiment, this assumption seems to hold up but I wonder if this is an
> actual assumption of the system.
>
> In addition, if I key the stream with another key, does the assumption
> still hold? If not, is there any way with Beam to ensure that
> processElement is called in order of some time stamp.
>
> Thanks
> Eric
>
>
> --
>
> Eric Fang
>
> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>
>
> This electronic mail transmission may contain private, confidential
> and privileged information that is for the sole use of the intended
> recipient.  If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, archiving, or
> copying of this communication is strictly prohibited.  If you
> received this communication in error, please reply to this message
> immediately and delete the original message and the associated reply.
>

 --
>>>
>>> Eric Fang
>>>
>>> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>>>
>>>
>>> This electronic mail transmission may contain private, confidential and
>>> privileged information that is for the sole use of the intended
>>> recipient.  If you are not the intended recipient, you are hereby
>>> notified that any review, dissemination, distribution, archiving, or
>>> copying of this communication is strictly prohibited.  If you received
>>> this communication in error, please reply to this message immediately
>>> and delete the original message and the associated reply.
>>>
>>
>>
> --

Eric Fang

Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014


This electronic mail transmission may contain private, confidential and
privileged information that is for the sole use of the intended recipient.
If you are not the intended recipient, you are hereby notified that any
review, dissemination, distribution, archiving, or copying of this
communication
is strictly prohibited.  If you received this communication in error,
please reply to this message immediately and delete the original message
and the associated reply.


Re: PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread Lukasz Cwik
To my knowledge, autoscaling is dependent on how many messages are
backlogged within Pubsub and independent of the second subscription (the
second subscription is really to compute a better watermark).

On Thu, Aug 3, 2017 at 1:34 PM,  wrote:

> Thanks Lukasz that's good to know! It sounds like we can halve our PubSub
> costs then!
>
> Just to clarify, if I were to remove withTimestampAttribute from a job,
> this would cause the watermark to always be up to date (processing time)
> even if the job starts to lag behind (build up of unacknowledged PubSub
> messages). In this case would Dataflow's autoscaling still scale up? I
> thought the reason the autoscaler scales up is due to the watermark lagging
> behind, but is it also aware of the acknowledged PubSub messages?
>
> On 3 Aug 2017, at 18:58, Lukasz Cwik  wrote:
>
> You understanding is correct - the data watermark will only matter for
> windowing. It will not affect auto-scaling. If the pipeline is not doing
> any windowing, triggering, etc then there is no need to pay for the cost of
> the second subscription.
>
> On Thu, Aug 3, 2017 at 8:17 AM, Josh  wrote:
>
>> Hi all,
>>
>> We've been running a few streaming Beam jobs on Dataflow, where each job
>> is consuming from PubSub via PubSubIO. Each job does something like this:
>>
>> PubsubIO.readMessagesWithAttributes()
>> .withIdAttribute("unique_id")
>> .withTimestampAttribute("timestamp");
>>
>> My understanding of `withTimestampAttribute` is that it means we use the
>> timestamp on the PubSub message as Beam's concept of time (the watermark) -
>> so that any windowing we do in the job uses "event time" rather than
>> "processing time".
>>
>> My question is: is my understanding correct, and does using
>> `withTimestampAttribute` have any effect in a job that doesn't do any
>> windowing? I have a feeling it may also have an effect on Dataflow's
>> autoscaling, since I think Dataflow scales up when the watermark timestamp
>> lags behind, but I'm not sure about this.
>>
>> The reason I'm concerned about this is because we've been using it in all
>> our Dataflow jobs, and have now realised that whenever
>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub
>> subscription (suffixed with `__streaming_dataflow_internal`), which
>> appears to be doubling PubSub costs (since we pay per subscription)! So I
>> want to remove `withTimestampAttribute` from jobs where possible, but want
>> to first understand the implications.
>>
>> Thanks for any advice,
>> Josh
>>
>
>


Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
But the windows can still be processed out of order.

On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik  wrote:

> Yes.
>
> On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang  wrote:
>
>> Thanks Lukasz. If the stream is infinite, I am assuming you mean by
>> window the stream into pans and sort the bundle for each trigger to an
>> order?
>>
>> Eric
>>
>> On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik  wrote:
>>
>>> There is currently no strict ordering which is supported within Apache
>>> Beam (timestamp or not) and any ordering which may be occurring is just a
>>> side effect and not guaranteed in any way.
>>>
>>> Since the smallest unit of work is a bundle containing 1 element, the
>>> only way to get ordering is to make one giant element containing all your
>>> data that needs to be ordered and perform the ordering yourself (e.g
>>>  GroupByKey with single dummy key).
>>>
>>> On Thu, Aug 3, 2017 at 12:41 PM, Eric Fang 
>>> wrote:
>>>
 Hi all,

 We have a stream of data that's ordered by a timestamp and our use case
 requires us to process the data in order with respect to the previous
 element. For example, we have a stream of true/false ingested from PubSub
 and we want to make sure for each key, a true always follows by a false.

 I know from PubSub, the order is not guaranteed, but for the same
 Dataflow job, does the ProcessContext.output guarantee order when
 processElement is called based on event time or process time? From my
 experiment, this assumption seems to hold up but I wonder if this is an
 actual assumption of the system.

 In addition, if I key the stream with another key, does the assumption
 still hold? If not, is there any way with Beam to ensure that
 processElement is called in order of some time stamp.

 Thanks
 Eric


 --

 Eric Fang

 Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014


 This electronic mail transmission may contain private, confidential and
 privileged information that is for the sole use of the intended
 recipient.  If you are not the intended recipient, you are hereby
 notified that any review, dissemination, distribution, archiving, or
 copying of this communication is strictly prohibited.  If you received
 this communication in error, please reply to this message immediately
 and delete the original message and the associated reply.

>>>
>>> --
>>
>> Eric Fang
>>
>> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>>
>>
>> This electronic mail transmission may contain private, confidential and
>> privileged information that is for the sole use of the intended
>> recipient.  If you are not the intended recipient, you are hereby
>> notified that any review, dissemination, distribution, archiving, or
>> copying of this communication is strictly prohibited.  If you received
>> this communication in error, please reply to this message immediately
>> and delete the original message and the associated reply.
>>
>
>


Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
Yes.

On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang  wrote:

> Thanks Lukasz. If the stream is infinite, I am assuming you mean by window
> the stream into pans and sort the bundle for each trigger to an order?
>
> Eric
>
> On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik  wrote:
>
>> There is currently no strict ordering which is supported within Apache
>> Beam (timestamp or not) and any ordering which may be occurring is just a
>> side effect and not guaranteed in any way.
>>
>> Since the smallest unit of work is a bundle containing 1 element, the
>> only way to get ordering is to make one giant element containing all your
>> data that needs to be ordered and perform the ordering yourself (e.g
>>  GroupByKey with single dummy key).
>>
>> On Thu, Aug 3, 2017 at 12:41 PM, Eric Fang 
>> wrote:
>>
>>> Hi all,
>>>
>>> We have a stream of data that's ordered by a timestamp and our use case
>>> requires us to process the data in order with respect to the previous
>>> element. For example, we have a stream of true/false ingested from PubSub
>>> and we want to make sure for each key, a true always follows by a false.
>>>
>>> I know from PubSub, the order is not guaranteed, but for the same
>>> Dataflow job, does the ProcessContext.output guarantee order when
>>> processElement is called based on event time or process time? From my
>>> experiment, this assumption seems to hold up but I wonder if this is an
>>> actual assumption of the system.
>>>
>>> In addition, if I key the stream with another key, does the assumption
>>> still hold? If not, is there any way with Beam to ensure that
>>> processElement is called in order of some time stamp.
>>>
>>> Thanks
>>> Eric
>>>
>>>
>>> --
>>>
>>> Eric Fang
>>>
>>> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>>>
>>>
>>> This electronic mail transmission may contain private, confidential and
>>> privileged information that is for the sole use of the intended
>>> recipient.  If you are not the intended recipient, you are hereby
>>> notified that any review, dissemination, distribution, archiving, or
>>> copying of this communication is strictly prohibited.  If you received
>>> this communication in error, please reply to this message immediately
>>> and delete the original message and the associated reply.
>>>
>>
>> --
>
> Eric Fang
>
> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>
>
> This electronic mail transmission may contain private, confidential and
> privileged information that is for the sole use of the intended
> recipient.  If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, archiving, or
> copying of this communication is strictly prohibited.  If you received
> this communication in error, please reply to this message immediately and
> delete the original message and the associated reply.
>


Re: Ordering in PCollection

2017-08-03 Thread Eric Fang
Thanks Lukasz. If the stream is infinite, I am assuming you mean by window
the stream into pans and sort the bundle for each trigger to an order?

Eric

On Thu, Aug 3, 2017 at 12:49 PM Lukasz Cwik  wrote:

> There is currently no strict ordering which is supported within Apache
> Beam (timestamp or not) and any ordering which may be occurring is just a
> side effect and not guaranteed in any way.
>
> Since the smallest unit of work is a bundle containing 1 element, the only
> way to get ordering is to make one giant element containing all your data
> that needs to be ordered and perform the ordering yourself (e.g  GroupByKey
> with single dummy key).
>
> On Thu, Aug 3, 2017 at 12:41 PM, Eric Fang 
> wrote:
>
>> Hi all,
>>
>> We have a stream of data that's ordered by a timestamp and our use case
>> requires us to process the data in order with respect to the previous
>> element. For example, we have a stream of true/false ingested from PubSub
>> and we want to make sure for each key, a true always follows by a false.
>>
>> I know from PubSub, the order is not guaranteed, but for the same
>> Dataflow job, does the ProcessContext.output guarantee order when
>> processElement is called based on event time or process time? From my
>> experiment, this assumption seems to hold up but I wonder if this is an
>> actual assumption of the system.
>>
>> In addition, if I key the stream with another key, does the assumption
>> still hold? If not, is there any way with Beam to ensure that
>> processElement is called in order of some time stamp.
>>
>> Thanks
>> Eric
>>
>>
>> --
>>
>> Eric Fang
>>
>> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>>
>>
>> This electronic mail transmission may contain private, confidential and
>> privileged information that is for the sole use of the intended
>> recipient.  If you are not the intended recipient, you are hereby
>> notified that any review, dissemination, distribution, archiving, or
>> copying of this communication is strictly prohibited.  If you received
>> this communication in error, please reply to this message immediately
>> and delete the original message and the associated reply.
>>
>
> --

Eric Fang

Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014


This electronic mail transmission may contain private, confidential and
privileged information that is for the sole use of the intended recipient.
If you are not the intended recipient, you are hereby notified that any
review, dissemination, distribution, archiving, or copying of this
communication
is strictly prohibited.  If you received this communication in error,
please reply to this message immediately and delete the original message
and the associated reply.


Re: PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread jofo90
Thanks Lukasz that's good to know! It sounds like we can halve our PubSub costs 
then!

Just to clarify, if I were to remove withTimestampAttribute from a job, this 
would cause the watermark to always be up to date (processing time) even if the 
job starts to lag behind (build up of unacknowledged PubSub messages). In this 
case would Dataflow's autoscaling still scale up? I thought the reason the 
autoscaler scales up is due to the watermark lagging behind, but is it also 
aware of the acknowledged PubSub messages?

> On 3 Aug 2017, at 18:58, Lukasz Cwik  wrote:
> 
> You understanding is correct - the data watermark will only matter for 
> windowing. It will not affect auto-scaling. If the pipeline is not doing any 
> windowing, triggering, etc then there is no need to pay for the cost of the 
> second subscription. 
> 
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh  wrote:
>> Hi all,
>> 
>> We've been running a few streaming Beam jobs on Dataflow, where each job is 
>> consuming from PubSub via PubSubIO. Each job does something like this:
>> 
>> PubsubIO.readMessagesWithAttributes()
>> .withIdAttribute("unique_id")
>> .withTimestampAttribute("timestamp");
>> 
>> My understanding of `withTimestampAttribute` is that it means we use the 
>> timestamp on the PubSub message as Beam's concept of time (the watermark) - 
>> so that any windowing we do in the job uses "event time" rather than 
>> "processing time".
>> 
>> My question is: is my understanding correct, and does using 
>> `withTimestampAttribute` have any effect in a job that doesn't do any 
>> windowing? I have a feeling it may also have an effect on Dataflow's 
>> autoscaling, since I think Dataflow scales up when the watermark timestamp 
>> lags behind, but I'm not sure about this.
>> 
>> The reason I'm concerned about this is because we've been using it in all 
>> our Dataflow jobs, and have now realised that whenever 
>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub 
>> subscription (suffixed with `__streaming_dataflow_internal`), which appears 
>> to be doubling PubSub costs (since we pay per subscription)! So I want to 
>> remove `withTimestampAttribute` from jobs where possible, but want to first 
>> understand the implications.
>> 
>> Thanks for any advice,
>> Josh
> 


Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
There is currently no strict ordering which is supported within Apache Beam
(timestamp or not) and any ordering which may be occurring is just a side
effect and not guaranteed in any way.

Since the smallest unit of work is a bundle containing 1 element, the only
way to get ordering is to make one giant element containing all your data
that needs to be ordered and perform the ordering yourself (e.g  GroupByKey
with single dummy key).

On Thu, Aug 3, 2017 at 12:41 PM, Eric Fang  wrote:

> Hi all,
>
> We have a stream of data that's ordered by a timestamp and our use case
> requires us to process the data in order with respect to the previous
> element. For example, we have a stream of true/false ingested from PubSub
> and we want to make sure for each key, a true always follows by a false.
>
> I know from PubSub, the order is not guaranteed, but for the same Dataflow
> job, does the ProcessContext.output guarantee order when processElement is
> called based on event time or process time? From my experiment, this
> assumption seems to hold up but I wonder if this is an actual assumption of
> the system.
>
> In addition, if I key the stream with another key, does the assumption
> still hold? If not, is there any way with Beam to ensure that
> processElement is called in order of some time stamp.
>
> Thanks
> Eric
>
>
> --
>
> Eric Fang
>
> Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014
>
>
> This electronic mail transmission may contain private, confidential and
> privileged information that is for the sole use of the intended
> recipient.  If you are not the intended recipient, you are hereby
> notified that any review, dissemination, distribution, archiving, or
> copying of this communication is strictly prohibited.  If you received
> this communication in error, please reply to this message immediately and
> delete the original message and the associated reply.
>


Ordering in PCollection

2017-08-03 Thread Eric Fang
Hi all,

We have a stream of data that's ordered by a timestamp and our use case
requires us to process the data in order with respect to the previous
element. For example, we have a stream of true/false ingested from PubSub
and we want to make sure for each key, a true always follows by a false.

I know from PubSub, the order is not guaranteed, but for the same Dataflow
job, does the ProcessContext.output guarantee order when processElement is
called based on event time or process time? From my experiment, this
assumption seems to hold up but I wonder if this is an actual assumption of
the system.

In addition, if I key the stream with another key, does the assumption
still hold? If not, is there any way with Beam to ensure that
processElement is called in order of some time stamp.

Thanks
Eric


-- 

Eric Fang

Stack Labs  |  10054 Pasadena Ave, Cupertino, CA 95014


This electronic mail transmission may contain private, confidential and
privileged information that is for the sole use of the intended recipient.
If you are not the intended recipient, you are hereby notified that any
review, dissemination, distribution, archiving, or copying of this
communication
is strictly prohibited.  If you received this communication in error,
please reply to this message immediately and delete the original message
and the associated reply.


PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread Josh
Hi all,

We've been running a few streaming Beam jobs on Dataflow, where each job is
consuming from PubSub via PubSubIO. Each job does something like this:

PubsubIO.readMessagesWithAttributes()
.withIdAttribute("unique_id")
.withTimestampAttribute("timestamp");

My understanding of `withTimestampAttribute` is that it means we use the
timestamp on the PubSub message as Beam's concept of time (the watermark) -
so that any windowing we do in the job uses "event time" rather than
"processing time".

My question is: is my understanding correct, and does using
`withTimestampAttribute` have any effect in a job that doesn't do any
windowing? I have a feeling it may also have an effect on Dataflow's
autoscaling, since I think Dataflow scales up when the watermark timestamp
lags behind, but I'm not sure about this.

The reason I'm concerned about this is because we've been using it in all
our Dataflow jobs, and have now realised that whenever
`withTimestampAttribute` is used, Dataflow creates an additional PubSub
subscription (suffixed with `__streaming_dataflow_internal`), which appears
to be doubling PubSub costs (since we pay per subscription)! So I want to
remove `withTimestampAttribute` from jobs where possible, but want to first
understand the implications.

Thanks for any advice,
Josh


Re: Spark job hangs up at Evaluating ParMultiDo(ParseInput)

2017-08-03 Thread Sathish Jayaraman
Hi,

Thanks for trying it out.

I was running the job in local single node setup. I also spawn a HDInsights 
cluster in Azure platform just to test the WordCount program. Its the same 
result there too, stuck at the Evaluating ParMultiDo step. It runs fine in mvn 
compile exec, but when bundled into jar & submitted via spark-submit there is 
no result. If there is no support from Beam to run on top of Spark then I have 
to write Spark native code which is what I am doing currently.

Regards,
Sathish. J


On 03-Aug-2017, at 2:34 PM, Jean-Baptiste Onofré 
> wrote:

nanthrax.net



Re: Spark job hangs up at Evaluating ParMultiDo(ParseInput)

2017-08-03 Thread Sathish Jayaraman
Hi,

Was anyone able to run Beam application on Spark at all??

I tried all possible options and still no luck. No executors getting assigned 
for the job submitted by below command even though explicitly specified,

$ ~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master 
yarn --executor-memory 2G --num-executors 2 
target/word-count-beam-0.1-shaded.jar --runner=SparkRunner --inputFile=pom.xml 
--output=counts

Please someone help or point me to the right forum.

Attached job log from YARN. Job is stuck at 'INFO spark.SparkRunner$Evaluator: 
Evaluating ParMultiDo(ExtractWords)’.

Regards,
Sathish. J



On 01-Aug-2017, at 4:16 PM, Sathish Jayaraman 
> wrote:

Hi JB,

Not when I submit is using the option —master local. But when I submit with 
with spark master’s port, The job executes and hangs at step 'Registering block 
manager 192.168.0.2:58956 with 366.3 MB RAM, BlockManagerId(0, 192.168.0.2, 
58956, None)’. Attached screenshots of the HistoryUI & Dashboard. Also raised a 
question in 
Stackoverflow
 too.

Command:
~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master 
spark://Sathish-MacBook-Pro.local:7077 target/word-count-beam-0.1.jar 
--runner=SparkRunner --inputFile=pom.xml --output=counts






Regards,
Sathish. J

On 01-Aug-2017, at 4:02 PM, Jean-Baptiste Onofré 
> wrote:

Hi Sathish,

Do you see the tasks submitted on the history server ?

Regards
JB

On 08/01/2017 11:51 AM, Sathish Jayaraman wrote:
Hi,
I am trying to execute Beam example in local spark setup. When I try to submit 
the sample WordCount jar via spark-submit, the job just hangs at 'INFO 
SparkRunner$Evaluator: Evaluating ParMultiDo(ExtractWords)’. But it runs fine 
when executed directly. Below is the command I used to submit the job in spark 
local,
$ ~/spark/bin/spark-submit --class "org.apache.beam.examples.WordCount"   
--master local[4] target/word-count-beam-0.1.jar —inputFile=./pom.xml 
--output=csvout --runner=SparkRunner
Have attached log file for reference. Can anyone please help me find out whats 
going on?
Regards,
Sathish. J

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


17/08/03 13:00:33 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8030
17/08/03 13:00:33 INFO yarn.YarnRMClient: Registering the ApplicationMaster
17/08/03 13:00:34 INFO yarn.YarnAllocator: Will request 2 executor 
container(s), each with 1 core(s) and 2432 MB memory (including 384 MB of 
overhead)
17/08/03 13:00:34 INFO yarn.YarnAllocator: Submitted 2 unlocalized container 
requests.
17/08/03 13:00:34 INFO yarn.ApplicationMaster: Started progress reporter thread 
with (heartbeat : 3000, initial allocation : 200) intervals
17/08/03 13:00:35 INFO impl.AMRMClientImpl: Received new token for : 
192.168.0.7:50173
17/08/03 13:00:35 INFO yarn.YarnAllocator: Launching container 
container_1501744514957_0003_01_02 on host 192.168.0.7
17/08/03 13:00:35 INFO yarn.YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
17/08/03 13:00:35 INFO impl.ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
17/08/03 13:00:35 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
192.168.0.7:50173
17/08/03 13:00:37 INFO yarn.YarnAllocator: Launching container 
container_1501744514957_0003_01_03 on host 192.168.0.7
17/08/03 13:00:37 INFO yarn.YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
17/08/03 13:00:37 INFO impl.ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
17/08/03 13:00:37 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
192.168.0.7:50173

Re: Un-parallelized TextIO to HDFS on Beam+Flink+YARN

2017-08-03 Thread Aljoscha Krettek
Hi,

I think this might not be a problem. The reason we have this 
DatSink(DiscardingOutputFormat) at the "end" of Flink Batch pipelines is that 
Flink Batch will not execute a chain of operations when they're not terminated 
by a sink. In Beam, it's just fine to just have a DoFn and no sink after that 
because the DoFn can also write data to an external system. In fact, the 
TextIO.write() operation was implemented as a combination of several DoFns, 
last time I checked. It would assume that this "terminator" sink is just 
waiting until the DoFns that do the actual work are done writing. 

Could you maybe post the execution plan that you see in the Flink Dashboard? 
Either here or in private to me, if you don't want to share that on the ML.

Best,
Aljoscha

> On 2. Aug 2017, at 17:02, Chris Hebert 
>  wrote:
> 
> Hi,
> 
> I ran a Beam+Flink+YARN job with many containers and a high 
> "parallelism.default" parameter in my conf/flink-conf.yaml file.
> 
> That all worked perfectly with all the containers parallelizing all the parts 
> of the job up until the very end at a TextIO.write().
> 
> The last Task running was a "DataSink 
> (org.apache.flink.api.java.io.DiscardingOutputFormat)" during which I could 
> finally actually observe output being written to HDFS. (The pipeline was in 
> batch-mode reading from the file system, so I'm not entirely shocked writing 
> output was saved until the end).
> 
> The problem is: This Task used only one TaskManger/Container and ran all by 
> itself for the last 15 minutes of the pipeline while all the other 
> TaskManagers/Containers sat idle.
> 
> How can I make sure that this gets parallelized the same as all the other 
> Tasks?
> 
> Should I address this through the Beam API or through some Flink 
> configuration parameter I haven't found yet?
> 
> Is it even possible to have multiple TaskManagers writing the TextIO output 
> to HDFS at the same time?
> 
> Thank you for your help,
> Chris