Re: Beam courses

2019-01-13 Thread Davor Bonaci
I'll introduce you to folks who can do this for you off-list.

On Sun, Jan 13, 2019 at 12:28 PM Vikram Tiwari 
wrote:

> Hey! I think he mentioned it to me once that they do trainings for Beam
> etc. Might wanna talk to him.
> https://www.linkedin.com/in/dbonaci
>
>
> On Sun, Jan 13, 2019, 12:08 PM Alex Van Boxel  wrote:
>
>> Hey all,
>>
>> Our team had the luxury of growing with Beam, we where Dataflow users
>> before it was GA. But now our team has grown, due to a merger.
>>
>> As we will continue using Beam, but then over different sites I'm
>> thinking about training. The question is... Should I create trainings
>> myself. Or do people specialise in Beam training? I'm not talking about
>> some simple getting started training... I want deep training.
>>
>> Any suggestions how people in this group do trainings?
>>
>


Re: Beam Cookbook?

2018-06-09 Thread Davor Bonaci
Hi Austin --
It would be great to see this materialize.

I've been pursued by publishers a lot in the last year, so I might be able
to facilitate introductions if you need them. You'll probably want to have
a few sample draft chapters, however.

I'm aware of 3 different groups creating related content in this space, but
I think nobody is doing it from this angle. They are lurking on the mailing
list, so they may have reached out already.

All the best, and I wish you finish it with the same enthusiasm as you
started!

Davor

On Thu, Jun 7, 2018 at 11:44 AM, Austin Bennett  wrote:

> I'm looking at assembling a physical book along the lines of "Apache Beam
> Cookbook", though might take a different approach to topic (if realize
> there is a better hole to fill or something that needs more attention
> before that).
>
> I believe many could benefit from more substantive write-ups and
> explanations on use-cases, and specific bits in code (ex: to accomplish x
> you might want to use recipe Y, pay special attention to this function,
> with associated paragraphs of text and noting specific lines in the code,
> etc etc).  While this can be done on a website and GitHub, I do believe the
> more concrete nature of a book (esp. with reputable publisher) gives
> additional signaling to others that the subject is sufficiently mature.  My
> aim will be for this book to be freely available at least in an e-book
> version, an example of that I have is: https://www.confluent.io/
> resources/kafka-the-definitive-guide/ and surely you've come across other
> examples.
>
> I see many cookbook examples of code already exist, but the associated
> writeup I know could be useful to others; as well as the overall
> presentation/bundling to make it even easier to find and use.
>
> Wondering thoughts from the group and if there are others with a strong
> interest in collaborating on such an undertaking.
>
>
>


Re: Hiring Data Engineers - $100/hr - LA

2018-05-03 Thread Davor Bonaci
>
> @moderator, please be careful when you accept message.


I don't think we have any active moderators. Also, moderation doesn't
catches "content", just messages from non-subscribers. Michael is probably
a subscriber, and the message went directly, without any moderation.


Re: DataWorks Summit Sydney 2017

2017-09-20 Thread Davor Bonaci
A quick remainder that Beam sessions are later today. And, of course,
stickers will make an appearance!

If you'd like to chat about all-things-Beam, please stay in the room after
any of these sessions, or stop me if you see me around.

I hope to see many of you there!

On Fri, Sep 15, 2017 at 11:04 AM, Davor Bonaci <da...@apache.org> wrote:

> Apache Beam will be featured at the DataWorks Summit / Hadoop Summit next
> week in Sydney, Australia [1].
>
> Scheduled events:
> * Realizing the promise of portable data processing with Apache Beam [2]
>   Time: Thursday, 9/21, 3:10 pm
>   Speaker: Davor Bonaci
>
> * Birds-of-a-feather: IoT, Streaming and Data Flow [3]
>   Time: Thursday, 9/21, 5:00 pm
>   Panel: Davor Bonaci, Priyank Shah, Andy Lopresto, Hugo da Cruz Louro
>
> Everybody is welcome -- users and contributors alike! Feel free to use
> code ASF25 to get 25% off the standard All-Access Pass.
>
> I hope to see many of your there!
>
> Davor
>
> [1] https://dataworkssummit.com/sydney-2017/
> [2] https://dataworkssummit.com/sydney-2017/sessions/
> realizing-the-promise-of-portable-data-processing-with-apache-beam/
> [3] https://dataworkssummit.com/sydney-2017/birds-of-a-
> feather/iot-streaming-data-flow/
>


YOW! Data Conference Sydney 2017

2017-09-15 Thread Davor Bonaci
Apache Beam will be featured at the YOW! Data Conference next week in
Sydney, Australia [1].

Scheduled events:
* Processing data of any size with Apache Beam [2]
  Time: Tuesday, 9/19, 9:00 am
  Speaker: Jesse Anderson

* Realizing the promise of portable data processing with Apache Beam [3]
  Time: Tuesday, 9/19, 10:50 am
  Speaker: Davor Bonaci

* Workshop: Data Engineering with Apache Beam [4]
  Time: Wednesday, 9/20, 9:00 am - 5:00 pm
  Speaker: Jesse Anderson (with Davor Bonaci)

Everybody is welcome -- users and contributors alike!

I hope to see many of your there!

Davor

[1] http://data.yowconference.com.au/
[2] http://data.yowconference.com.au/proposal/?id=4668
[3] http://data.yowconference.com.au/proposal/?id=4474
[4] http://data.yowconference.com.au/proposal/?id=4972


Re: DataWorks Summit San Jose 2017

2017-06-13 Thread Davor Bonaci
A quick remainder that Beam talks are tomorrow. And, of course, stickers
will make an appearance!

If you'd like to chat about all-things-Beam, please stay in the room after
any of these sessions, or stop me if you see me around.

I hope to see many of you there!

On Mon, Jun 5, 2017 at 4:22 PM, Davor Bonaci <da...@apache.org> wrote:

> Apache Beam will be featured at the DataWorks Summit / Hadoop Summit next
> week in San Jose, CA [1].
>
> Scheduled events:
> * Realizing the promise of portable data processing with Apache Beam [2]
>   Time: Wednesday, 6/14, 11:30 am
>   Speaker: Davor Bonaci
>
> * Stateful processing of massive out-of-order streams with Apache Beam [3]
>   Time: Wednesday, 6/14, 3:00 pm
>   Speaker: Kenneth Knowles
>
> * Birds-of-a-feather: IoT, Streaming and Data Flow [4]
>   Time: Thursday, 6/15, 5:00 pm
>   Panel: Yolanda Davis, Davor Bonaci, P. Taylor Goetz, Sriharsha
> Chintalapani, Joseph Niemiec
>
> Everybody is welcome -- users and contributors alike! Feel free to use
> code ASF25 to get 25% off the standard All-Access Pass.
>
> I hope to see many of your there!
>
> Davor
>
> [1] https://dataworkssummit.com/
> [2] https://dataworkssummit.com/san-jose-2017/sessions/
> realizing-the-promise-of-portable-data-processing-with-apache-beam/
> [3] https://dataworkssummit.com/san-jose-2017/sessions/
> stateful-processing-of-massive-out-of-order-streams-with-apache-beam/
> [4] https://dataworkssummit.com/san-jose-2017/birds-of-a-
> feather/iot-streaming-and-data-flow/
>


Re: Apache Beam Meetup in Bay Area

2017-05-22 Thread Davor Bonaci
Just a remainder that this is taking place on Wednesday in the Bay Area.
Hope to see many of you there!

On Thu, May 11, 2017 at 12:39 PM, Davor Bonaci <da...@apache.org> wrote:

> I'm happy to announce that we'll have our very first Bay Area meetup [1]
> on Wednesday, May 24, hosted by the fine folks at Hortonworks.
>
> Abstract:
> The world of big data involves an ever changing field of players. Much as
> SQL stands as a lingua franca for declarative data analysis, Apache Beam
> aims to provide a portable standard for expressing robust, massive scale,
> out-of-order data processing pipelines in a variety of languages across a
> variety of platforms. Come see how this vision has been realized in the
> first stable release of Apache Beam and discuss the challenges which lie
> ahead.
>
> Location:
> Hortonworks HQ
> 5470 Great America Parkway, Santa Clara, CA, USA
>
> Agenda:
> 6:00 pm: Doors open
> 6:00 pm - 6:15 pm: Networking, Pizza and Drinks
> 6:15 pm - 6:30 pm: Introduction & Apache Beam release 2.0.0 - Rafael Coss
> and Davor Bonaci
> 6:30 pm - 7:15 pm: Using Apache Beam for Batch, Streaming, and Everything
> in Between - Frances Perry
> 7:15 pm - 8:00 pm: Realizing the promise of portable data processing with
> Apache Beam - Davor Bonaci
> 8:00 pm - 8:45 pm: use case talk - to be confirmed
>
> If you are in the area, please join us -- RSVP now [1].
>
> Davor
>
> [1] https://www.meetup.com/futureofdata-siliconvalley/events/239369704/
>


Apache Beam publishes the first stable release

2017-05-17 Thread Davor Bonaci
The Apache Beam community is pleased to announce the availability of
version 2.0.0. This is the first stable release of Apache Beam, signifying
a statement from the community that it intends to maintain API stability
with all releases for the foreseeable future, and making Beam suitable for
enterprise deployment.

Apache Software Foundation press release:
http://globenewswire.com/news-release/2017/05/17/986839/0/
en/The-Apache-Software-Foundation-Announces-Apache-Beam-v2-0-0.html

Beam blog:
https://beam.apache.org/blog/2017/05/17/beam-first-stable-release.html

We’d like to invite everyone to try out Apache Beam today and consider
joining our vibrant community. We welcome feedback, contribution and
participation through our mailing lists, issue tracker, pull requests, and
events.

Davor


Re: BigQuery table backed by Google Sheet

2017-05-04 Thread Davor Bonaci
This sounds like a Dataflow-specific question, so it might be better
addressed by a StackOverflow question tagged with google-cloud-dataflow [1].

I'd suggest checking the Dataflow documentation for more information about
its security model [2]. You'd likely have to use a service account, as
described there for a few scenarios.

Hope this helps, but please don't hesitate to post on StackOverflow for any
Dataflow-specific questions.

Davor

[1] http://stackoverflow.com/questions/tagged/google-cloud-dataflow
[2] https://cloud.google.com/dataflow/security-and-permissions

On Wed, May 3, 2017 at 11:08 PM, Prabeesh K.  wrote:

> Hi Anil,
>
> Thank you for your attempt to answer the question. This is the broad
> answer.
>
> The problem is we should add additional scope to Beam to read google sheet
> and add the permission to the service account to read the google sheet.
>
> Regards,
> Prabeesh K.
>
>
> On 4 May 2017 at 08:58, Anil Srinivas 
> wrote:
>
>> Hi,
>> The problem with it as what I see is that you don't have the permissions
>> to access the data in the BigQuery table. Make sure you login into the
>> account which has permissions for reading/writing data in the table.
>>
>>
>>
>>
>> Regards,
>> Anil
>>
>>
>> On Wed, May 3, 2017 at 6:21 PM, Prabeesh K.  wrote:
>>
>>> Hi,
>>>
>>> How to we can read a BigQuery table that backed by google sheet?
>>>
>>> For me, I am getting the following error.
>>>
>>> "error": {
>>>   "errors": [
>>>{
>>> "domain": "global",
>>> "reason": "accessDenied",
>>> "message": "Access Denied: BigQuery BigQuery: Permission denied
>>> while globbing file pattern.",
>>> "locationType": "other",
>>> "location": "/gdrive/id/--"
>>>}
>>>   ],
>>>   "code": 403,
>>>   "message": "Access Denied: BigQuery BigQuery: Permission denied while
>>> globbing file pattern."
>>>  }
>>> }
>>>
>>>
>>> Help to fix this issue.
>>>
>>>
>>> Regards,
>>> Prabeesh K.
>>>
>>>
>>>
>>
>


Re: AvroCoder + KafkaIO + Flink problem

2017-04-30 Thread Davor Bonaci
And, thanks to Kenn, a likely workaround is in a pending pull request [1].
This should be fixed shortly at HEAD and be a part of the next release.

Davor

[1] https://github.com/apache/beam/pull/2783


On Sat, Apr 29, 2017 at 1:13 PM, Aljoscha Krettek 
wrote:

> There were some newer messages on the issue as well (
> https://issues.apache.org/jira/browse/BEAM-1970). The problem occurs if
> you reuse a Flink cluster for running the same job (or some other job that
> uses the same classes) again. The workaround would be to not reuse a
> cluster for several Jobs. If the jar of a Job has to be in the lib folder
> this also means that a cluster cannot be reused for a new jar since the
> cluster has to be restarted when you have a new job/jar so this workaround
> would have roughly the same “complexity”.
>
> I think we’ll finally get rid of that once we move to a cluster-per-job
> model. :-)
>
> Best,
> Aljoscha
>
> On 29. Apr 2017, at 14:45, Stephan Ewen  wrote:
>
> Avro tries to do some magic caching which does not work with multi
> classloader setups.
>
> A common workaround for these types of problems is to drop the relevant
> classes into Flink's "lib" folder and not have them in the job's jar.
>
> We wrote up some help on that: https://ci.apache.org/pr
> ojects/flink/flink-docs-release-1.2/monitoring/debugging_classloading.html
>
> Some background: Flink has two different class loading modes:
>
> (1) All in the system class loader, which is used for the one-job-Yarn
> deployments (in the future also for Mesos / Docker / etc)
>
> (2) Dynamic class loading, with one classloader per deployment. Used in
> all setups that follow the pattern "start Flink first, launch job later".
>
>
> On Sat, Apr 29, 2017 at 8:02 AM, Aljoscha Krettek 
> wrote:
>
>> Yep, KafkaCheckpointMark also uses AvroCoder, so I’m guessing it’s
>> exactly the same problem.
>>
>> Best,
>> Aljoscha
>>
>>
>> On 28. Apr 2017, at 22:29, Jins George  wrote:
>>
>> I also faced a similar issue when re-starting a flink job from a save
>> point on an existing cluster . ClassCastException was with
>> KafkaCheckpointMark class . It was due to the different class loaders.  The
>> workaround for me was to run one job per Yarn session.  For restart from
>> savepoint, start a new yarn session and submit.
>>
>> Thanks,
>> Jins George
>>
>> On 04/28/2017 09:34 AM, Frances Perry wrote:
>>
>> I have the same problem and am working around it with SerializableCoder.
>> +1 to a real solution.
>>
>> On Fri, Apr 28, 2017 at 8:46 AM, Aljoscha Krettek < 
>> aljos...@apache.org> wrote:
>>
>>> I think you could. But we should also try finding a solution for this
>>> problem.
>>>
>>> On 28. Apr 2017, at 17:31, Borisa Zivkovic 
>>> wrote:
>>>
>>> Hi Aljoscha,
>>>
>>> this is probably the same problem I am facing.
>>>
>>> I execute multiple pipelines on the same Flink cluster - all launched at
>>> the same time...
>>>
>>> I guess I can try to switch to SerializableCoder and see how that works?
>>>
>>> thanks
>>>
>>>
>>>
>>> On Fri, 28 Apr 2017 at 16:20 Aljoscha Krettek < 
>>> aljos...@apache.org> wrote:
>>>
 Hi,
 There is this open issue:
 
 https://issues.apache.org/jira/browse/BEAM-1970. Could this also be
 what is affecting you? Are you running several pipelines on the same Flink
 cluster, either one after another or at the same time?

 Best,
 Aljoscha

 On 28. Apr 2017, at 12:45, Borisa Zivkovic <
 borisha.zivko...@gmail.com> wrote:

 Hi,

 I have this small pipeline that is writing data to Kafka (using
 AvroCoder) and then another job is reading the same data from Kafka, doing
 few transformations and then writing data back to different Kafka topic
 (AvroCoder again).

 First pipeline is very simple, read data from a text file, create POJO,
 use AvroCoder to write POJO to Kafka.

 Second pipeline is also simple, read POJO from Kafka, do few
 transformations, create new POJO and write data to Kafka using AvroCoder
 again.

 When I use direct runner everything is ok.

 When I switch to flink runner (small remote flink cluster) I get this
 exception in the second pipeline

 Caused by: java.lang.ClassCastException: test.MyClass cannot be cast to
 test.MyClass

 This happens in the the first MapFunction immediately after reading
 data from Kafka.

 I found about this problem in Flink and how they resolve it but not
 sure how to fix this when using Beam?!

 
 https://issues.apache.org/jira/browse/FLINK-1390

 test.MyClass has annotation @DefaultCoder(AvroCoder.class) and is very
 simple POJO.

 Not sure how to 

Re: Apache Beam Slack channel

2017-04-27 Thread Davor Bonaci
(There were already done by someone.)

On Thu, Apr 27, 2017 at 1:53 PM, Tony Moulton  wrote:

> Please include me as well during the next batch of Slack additions.
> Thanks!
>
> —
> Tony
>
>
>
> On Apr 27, 2017, at 4:51 PM,  <
> oscar.b.rodrig...@accenture.com> wrote:
>
> Hi there,
>
> Can you please add me to the Apache Beam Slack channel?
>
> Thanks
> -Oscar
>
> Oscar Rodriguez
> Solution Architect
> Google CoE | Accenture Cloud
> M +1 718-440-0881 <(718)%20440-0881> | W +1 917-452-3923
> <(917)%20452-3923>
> email: oscar.b.rodrig...@accenture.com
>
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>
>
>


Apache: Big Data North America 2017 conference

2017-04-24 Thread Davor Bonaci
Apache Beam will be prominently featured at the upcoming Apache: Big Data
North America 2017 conference in Miami, FL [1].

Scheduled talks:
* Using Apache Beam for Batch, Streaming, and Everything in Between
  Speakers: Frances Perry, and Dan Halperin

* Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways
  Speakers: Davor Bonaci, and Jean-Baptiste Onofré

* Concrete Big Data Use Cases Implemented with Apache Beam
  Speaker: Jean-Baptiste Onofré

* Nexmark, a Unified Framework to Evaluate Big Data Processing Systems with
Apache Beam
  Speakers: Ismael Mejia, and Etienne Chauchot

In addition to talks, we plan to organize a birds-of-a-feather session and
social activities. Details TBD.

Everybody is welcome -- users and contributors alike! Feel free to use code
ABDSP20 to get 20% off registration fee and/or your committer code (if
applicable).

I hope to see many of your there!

Davor

[1] http://events.linuxfoundation.org/events/apache-big-data-north-america/


Re: how to enforce dependency among transforms in a piepline

2017-04-20 Thread Davor Bonaci
Unfortunately, adding a dependency from TextIO.Write to TextIO.Read is not
something that can be done via Beam APIs today. Stay tuned, however -- this
is something we are interested in improving, and a lot of the groundwork
has already been done.

As a workaround, you can pipe input to TextIO.Write as an input of the step
currently consuming the output of TextIO.Read.

Also, the pipeline construction exception you are seeing can be worked
around by disabling validation on the TextIO.Read step; that, however,
would make the pipeline start executing, but wouldn't guarantee correct
results.

On Wed, Apr 19, 2017 at 3:54 PM, Seshadri Raghunathan 
wrote:

> Hi,
>
> I am trying to do something like below  :
>
> PCollection pc1=;
>
> pc1.apply(TextIO.Write.to("gs://BucketA/fileA.txt"))
>
> PCollection pc2=TextIO.Read.from("gs://BucketA/fileA.txt")
>
> When I execute the above in a pipeline, I get an error when I try to read
> from the GS bucket-
>
> java.lang.IllegalStateException: Unable to find any files matching
> StaticValueProvider{value=gs://beamoutput/test105A.txt}
>
> Is there a mechanism to ensure that the read above does not happen before
> the write ?
>
> Thanks,
> Seshadri
>
>


Re: beam + scala + streamline

2017-04-13 Thread Davor Bonaci
Hi Georg --
Great to see you are evaluating Beam for your scenario.


> > someone told me that e.g. the flink runner for beam seems to be slower
>> than a
>> > native flink job. Is this true? Did you observe such characteristics
>> for several
>> > runners?
>>
>
This should not be true in a general sense -- the performance should be
~equivalent. The Flink runner in Beam constructs a "native" Flink pipeline;
the overhead of invoking user-defined functions is often to set a few
fields and invoke a function, which is negligible. The actual performance
of a pipeline tend to depend on other factors -- stragglers, how fast the
system can adopt to changing load, etc.

(If there's a gap somewhere, it is likely a bug -- and we'd like to know
about it and fix it.)

> in case I want to use some low level functionality (specific to a runner)
>> like
>> > ML, graph processing or sql-tables api, is it possible to just drop
>> from the
>> > beam API one level deeper to the actual runner and sort of mesh beam
>> with runner
>> > native code to integrate these features?
>
>
The Beam API, in a general sense, doesn't provide such hooks, as that would
break portability.

I wouldn't advise this, but technically, it wouldn't be hard -- you'd
create a PTransform in Beam, and modify the runner to replace it with their
own specific implementation. Instead, I'd suggest using Beam's abstractions
and, in the case of a missing pattern or a feature, to work with us to
augment the Beam model accordingly.

Hope this helps -- and that you find Beam fitting for your case. Please let
us know if we can assist any further -- thanks!

Davor


Re: Unhelpful ExceptionInChainedStubException errors with Flink runner

2017-04-12 Thread Davor Bonaci
Aljoscha, any ideas perhaps?

On Wed, Apr 5, 2017 at 12:52 PM, peay  wrote:

> Hello,
>
> I've been having some trouble with debugging exceptions in user code when
> using the Flink runner. Here's an example from a window/DoFn/GroupByKey
> pipeline.
>
> ERROR o.a.f.runtime.operators.BatchTask - Error in task code:  CHAIN
> MapPartition (MapPartition at ParDo(MyDoFn)) -> FlatMap
> (Transform/Windowing/Window.Assign.out) -> Map (Key Extractor) ->
> GroupCombine (GroupCombine at GroupCombine: GroupByKey) -> Map (Key
> Extractor) (1/8)
> org.apache.beam.sdk.util.UserCodeException: org.apache.flink.runtime.
> operators.chaining.ExceptionInChainedStubException
>at 
> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
> ~[beam-sdks-java-core-0.6.0.jar:0.6.0]
>at 
> org.org.my.pipelines.MyDoFn$auxiliary$s09rfuPj.invokeProcessElement(Unknown
> Source) ~[na:na]
>at org.apache.beam.runners.core.SimpleDoFnRunner.
> invokeProcessElement(SimpleDoFnRunner.java:198)
> ~[beam-runners-core-java-0.6.0.jar:0.6.0]
>at 
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:156)
> ~[beam-runners-core-java-0.6.0.jar:0.6.0]
>at org.apache.beam.runners.flink.translation.functions.
> FlinkDoFnFunction.mapPartition(FlinkDoFnFunction.java:109)
> ~[beam-runners-flink_2.10-0.6.0.jar:0.6.0]
>at org.apache.flink.runtime.operators.MapPartitionDriver.
> run(MapPartitionDriver.java:103) ~[flink-runtime_2.10-1.2.0.jar:1.2.0]
>at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
> [flink-runtime_2.10-1.2.0.jar:1.2.0]
>at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
> [flink-runtime_2.10-1.2.0.jar:1.2.0]
>at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
> [flink-runtime_2.10-1.2.0.jar:1.2.0]
>at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> org.apache.flink.runtime.operators.chaining.ExceptionInChainedStubException:
> null
>at org.apache.flink.runtime.operators.chaining.
> ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:82)
> ~[flink-runtime_2.10-1.2.0.jar:1.2.0]
>at org.apache.flink.runtime.operators.util.metrics.
> CountingCollector.collect(CountingCollector.java:35)
> ~[flink-runtime_2.10-1.2.0.jar:1.2.0]
>at org.apache.beam.runners.flink.translation.functions.
> FlinkDoFnFunction$DoFnOutputManager.output(FlinkDoFnFunction.java:138)
> ~[beam-runners-flink_2.10-0.6.0.jar:0.6.0]
>at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnContext.
> outputWindowedValue(SimpleDoFnRunner.java:351)
> ~[beam-runners-core-java-0.6.0.jar:0.6.0]
>at org.apache.beam.runners.core.SimpleDoFnRunner$
> DoFnProcessContext.output(SimpleDoFnRunner.java:545)
> ~[beam-runners-core-java-0.6.0.jar:0.6.0]
>at org.my.pipelines.MyDoFn.processElement(MyDoFn.java:49)
> ~[pipelines-0.1.jar:na]
>
> The top stacktrace references some kind of anonymous
> `invokeProcessElement(Unknown Source)` which is not really informative. The
> bottom stacktrace references my call to `context.output()`, which is even
> more confusing. I've gone through fixing a couple issue by manually
> try/catching and logging directly from within `processElement`, but this is
> far from ideal. Any advice on how to interpret those and possibly set
> things up in order to get more helpful error messages would be really
> helpful.
>
> Running Beam 0.6, Flink 1.2.
>
> Thanks!
>
>
>


Re: Slack Channel Request

2017-03-27 Thread Davor Bonaci
Invite sent.

On Sat, Mar 25, 2017 at 2:48 AM, Prabeesh K.  wrote:

> Hi Jean,
>
> Thank you for your reply. I am eagerly waiting for the other options.
>
> Regards,
> Prabeesh K.
>
> On 25 March 2017 at 10:08, Jean-Baptiste Onofré  wrote:
>
>> Unfortunately we reached the max number of people on Slack (90).
>>
>> Let me see what we can do.
>>
>> Regards
>> JB
>>
>>
>> On 03/24/2017 09:49 PM, Prabeesh K. wrote:
>>
>>> Hi,
>>>
>>> Can someone please add me to the Apache Beam slack channel?
>>>
>>> Regards,
>>>
>>> Prabeesh K.
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: Recommendations on reading/writing avro to hdfs

2017-03-06 Thread Davor Bonaci
Hi Michael,
Sorry about the inconvenience here; AvroWrapperCoder is indeed removed
recently from Hadoop/HDFS IO.

I think the best approach would be to use HDFSFileSource; this is the only
approach I can recommend today.

Going forward, we are working on being able to read Avro files via AvroIO,
regardless which file system the files may be stored on. So, you'd do
something like AvroIO.Read.from("hdfs://..."), just as you can today do
AvroIO.Read.from("gs://...").

Hope this helps!

Davor

On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey  wrote:

> Hi all,
>
> we are currently using beam over spark, reading and writing avro files to
> hdfs.
>
> Until now we use HDFSFileSource for reading and HadoopIO for writing,
> essentially reading and writing PCollection
>
> With the changes introduced by https://issues.apache.org/
> jira/browse/BEAM-1497 this seems to be not directly supported anymore by
> beam, as the required AvroWrapperCoder is deleted.
>
> So as we have to change our code anyway, we are wondering, what would be
> the recommended approach to read/write avro files from/to hdfs with beam on
> spark.
>
> - use the new implementation of HDFSFileSource/HDFSFileSink
> - use spark provided HadoopIO (and probably reimplement AvroWrapperCoder
> by ourself?)
>
> What ware the trade offs here, possibly also considering already planned
> changes on IO? Do we have advantages using the spark HadoopIO as our
> underlying engine is currently spark, or will this eventually be deprecated
> and exists only for ‘historical’ reasons?
>
> Any thoughts and advice here?
>
> Regards,
>
> michel
>


Re: Beam Spark/Flink runner with DC/OS

2017-01-30 Thread Davor Bonaci
Sorry for the delay here.

Am I correct in summarizing that "gs://bucket/file" doesn't work on a Spark
cluster, but does with Spark runner locally? Beam file systems utilize
AutoService functionality and, generally speaking, all filesystems should
be available and automatically registered on all runners. This is probably
just a simple matter of staging the right classes on the cluster.

Pei, any additional thoughts here?

On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu 
wrote:

> Sorry for the spam. But to clarify, I didn’t write the code. I’m using the
> code described here https://beam.apache.org/get-started/wordcount-example/
> So the file already exists in GS.
>
> On Jan 23, 2017, at 4:55 PM, Chaoran Yu  wrote:
>
> I didn’t upload the file. But since the identical Beam code, when running
> in Spark local mode, was able to fetch the file and process it, the file
> does exist.
> It’s just that somehow Spark standalone mode can’t find the file.
>
>
> On Jan 23, 2017, at 4:50 PM, Amit Sela  wrote:
>
> I think "external" is the key here, you're cluster is running all it's
> components on your local machine so you're good.
>
> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered
> by Google. You need to upload your file to GS. Have you ?
>
> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu 
> wrote:
>
>> Well, my file is not in my local filesystem. It’s in GS.
>> This is the line of code that reads the input file: p.apply(TextIO.Read.
>> from("gs://apache-beam-samples/shakespeare/*"))
>>
>> And this page https://beam.apache.org/get-started/quickstart/ says the
>> following:
>> "you can’t access a local file if you are running the pipeline on an
>> external cluster”.
>> I’m indeed trying to run a pipeline on a standalone Spark cluster running
>> on my local machine. So local files are not an option.
>>
>>
>> On Jan 23, 2017, at 4:41 PM, Amit Sela  wrote:
>>
>> Why not try file:// instead ? it doesn't seem like you're using Google
>> Storage, right ? I mean the input file is on your local FS.
>>
>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu 
>> wrote:
>>
>> No I’m not using Dataproc.
>> I’m simply running on my local machine. I started a local Spark cluster
>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam
>> job to that cluster.
>> The gs file is the kinglear.txt from Beam’s example code and it should be
>> public.
>>
>> My full stack trace is attached.
>>
>> Thanks,
>> Chaoran
>>
>>
>>
>> On Jan 23, 2017, at 4:23 PM, Amit Sela  wrote:
>>
>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the
>> machines hosting the executor processes have access to GS ? could you paste
>> the entire stack trace ?
>>
>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu 
>> wrote:
>>
>> Thank you Amit for the reply,
>>
>> I just tried two more runners and below is a summary:
>>
>> DirectRunner: works
>> FlinkRunner: works in local mode. I got an error “Communication with
>> JobManager failed: lost connection to the JobManager” when running in
>> cluster mode,
>> SparkRunner: works in local mode (mvn exec command) but fails in cluster
>> mode (spark-submit) with the error I pasted in the previous email.
>>
>> In SparkRunner’s case, can it be that Spark executor can’t access gs file
>> in Google Storage?
>>
>> Thank you,
>>
>>
>>
>> On Jan 23, 2017, at 3:28 PM, Amit Sela  wrote:
>>
>> Is this working for you with other runners ? judging by the stack trace,
>> it seems like IOChannelUtils fails to find a handler so it doesn't seem
>> like it is a Spark specific problem.
>>
>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu 
>> wrote:
>>
>> Thank you Amit and JB!
>>
>> This is not related to DC/OS itself, but I ran into a problem when
>> launching a Spark job on a cluster with spark-submit. My Spark job written
>> in Beam can’t read the specified gs file. I got the following error:
>>
>> Caused by: java.io.IOException: Unable to find handler for
>> gs://beam-samples/sample.txt
>> at org.apache.beam.sdk.util.IOChannelUtils.getFactory(
>> IOChannelUtils.java:307)
>> at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(
>> FileBasedSource.java:528)
>> at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(
>> OffsetBasedSource.java:271)
>> at org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.
>> hasNext(SourceRDD.java:125)
>>
>> Then I thought about switching to reading from another source, but I saw
>> in Beam’s documentation that TextIO can only read from files in Google
>> Cloud Storage (prefixed with gs://) when running in cluster mode. How do
>> you guys doing file IO in Beam when using the SparkRunner?
>>
>>
>> Thank you,
>> Chaoran
>>
>>
>> On Jan 22, 2017, at 4:32 AM, Amit Sela  wrote:

Re: Regarding Beam Slack Channel

2017-01-23 Thread Davor Bonaci
I wouldn't be in favor of such "solutions" on top of Slack, unless
specifically endorsed by Slack itself. It sounds like something intended to
override specific design decisions, and we may not be acting in the best
faith if we were to deploy it.

If Slack doesn't support features we think we need, we can always consider
other solutions as well.

On Mon, Jan 23, 2017 at 10:41 AM, Jason Kuster 
wrote:

> I've set up https://github.com/thesandlord/SlackEngine for a different
> slack group I'm a part of before; would there be interest in that? It's an
> automated slack invite generator -- we could just put up a link to it on
> the Beam website and then anyone could generate themselves an invite.
>
> On Mon, Jan 23, 2017 at 10:39 AM, Dan Halperin 
> wrote:
>
>> +1 to please make it an open slack channel if possible. And if not,
>> please enlarge the set of approvers as quickly as possible.
>>
>> On Sun, Jan 22, 2017 at 11:10 PM, Ritesh Kasat 
>> wrote:
>>
>>> Thanks Jean
>>>
>>> —Ritesh
>>>
>>> On 22-Jan-2017, at 10:36 PM, Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> I will.
>>>
>>> By the way, not sure it's possible (I gonna check) but maybe it would be
>>> great to have a open slack channel (like IRC) without invite needed.
>>>
>>> Regards
>>> JB
>>> On Jan 23, 2017, at 07:32, Ritesh Kasat  wrote:

 Hello,

 Can someone add me to the Beam slack channel.

 Thanks
 Ritesh


>>>
>>
>
>
> --
> ---
> Jason Kuster
> Apache Beam (Incubating) / Google Cloud Dataflow
>


Re: Some questions and kudos to everyone contributed to this project

2017-01-11 Thread Davor Bonaci
Welcome Ufuk!

I’m using in GCP dataflow and so far so good. However do you have any
> problems that you foresee I may bump into, my data is safely stored so in
> case of hard failures I can recover easily.
>

Beam community as a whole, including all of its runners, always focuses on
data consistency and semantic guarantees as a top priority. We have an
extensive testing infrastructure to try to catch bugs as early as possible.
Bugs may occasionally happen in any system -- but we'll certainly strive to
avoid it and/or mitigate any impact as best as we can.

- I didn’t wait for Dataflow 2.0 beta release and jumped into beam 0.4, do
> you think I should go and use 2.0 now instead of 0.4 as I will be only
> using Google’s managed dataflow service.
>

You are welcome to use either. The Beam community endorses Beam releases,
and 0.4.0 is the newest and the recommended release.

With my Google hat on -- Any vendor's distribution comes with additional
vetting and support for that specific runner/scenario, and Dataflow is not
an exception. If you want to use Dataflow service, using the Dataflow
distribution makes sense. And, of course, you can always easily change your
mind, or mix-and-match with Beam releases, without any modification to your
code.

- How can I try ‘TemplatingDataflowPipelineRunner’ on 0.4 or 2.0? I think
> this will be renamed to ‘TemplatingPipelineRunner’? Do you have any
> guidance?
>

Please see --templateLocation pipeline option and BEAM-551 [1] in our JIRA
issue tracker.

These are my questions so far. I will also add some feat requests for
> Google Dataflow part, this may not be correct place to post those, if so
> ignore those please:
>

These are great ideas, but they pertain to the Dataflow Service, so it
would be best addressed by Dataflow support [2].

Once again, welcome! It is great to have you join the Beam user community.

Davor

[1] https://issues.apache.org/jira/browse/BEAM-551
[2] https://cloud.google.com/dataflow/support


Re: Extending Beam to Write/Read to Apache Accumulo.

2017-01-11 Thread Davor Bonaci
Hi Wyatt -- welcome!

If you'd like to write to a PCollection to Apache Accumulo's key/value
store, writing an new IO connector would be the best path forward. Accumulo
has somewhat similar concepts as BigTable, so you can use the existing
BigTableIO as an inspiration.

You are thinking it exactly right -- a connector written in Beam would be
runner-independent, and thus can run anywhere.

I'm not aware that anybody has started on this one yet -- feel free to file
a JIRA to have a place to coordinate if someone else is interested. And, if
you get stuck or need help in any way, there are plenty of people on the
Beam mailing lists happy to help!

Once again, welcome!

Davor

On Wed, Jan 11, 2017 at 6:04 PM, Wyatt Frelot  wrote:

> All,
>
> Being new to Apache Beam...I want to ensure that I approach things the
> "right way".
>
> My goal:
>
> I want to be able to write a PCollection to Apache Accumulo. Something
> like this:
>   PCollection.apply(AccumuloIO.Write.to("AccumuloTable"));
>
> While I am sure I can create a custom class to do so, it has me thinking
> about identifying the best way forward.
>
> I want to use the Apex Runner to run my applications. Apex has Malhar
> libraries that are already written that would be really useful. But, I
> don't think that is the point. The goal is to develop IO Connectors that
> are able to be applied to any runner.  Am I thinking about his correctly?
>
> Is there any work being done to develop an IO Connector for Apache
> Accumulo?
>
> Wyatt
>
>
> wa
>