Re: Beam + Dataflow + Go article

Andrew Brampton Sun, 06 Jan 2019 13:31:54 -0800

Sorry I got busy over the holidays, and never wrote up the areas I
struggled with. Here is a quick bulleted list. I've created JIRA issues (or
found existing ones) where it seems appropriate:

* Coder Registry <https://issues.apache.org/jira/browse/BEAM-3306> - As
soon as I wanted to send my own custom types, I hit problems. For example,
my struct contained a color.Color (a interface) which currently can't be
encoded. However, a simple coder would have most likely fixed my problem.

* Direct doesn't marshal my data
<https://issues.apache.org/jira/browse/BEAM-6372> - I would test my
pipeline using direct, and it would happily run on a sample. When I ran it
on dataflow, it'll run for a hour, then get to a stage that would crash
like so:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error
received from SDK harness for instruction -224: execute failed: panic:
reflect: Call using main.HistogramResult as type struct { Key string
"json:\"key\""; Files []string "json:\"files\""; Histogram
palette.ColorHistogram "json:\"histogram,omitempty\""; Palette []struct { R
uint8; G uint8; B uint8; A uint8 } "json:\"palette\"" } goroutine 70
[running]:

It is not obvious what that means, but its because I forgot to register my
HistogramResult type. I had similar other errors, that could have easily
been spotted by the direct pipeline, if it had just tried to marshal and
unmarshal my types at init time. I'm sure there are additional checks the
direct runner could do.

* CSV support <https://issues.apache.org/jira/browse/BEAM-6371> - I wanted
to read my data from here csv file. I see the Java API doesn't seem to
support this, but it was very easy to implement
<https://github.com/bramp/morebeam/tree/master/csvio>, and would be a easy
thing to support.

* Hard to understand why it failed - Multiple times my dataflow pipeline
would fail, with a error like:

"A work item was attempted 4 times without success. Each time the worker
eventually lost contact with the service."

Spelunking the logs didn't find much, and I spent a lot of time adding
additional logging, adding a pprof handler, and testing. I came to the
conclusion my pipeline would run out of memory, and a worker would crash /
OOM.

* Memory leaks - I was using the standard sized GCE instances
(n1-standard-1), but I found my CoGBK would end up using a huge amount of
RAM. I determined this by using pprof, and inspecting the job as it ran.
Without reading the code, it seems that CoGroupByKey would staged the
groups in RAM before iterating. My groups weren't too big (no more than 100
items each), but it seemed it would stage multiple groups at the same time,
it would run out of RAM.  Moving to n1-standard-2 fixed that problem.

* Auto-scaling - It never seemed to work well, never going above a dozen or
so workers. When I looking at the CPU usage most of the workers would be at
100%, but a couple would be at 0%. As if they failed to start, or stalled,
or had no work to do. I didn't know how to debug that, other than noting
none of my log lines were printed in the worker's logs. I had inserted a
reshuffle early in my pipeline, so I thought the work should have been
partioned across 1000s of workers.  I started to force the number of workers
<https://issues.apache.org/jira/browse/BEAM-6144> to get the performance I
needed.

* Auto-scaling with two CPUs
<https://issues.apache.org/jira/browse/BEAM-6373>- Once I moved
to n1-standard-2, two v-cpus, auto-scaling would never scale beyond one
worker.  I guess the Go Beam API isn't multi-threaded, because even when I
forced the number of workers, each worker would never exceed 50% CPU.
<https://pasteboard.co/HVf7puP.png> I suspect the auto-scaling is looking
at CPU to decide when more workers are needed.

* GCS - The version of the GCS library was old, and it didn't support
context/timeouts. This caused my pipeline to stall and ever complete.
Specifically, a connection to GCS should have timed out, but was hanging
for some reason, causing my pipeline to make no progress for hours. I filed
a couple <https://issues.apache.org/jira/browse/BEAM-6155> of issues
<https://issues.apache.org/jira/browse/BEAM-6164> around this, and already
fixed them. But anywhere we call an external service, should follow best
practices around timeouts, retries, backoffs, etc.

* Reshuffle / AddKey / others - I'm glad with SplittableDoFn that reshuffle
won't be needed, but until then that might be useful. I also added a AddKey
<https://godoc.org/bramp.net/morebeam#AddKey> function, which allowed the
user to pass a simple function that took a value, and returned a KV<key,
value>. Kind of the opposite of DropKey
<https://godoc.org/github.com/apache/beam/sdks/go/pkg/beam#DropKey>. Feel
free to use my implementation
<https://github.com/bramp/morebeam/blob/master/addkey.go#L41>. For example:

// Return a new PCollection<KV<string, Painting>> where the key is the
artist.
paintingsByArtist := morebeam.AddKey(s, func(painting Painting) string {
     return painting.Artist
}, paintings)

* Elements added always blank
<https://issues.apache.org/jira/browse/BEAM-6374> - As seen here
<https://pasteboard.co/HVf80BU.png>, the "elements added" for input and
output collections was always empty.

* Wall time always very small
<https://issues.apache.org/jira/browse/BEAM-6375>- Similar to above, my
pipeline would run for hours, but the walltime would be seconds
<https://pasteboard.co/HVf8PCp.png>. That can't be right.

I think that's it.

Even though I found lots of rough edges, I was able to get what I wanted to
do done, and it worked well in the end. I'd like to thank everyone for
their hard work! This also was the first Flume/Dataflow/Pipeline I've ever
written, so perhaps there are best practices I was missing.

Thanks
Andrew

On Mon, 17 Dec 2018 at 08:39, Robert Burke <[email protected]> wrote:

> Thanks for the excellent article!
>
> It's great to see what the experience is like from an outside perspective,
> and it's comforting that it mirrors my own. It means I'm not missing much.
>
> It's been on my to-do list to make the Go SDK direct runner more robust,
> so transitioning to other runners wouldn't be such a burden. I'd love for
> it to have better error messages, and be more useful for testing.
>
> Daniel Oliveira recently updated the Universal Runner guide to include how
> to run Go SDK jobs against it. It should also catch the same issues, and
> provides a free way to check that pipelines are correct. It has the same
> single machine limitation though.
> https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide
>
> User Defined Coders and Pointer Elements (and their semantics) is
> something I've been thinking about as well and will be working on within
> the next month. JSON is ok for debugging but less so for performance at
> scale. Let me know if you have any opinions on that! I intend to post to
> the list about it this week.
>
> As for reshuffle, other IOs, and scalability, as mentioned in the roadmap (
> https://beam.apache.org/roadmap/go-sdk/) we're mostly blocked on
> SplittableDoFn support. With it, we wouldn't need to reshuffle, and would
> gain more natural scaling of IOs. Once the Go SDK havs these it will br
> well on it's way to not being experimental. :)
>
> Finally, I'm obligated to mention that while the SDK works on Dataflow
> it's not yet officially supported by the service.
>
> Cheers
> Robert B
>
>
> On Mon, Dec 17, 2018, 8:07 AM Kenneth Knowles <[email protected]> wrote:
>
>> Nice!
>>
>> It reads really well. For the benefit of this list, would you be willing
>> to summarize the rough edges (and maybe the "couple of other things" you
>> had to implement) in a few bullet points? and/or file Jira issues if they
>> are clear feature requests or bugs.
>>
>> Kenn
>>
>> On Mon, Dec 17, 2018 at 10:40 AM Andrew Brampton <[email protected]>
>> wrote:
>>
>>> Hey all,
>>>
>>> I've recently been playing with the Go Beam SDK running on Dataflow. I
>>> wrote up a tutorial for today's Go Advent blog.
>>>
>>> Feel free to check it out:
>>> https://blog.gopheracademy.com/advent-2018/apache-beam/
>>>
>>> Feedback is welcomed. I know the Go SDK is experimental, but I hit a few
>>> rough edges. I also had to implement my own csvio, reshuffle, and a couple
>>> of other things. I will be sharing my feedback on using go and dataflow
>>> shortly.
>>>
>>> Thanks
>>> Andrew
>>>
>>

Re: Beam + Dataflow + Go article

Reply via email to