Re: Pattern for sharing a resource across all worker threads in Beam Java SDK

2020-04-30 Thread Luke Cwik
I looked at your example and the custom logic for the singleton is basically: static transient T value; public static synchronized getOrCreate(...) { if (value == null) { ... instantiate value ... } return value; } Which is only a few lines. You could use one of Java's injection framewo

Re: Pattern for sharing a resource across all worker threads in Beam Java SDK

2020-04-30 Thread Cameron Bateman
I'm interested in this area too. One limitation I guess is that this assumes your runner is going to be single JVM if you need your singletons to be globally unique. I'm mostly using DirectRunner (I'm still new to all this) for which this holds. I suppose for more distributed runners this would

Re: Beam + Flink + Docker - Write to host system

2020-04-30 Thread Robbe Sneyders
Yes, the task manager has one task slot per CPU core available, and the dashboard shows that the work is parallelized across multiple subtasks. However when using parallelism, the pipeline stalls, the Task Manager starts throwing 'Output channel stalled' warnings, and high back pressure is created

Re: Beam + Flink + Docker - Write to host system

2020-04-30 Thread Kyle Weaver
If you are using only a single task manager but want to get parallelism > 1, you will need to increase taskmanager.numberOfTaskSlots in your flink-conf.yaml. https://ci.apache.org/projects/flink/flink-docs-stable/internals/job_scheduling.html#scheduling On Thu, Apr 30, 2020 at 8:19 AM Robbe Sneyde

Pattern for sharing a resource across all worker threads in Beam Java SDK

2020-04-30 Thread Jeff Klukas
Beam Java users, I've run into a few cases where I want to present a single thread-safe data structure to all threads on a worker, and I end up writing a good bit of custom code each time involving a synchronized method that handles creating the resource exactly once, and then each thread has its

RE: HCatalogIO - Trying to read table metadata (columns names and indexes)

2020-04-30 Thread Gershi, Noam
Thank you Rahul From: [gmail.com] rahul patwari Sent: Tuesday, April 28, 2020 4:21 PM To: user Subject: Re: HCatalogIO - Trying to read table metadata (columns names and indexes) Hi Noam, Currently, Beam doesn't support conversion of HCatRecords to Rows (or) in your case creating Beam Schema

Re: Beam + Flink + Docker - Write to host system

2020-04-30 Thread Robbe Sneyders
Hi Kyle, Thanks for the quick response. The problem was that the pipeline could not access the input file. The Task Manager errors seem unrelated indeed. I'm now able to run the pipeline completely, but I'm running into problems when using parallelism. The pipeline can be summarized as: read file

Re: Apache beam job on Flink checkpoint size growing over time

2020-04-30 Thread Maximilian Michels
Hey Steve, The checkpoint buffer is cleared "lazily" when a new bundle is started. Theoretically, if no more data was to arrive after a checkpoint, then this buffer would not be cleared until more data arrived. If this repeats every time new data comes in, then always some data would remain buffer