Re: [EXTERNAL] - Re: Create PCollection from Listin DoFn<>

2018-06-07 Thread Marián Dvorský
If you have a function which given X returns a List, you can use
FlatMapElements

transform
on PCollection to get a PCollection.

On Thu, Jun 7, 2018 at 8:16 AM S. Sahayaraj  wrote:

> In case if we could return List from DoFn<> then we could use the
> code as suggested in section 3.1.2 and mentioned by you below., but the
> return type of DoFn<> is always PCollection<> in where I could not have the
> list of ABC objects which further will be fed as input for parallel
> computation. Is there any possibility to convert List to
> PCollection in DoFn<> itself? OR can DoFn<> return List objects?
>
>
>
>
>
> Cheers,
>
> S. Sahayaraj
>
>
>
> *From:* Robert Bradshaw [mailto:rober...@google.com]
> *Sent:* Wednesday, June 6, 2018 9:40 PM
> *To:* user@beam.apache.org
> *Subject:* [EXTERNAL] - Re: Create PCollection from Listin
> DoFn<>
>
>
>
> You can use the Create transform to do this, e.g.
>
>
>
>   Pipeline p = ...
>
>   List inMemoryObjects = ...
>
>   PCollection pcollectionOfObject = p.apply(Create.of(
> inMemoryObjects));
>
>   result = pcollectionOfObject.apply(ParDo.of(SomeDoFn...));
>
>
>
> See section 3.1.2 at
> https://beam.apache.org/documentation/programming-guide/#pcollections
>
>
>
> On Wed, Jun 6, 2018 at 8:34 AM S. Sahayaraj  wrote:
>
> Hello,
>
> I have created a java class which extends DoFn<>, there
> are list of objects of type ABC (List) populated in
> processElement(ProcessContext c) at runtime and would like to generate
> respective PCollection from List so that the subsequent
> transformation can do parallel execution on each ABC object in
> PCollection. How do we create PCollection from in-memory object
> created in DoFn<>? OR How do we get pipeline object being in DoFn<>? OR is
> there any SDK guidelines to refer?
>
>
>
>
>
> Thanks,
>
> S. Sahayaraj
>
>


Re: Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Marián Dvorský
Hi Guillaume,

You may want to avoid the final join by using CombineFns.compose()

 instead.

Marian

On Tue, Mar 13, 2018 at 9:07 PM Guillaume Balaine  wrote:

> Hello Beamers,
>
> I have been a Beam advocate for a while now, and am trying to use it for
> batch jobs as well as streaming jobs.
> I am trying to prove that it can be as fast as Spark for simple use cases.
> Currently, I have a Spark job that processes a sum + count over a TB of
> parquet files that runs in roughly 90 min.
> Using the same resources (on EMR or on Mesos) I can't even come close to
> that.
> My job with Scio/Beam and Flink processes sums roughly 20 Billion rows of
> parquet in 3h with 144 parallelism and 1TB of Ram (although I suspect many
> operators are idle, so I should probably use less parallelism with the same
> amount of cores).
> I also implemented an identical version in pure Java because I am unsure
> whether or not the Kryo encoded tuples are properly managed by the Flink
> memory optimizations. And am also testing it on Spark and Apex.
>
> My point is, is there anyway to optimize this simple process :
> HadoopFileIO (using parquet and specific avro coders to improve perf over
> Generic) ->
> Map to KV of (field1 str, field2 str, field3 str) (value double, 1)
> ordered by most discriminating to least -> Combine.perKey(Sum)
> Or value and then join Sum and Count with a TupledPCollection
> -> AvroIO.Write
>
> The equivalent Spark Job does a group by key, and then a sum.
>
> Are there some tricks I am missing here ?
>
> Thanks in advance for your help.
>


Re: [Scio/Scala API] Casting PCollection to SCollection.

2018-02-22 Thread Marián Dvorský
Try the wrap

method on ScioContext.


On Thu, Feb 22, 2018 at 2:27 AM Mungeol Heo  wrote:

> Hello,
>
> I am trying to cast PCollection.to SCollection.
> Is there a way to do it? like, SColleciton.internal.
> Any help will be great.
>
> Thank you.
>


Re: Troubleshooting live Java process in Dataflow

2017-12-08 Thread Marián Dvorský
The worker exports an HTTP server with a handler for "/threadz":

$ curl localhost:8081/threadz

gives the stack traces for all Java threads.

On Fri, Dec 8, 2017 at 2:00 AM, Jacob Marble  wrote:

> In this post, Reuven says:
> "we had to ssh to the VMs to get actual thread profiles from workers.
> After a bit of digging, we found threads were often stuck in the following
> stack trace"
>
> Can someone describe what tools you use to do this?
>
> I logged into a Dataflow runner, found that the gc is thrashing. Neat! Now
> I'm trying to get a thread dump. Looks like the Dataflow runner is actually
> running in a container within the VM, and the host VM doesn't have jmap or
> any j* utility installed.
>
> Tried kill -3, didn't seem to trigger a thread dump.
>
> Also found an open JMX port, but only hangs VisualVM and JConsole.
>
> Jacob
>