Re: Beam courses

2019-01-14 Thread Austin Bennett
Hi Alex,

I'm certainly interested in helping more people use beam (and beyond
beginner level).  I believe there are people that can help as have already
been mentioned in this thread, I am also happy to help create training
materials for people as we identify areas that are in need.  Have discussed
a cookbook (and started drafting what would be needed for a 'beam: up and
running' tome), but what you mention might need to be done by (hopefully
not).

We have also discussed having some hands-on training at beam summits, so
perhaps your need can help motivate getting that kicked off here in SF, and
then more substantially at eu summit, locally for your needs.  I know that
helped with my spark usage (attending training tied to various spark
summit).

Cheers,
Austin


On Mon, Jan 14, 2019, 7:31 AM Maximilian Michels  Hi Alex,
>
> I know of
> http://www.bigdatainstitute.io/courses/data-engineering-with-apache-beam/
>
> There is also some public materials by Jesse (in CC):
> https://github.com/eljefe6a/beamexample
> This training uses the above exercises:
>
> https://docs.google.com/presentation/d/1ln5KndBTiskEOGa1QmYSCq16YWO9Dtmj7ZwzjU7SsW4
>
> Overall, this is more for beginners and some of the newer features like
> user
> state and timers are missing. I think it could be interesting to
> collaborate on
> an updated in-depth training.
>
> Best,
> Max
>
> On 14.01.19 01:47, Davor Bonaci wrote:
> > I'll introduce you to folks who can do this for you off-list.
> >
> > On Sun, Jan 13, 2019 at 12:28 PM Vikram Tiwari  > > wrote:
> >
> > Hey! I think he mentioned it to me once that they do trainings for
> Beam etc.
> > Might wanna talk to him.
> > https://www.linkedin.com/in/dbonaci
> >
> >
> > On Sun, Jan 13, 2019, 12:08 PM Alex Van Boxel  > > wrote:
> >
> > Hey all,
> >
> > Our team had the luxury of growing with Beam, we where Dataflow
> users
> > before it was GA. But now our team has grown, due to a merger.
> >
> > As we will continue using Beam, but then over different sites I'm
> > thinking about training. The question is... Should I create
> trainings
> > myself. Or do people specialise in Beam training? I'm not
> talking about
> > some simple getting started training... I want deep training.
> >
> > Any suggestions how people in this group do trainings?
> >
>


Re: ParquetIO write of CSV document data

2019-01-14 Thread Alexey Romanenko
Hi Sri,

Afaik, you have to create “PCollection" of "GenericRecord”s and define your 
Avro schema manually to write your data into Parquet files. 
In this case, you will need to create a ParDo for this translation. Also, I 
expect that your schema is the same for all CSV files.

Basic example of using Parquet Sink with Java SDK could be found here [1]

[1] https://git.io/fhcfV 


> On 14 Jan 2019, at 02:00, Sridevi Nookala  
> wrote:
> 
> hi,
> 
> I have a bunch of CSV data files that i need to store in Parquet format. I 
> did look at basic documentation on ParquetIO. and ParquetIO.sink() can be 
> used to achive the same.
> However there is a dependency on the Avro Schema.
> how do i infer/generate Avro schema from CSV document data ?
> Does beam have any API for the same.
> I tried using Kite SDK API CSVUtil / JsonUtil but had no luck generating avro 
> schema
> my CSV data files have headers in them and quite a few of the header fields 
> are hyphenated which are not liked by Kite 's CSVUtil
> 
> I think it will be a redundant effort to convert CSV documents to json 
> documents .
> Any suggestions on how to infer avro schema from CSV data or a JSON schema 
> will be helpful
> 
> thanks
> Sri



Re: [flink-runner] taskmanager.network.memory.max ignored for local Flink runner

2019-01-14 Thread Maximilian Michels

Why the older version does not hit the limit and the newer version does is not 
quite clear, but it could just be expected resource usage differences between 
versions.


There were some changes how the network buffers are assigned. But my best guess 
is that it's because we changed the default parallelism from 1 to the number of 
available cores. That consumes more network buffers which exceeds the default.



I can see how differentiating between the two kinds of configs are important. 
If FLINK_CONF_DIR is a de facto standard for Flink that does seem like the 
right solution, but I think this could be better documented on the Flink runner 
page.


It should be transparent to the user because the Flink scripts set the 
environment variable. The FlinkPipielineOptions are for job-scoped settings. The 
local execution is an exception because it brings up a new cluster.


Thinking about it more, we might actually add a configuration option which can 
solely be used for local execution. Would unblock the problem you are seeing and 
also allow users to test their production config locally.


Thanks,
Max

On 13.01.19 05:57, Mike Pedersen wrote:

Hi Max.

Ah, that explains it. Great to see it already has been fixed.

Currently we are using a older version of Beam which does not run out of memory 
buffers. Why the older version does not hit the limit and the newer version does 
is not quite clear, but it could just be expected resource usage differences 
between versions. We can use that until the 2.10.0 release.


I can see how differentiating between the two kinds of configs are important. If 
FLINK_CONF_DIR is a de facto standard for Flink that does seem like the right 
solution, but I think this could be better documented on the Flink runner page.


Thanks a lot for the response,
Mike

Den søn. 13. jan. 2019 kl. 02.02 skrev Maximilian Michels >:


Hi Mike,

Thank you for your message. What you have done is correct, but you have run
into a bug which was present for local execution in 2.9.0. It has since been
fixed for the upcoming 2.10.0 release.

If you look at the 2.9.0 brach, you will see that the configuration is not
passed to the local cluster:

https://github.com/apache/beam/blob/release-2.9.0/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L63

It is confusing that the config is indeed loaded but it won't be passed to
the local cluster.

Regarding the environment variable you mentioned, this is how Flink picks up
the configuration file. If you use the Flink CLI or scripts this will work
fine. But keep in mind that the memory settings are only read upon cluster
startup, so changing this value for a Beam job won't do anything to existing
non-local clusters.

We could add an option to FlinkPipelineOptions to allow arbitrary Flink
options to be passed. The main reason why we hesitated doing that was to
avoid confusion about the different types of configuration settings and
their scope.

Please let us know if you have further questions.

Best,
Max


On January 9, 2019 8:27:36 AM EST, Mike Pedersen mailto:m...@mikepedersen.dk>> wrote:

So I have Beam job that I want to run with Flink locally. Problem is, I
get the following error:

 > java.io.IOException: Insufficient number of network buffers: required
32, but only 24 available. The total number of network buffers is
currently set to 32768 of 32768 bytes each. You can increase this number
by setting the configuration keys 'taskmanager.network.memory.fraction',
'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.

So I create a config file with taskmanager.network.memory.max set to 5gb
and taskmanager.network.memory.fraction set to 0.2. I also set the
FLINK_CONF_DIR path to the dir with the config file (undocumented
feature) and set the --flinkMaster path to "[local]" as it seems like
the default "[auto]" ignores the config file:

https://github.com/apache/beam/blob/1e41220977d6c45d293b86f2e581daec3513c66e/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L76-L82.

Now, it seems like configs are loaded ok. I get the following log
message during start:

 > Jan 09, 2019 10:54:43 AM
org.apache.flink.configuration.GlobalConfiguration loadYAMLResource
INFO: Loading configuration property: taskmanager.network.memory.max, 
5gb

But the error at the top of the post still appears. 32768 * 32768 bytes
= 1gb, which is the default value of taskmanager.network.memory.max, so
it seems like the config is ignored.

Any ideas what might cause this problem? Am I adjusting the wrong
parameter or something?



Re: Beam courses

2019-01-14 Thread Maximilian Michels

Hi Alex,

I know of 
http://www.bigdatainstitute.io/courses/data-engineering-with-apache-beam/

There is also some public materials by Jesse (in CC): 
https://github.com/eljefe6a/beamexample
This training uses the above exercises: 
https://docs.google.com/presentation/d/1ln5KndBTiskEOGa1QmYSCq16YWO9Dtmj7ZwzjU7SsW4


Overall, this is more for beginners and some of the newer features like user 
state and timers are missing. I think it could be interesting to collaborate on 
an updated in-depth training.


Best,
Max

On 14.01.19 01:47, Davor Bonaci wrote:

I'll introduce you to folks who can do this for you off-list.

On Sun, Jan 13, 2019 at 12:28 PM Vikram Tiwari > wrote:


Hey! I think he mentioned it to me once that they do trainings for Beam etc.
Might wanna talk to him.
https://www.linkedin.com/in/dbonaci


On Sun, Jan 13, 2019, 12:08 PM Alex Van Boxel mailto:a...@vanboxel.be>> wrote:

Hey all,

Our team had the luxury of growing with Beam, we where Dataflow users
before it was GA. But now our team has grown, due to a merger.

As we will continue using Beam, but then over different sites I'm
thinking about training. The question is... Should I create trainings
myself. Or do people specialise in Beam training? I'm not talking about
some simple getting started training... I want deep training.

Any suggestions how people in this group do trainings?