Re: Source job parallelism

2015-08-25 Thread Matthias J. Sax
Hi Arnaud,

did you try:

 Env.setSource(mySource).setParrellelism(1).map(mymapper).setParallelism(10)

If this does not work, it might be that Flink chains the mapper to the
source which implies to use the same parallelism (and the producer
dictates this dop value).

Using a rebalance() in between should break the chaining:

 Env.setSource(mySource).setParrellelism(1).rebalance().map(mymapper).setParallelism(10)



-Matthias

On 08/25/2015 07:08 PM, LINZ, Arnaud wrote:
 Hi,
 
  
 
 I have a streaming source that extends RichParallelSourceFunction, but
 for some reason I don’t want parallelism at the source level, so I use :
 
 Env.setSource(mySource).setParrellelism(1).map(mymapper)
 
  
 
 I do want parallelism at the mapper level, because it’s a long task, and
 I would like the source to dispatch data to several mappers.
 
  
 
 It seems that I don’t get parallelism on the mapper, it seems that the
 setParallelism() does not apply only to the source.
 
 Is that right? If yes, how can I mix my parallelism levels ?
 
  
 
 Best regards,
 
 Arnaud
 
 
 
 
 L'intégrité de ce message n'étant pas assurée sur internet, la société
 expéditrice ne peut être tenue responsable de son contenu ni de ses
 pièces jointes. Toute utilisation ou diffusion non autorisée est
 interdite. Si vous n'êtes pas destinataire de ce message, merci de le
 détruire et d'avertir l'expéditeur.
 
 The integrity of this message cannot be guaranteed on the Internet. The
 company that sent this message cannot therefore be held liable for its
 content nor attachments. Any unauthorized use or dissemination is
 prohibited. If you are not the intended recipient of this message, then
 please delete it and notify the sender.



signature.asc
Description: OpenPGP digital signature


Re: Application-specific loggers configuration

2015-08-25 Thread Aljoscha Krettek
Hi Gwenhaël,
are you using the one-yarn-cluster-per-job mode of Flink? I.e., you are
starting your Flink job with (from the doc):

flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096
./examples/flink-java-examples-0.10-SNAPSHOT-WordCount.jar

If you are, then this is almost possible on the current version of Flink.
What you have to do is copy the conf directory of Flink to a separate
directory that is specific to your job. There you make your modifications
to the log configuration etc. Then, when you start your job you do this
instead:

export FLINK_CONF_DIR=/path/to/my/conf
flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096
./examples/flink-java-examples-0.10-SNAPSHOT-WordCount.jar

You can easily put this into your startup script, of course.

I said almost possible because this requires a small fix in bin/flink.
Around line 130 this line:
FLINK_CONF_DIR=$FLINK_ROOT_DIR_MANGLED/conf
needs to be replaced by this line:
if [ -z $FLINK_CONF_DIR ]; then
FLINK_CONF_DIR=$FLINK_ROOT_DIR_MANGLED/conf; fi

(We will fix this in the upcoming version and the 0.9.1 bugfix release.)

Does this help? Let us know if you are not using the
one-yarn-cluster-per-job mode, then we'll have to try to find another
solution.

Best,
Aljoscha



On Tue, 25 Aug 2015 at 16:22 Gwenhael Pasquiers 
gwenhael.pasqui...@ericsson.com wrote:

 Hi,



 We’re developing the first of (we hope) many flink streaming app.



 We’d like to package the logging configuration (log4j) together with the
 jar. Meaning, different application will probably have different logging
 configuration (ex: to different logstash ports) …



 Is there a way to “override” the many log4j properties files that are in
 flink/conf./*.properties ?



 In our environment, the flink binaries would be on the PATH, and our apps
 would be :

 -  Jar file

 -  App configuration files

 -  Log configuration files

 -  Startup script



 B.R.



 Gwenhaël PASQUIERS



[ANNOUNCE] Flink Forward 2015 program is online

2015-08-25 Thread Kostas Tzoumas
Hi everyone,

Just a shoutout that we have posted the program of Flink Forward 2015 here:
http://flink-forward.org/?post_type=day

You can expect few changes here and there, but the majority of the talks is
in.

Thanks again to the speakers and the reviewers!

If you have not registered yet, now is the time to do it :-) (here:
http://flink-forward.org/?page_id=96)

Kostas


Re: Broadcasting sets in Flink Streaming

2015-08-25 Thread Tamara Mendt
Ok, I'll try that. Thanks a lot!

On Tue, Aug 25, 2015 at 4:19 PM, Stephan Ewen se...@apache.org wrote:

 You can do something very similar like broadcast sets like this:

 Use a Co-Map function and connect your main data set regularly (forward
 partitioning) to one input and your broadcast set via broadcast to the
 other input. You can then retrieve the data in the two map functions
 separately.

 This approach misses the logic that the broadcast data arrives fully
 before the non-broadcast data (you may receive events from the main data
 set before all broadcast data was received), but maybe you can work around
 that...

 On Tue, Aug 25, 2015 at 2:45 PM, Till Rohrmann trohrm...@apache.org
 wrote:

 Hi Tamara,

 I think this is not officially supported by Flink yet. However, I think
 that Gyula had once an example where he did something comparable. Maybe he
 can chime in here.

 Cheers,
 Till

 On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com
 wrote:

 Hello,

 I have been trying to use the function withBroadcastSet on a
 SingleOutputStreamOperator (map) the same way I would on a MapOperator for
 a DataSet. From what I see, this cannot be done. I wonder if there is some
 way to broadcast a DataSet to the tasks that are performing transformations
 on a DataStream?

 I am basically pre-calculating some things with Flink which I later need
 for the transformations on the incoming data from the stream. So I want to
 broadcast the resulting datasets from the pre-calculations.

 Any ideas on how to best approach this?

 Thanks, cheers

 Tamara.






-- 
Tamara Mendt


Application-specific loggers configuration

2015-08-25 Thread Gwenhael Pasquiers
Hi,

We're developing the first of (we hope) many flink streaming app.

We'd like to package the logging configuration (log4j) together with the jar. 
Meaning, different application will probably have different logging 
configuration (ex: to different logstash ports) ...

Is there a way to override the many log4j properties files that are in 
flink/conf./*.properties ?

In our environment, the flink binaries would be on the PATH, and our apps would 
be :

-  Jar file

-  App configuration files

-  Log configuration files

-  Startup script

B.R.

Gwenhaël PASQUIERS


Re: Broadcasting sets in Flink Streaming

2015-08-25 Thread Stephan Ewen
You can do something very similar like broadcast sets like this:

Use a Co-Map function and connect your main data set regularly (forward
partitioning) to one input and your broadcast set via broadcast to the
other input. You can then retrieve the data in the two map functions
separately.

This approach misses the logic that the broadcast data arrives fully before
the non-broadcast data (you may receive events from the main data set
before all broadcast data was received), but maybe you can work around
that...

On Tue, Aug 25, 2015 at 2:45 PM, Till Rohrmann trohrm...@apache.org wrote:

 Hi Tamara,

 I think this is not officially supported by Flink yet. However, I think
 that Gyula had once an example where he did something comparable. Maybe he
 can chime in here.

 Cheers,
 Till

 On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com
 wrote:

 Hello,

 I have been trying to use the function withBroadcastSet on a
 SingleOutputStreamOperator (map) the same way I would on a MapOperator for
 a DataSet. From what I see, this cannot be done. I wonder if there is some
 way to broadcast a DataSet to the tasks that are performing transformations
 on a DataStream?

 I am basically pre-calculating some things with Flink which I later need
 for the transformations on the incoming data from the stream. So I want to
 broadcast the resulting datasets from the pre-calculations.

 Any ideas on how to best approach this?

 Thanks, cheers

 Tamara.





Re: Flink to ingest from Kafka to HDFS?

2015-08-25 Thread Rico Bergmann
Hi!

Sorry, I won't be able to implement this soon. I just shared my ideas on this. 

Greets. Rico. 



 Am 25.08.2015 um 17:52 schrieb Stephan Ewen se...@apache.org:
 
 Hi Rico!
 
 Can you give us an update on your status here? We actually need something 
 like this as well (and pretty urgent), so we would jump in
 and implement this, unless you have something already.
 
 Stephan
 
 
 On Thu, Aug 20, 2015 at 12:13 PM, Stephan Ewen se...@apache.org wrote:
 BTW: This is becoming a dev discussion, maybe should move to that list...
 
 On Thu, Aug 20, 2015 at 12:12 PM, Stephan Ewen se...@apache.org wrote:
 Yes, one needs exactly a mechanism to seek the output stream back to the 
 last checkpointed position, in order to overwrite duplicates.
 
 I think HDFS is not going to make this easy, there is basically no seek for 
 write. Not sure how to solve this, other then writing to tmp files and 
 copying upon success.
 
 Apache Flume must have solved this issue in some way, it may be a worth 
 looking into how they solved it.
 
 On Thu, Aug 20, 2015 at 11:58 AM, Rico Bergmann i...@ricobergmann.de 
 wrote:
 My ideas for checkpointing:
 
 I think writing to the destination should not depend on the checkpoint 
 mechanism (otherwise the output would never be written to the destination 
 if checkpointing is disabled). Instead I would keep the offsets of written 
 and Checkpointed records. When recovering you would then somehow delete or 
 overwrite the records after that offset. (But I don't really know whether 
 this is as simple as I wrote it ;-) ). 
 
 Regarding the rolling files I would suggest making the values of the 
 user-defined partitioning function part of the path or file name. Writing 
 records is then basically:
 Extract the partition to write to, then add the record to a queue for this 
 partition. Each queue has an output format assigned to it. On flushing the 
 output file is opened, the content of the queue is written to it, and then 
 closed.
 
 Does this sound reasonable?
 
 
 
 Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek aljos...@apache.org:
 
 Yes, this seems like a good approach. We should probably no reuse the 
 KeySelector for this but maybe a more use-case specific type of function 
 that can create a desired filename from an input object.
 
 This is only the first part, though. The hard bit would be implementing 
 rolling files and also integrating it with Flink's checkpointing 
 mechanism. For integration with checkpointing you could maybe use 
 staging-files: all elements are put into a staging file. And then, when 
 the notification about a completed checkpoint is received the contents of 
 this file would me moved (or appended) to the actual destination.
 
 Do you have any Ideas about the rolling files/checkpointing?
 
 On Thu, 20 Aug 2015 at 09:44 Rico Bergmann i...@ricobergmann.de wrote:
 I'm thinking about implementing this. 
 
 After looking into the flink code I would basically subclass 
 FileOutputFormat in let's say KeyedFileOutputFormat, that gets an 
 additional KeySelector object. The path in the file system is then 
 appended by the string, the KeySelector returns. 
 
 U think this is a good approach?
 
 Greets. Rico. 
 
 
 
 Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org:
 
 If you are up for it, this would be a very nice addition to Flink, a 
 great contribution :-)
 
 On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote:
 Hi!
 
 This should definitely be possible in Flink. Pretty much exactly like 
 you describe it.
 
 You need a custom version of the HDFS sink with some logic when to 
 roll over to a new file.
 
 You can also make the sink exactly once by integrating it with the 
 checkpointing. For that, you would probably need to keep the current 
 path and output stream offsets as of the last checkpoint, so you can 
 resume from that offset and overwrite records to avoid duplicates. If 
 that is not possible, you would probably buffer records between 
 checkpoints and only write on checkpoints.
 
 Greetings,
 Stephan
 
 
 
 On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com 
 wrote:
 Hi,
 
 Did anybody think of (mis-) using Flink streaming as an alternative 
 to Apache Flume just for ingesting data from Kafka (or other 
 streaming sources) to HDFS? Knowing that Flink can read from Kafka 
 and write to hdfs I assume it should be possible, but Is this a good 
 idea to do? 
 
 Flume basically is about consuming data from somewhere, peeking into 
 each record and then directing it to a specific directory/file in 
 HDFS reliably. I've seen there is a FlumeSink, but would it be 
 possible to get the same functionality with
 Flink alone?
 
 I've skimmed through the documentation and found the option to split 
 the output by key and the possibility to add multiple sinks. As I 
 understand, Flink programs are generally static, so it would not be 
 possible to add/remove sinks at runtime?
 So you would need to implement a custom 

Broadcasting sets in Flink Streaming

2015-08-25 Thread Tamara Mendt
Hello,

I have been trying to use the function withBroadcastSet on a
SingleOutputStreamOperator (map) the same way I would on a MapOperator for
a DataSet. From what I see, this cannot be done. I wonder if there is some
way to broadcast a DataSet to the tasks that are performing transformations
on a DataStream?

I am basically pre-calculating some things with Flink which I later need
for the transformations on the incoming data from the stream. So I want to
broadcast the resulting datasets from the pre-calculations.

Any ideas on how to best approach this?

Thanks, cheers

Tamara.


Re: Broadcasting sets in Flink Streaming

2015-08-25 Thread Till Rohrmann
Hi Tamara,

I think this is not officially supported by Flink yet. However, I think
that Gyula had once an example where he did something comparable. Maybe he
can chime in here.

Cheers,
Till

On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com wrote:

 Hello,

 I have been trying to use the function withBroadcastSet on a
 SingleOutputStreamOperator (map) the same way I would on a MapOperator for
 a DataSet. From what I see, this cannot be done. I wonder if there is some
 way to broadcast a DataSet to the tasks that are performing transformations
 on a DataStream?

 I am basically pre-calculating some things with Flink which I later need
 for the transformations on the incoming data from the stream. So I want to
 broadcast the resulting datasets from the pre-calculations.

 Any ideas on how to best approach this?

 Thanks, cheers

 Tamara.