Re: Source job parallelism
Hi Arnaud, did you try: Env.setSource(mySource).setParrellelism(1).map(mymapper).setParallelism(10) If this does not work, it might be that Flink chains the mapper to the source which implies to use the same parallelism (and the producer dictates this dop value). Using a rebalance() in between should break the chaining: Env.setSource(mySource).setParrellelism(1).rebalance().map(mymapper).setParallelism(10) -Matthias On 08/25/2015 07:08 PM, LINZ, Arnaud wrote: Hi, I have a streaming source that extends RichParallelSourceFunction, but for some reason I don’t want parallelism at the source level, so I use : Env.setSource(mySource).setParrellelism(1).map(mymapper) I do want parallelism at the mapper level, because it’s a long task, and I would like the source to dispatch data to several mappers. It seems that I don’t get parallelism on the mapper, it seems that the setParallelism() does not apply only to the source. Is that right? If yes, how can I mix my parallelism levels ? Best regards, Arnaud L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender. signature.asc Description: OpenPGP digital signature
Re: Application-specific loggers configuration
Hi Gwenhaël, are you using the one-yarn-cluster-per-job mode of Flink? I.e., you are starting your Flink job with (from the doc): flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096 ./examples/flink-java-examples-0.10-SNAPSHOT-WordCount.jar If you are, then this is almost possible on the current version of Flink. What you have to do is copy the conf directory of Flink to a separate directory that is specific to your job. There you make your modifications to the log configuration etc. Then, when you start your job you do this instead: export FLINK_CONF_DIR=/path/to/my/conf flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096 ./examples/flink-java-examples-0.10-SNAPSHOT-WordCount.jar You can easily put this into your startup script, of course. I said almost possible because this requires a small fix in bin/flink. Around line 130 this line: FLINK_CONF_DIR=$FLINK_ROOT_DIR_MANGLED/conf needs to be replaced by this line: if [ -z $FLINK_CONF_DIR ]; then FLINK_CONF_DIR=$FLINK_ROOT_DIR_MANGLED/conf; fi (We will fix this in the upcoming version and the 0.9.1 bugfix release.) Does this help? Let us know if you are not using the one-yarn-cluster-per-job mode, then we'll have to try to find another solution. Best, Aljoscha On Tue, 25 Aug 2015 at 16:22 Gwenhael Pasquiers gwenhael.pasqui...@ericsson.com wrote: Hi, We’re developing the first of (we hope) many flink streaming app. We’d like to package the logging configuration (log4j) together with the jar. Meaning, different application will probably have different logging configuration (ex: to different logstash ports) … Is there a way to “override” the many log4j properties files that are in flink/conf./*.properties ? In our environment, the flink binaries would be on the PATH, and our apps would be : - Jar file - App configuration files - Log configuration files - Startup script B.R. Gwenhaël PASQUIERS
[ANNOUNCE] Flink Forward 2015 program is online
Hi everyone, Just a shoutout that we have posted the program of Flink Forward 2015 here: http://flink-forward.org/?post_type=day You can expect few changes here and there, but the majority of the talks is in. Thanks again to the speakers and the reviewers! If you have not registered yet, now is the time to do it :-) (here: http://flink-forward.org/?page_id=96) Kostas
Re: Broadcasting sets in Flink Streaming
Ok, I'll try that. Thanks a lot! On Tue, Aug 25, 2015 at 4:19 PM, Stephan Ewen se...@apache.org wrote: You can do something very similar like broadcast sets like this: Use a Co-Map function and connect your main data set regularly (forward partitioning) to one input and your broadcast set via broadcast to the other input. You can then retrieve the data in the two map functions separately. This approach misses the logic that the broadcast data arrives fully before the non-broadcast data (you may receive events from the main data set before all broadcast data was received), but maybe you can work around that... On Tue, Aug 25, 2015 at 2:45 PM, Till Rohrmann trohrm...@apache.org wrote: Hi Tamara, I think this is not officially supported by Flink yet. However, I think that Gyula had once an example where he did something comparable. Maybe he can chime in here. Cheers, Till On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com wrote: Hello, I have been trying to use the function withBroadcastSet on a SingleOutputStreamOperator (map) the same way I would on a MapOperator for a DataSet. From what I see, this cannot be done. I wonder if there is some way to broadcast a DataSet to the tasks that are performing transformations on a DataStream? I am basically pre-calculating some things with Flink which I later need for the transformations on the incoming data from the stream. So I want to broadcast the resulting datasets from the pre-calculations. Any ideas on how to best approach this? Thanks, cheers Tamara. -- Tamara Mendt
Application-specific loggers configuration
Hi, We're developing the first of (we hope) many flink streaming app. We'd like to package the logging configuration (log4j) together with the jar. Meaning, different application will probably have different logging configuration (ex: to different logstash ports) ... Is there a way to override the many log4j properties files that are in flink/conf./*.properties ? In our environment, the flink binaries would be on the PATH, and our apps would be : - Jar file - App configuration files - Log configuration files - Startup script B.R. Gwenhaël PASQUIERS
Re: Broadcasting sets in Flink Streaming
You can do something very similar like broadcast sets like this: Use a Co-Map function and connect your main data set regularly (forward partitioning) to one input and your broadcast set via broadcast to the other input. You can then retrieve the data in the two map functions separately. This approach misses the logic that the broadcast data arrives fully before the non-broadcast data (you may receive events from the main data set before all broadcast data was received), but maybe you can work around that... On Tue, Aug 25, 2015 at 2:45 PM, Till Rohrmann trohrm...@apache.org wrote: Hi Tamara, I think this is not officially supported by Flink yet. However, I think that Gyula had once an example where he did something comparable. Maybe he can chime in here. Cheers, Till On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com wrote: Hello, I have been trying to use the function withBroadcastSet on a SingleOutputStreamOperator (map) the same way I would on a MapOperator for a DataSet. From what I see, this cannot be done. I wonder if there is some way to broadcast a DataSet to the tasks that are performing transformations on a DataStream? I am basically pre-calculating some things with Flink which I later need for the transformations on the incoming data from the stream. So I want to broadcast the resulting datasets from the pre-calculations. Any ideas on how to best approach this? Thanks, cheers Tamara.
Re: Flink to ingest from Kafka to HDFS?
Hi! Sorry, I won't be able to implement this soon. I just shared my ideas on this. Greets. Rico. Am 25.08.2015 um 17:52 schrieb Stephan Ewen se...@apache.org: Hi Rico! Can you give us an update on your status here? We actually need something like this as well (and pretty urgent), so we would jump in and implement this, unless you have something already. Stephan On Thu, Aug 20, 2015 at 12:13 PM, Stephan Ewen se...@apache.org wrote: BTW: This is becoming a dev discussion, maybe should move to that list... On Thu, Aug 20, 2015 at 12:12 PM, Stephan Ewen se...@apache.org wrote: Yes, one needs exactly a mechanism to seek the output stream back to the last checkpointed position, in order to overwrite duplicates. I think HDFS is not going to make this easy, there is basically no seek for write. Not sure how to solve this, other then writing to tmp files and copying upon success. Apache Flume must have solved this issue in some way, it may be a worth looking into how they solved it. On Thu, Aug 20, 2015 at 11:58 AM, Rico Bergmann i...@ricobergmann.de wrote: My ideas for checkpointing: I think writing to the destination should not depend on the checkpoint mechanism (otherwise the output would never be written to the destination if checkpointing is disabled). Instead I would keep the offsets of written and Checkpointed records. When recovering you would then somehow delete or overwrite the records after that offset. (But I don't really know whether this is as simple as I wrote it ;-) ). Regarding the rolling files I would suggest making the values of the user-defined partitioning function part of the path or file name. Writing records is then basically: Extract the partition to write to, then add the record to a queue for this partition. Each queue has an output format assigned to it. On flushing the output file is opened, the content of the queue is written to it, and then closed. Does this sound reasonable? Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek aljos...@apache.org: Yes, this seems like a good approach. We should probably no reuse the KeySelector for this but maybe a more use-case specific type of function that can create a desired filename from an input object. This is only the first part, though. The hard bit would be implementing rolling files and also integrating it with Flink's checkpointing mechanism. For integration with checkpointing you could maybe use staging-files: all elements are put into a staging file. And then, when the notification about a completed checkpoint is received the contents of this file would me moved (or appended) to the actual destination. Do you have any Ideas about the rolling files/checkpointing? On Thu, 20 Aug 2015 at 09:44 Rico Bergmann i...@ricobergmann.de wrote: I'm thinking about implementing this. After looking into the flink code I would basically subclass FileOutputFormat in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. The path in the file system is then appended by the string, the KeySelector returns. U think this is a good approach? Greets. Rico. Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org: If you are up for it, this would be a very nice addition to Flink, a great contribution :-) On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote: Hi! This should definitely be possible in Flink. Pretty much exactly like you describe it. You need a custom version of the HDFS sink with some logic when to roll over to a new file. You can also make the sink exactly once by integrating it with the checkpointing. For that, you would probably need to keep the current path and output stream offsets as of the last checkpoint, so you can resume from that offset and overwrite records to avoid duplicates. If that is not possible, you would probably buffer records between checkpoints and only write on checkpoints. Greetings, Stephan On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com wrote: Hi, Did anybody think of (mis-) using Flink streaming as an alternative to Apache Flume just for ingesting data from Kafka (or other streaming sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs I assume it should be possible, but Is this a good idea to do? Flume basically is about consuming data from somewhere, peeking into each record and then directing it to a specific directory/file in HDFS reliably. I've seen there is a FlumeSink, but would it be possible to get the same functionality with Flink alone? I've skimmed through the documentation and found the option to split the output by key and the possibility to add multiple sinks. As I understand, Flink programs are generally static, so it would not be possible to add/remove sinks at runtime? So you would need to implement a custom
Broadcasting sets in Flink Streaming
Hello, I have been trying to use the function withBroadcastSet on a SingleOutputStreamOperator (map) the same way I would on a MapOperator for a DataSet. From what I see, this cannot be done. I wonder if there is some way to broadcast a DataSet to the tasks that are performing transformations on a DataStream? I am basically pre-calculating some things with Flink which I later need for the transformations on the incoming data from the stream. So I want to broadcast the resulting datasets from the pre-calculations. Any ideas on how to best approach this? Thanks, cheers Tamara.
Re: Broadcasting sets in Flink Streaming
Hi Tamara, I think this is not officially supported by Flink yet. However, I think that Gyula had once an example where he did something comparable. Maybe he can chime in here. Cheers, Till On Tue, Aug 25, 2015 at 11:15 AM, Tamara Mendt tammyme...@gmail.com wrote: Hello, I have been trying to use the function withBroadcastSet on a SingleOutputStreamOperator (map) the same way I would on a MapOperator for a DataSet. From what I see, this cannot be done. I wonder if there is some way to broadcast a DataSet to the tasks that are performing transformations on a DataStream? I am basically pre-calculating some things with Flink which I later need for the transformations on the incoming data from the stream. So I want to broadcast the resulting datasets from the pre-calculations. Any ideas on how to best approach this? Thanks, cheers Tamara.