Re: Flink to ingest from Kafka to HDFS?
My ideas for checkpointing: I think writing to the destination should not depend on the checkpoint mechanism (otherwise the output would never be written to the destination if checkpointing is disabled). Instead I would keep the offsets of written and Checkpointed records. When recovering you would then somehow delete or overwrite the records after that offset. (But I don't really know whether this is as simple as I wrote it ;-) ). Regarding the rolling files I would suggest making the values of the user-defined partitioning function part of the path or file name. Writing records is then basically: Extract the partition to write to, then add the record to a queue for this partition. Each queue has an output format assigned to it. On flushing the output file is opened, the content of the queue is written to it, and then closed. Does this sound reasonable? Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek aljos...@apache.org: Yes, this seems like a good approach. We should probably no reuse the KeySelector for this but maybe a more use-case specific type of function that can create a desired filename from an input object. This is only the first part, though. The hard bit would be implementing rolling files and also integrating it with Flink's checkpointing mechanism. For integration with checkpointing you could maybe use staging-files: all elements are put into a staging file. And then, when the notification about a completed checkpoint is received the contents of this file would me moved (or appended) to the actual destination. Do you have any Ideas about the rolling files/checkpointing? On Thu, 20 Aug 2015 at 09:44 Rico Bergmann i...@ricobergmann.de wrote: I'm thinking about implementing this. After looking into the flink code I would basically subclass FileOutputFormat in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. The path in the file system is then appended by the string, the KeySelector returns. U think this is a good approach? Greets. Rico. Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org: If you are up for it, this would be a very nice addition to Flink, a great contribution :-) On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote: Hi! This should definitely be possible in Flink. Pretty much exactly like you describe it. You need a custom version of the HDFS sink with some logic when to roll over to a new file. You can also make the sink exactly once by integrating it with the checkpointing. For that, you would probably need to keep the current path and output stream offsets as of the last checkpoint, so you can resume from that offset and overwrite records to avoid duplicates. If that is not possible, you would probably buffer records between checkpoints and only write on checkpoints. Greetings, Stephan On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com wrote: Hi, Did anybody think of (mis-) using Flink streaming as an alternative to Apache Flume just for ingesting data from Kafka (or other streaming sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs I assume it should be possible, but Is this a good idea to do? Flume basically is about consuming data from somewhere, peeking into each record and then directing it to a specific directory/file in HDFS reliably. I've seen there is a FlumeSink, but would it be possible to get the same functionality with Flink alone? I've skimmed through the documentation and found the option to split the output by key and the possibility to add multiple sinks. As I understand, Flink programs are generally static, so it would not be possible to add/remove sinks at runtime? So you would need to implement a custom sink directing the records to different files based on a key (e.g. date)? Would it be difficult to implement things like rolling outputs etc? Or better just use Flume? Best, Hans-Peter
Re: Flink to ingest from Kafka to HDFS?
I'm thinking about implementing this. After looking into the flink code I would basically subclass FileOutputFormat in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. The path in the file system is then appended by the string, the KeySelector returns. U think this is a good approach? Greets. Rico. Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org: If you are up for it, this would be a very nice addition to Flink, a great contribution :-) On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote: Hi! This should definitely be possible in Flink. Pretty much exactly like you describe it. You need a custom version of the HDFS sink with some logic when to roll over to a new file. You can also make the sink exactly once by integrating it with the checkpointing. For that, you would probably need to keep the current path and output stream offsets as of the last checkpoint, so you can resume from that offset and overwrite records to avoid duplicates. If that is not possible, you would probably buffer records between checkpoints and only write on checkpoints. Greetings, Stephan On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com wrote: Hi, Did anybody think of (mis-) using Flink streaming as an alternative to Apache Flume just for ingesting data from Kafka (or other streaming sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs I assume it should be possible, but Is this a good idea to do? Flume basically is about consuming data from somewhere, peeking into each record and then directing it to a specific directory/file in HDFS reliably. I've seen there is a FlumeSink, but would it be possible to get the same functionality with Flink alone? I've skimmed through the documentation and found the option to split the output by key and the possibility to add multiple sinks. As I understand, Flink programs are generally static, so it would not be possible to add/remove sinks at runtime? So you would need to implement a custom sink directing the records to different files based on a key (e.g. date)? Would it be difficult to implement things like rolling outputs etc? Or better just use Flume? Best, Hans-Peter
Re: Gelly ran out of memory
Hi Flavio, These kinds of exceptions generally arise from the fact that you ran out of `user` memory. You can try to increase that a bit. In your flink-conf.yaml try adding # The memory fraction allocated system -user taskmanager.memory.fraction: 0.4 This will give 0.6 of the unit of memory to the user and 0.4 to the system. Tell me if that helped. Andra On Thu, Aug 20, 2015 at 12:02 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, I tried to run my gelly job on Flink 0.9-SNAPSHOT and I was having an EOFException, so I tried on 0.10-SNAPSHOT and now I have the following error: Caused by: java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 32 minPartition: 73 maxPartition: 80 number of overflow segments: 0 bucketSize: 570 Overall memory: 102367232 Partition memory: 81100800 Message: null at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:465) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:211) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:272) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581) at java.lang.Thread.run(Thread.java:745) Probably I'm doing something wrong but I can't understand how to estimate the required memory for my Gelly job.. Best, Flavio
Re: when use broadcast variable and run on bigdata display this error please help
why this is not good broadcast variable use in bigdata -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2468.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Re: when use broadcast variable and run on bigdata display this error please help
As you can see from the exceptions your broadcast variable is too large to fit into the main memory. I think storing that amount of data in a broadcast variable is not the best approach. Try to use a dataset for this, I would suggest. Am 20.08.2015 um 11:56 schrieb hagersaleh loveallah1...@yahoo.com: please help -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2461.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Re: Keep Model in Operator instance up to date
Hi, I don't think I fully understand your question, could you please try to be a little more specific? I assume by caching you mean that you keep the current model as an operator state. Why would you need to add new streams in this case? I might be slow to answer as I am currently on vacation without stable internet connection. Cheers, Gyula On Thu, Aug 20, 2015 at 5:36 AM Welly Tambunan if05...@gmail.com wrote: Hi Gyula, I have another question. So if i cache something on the operator, to keep it up to date, i will always need to add and connect another stream of changes to the operator ? Is this right for every case ? Cheers On Wed, Aug 19, 2015 at 3:21 PM, Welly Tambunan if05...@gmail.com wrote: Hi Gyula, That's really helpful. The docs is improving so much since the last time (0.9). Thanks a lot ! Cheers On Wed, Aug 19, 2015 at 3:07 PM, Gyula Fóra gyula.f...@gmail.com wrote: Hey, If it is always better to check the events against a more up-to-date model (even if the events we are checking arrived before the update) then it is fine to keep the model outside of the system. In this case we need to make sure that we can push the updates to the external system consistently. If you are using the PersistentKafkaSource for instance it can happen that some messages are replayed in case of failure. In this case you need to make sure that you remove duplicate updates or have idempotent updates. You can read about the checkpoint mechanism in the Flink website: https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html Cheers, Gyula On Wed, Aug 19, 2015 at 9:56 AM Welly Tambunan if05...@gmail.com wrote: Thanks Gyula, Another question i have.. ... while external model updates would be *tricky *to keep consistent. Is that still the case if the Operator treat the external model as read-only ? We create another stream that will update the external model separately. Could you please elaborate more about this one ? Cheers On Wed, Aug 19, 2015 at 2:52 PM, Gyula Fóra gyula.f...@gmail.com wrote: In that case I would apply a map to wrap in some common type, like a n Eithert1,t2 before the union. And then in the coflatmap you can unwrap it. On Wed, Aug 19, 2015 at 9:50 AM Welly Tambunan if05...@gmail.com wrote: Hi Gyula, Thanks. However update1 and update2 have a different type. Based on my understanding, i don't think we can use union. How can we handle this one ? We like to create our event strongly type to get the domain language captured. Cheers On Wed, Aug 19, 2015 at 2:37 PM, Gyula Fóra gyula.f...@gmail.com wrote: Hey, One input of your co-flatmap would be model updates and the other input would be events to check against the model if I understand correctly. This means that if your model updates come from more than one stream you need to union them into a single stream before connecting them with the event stream and applying the coatmap. DataStream updates1 = DataStream updates2 = DataStream events = events.connect(updates1.union(updates2).broadcast()).flatMap(...) Does this answer your question? Gyula On Wednesday, August 19, 2015, Welly Tambunan if05...@gmail.com wrote: Hi Gyula, Thanks for your response. However the model can received multiple event for update. How can we do that with co-flatmap as i can see the connect API only received single datastream ? ... while external model updates would be tricky to keep consistent. Is that still the case if the Operator treat the external model as read-only ? We create another stream that will update the external model separately. Cheers On Wed, Aug 19, 2015 at 2:05 PM, Gyula Fóra gyf...@apache.org wrote: Hey! I think it is safe to say that the best approach in this case is creating a co-flatmap that will receive updates on one input. The events should probably be broadcasted in this case so you can check in parallel. This approach can be used effectively with Flink's checkpoint mechanism, while external model updates would be tricky to keep consistent. Cheers, Gyula On Wed, Aug 19, 2015 at 8:44 AM Welly Tambunan if05...@gmail.com wrote: Hi All, We have a streaming computation that required to validate the data stream against the model provided by the user. Right now what I have done is to load the model into flink operator and then validate against it. However the model can be updated and changed frequently. Fortunately we always publish this event to RabbitMQ. I think we can 1. Create RabbitMq listener for model changed event from inside the operator, then update the model if event arrived. But i think this will create race condition if not handle correctly and it seems odd to keep this 2. We can move the model into external in external memory cache storage and keep the model up to date using flink. So the operator
Re: Gelly ran out of memory
Hi Stephan, this looks like a bug to me. Shouldn't the memory manager switch to out of managed area if it is out of memory space? - Henry On Thu, Aug 20, 2015 at 3:09 AM, Stephan Ewen se...@apache.org wrote: Actually, you ran out of Flink Managed Memory, not user memory. User memory shortage manifests itself as Java OutofMemoryError. At this point, the Delta iterations cannot spill. They additionally suffer a bit from memory fragmentation. A possible workaround is to use the option setSolutionSetUnmanaged(true) on the iteration. That will eliminate the fragmentation issue, at least. Stephan On Thu, Aug 20, 2015 at 12:06 PM, Andra Lungu lungu.an...@gmail.com wrote: Hi Flavio, These kinds of exceptions generally arise from the fact that you ran out of `user` memory. You can try to increase that a bit. In your flink-conf.yaml try adding # The memory fraction allocated system -user taskmanager.memory.fraction: 0.4 This will give 0.6 of the unit of memory to the user and 0.4 to the system. Tell me if that helped. Andra On Thu, Aug 20, 2015 at 12:02 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, I tried to run my gelly job on Flink 0.9-SNAPSHOT and I was having an EOFException, so I tried on 0.10-SNAPSHOT and now I have the following error: Caused by: java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 32 minPartition: 73 maxPartition: 80 number of overflow segments: 0 bucketSize: 570 Overall memory: 102367232 Partition memory: 81100800 Message: null at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:465) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:211) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:272) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581) at java.lang.Thread.run(Thread.java:745) Probably I'm doing something wrong but I can't understand how to estimate the required memory for my Gelly job.. Best, Flavio