Re: Flink to ingest from Kafka to HDFS?

2015-08-20 Thread Rico Bergmann
My ideas for checkpointing:

I think writing to the destination should not depend on the checkpoint 
mechanism (otherwise the output would never be written to the destination if 
checkpointing is disabled). Instead I would keep the offsets of written and 
Checkpointed records. When recovering you would then somehow delete or 
overwrite the records after that offset. (But I don't really know whether this 
is as simple as I wrote it ;-) ). 

Regarding the rolling files I would suggest making the values of the 
user-defined partitioning function part of the path or file name. Writing 
records is then basically:
Extract the partition to write to, then add the record to a queue for this 
partition. Each queue has an output format assigned to it. On flushing the 
output file is opened, the content of the queue is written to it, and then 
closed.

Does this sound reasonable?



 Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek aljos...@apache.org:
 
 Yes, this seems like a good approach. We should probably no reuse the 
 KeySelector for this but maybe a more use-case specific type of function that 
 can create a desired filename from an input object.
 
 This is only the first part, though. The hard bit would be implementing 
 rolling files and also integrating it with Flink's checkpointing mechanism. 
 For integration with checkpointing you could maybe use staging-files: all 
 elements are put into a staging file. And then, when the notification about a 
 completed checkpoint is received the contents of this file would me moved (or 
 appended) to the actual destination.
 
 Do you have any Ideas about the rolling files/checkpointing?
 
 On Thu, 20 Aug 2015 at 09:44 Rico Bergmann i...@ricobergmann.de wrote:
 I'm thinking about implementing this. 
 
 After looking into the flink code I would basically subclass 
 FileOutputFormat in let's say KeyedFileOutputFormat, that gets an additional 
 KeySelector object. The path in the file system is then appended by the 
 string, the KeySelector returns. 
 
 U think this is a good approach?
 
 Greets. Rico. 
 
 
 
 Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org:
 
 If you are up for it, this would be a very nice addition to Flink, a great 
 contribution :-)
 
 On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote:
 Hi!
 
 This should definitely be possible in Flink. Pretty much exactly like you 
 describe it.
 
 You need a custom version of the HDFS sink with some logic when to roll 
 over to a new file.
 
 You can also make the sink exactly once by integrating it with the 
 checkpointing. For that, you would probably need to keep the current path 
 and output stream offsets as of the last checkpoint, so you can resume 
 from that offset and overwrite records to avoid duplicates. If that is not 
 possible, you would probably buffer records between checkpoints and only 
 write on checkpoints.
 
 Greetings,
 Stephan
 
 
 
 On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com wrote:
 Hi,
 
 Did anybody think of (mis-) using Flink streaming as an alternative to 
 Apache Flume just for ingesting data from Kafka (or other streaming 
 sources) to HDFS? Knowing that Flink can read from Kafka and write to 
 hdfs I assume it should be possible, but Is this a good idea to do? 
 
 Flume basically is about consuming data from somewhere, peeking into each 
 record and then directing it to a specific directory/file in HDFS 
 reliably. I've seen there is a FlumeSink, but would it be possible to get 
 the same functionality with
 Flink alone?
 
 I've skimmed through the documentation and found the option to split the 
 output by key and the possibility to add multiple sinks. As I understand, 
 Flink programs are generally static, so it would not be possible to 
 add/remove sinks at runtime?
 So you would need to implement a custom sink directing the records to 
 different files based on a key (e.g. date)? Would it be difficult to 
 implement things like rolling outputs etc? Or better just use Flume?
 
 Best, 
 Hans-Peter


Re: Flink to ingest from Kafka to HDFS?

2015-08-20 Thread Rico Bergmann
I'm thinking about implementing this. 

After looking into the flink code I would basically subclass FileOutputFormat 
in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. 
The path in the file system is then appended by the string, the KeySelector 
returns. 

U think this is a good approach?

Greets. Rico. 



 Am 16.08.2015 um 19:56 schrieb Stephan Ewen se...@apache.org:
 
 If you are up for it, this would be a very nice addition to Flink, a great 
 contribution :-)
 
 On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen se...@apache.org wrote:
 Hi!
 
 This should definitely be possible in Flink. Pretty much exactly like you 
 describe it.
 
 You need a custom version of the HDFS sink with some logic when to roll over 
 to a new file.
 
 You can also make the sink exactly once by integrating it with the 
 checkpointing. For that, you would probably need to keep the current path 
 and output stream offsets as of the last checkpoint, so you can resume from 
 that offset and overwrite records to avoid duplicates. If that is not 
 possible, you would probably buffer records between checkpoints and only 
 write on checkpoints.
 
 Greetings,
 Stephan
 
 
 
 On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn hpz...@gmail.com wrote:
 Hi,
 
 Did anybody think of (mis-) using Flink streaming as an alternative to 
 Apache Flume just for ingesting data from Kafka (or other streaming 
 sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs 
 I assume it should be possible, but Is this a good idea to do? 
 
 Flume basically is about consuming data from somewhere, peeking into each 
 record and then directing it to a specific directory/file in HDFS reliably. 
 I've seen there is a FlumeSink, but would it be possible to get the same 
 functionality with
 Flink alone?
 
 I've skimmed through the documentation and found the option to split the 
 output by key and the possibility to add multiple sinks. As I understand, 
 Flink programs are generally static, so it would not be possible to 
 add/remove sinks at runtime?
 So you would need to implement a custom sink directing the records to 
 different files based on a key (e.g. date)? Would it be difficult to 
 implement things like rolling outputs etc? Or better just use Flume?
 
 Best, 
 Hans-Peter
 


Re: Gelly ran out of memory

2015-08-20 Thread Andra Lungu
Hi Flavio,

These kinds of exceptions generally arise from the fact that you ran out of
`user` memory. You can try to increase that a bit.
In your flink-conf.yaml try adding
# The memory fraction allocated system -user
taskmanager.memory.fraction: 0.4

This will give 0.6 of the unit of memory to the user and 0.4 to the system.

Tell me if that helped.
Andra

On Thu, Aug 20, 2015 at 12:02 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Hi to all,

 I tried to run my gelly job on Flink 0.9-SNAPSHOT and I was having an
 EOFException, so I tried on 0.10-SNAPSHOT and now I have the following
 error:

 Caused by: java.lang.RuntimeException: Memory ran out. Compaction failed.
 numPartitions: 32 minPartition: 73 maxPartition: 80 number of overflow
 segments: 0 bucketSize: 570 Overall memory: 102367232 Partition memory:
 81100800 Message: null
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:465)
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:211)
 at
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:272)
 at
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
 at java.lang.Thread.run(Thread.java:745)

 Probably I'm doing something wrong but I can't understand how to estimate
 the required memory for my Gelly job..

 Best,
 Flavio



Re: when use broadcast variable and run on bigdata display this error please help

2015-08-20 Thread hagersaleh
why this is not good broadcast variable use in bigdata



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2468.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: when use broadcast variable and run on bigdata display this error please help

2015-08-20 Thread Rico Bergmann
As you can see from the exceptions your broadcast variable is too large to fit 
into the main memory. 

I think storing that amount of data in a broadcast variable is not the best 
approach. Try to use a dataset for this, I would suggest. 



 Am 20.08.2015 um 11:56 schrieb hagersaleh loveallah1...@yahoo.com:
 
 please help
 
 
 
 --
 View this message in context: 
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2461.html
 Sent from the Apache Flink User Mailing List archive. mailing list archive at 
 Nabble.com.


Re: Keep Model in Operator instance up to date

2015-08-20 Thread Gyula Fóra
Hi,

I don't think I fully understand your question, could you please try to be
a little more specific?

I assume by caching you mean that you keep the current model as an operator
state. Why would you need to add new streams in this case?

I might be slow to answer as I am currently on vacation without stable
internet connection.

Cheers,
Gyula
On Thu, Aug 20, 2015 at 5:36 AM Welly Tambunan if05...@gmail.com wrote:

 Hi Gyula,

 I have another question. So if i cache something on the operator, to keep
 it up to date,  i will always need to add and connect another stream of
 changes to the operator ?

 Is this right for every case ?

 Cheers

 On Wed, Aug 19, 2015 at 3:21 PM, Welly Tambunan if05...@gmail.com wrote:

 Hi Gyula,

 That's really helpful. The docs is improving so much since the last time
 (0.9).

 Thanks a lot !

 Cheers

 On Wed, Aug 19, 2015 at 3:07 PM, Gyula Fóra gyula.f...@gmail.com wrote:

 Hey,

 If it is always better to check the events against a more up-to-date
 model (even if the events we are checking arrived before the update) then
 it is fine to keep the model outside of the system.

 In this case we need to make sure that we can push the updates to the
 external system consistently. If you are using the PersistentKafkaSource
 for instance it can happen that some messages are replayed in case of
 failure. In this case you need to make sure that you remove duplicate
 updates or have idempotent updates.

 You can read about the checkpoint mechanism in the Flink website:
 https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html

 Cheers,
 Gyula

 On Wed, Aug 19, 2015 at 9:56 AM Welly Tambunan if05...@gmail.com
 wrote:

 Thanks Gyula,

 Another question i have..

  ... while external model updates would be *tricky *to keep
 consistent.
 Is that still the case if the Operator treat the external model as
 read-only ? We create another stream that will update the external model
 separately.

 Could you please elaborate more about this one ?

 Cheers

 On Wed, Aug 19, 2015 at 2:52 PM, Gyula Fóra gyula.f...@gmail.com
 wrote:

 In that case I would apply a map to wrap in some common type, like a n
 Eithert1,t2 before the union.

 And then in the coflatmap you can unwrap it.
 On Wed, Aug 19, 2015 at 9:50 AM Welly Tambunan if05...@gmail.com
 wrote:

 Hi Gyula,

 Thanks.

 However update1 and update2 have a different type. Based on my
 understanding, i don't think we can use union. How can we handle this 
 one ?

 We like to create our event strongly type to get the domain language
 captured.


 Cheers

 On Wed, Aug 19, 2015 at 2:37 PM, Gyula Fóra gyula.f...@gmail.com
 wrote:

 Hey,

 One input of your co-flatmap would be model updates and the other
 input would be events to check against the model if I understand 
 correctly.

 This means that if your model updates come from more than one stream
 you need to union them into a single stream before connecting them with 
 the
 event stream and applying the coatmap.

 DataStream updates1 = 
 DataStream updates2 = 
 DataStream events = 

 events.connect(updates1.union(updates2).broadcast()).flatMap(...)

 Does this answer your question?

 Gyula


 On Wednesday, August 19, 2015, Welly Tambunan if05...@gmail.com
 wrote:

 Hi Gyula,

 Thanks for your response.

 However the model can received multiple event for update. How can
 we do that with co-flatmap as i can see the connect API only received
 single datastream ?


  ... while external model updates would be tricky to keep
 consistent.
 Is that still the case if the Operator treat the external model as
 read-only ? We create another stream that will update the external 
 model
 separately.

 Cheers

 On Wed, Aug 19, 2015 at 2:05 PM, Gyula Fóra gyf...@apache.org
 wrote:

 Hey!

 I think it is safe to say that the best approach in this case is
 creating a co-flatmap that will receive updates on one input. The 
 events
 should probably be broadcasted in this case so you can check in 
 parallel.

 This approach can be used effectively with Flink's checkpoint
 mechanism, while external model updates would be tricky to keep 
 consistent.

 Cheers,
 Gyula




 On Wed, Aug 19, 2015 at 8:44 AM Welly Tambunan if05...@gmail.com
 wrote:

 Hi All,

 We have a streaming computation that required to validate the
 data stream against the model provided by the user.

 Right now what I have done is to load the model into flink
 operator and then validate against it. However the model can be 
 updated and
 changed frequently. Fortunately we always publish this event to 
 RabbitMQ.

 I think we can


1. Create RabbitMq listener for model changed event from
inside the operator, then update the model if event arrived.

But i think this will create race condition if not handle
correctly and it seems odd to keep this

2. We can move the model into external in external memory
cache storage and keep the model up to date using flink. So the 
 operator
  

Re: Gelly ran out of memory

2015-08-20 Thread Henry Saputra
Hi Stephan, this looks like a bug to me. Shouldn't the memory manager
switch to out of managed area if it is out of memory space?

- Henry

On Thu, Aug 20, 2015 at 3:09 AM, Stephan Ewen se...@apache.org wrote:
 Actually, you ran out of Flink Managed Memory, not user memory. User
 memory shortage manifests itself as Java OutofMemoryError.

 At this point, the Delta iterations cannot spill. They additionally suffer a
 bit from memory fragmentation.
 A possible workaround is to use the option setSolutionSetUnmanaged(true)
 on the iteration. That will eliminate the fragmentation issue, at least.

 Stephan


 On Thu, Aug 20, 2015 at 12:06 PM, Andra Lungu lungu.an...@gmail.com wrote:

 Hi Flavio,

 These kinds of exceptions generally arise from the fact that you ran out
 of `user` memory. You can try to increase that a bit.
 In your flink-conf.yaml try adding
 # The memory fraction allocated system -user
 taskmanager.memory.fraction: 0.4

 This will give 0.6 of the unit of memory to the user and 0.4 to the
 system.

 Tell me if that helped.
 Andra

 On Thu, Aug 20, 2015 at 12:02 PM, Flavio Pompermaier
 pomperma...@okkam.it wrote:

 Hi to all,

 I tried to run my gelly job on Flink 0.9-SNAPSHOT and I was having an
 EOFException, so I tried on 0.10-SNAPSHOT and now I have the following
 error:

 Caused by: java.lang.RuntimeException: Memory ran out. Compaction failed.
 numPartitions: 32 minPartition: 73 maxPartition: 80 number of overflow
 segments: 0 bucketSize: 570 Overall memory: 102367232 Partition memory:
 81100800 Message: null
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:465)
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:211)
 at
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:272)
 at
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
 at java.lang.Thread.run(Thread.java:745)

 Probably I'm doing something wrong but I can't understand how to estimate
 the required memory for my Gelly job..

 Best,
 Flavio