RE: Kyro Intermittent Exception for Large Data

2016-02-18 Thread Ken Krugler
I've seen this type of error when using Kryo with a Cascading scheme I'd created. In my case it happened when serializing a large object graph, where some of the classes didn't have no-arg constructors. The general fix was to set an instantiator strategy for Kryo - see:

Re: Flink HA

2016-02-18 Thread Ufuk Celebi
On Thu, Feb 18, 2016 at 6:59 PM, Thomas Lamirault wrote: > We are trying flink in HA mode. Great to hear! > We set in the flink yaml : > > state.backend: filesystem > > recovery.mode: zookeeper > recovery.zookeeper.quorum: > > recovery.zookeeper.path.root: > >

Flink HA

2016-02-18 Thread Thomas Lamirault
Hi ! We are trying flink in HA mode. Our application is a streaming application with windowing mechanism. We set in the flink yaml : state.backend: filesystem recovery.mode: zookeeper recovery.zookeeper.quorum: recovery.zookeeper.path.root: recovery.zookeeper.storageDir:

Re: Finding the average temperature

2016-02-18 Thread Nirmalya Sengupta
Hello Aljoscha , You mentioned: '.. Yes, this is right if you temperatures don’t have any other field on which you could partition them. '. What I am failing to understand is that if temperatures are partitioned on some other field (in my use-case, I have one such: the

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I tried to implement your idea but I'm getting NullPointer exceptions from the AvroInputFormat any Idea what I'm doing wrong? See the code below: public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env =

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I guess I need to set the parallelism for the FlatMap to 1 to make sure I read one file at a time. The downside I see with this is that I will be not able to read in parallel from HDFS (and the files are Huge). I give it a try and see how much performance I loose. cheers Martin On Thu, Feb 18,

Re: Read every file in a directory at once

2016-02-18 Thread Flavio Pompermaier
My current solution is: List paths = new ArrayList(); File dir = new File(BASE_DIR); for (File f : dir.listFiles()) { paths.add(f.getName()); } DataSet mail = env.fromCollection(paths).map(new FileToString(BASE_DIR)). The FileToString does basically a map that return

Re: How to iterate over DataSet elements without converting it to List

2016-02-18 Thread Judit Fehér
Hi, if you want to iterate through a DataSet you can simply use the map function on the DataSets instead of for loops. In your example you have nested loops, instead of this you can join the two datasets and then perform the map function. It looks like you may want to implement a k-means

Read every file in a directory at once

2016-02-18 Thread Flavio Pompermaier
Hi to all, I want to apply a map function to every file in a folder. Is there an easy way (or an already existing InputFormat) to do that? Best, Flavio

Re: Changing parallelism

2016-02-18 Thread Zach Cox
Hi Ufuk - thanks for the 2016 roadmap - glad to see changing parallelism is the first bullet :) Mesos support also sounds great, we're currently running job and task managers on Mesos statically via Marathon. Hi Stephan - thanks, that trick sounds pretty clever, I will try wrapping my head

Re: Availability for the ElasticSearch 2 streaming connector

2016-02-18 Thread Zach Cox
Awesome, thanks Suneel. :D I made the changes to support our use case, which needed flatMap behavior (index 2 docs, or zero docs, per incoming element) instead of map, and we also need to make either IndexRequest or UpdateRequest depending on the element. -Zach On Thu, Feb 18, 2016 at 2:06 AM

Re: streaming hdfs sub folders

2016-02-18 Thread Stephan Ewen
Martin, I think you can approximate this in an easy way like this: - On the client, you traverse your directories to collect all files that you need, collect all file paths in a list. - Then you have a source "env.fromElements(paths)". - Then you flatMap and in the FlatMap, run the Avro

Re: Changing parallelism

2016-02-18 Thread Stephan Ewen
Hi Zach! Yes, changing parallelism is pretty high up the priority list. The good news is that "scaling in" is the simpler part of changing the parallelism and we are pushing to get that in soon. Until then, there is only a pretty ugly trick that you can do right now to "rescale' the state:

Re: Very old dependencies and solutions

2016-02-18 Thread Stephan Ewen
Hi! A lot of those dependencies are pulled in by Hadoop (for example the configuration / HTTP components). In 1.0-SNAPSHOT, the HTTP components dependency has been shaded away in Hadoop, so it should not bother you any more. One solution you can always do is to "shade" your dependencies in your

Re: Kafka partition alignment for event time

2016-02-18 Thread Stephan Ewen
You are right, the checkpoints should contain all offsets. I created a Ticket for this: https://issues.apache.org/jira/browse/FLINK-3440 On Thu, Feb 18, 2016 at 10:15 AM, agaoglu wrote: > Hi, > > On a related and a more exaggerated setup, our kafka-producer (flume)

Re: Finding the average temperature

2016-02-18 Thread Stephan Ewen
Combiners in streaming are a bit tricky, from their semantics: 1) Combiners always hold data back, through the preaggregation. That adds latency and also means the values are not in the actual windows immediately, where a trigger may expect them. 2) In batch, a combiner combines as long as there

Re: Problem with Kafka 0.9 Client

2016-02-18 Thread Robert Metzger
Hi Javier, sorry for the late response. In the Error Mapping of Kafka, it says that code 15 means: ConsumerCoordinatorNotAvailableCode. https://github.com/apache/kafka/blob/0.9.0/core/src/main/scala/kafka/common/ErrorMapping.scala How many brokers did you put into the list of bootstrap servers?

Re: Finding the average temperature

2016-02-18 Thread Aljoscha Krettek
They would be awesome, but it’s not yet possible in Flink Streaming, I’m afraid. > On 18 Feb 2016, at 10:59, Stefano Baghino > wrote: > > I think combiners are pretty awesome for certain cases to minimize network > usage (the average use case seems to fit

Re: Finding the average temperature

2016-02-18 Thread Stefano Baghino
I think combiners are pretty awesome for certain cases to minimize network usage (the average use case seems to fit perfectly), maybe it would be worthwhile adding a detailed description of the approach to the docs? On Thu, Feb 18, 2016 at 10:47 AM, Aljoscha Krettek wrote:

Re: Finding the average temperature

2016-02-18 Thread Aljoscha Krettek
@Nirmalya: Yes, this is right if you temperatures don’t have any other field on which you could partition them. @Stefano: Under some circumstances it would be possible to use a a combiner (I’m using the name as Hadoop MapReduce would use it, here). When the assignment of elements to windows

Re: Kafka partition alignment for event time

2016-02-18 Thread agaoglu
Hi, On a related and a more exaggerated setup, our kafka-producer (flume) seems to send data to a single partition at a time and switches it every few minutes. So when i run my flink datastream program for the first time, it starts on the *largest* offsets and shows something like this: .

Re: Changing parallelism

2016-02-18 Thread Ufuk Celebi
Hey Zach! Sounds like a great use case. On Wed, Feb 17, 2016 at 3:16 PM, Zach Cox wrote: > However, the savepoint docs state that the job parallelism cannot be changed > over time [1]. Does this mean we need to use the same, fixed parallelism=n > during reprocessing and going

Re: Availability for the ElasticSearch 2 streaming connector

2016-02-18 Thread Suneel Marthi
Thanks Zach, I have a few minor changes too locally; I'll push a PR out tomorrow that has ur changes too. On Wed, Feb 17, 2016 at 5:13 PM, Zach Cox wrote: > I recently did exactly what Robert described: I copied the code from this > (closed) PR