Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
In case of rebalance(), all sources start the round-robin partitioning at index 0. Since each source emits only very few elements, only the first 15 mappers receive any input. It would be better to let each source start the round-robin partitioning at a different index, something like startIdx =

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
The purpose of rebalance() should be to rebalance the partitions of a data streams as evenly as possible, right? If all senders start sending data to the same receiver and there is less data in each partition than receivers, partitions are not evenly rebalanced. That is exactly the problem Arnaud

Re: Multiple restarts of Local Cluster

2015-09-03 Thread Stephan Ewen
Have a look at the class IOManager and IOManagerAsync, it is a good example of how we use these hooks for cleanup. The constructor usually installs them, and the shutdown logic removes them. On Thu, Sep 3, 2015 at 9:19 PM, Stephan Ewen wrote: > Stopping the JVM process clean

Question on flink and hdfs

2015-09-03 Thread Jerry Peng
Hello, Does flink require hdfs to run? I know you can use hdfs to checkpoint and process files in a distributed fashion. So can flink run standalone without hdfs?

Re: Multiple restarts of Local Cluster

2015-09-03 Thread Stephan Ewen
Stopping the JVM process clean up all resources, except temp files. Everything that creates temp files uses a shutdown hook to remove these: IOManager, BlobManager, LibraryCache, ... On Wed, Sep 2, 2015 at 7:40 PM, Sachin Goel wrote: > I'm not sure what you mean by

Re: when use broadcast variable and run on bigdata display this error please help

2015-09-03 Thread hagersaleh
Hi Chiwan Park not understand this solution please explain more -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2676.html Sent from the Apache Flink User

Re: Question on flink and hdfs

2015-09-03 Thread Stephan Ewen
Hi! Yes, you can run Flink completely without HDFS. Also the checkpointing can put state into any file system, like S3, or a Unix file system (like a NAS or Amazon EBS), or even Tachyon. Greetings, Stephan On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng wrote: >

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Robert Metzger
I think most cloud providers moved beyond Hadoop 2.2.0. Google's Click-To-Deploy is on 2.4.1 AWS EMR is on 2.6.0 The situation for the distributions seems to be the following: MapR 4 uses Hadoop 2.4.0 (current is MapR 5) CDH 5.0 uses 2.3.0 (the current CDH release is 5.4) HDP 2.0 (October 2013)

Re: what different between join and coGroup in flink

2015-09-03 Thread Fabian Hueske
CoGroup is more generic than Join. You can perform a Join with CoGroup but not do a CoGroup with a Join. However, Join can be executed more efficiently than CoGroup. 2015-09-03 22:28 GMT+02:00 hagersaleh : > what different between join and coGroup in flink > > > > > -- >

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Ufuk Celebi
+1 to what Robert said. On Thursday, September 3, 2015, Robert Metzger wrote: > I think most cloud providers moved beyond Hadoop 2.2.0. > Google's Click-To-Deploy is on 2.4.1 > AWS EMR is on 2.6.0 > > The situation for the distributions seems to be the following: > MapR 4

Re: Bug broadcasting objects (serialization issue)

2015-09-03 Thread Andres R. Masegosa
Hi, I get a new similar bug when broadcasting a list of integers if this list is made unmodifiable, elements = Collections.unmodifiableList(elements); I include this code to reproduce the result, public class WordCountExample { public static void main(String[] args) throws

Re: Bug broadcasting objects (serialization issue)

2015-09-03 Thread Maximilian Michels
Thanks for clarifying the "eager serialization". By serializing and deserializing explicitly (eagerly) we can raise better Exceptions to notify the user of non-serializable classes. > BTW: There is an opportunity to fix two problems with one patch: The > framesize overflow for the input format,

Re: nosuchmethoderror

2015-09-03 Thread Robert Metzger
I'm sorry that we changed the method name between minor versions. We'll soon bring some infrastructure in place a) mark the audience of classes and b) ensure that public APIs are stable. On Wed, Sep 2, 2015 at 9:04 PM, Ferenc Turi wrote: > Ok. As I see only the method name

Re: question on flink-storm-examples

2015-09-03 Thread Matthias J. Sax
One more remark that just came to my mind. There is a storm-hdfs module available: https://github.com/apache/storm/tree/master/external/storm-hdfs Maybe you can use it. It would be great if you could give feedback if this works for you. -Matthias On 09/02/2015 10:52 AM, Matthias J. Sax wrote: >

Re: Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-09-03 Thread Robert Metzger
Hi Arnaud, I think that's a bug ;) I'll file a JIRA to fix it for the next release. On Thu, Sep 3, 2015 at 10:26 AM, LINZ, Arnaud wrote: > Hi, > > > > I am wondering why, despite the fact that my java main() methods runs OK > and exit with 0 code value, the Yarn

Re: Travis updates on Github

2015-09-03 Thread Robert Metzger
I've filed a JIRA at INFRA: https://issues.apache.org/jira/browse/INFRA-10239 On Wed, Sep 2, 2015 at 11:18 AM, Robert Metzger wrote: > Hi Sachin, > > I also noticed that the GitHub integration is not working properly. I'll > ask the Apache Infra team. > > On Wed, Sep 2,

Re: max-fan

2015-09-03 Thread Stephan Ewen
Hi Greg! That number should control the merge fan in, yes. Maybe a bug was introduced a while back that prevents this parameter from being properly passed through the system. Have you modified the config value in the cluster, on the client, or are you starting the job via the command line, in

Splitting Streams

2015-09-03 Thread Martin Neumann
Hej, I have a Stream of json objects of several different types. I want to split this stream into several streams each of them dealing with one type. (so its not partitioning) The only Way I found so far is writing a bunch of filters and connect them to the source directly. This way I will have

Re: Splitting Streams

2015-09-03 Thread Aljoscha Krettek
Hi Martin, maybe this is what you are looking for: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#output-splitting Regards, Aljoscha On Thu, 3 Sep 2015 at 12:02 Till Rohrmann wrote: > Hi Martin, > > could grouping be a solution to your

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann
Hi! Testing it with the current 0.10 snapshot is not easily possible atm But I deactivated checkpointing in my program and still get duplicates in my output. So it seems not only to come from the checkpointing feature, or? May be the KafkaSink is responsible for this? (Just my guess) Cheers

Re: Duplicates in Flink

2015-09-03 Thread Stephan Ewen
Do you mean the KafkaSource? Which KafkaSource are you using? The 0.9.1 FlinkKafkaConsumer082 or the KafkaSource? On Thu, Sep 3, 2015 at 1:10 PM, Rico Bergmann wrote: > Hi! > > Testing it with the current 0.10 snapshot is not easily possible atm > > But I deactivated

Re: Duplicates in Flink

2015-09-03 Thread Stephan Ewen
Can you tell us where the KafkaSink comes into play? At what point do the duplicates come up? On Thu, Sep 3, 2015 at 2:09 PM, Rico Bergmann wrote: > No. I mean the KafkaSink. > > A bit more insight to my program: I read from a Kafka topic with > flinkKafkaConsumer082, then

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-03 Thread Stephan Ewen
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain. For purely local chains at full speed,

Re: Hardware requirements and learning resources

2015-09-03 Thread Stefan Winterstein
> Answering to myself, I have found some nice training material at > http://dataartisans.github.io/flink-training. Excellent resources! Somehow, I managed not to stumble over them by myself - either I was blind, or they are well hidden... :) Best, -Stefan

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann
No. I mean the KafkaSink. A bit more insight to my program: I read from a Kafka topic with flinkKafkaConsumer082, then hashpartition the data, then I do a deduplication (does not eliminate all duplicates though). Then some computation, afterwards again deduplication (group by message in a

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-03 Thread Welly Tambunan
Hi Stephan, That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. I think it's safe to minimize the chain right now. Thanks a lot for this Stephan. Cheers On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Chiwan Park
+1 for dropping Hadoop 2.2.0 Regards, Chiwan Park > On Sep 4, 2015, at 5:58 AM, Ufuk Celebi wrote: > > +1 to what Robert said. > > On Thursday, September 3, 2015, Robert Metzger wrote: > I think most cloud providers moved beyond Hadoop 2.2.0. > Google's

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
Btw, it is working with a parallelism 1 source, because only a single source partitions (round-robin or random) the data. Several sources do not assign work to the same few mappers. 2015-09-03 15:22 GMT+02:00 Matthias J. Sax : > If it would be only 14 elements, you are

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann
The KafkaSink is the last step in my program after the 2nd deduplication. I could not yet track down where duplicates show up. That's a bit difficult to find out... But I'm trying to find it... > Am 03.09.2015 um 14:14 schrieb Stephan Ewen : > > Can you tell us where the