In case of rebalance(), all sources start the round-robin partitioning at
index 0. Since each source emits only very few elements, only the first 15
mappers receive any input.
It would be better to let each source start the round-robin partitioning at
a different index, something like startIdx =
The purpose of rebalance() should be to rebalance the partitions of a data
streams as evenly as possible, right?
If all senders start sending data to the same receiver and there is less
data in each partition than receivers, partitions are not evenly rebalanced.
That is exactly the problem Arnaud
Have a look at the class IOManager and IOManagerAsync, it is a good example
of how we use these hooks for cleanup.
The constructor usually installs them, and the shutdown logic removes them.
On Thu, Sep 3, 2015 at 9:19 PM, Stephan Ewen wrote:
> Stopping the JVM process clean
Hello,
Does flink require hdfs to run? I know you can use hdfs to checkpoint and
process files in a distributed fashion. So can flink run standalone
without hdfs?
Stopping the JVM process clean up all resources, except temp files.
Everything that creates temp files uses a shutdown hook to remove these:
IOManager, BlobManager, LibraryCache, ...
On Wed, Sep 2, 2015 at 7:40 PM, Sachin Goel
wrote:
> I'm not sure what you mean by
Hi Chiwan Park
not understand this solution please explain more
--
View this message in context:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2676.html
Sent from the Apache Flink User
Hi!
Yes, you can run Flink completely without HDFS.
Also the checkpointing can put state into any file system, like S3, or a
Unix file system (like a NAS or Amazon EBS), or even Tachyon.
Greetings,
Stephan
On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng
wrote:
>
I think most cloud providers moved beyond Hadoop 2.2.0.
Google's Click-To-Deploy is on 2.4.1
AWS EMR is on 2.6.0
The situation for the distributions seems to be the following:
MapR 4 uses Hadoop 2.4.0 (current is MapR 5)
CDH 5.0 uses 2.3.0 (the current CDH release is 5.4)
HDP 2.0 (October 2013)
CoGroup is more generic than Join. You can perform a Join with CoGroup but
not do a CoGroup with a Join.
However, Join can be executed more efficiently than CoGroup.
2015-09-03 22:28 GMT+02:00 hagersaleh :
> what different between join and coGroup in flink
>
>
>
>
> --
>
+1 to what Robert said.
On Thursday, September 3, 2015, Robert Metzger wrote:
> I think most cloud providers moved beyond Hadoop 2.2.0.
> Google's Click-To-Deploy is on 2.4.1
> AWS EMR is on 2.6.0
>
> The situation for the distributions seems to be the following:
> MapR 4
Hi,
I get a new similar bug when broadcasting a list of integers if this
list is made unmodifiable,
elements = Collections.unmodifiableList(elements);
I include this code to reproduce the result,
public class WordCountExample {
public static void main(String[] args) throws
Thanks for clarifying the "eager serialization". By serializing and
deserializing explicitly (eagerly) we can raise better Exceptions to
notify the user of non-serializable classes.
> BTW: There is an opportunity to fix two problems with one patch: The
> framesize overflow for the input format,
I'm sorry that we changed the method name between minor versions.
We'll soon bring some infrastructure in place a) mark the audience of
classes and b) ensure that public APIs are stable.
On Wed, Sep 2, 2015 at 9:04 PM, Ferenc Turi wrote:
> Ok. As I see only the method name
One more remark that just came to my mind. There is a storm-hdfs module
available: https://github.com/apache/storm/tree/master/external/storm-hdfs
Maybe you can use it. It would be great if you could give feedback if
this works for you.
-Matthias
On 09/02/2015 10:52 AM, Matthias J. Sax wrote:
>
Hi Arnaud,
I think that's a bug ;)
I'll file a JIRA to fix it for the next release.
On Thu, Sep 3, 2015 at 10:26 AM, LINZ, Arnaud
wrote:
> Hi,
>
>
>
> I am wondering why, despite the fact that my java main() methods runs OK
> and exit with 0 code value, the Yarn
I've filed a JIRA at INFRA:
https://issues.apache.org/jira/browse/INFRA-10239
On Wed, Sep 2, 2015 at 11:18 AM, Robert Metzger wrote:
> Hi Sachin,
>
> I also noticed that the GitHub integration is not working properly. I'll
> ask the Apache Infra team.
>
> On Wed, Sep 2,
Hi Greg!
That number should control the merge fan in, yes. Maybe a bug was
introduced a while back that prevents this parameter from being properly
passed through the system. Have you modified the config value in the
cluster, on the client, or are you starting the job via the command line,
in
Hej,
I have a Stream of json objects of several different types. I want to split
this stream into several streams each of them dealing with one type. (so
its not partitioning)
The only Way I found so far is writing a bunch of filters and connect them
to the source directly. This way I will have
Hi Martin,
maybe this is what you are looking for:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#output-splitting
Regards,
Aljoscha
On Thu, 3 Sep 2015 at 12:02 Till Rohrmann wrote:
> Hi Martin,
>
> could grouping be a solution to your
Hi!
Testing it with the current 0.10 snapshot is not easily possible atm
But I deactivated checkpointing in my program and still get duplicates in my
output. So it seems not only to come from the checkpointing feature, or?
May be the KafkaSink is responsible for this? (Just my guess)
Cheers
Do you mean the KafkaSource?
Which KafkaSource are you using? The 0.9.1 FlinkKafkaConsumer082 or the
KafkaSource?
On Thu, Sep 3, 2015 at 1:10 PM, Rico Bergmann wrote:
> Hi!
>
> Testing it with the current 0.10 snapshot is not easily possible atm
>
> But I deactivated
Can you tell us where the KafkaSink comes into play? At what point do the
duplicates come up?
On Thu, Sep 3, 2015 at 2:09 PM, Rico Bergmann wrote:
> No. I mean the KafkaSink.
>
> A bit more insight to my program: I read from a Kafka topic with
> flinkKafkaConsumer082, then
In a set of benchmarks a while back, we found that the chaining mechanism
has some overhead right now, because of its abstraction. The abstraction
creates iterators for each element and makes it hard for the JIT to
specialize on the operators in the chain.
For purely local chains at full speed,
> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training.
Excellent resources! Somehow, I managed not to stumble over them by
myself - either I was blind, or they are well hidden... :)
Best,
-Stefan
No. I mean the KafkaSink.
A bit more insight to my program: I read from a Kafka topic with
flinkKafkaConsumer082, then hashpartition the data, then I do a deduplication
(does not eliminate all duplicates though). Then some computation, afterwards
again deduplication (group by message in a
Hi Stephan,
That's good information to know. We will hit that throughput easily. Our
computation graph has lot of chaining like this right now.
I think it's safe to minimize the chain right now.
Thanks a lot for this Stephan.
Cheers
On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen
+1 for dropping Hadoop 2.2.0
Regards,
Chiwan Park
> On Sep 4, 2015, at 5:58 AM, Ufuk Celebi wrote:
>
> +1 to what Robert said.
>
> On Thursday, September 3, 2015, Robert Metzger wrote:
> I think most cloud providers moved beyond Hadoop 2.2.0.
> Google's
Btw, it is working with a parallelism 1 source, because only a single
source partitions (round-robin or random) the data.
Several sources do not assign work to the same few mappers.
2015-09-03 15:22 GMT+02:00 Matthias J. Sax :
> If it would be only 14 elements, you are
The KafkaSink is the last step in my program after the 2nd deduplication.
I could not yet track down where duplicates show up. That's a bit difficult to
find out... But I'm trying to find it...
> Am 03.09.2015 um 14:14 schrieb Stephan Ewen :
>
> Can you tell us where the
29 matches
Mail list logo