Re: Flink's WordCount at scale of 1BLN of unique words

2016-05-31 Thread Xtra Coder
Thanks, things are clear so far.


Re: Flink's WordCount at scale of 1BLN of unique words

2016-05-23 Thread Matthias J. Sax
Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello, 
> 
> Question from newbie about how Flink's WordCount will actually work at
> scale. 
> 
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
> 
> Use-case: 
> --
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally. 
> 
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes? 
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
> 
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive). 
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
> 
> And two functional question on top of this  ...
> 
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed. 
> Is this possible?
> 
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2. 
> What APIs I should use? 
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
> 
> Thanks.



signature.asc
Description: OpenPGP digital signature


Flink's WordCount at scale of 1BLN of unique words

2016-05-22 Thread Xtra Coder
Hello,

Question from newbie about how Flink's WordCount will actually work at
scale.

I've read/seen rather many high-level presentations and do not see
more-or-less clear answers for following …

Use-case:
--
there is huuuge text stream with very variable set of words – let's say
1BLN of unique words. Storing them just as raw text, without supplementary
data, will take roughly 16TB of RAM. How Flink is approaching this
internally.

Here I'm more interested in following:
1.  How individual words are spread in cluster of Flink nodes?
Will each word appear exactly in one node and will be counted there or ...
I'm not sure about the variants

2.  As far as I understand – while job is running all its intermediate
aggregation results are stored in-memory across cluster nodes (which may be
partially written to local drive).
Wild guess - what size of cluster is required to run above mentioned tasks
efficiently?

And two functional question on top of this  ...

1. Since intermediate results are in memory – I guess it should be possible
to get “current” counter for any word being processed.
Is this possible?

2. After I've streamed 100TB of text – what will be the right way to save
result to HDFS. For example I want to save list of words ordered by key
with portions of 10mln per file compressed with bzip2.
What APIs I should use?
Since Flink uses intermediate snapshots for falt-tolerance - is it possible
to save whole "current" state without stopping the stream?

Thanks.