Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello, 
> 
> Question from newbie about how Flink's WordCount will actually work at
> scale. 
> 
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
> 
> Use-case: 
> --------------
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally. 
> 
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes? 
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
> 
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive). 
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
> 
> And two functional question on top of this  ...
> 
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed. 
> Is this possible?
> 
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2. 
> What APIs I should use? 
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
> 
> Thanks.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to