Are you talking about a streaming or a batch job?
You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.
-Matthias
On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello,
>
> Question from newbie about how Flink's WordCount will actually work at
> scale.
>
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
>
> Use-case:
> --
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally.
>
> Here I'm more interested in following:
> 1. How individual words are spread in cluster of Flink nodes?
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
>
> 2. As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive).
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
>
> And two functional question on top of this ...
>
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed.
> Is this possible?
>
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2.
> What APIs I should use?
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
>
> Thanks.
signature.asc
Description: OpenPGP digital signature