Are you talking about a streaming or a batch job? You are mentioning a "text stream" but also say you want to stream 100TB -- indicating you have a finite data set using DataSet API.
-Matthias On 05/22/2016 09:50 PM, Xtra Coder wrote: > Hello, > > Question from newbie about how Flink's WordCount will actually work at > scale. > > I've read/seen rather many high-level presentations and do not see > more-or-less clear answers for following … > > Use-case: > -------------- > there is huuuge text stream with very variable set of words – let's say > 1BLN of unique words. Storing them just as raw text, without > supplementary data, will take roughly 16TB of RAM. How Flink is > approaching this internally. > > Here I'm more interested in following: > 1. How individual words are spread in cluster of Flink nodes? > Will each word appear exactly in one node and will be counted there or > ... I'm not sure about the variants > > 2. As far as I understand – while job is running all its intermediate > aggregation results are stored in-memory across cluster nodes (which may > be partially written to local drive). > Wild guess - what size of cluster is required to run above mentioned > tasks efficiently? > > And two functional question on top of this ... > > 1. Since intermediate results are in memory – I guess it should be > possible to get “current” counter for any word being processed. > Is this possible? > > 2. After I've streamed 100TB of text – what will be the right way to > save result to HDFS. For example I want to save list of words ordered by > key with portions of 10mln per file compressed with bzip2. > What APIs I should use? > Since Flink uses intermediate snapshots for falt-tolerance - is it > possible to save whole "current" state without stopping the stream? > > Thanks.
signature.asc
Description: OpenPGP digital signature