Re: Your Help Is Required !

Matthias J. Sax Thu, 23 Apr 2015 04:51:14 -0700

Hi,

a short explanation about the different grouping pattern you can use.
Let's say, you have 10 tuples each having two attributes: For example
(a,1),(b,1),(c,1),(a,2),(b,2),(c,2),(a,3),(b,3),(c,3),(a,4)


And let's assume you have two bolt tasks (ie, a single bolt with dop=2),
b1 and b2.

* If you use shuffle-grouping, each tuple is "randomly" sent to a single
bolt talks; tuples are NOT replicated. For example:

b1: (a,1),(c,1),(b,2),(a,3),(c,3)
b2: (b,1),(a,2),(c,2),(b,3),(a,4)

* If you use field-grouping data is partitioned according to the
specified key. Let's assume, we use the first attribute as key.
Thus, we have three partitions:
(a,1),(a,2),(a,3),(a,4)
(b,1),(b,2),(b,3)
(c,1),(c,2),(c,3)

Storm ensures, that all tuples of a single partition are processed by a
single task. Because we have only two bolt tasks in out example, one
bolt task will process two partitions and the other one partition.
Tuples are not replicates in this case either:

b1: (a,1),(a,2),(a,3),(a,4)
b2: (b,1),(c,1),(b,2),(c,2)(b,3),(c,3)

Pay attention, that b2 received the tuple of both partitions interleaved.

* If you use all-grouping, each tuple is replicated to each bolt task:

b1: (a,1),(b,1),(c,1),(a,2),(b,2),(c,2),(a,3),(b,3),(c,3),(a,4)
b2: (a,1),(b,1),(c,1),(a,2),(b,2),(c,2),(a,3),(b,3),(c,3),(a,4)

* If you use global-grouping all tuples are sent to a single bolt tasks.

b1: (a,1),(b,1),(c,1),(a,2),(b,2),(c,2),(a,3),(b,3),(c,3),(a,4)
b2:

For this case, a dop greater then one is not useful.


About question 4:
If you read from multiples files, and use multiple spout tasks, you need
to make sure, that each task is reading from a different file. If I
understood it correctly, each task is reading each file in your case,
and thus the get each tuple multiple times.
-> To ensure, that each spout task read a different file, you can
exploit the task-id of the spout tasks, to assign different files to
different tasks.


Unfortunately, I cannot answer your other questions.


-Matthias


On 04/23/2015 11:20 AM, prasad ch wrote:
> Hi,
>  
> 
> i read  complete storm document !  but am  not understanding the
> following things !
> 
> please help me !.
> 
> 
> 1) we have concept of stream grouping , am not getting any difference
> practically  with one to one.
>  in document , shuffle grouping mens  tuples are equally distributed to
> all  tasks i.e if spout emits 10 tuples only i have bolt 5 tasks then
> finally bolt receives 50 tuples .
> 
> field grouping  means all same tuples will go to same task ,am unable to
> prove practically with all groupings.
> 
> please help me in stream grouping.
> 
> 2)for processing   tuples in bolt fast i used 10 executors  with field
> grouping , i had a problem here if spout emits tuples 10 then am not
> receive same tuples in bolt duplicates are coming but when i use global
> grouping it is fine but slow in cluster mode.
> 
> 3)what are  the minimum basic properties for topology run fast  ?
> 
> 4)   i performed  reading list of files using spout using java ,when i
> use multiple executors for reading fast already processed records are
> coming at that scenario i  handled with java is it possible in storm? 
> am unable to perform (reading from files and processing ) 10000 records
> (including 10 files) aggregations  successfully ?
> 
> 5)in trident topology i did  aggregations with 10 files(each file 1000
> records)  fine ,but when i use 10 files with 100000 records (each file
> 10000 records) my application is keep processing nothing is done  i mean
> control is not coming to corresponding filter  or function , it will
> take lot of time to emit never comes to filter or function .(here i used
> irichspout)
> 
> example code;-
> main app code 
> 
>         Config con = new Config();
>         con.setDebug(true);
>         con.put("fileLocation",args[0]);
>         con.put("ext",args[1]);
>         con.setNumAckers(10);
>                file=args[2];
>         //con.setNumWorkers(Integer.parseInt(args[3]));
>         System.out.println("application start time :"+new Date());
>         TridentTopology topology = new TridentTopology();
>         Stream s=topology.newStream("spout1", new
> ReadingSpout1(9080000)).parallelismHint(10);
>                 s.groupBy(new Fields("m")).
>                 aggregate(new Fields("v"), new Sum(), new
> Fields("r")).each(new Fields("m", "r"), new MyFun1(file), new
> Fields("o")).parallelismHint(40);
>         
>         
>         LocalCluster cluster=new LocalCluster();
>         
>         cluster.submitTopology("TD", con,topology.build()); 
> 
> 
> 
> 6) with out trident state tuples which are failed are never replay?
> 
> 
> 7)when i run trident with specified  number of workers also it is not
> running ,please help me any configuration i missed? 
> 
> 
> 8)i have a requirement to perform streaming by reading list of files in
> a specified locations(large number of files) i need to aggregate based
> on considering sliding window operations and write result into some
> files or any destination !
> which way is better either normal topology or trident ?  please help me!

signature.asc
Description: OpenPGP digital signature

Re: Your Help Is Required !

Reply via email to