So I need to test something like count (*)? To use count, I need the following query:
Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY cs_item_sk; Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, cs_order_number), catalog _returns BY (cr_item_sk, cr_order_number); b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk, cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk, cs_it em_sk, cs_order_number, inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quan tity_on_hand; b_group=GROUP b ALL; b_count=FOREACH b_group GENERATE COUNT(b); Dump b_count; or use ones = FOREACH b GENERATE 1 AS one:int; counter_group = GROUP ones ALL; log_count = FOREACH counter_group GENERATE COUNT(ones); dump log_count I think the second method can save more time. On Wed, Dec 7, 2016 at 9:49 AM, Rohini Palaniswamy <rohini.adi...@gmail.com> wrote: > Limit 4 would make processing of join stop after 4 records. It is not a > good idea to add it if you are testing performance of join. > > On Tue, Dec 6, 2016 at 8:13 PM mingda li <limingda1...@gmail.com> wrote: > > > Thanks for your quick reply. If so, I can use the limit operator to > compare > > > > good and bad join plan. It takes time to dump all. > > > > > > > > Bests, > > > > Mingda > > > > > > > > > > > > On Tue, Dec 6, 2016 at 5:23 PM, Zhang, Liyun <liyun.zh...@intel.com> > > wrote: > > > > > > > > > Hi: > > > > > I think the query time about multiple join part is not related with > > the > > > > > number of limit operator(in your case the number is 4). When the query > is > > > > > executed, limit_data is executed after Bad_OrderRes, after join > > > > > (Bad_OrderRes) is finished, limit(limit_data) starts. > > > > > If I have missed something, please tell me. > > > > > > > > > > > > > > > Best Regards > > > > > Kelly Zhang/Zhang,Liyun > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: mingda li [mailto:limingda1...@gmail.com] > > > > > Sent: Wednesday, December 7, 2016 8:18 AM > > > > > To: d...@pig.apache.org; user@pig.apache.org > > > > > Subject: How to test the efficiency of multiple join > > > > > > > > > > Dear all, > > > > > > > > > > I want to test the different multiple join orders' efficiency. However, > > > > > since the pig query is executed lazily, I need to use dump or store to > > let > > > > > the query be executed. > > > > > > > > > > Now, I use the following query to test the efficiency. > > > > > > > > > > *Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY > > > > > cs_item_sk;* > > > > > *Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, cs_order_number), > > > > > catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT > > > > > Bad_OrderRes 4; * *Dump limit_data;* > > > > > > > > > > Do you think this is OK to just show 4 of results? Could this query > > > > > execution time represent the efficiency of multilpe join? I am not sure > > if > > > > > it will just get 4 items and stop without executing other items. > > > > > > > > > > Bests, > > > > > Mingda > > > > > > > > > >