I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G, but it was not occupied enough.
赵天烁 <[email protected]>于2016年8月4日周四 上午11:54写道: > I dig into the code, found that this threshold is determine at > CubeStorageQuery--setThreshold method, there already has a prop call > kylin.query.mem.budget,which > define the memory size of current jvm obtain,default to 3G then divide this > by a estimate size of a single row.the result of that is the threshold.so I > think by change kylin.query.mem.budget to a higher val will do. > > btw my cube is growing like a hell.....the biggest one is almost > 1.7TB(only keep on month of these data),source record is 200 milion level,9 > dim,two of then cardinality are bigger than 100milion,already have > performance issue,maybe you kylin expert could provide some optimize sugg? > waiting...... > > ------------------------------ > > 赵天烁 > > Kevin Zhao > > *[email protected] <[email protected]>* > > > > 珠海市魅族科技有限公司 > > MEIZU Technology Co., Ltd. > > 广东省珠海市科技创新海岸魅族科技楼 > > MEIZU Tech Bldg., Technology & Innovation Coast > > Zhuhai, 519085, Guangdong, China > > meizu.com > > > *From:* hongbin ma <[email protected]> > *Date:* 2016-08-03 12:50 > *To:* user.kylin <[email protected]> > > *Subject:* Re: Re: Scan row count exceeded threshold > > Hi Tiansheng and Tianshuo > > Kylin is basically limiting the footprint on storage visit, the size of > each row is inversely proportional to the number of rows to be read. The > cuboid row size grows when there's a distinct count measure, so you're > observing threshold being 49121, this is normal. > > The unnormal part is kylins behavior when there's a limit clause, > especially for cases like Tianshuo's case, where query being: > > select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by > FCRASH_ANALYSIS_ID,FCRASHTIME limit N > > The query does not have any filters, so we should be able to read the > first N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the > result to users. Yang Li tried to fix the issue in > https://issues.apache.org/jira/browse/KYLIN-1787, however the approach > was still a little bit too conservative to me. The patch in KYLIN-1787 > would not enable the storage read limit as long as the cube has a partition > time column (and meanwhile the query is not grouping by the partition time > column), because we'll need to further aggregate rows from different > segments. This is why 1.5.3 does not behave as Tianshuo expect. > > However there's still room for improvement even if further aggregation is > required across multiple segments. For tianshuo's case, we can ask for N > cuboid row from each segment, and merge them at query server side. Since > the cuboid rows are respectively sorted in each segment, it > is guaranteed that the result is correct > > However it's a different story if the query contains filters, like in > Tiansheng's case. Filter on dimensions may prevent limit clause put down, > especially when the dimension is not the first dimension in row key. Below > is Tiansheng's case: > > Error while executing SQL "select > "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS) as > clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as > cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where > COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by > "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10 > offset 0": Scan row count exceeded threshold: 49121, please add filter > condition to narrow down backend scan range, like where clause. > > I have opened a JIRA to fix this > > On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <[email protected]> wrote: > >> Hi ShaoFeng: >> >> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) >> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, >> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as cost,COUNT(DISTINCT >> CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296' and >> "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by >> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10 >> offset 0". I analyzed the critical reason is "COUNT(DISTINCT >> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even >> for a long time span query. >> >> 赵天烁 <[email protected]>于2016年8月3日周三 上午10:16写道: >> >>> ok,I'll try to make it configurable in kylin.properties >>> >>> ------------------------------ >>> >>> 赵天烁 >>> >>> Kevin Zhao >>> >>> *[email protected] <[email protected]>* >>> >>> >>> >>> 珠海市魅族科技有限公司 >>> >>> MEIZU Technology Co., Ltd. >>> >>> 广东省珠海市科技创新海岸魅族科技楼 >>> >>> MEIZU Tech Bldg., Technology & Innovation Coast >>> >>> Zhuhai, 519085, Guangdong, China >>> >>> meizu.com >>> >>> >>> *From:* ShaoFeng Shi <[email protected]> >>> *Date:* 2016-08-02 22:37 >>> *To:* user <[email protected]> >>> *Subject:* Re: Re: Scan row count exceeded threshold >>> >>> This is a new change introduced in 1.5.3, JIRA is: >>> https://issues.apache.org/jira/browse/KYLIN-1787 >>> >>> In a short, it tries to avoid scanning too many rows (1 million) in one >>> query. But the threshold wasn't externalized as a parameter. @Tianshuo, >>> would you mind to contribute a patch? >>> >>> Besides, the query has "limit 1" but Kylin didn't smartly aware that. >>> >>> @Tian sheng, what's the SQL query looks like in your case? >>> >>> 2016-08-02 21:14 GMT+08:00 张天生 <[email protected]>: >>> >>>> I meet the same question. In 1.5.2.1 version it worked fine, but >>>> switched 1.5.3 it ofter report this error. >>>> >>>> 赵天烁 <[email protected]>于2016年8月2日周二 上午11:57写道: >>>> >>>>> yes, it is 1.5.3,after limit the date range to a week,it still cost >>>>> about 7 second to response, is there any way to get the query execute >>>>> detail just like the diagnosis log? >>>>> >>>>> ------------------------------ >>>>> >>>>> 赵天烁 >>>>> >>>>> Kevin Zhao >>>>> >>>>> *[email protected] <[email protected]>* >>>>> >>>>> >>>>> >>>>> 珠海市魅族科技有限公司 >>>>> >>>>> MEIZU Technology Co., Ltd. >>>>> >>>>> 广东省珠海市科技创新海岸魅族科技楼 >>>>> >>>>> MEIZU Tech Bldg., Technology & Innovation Coast >>>>> >>>>> Zhuhai, 519085, Guangdong, China >>>>> >>>>> meizu.com >>>>> >>>>> >>>>> *From:* ShaoFeng Shi <[email protected]> >>>>> *Date:* 2016-08-01 20:20 >>>>> *To:* user <[email protected]> >>>>> *Subject:* Re: Scan row count exceeded threshold >>>>> >>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it. >>>>> Besides, you can add some "where" contition to reduce the records to scan, >>>>> as the error message mentioned. >>>>> >>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <[email protected]>: >>>>> >>>>>> recently I got the following error while execute query on a cube >>>>>> which is not that big( about 400mb, 20milion record) >>>>>> ================== >>>>>> Error while executing SQL "select FCRASHTIME,count(1) from >>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME >>>>>> limit >>>>>> 1": Scan row count exceeded threshold: 10000000, please add filter >>>>>> condition to narrow down backend scan range, like where clause. >>>>>> >>>>>> I guess what it scan were the intermediate result, but It doesn't >>>>>> any order by,also the result count is limit to just 1.so it could >>>>>> scan to find any record with those two dimension and wala. >>>>>> waiting kylin expert to give me some detail on that. >>>>>> btw how to configure that threshold, any properties in >>>>>> kylin.properties? >>>>>> ------------------------------ >>>>>> >>>>>> 赵天烁 >>>>>> >>>>>> Kevin Zhao >>>>>> >>>>>> *[email protected] <[email protected]>* >>>>>> >>>>>> >>>>>> >>>>>> 珠海市魅族科技有限公司 >>>>>> >>>>>> MEIZU Technology Co., Ltd. >>>>>> >>>>>> 广东省珠海市科技创新海岸魅族科技楼 >>>>>> >>>>>> MEIZU Tech Bldg., Technology & Innovation Coast >>>>>> >>>>>> Zhuhai, 519085, Guangdong, China >>>>>> >>>>>> meizu.com >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> Shaofeng Shi >>>>> >>>>> >>> >>> >>> -- >>> Best regards, >>> >>> Shaofeng Shi >>> >>> > > > -- > Regards, > > *Bin Mahone | 马洪宾* > >
