Re: Re: Scan row count exceeded threshold

张天生 Fri, 05 Aug 2016 21:20:57 -0700

I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G,
but it was not occupied enough.


赵天烁 <[email protected]>于2016年8月4日周四 上午11:54写道：

> I dig into the code, found that this threshold is determine at
> CubeStorageQuery--setThreshold method, there already has a prop call 
> kylin.query.mem.budget,which
> define the memory size of current jvm obtain,default to 3G then divide this
> by a estimate size of a single row.the result of that is the threshold.so I
> think by change kylin.query.mem.budget to a higher val will do.
>
> btw my cube is growing like a hell.....the biggest one is almost
> 1.7TB(only keep on month of these data),source record is 200 milion level,9
> dim,two of then cardinality are bigger than 100milion,already have
> performance issue,maybe you kylin expert could provide some optimize sugg?
> waiting......
>
> ------------------------------
>
> 赵天烁
>
> Kevin Zhao
>
> *[email protected] <[email protected]>*
>
>
>
> 珠海市魅族科技有限公司
>
> MEIZU Technology Co., Ltd.
>
> 广东省珠海市科技创新海岸魅族科技楼
>
> MEIZU Tech Bldg., Technology & Innovation Coast
>
> Zhuhai, 519085, Guangdong, China
>
> meizu.com
>
>
> *From:* hongbin ma <[email protected]>
> *Date:* 2016-08-03 12:50
> *To:* user.kylin <[email protected]>
>
> *Subject:* Re: Re: Scan row count exceeded threshold
>
> Hi Tiansheng and Tianshuo
>
> Kylin is basically limiting the footprint on storage visit, the size of
> each row is inversely proportional to the number of rows to be read. The
> cuboid row size grows when there's a distinct count measure, so you're
> observing  threshold being 49121, this is normal.
>
> The unnormal part is kylins behavior when there's a limit clause,
> especially for cases like Tianshuo's case, where query being:
>
> select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by
> FCRASH_ANALYSIS_ID,FCRASHTIME limit N
>
> The query does not have any filters, so we should be able to read the
> first N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the
> result to users. Yang Li tried to fix the issue in
> https://issues.apache.org/jira/browse/KYLIN-1787, however the approach
> was still a little bit too conservative to me. The patch in KYLIN-1787
> would not enable the storage read limit as long as the cube has a partition
> time column (and meanwhile the query is not grouping by the partition time
> column), because we'll need to further aggregate rows from different
> segments. This is why 1.5.3 does not behave as Tianshuo expect.
>
> However there's still room for improvement even if further aggregation is
> required across multiple segments. For tianshuo's case, we can ask for N
> cuboid row from each segment, and merge them at query server side. Since
> the cuboid rows are respectively sorted in each segment, it
> is guaranteed that the result is correct
>
> However it's a different story if the query contains filters, like in
> Tiansheng's case. Filter on dimensions may prevent limit clause put down,
> especially when the dimension is not the first dimension in row key. Below
> is Tiansheng's case:
>
> Error while executing SQL "select
> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS) as
> clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as
> cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where
> COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by
> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10
> offset 0": Scan row count exceeded threshold: 49121, please add filter
> condition to narrow down backend scan range, like where clause.
>
> I have opened a JIRA to fix this
>
> On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <[email protected]> wrote:
>
>> Hi ShaoFeng:
>>
>> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS)
>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons,
>> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as cost,COUNT(DISTINCT
>> CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296' and
>> "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by
>> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10
>> offset 0". I analyzed the critical reason is "COUNT(DISTINCT
>> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even
>> for a long time span query.
>>
>> 赵天烁 <[email protected]>于2016年8月3日周三 上午10:16写道：
>>
>>> ok,I'll try to make it configurable in kylin.properties
>>>
>>> ------------------------------
>>>
>>> 赵天烁
>>>
>>> Kevin Zhao
>>>
>>> *[email protected] <[email protected]>*
>>>
>>>
>>>
>>> 珠海市魅族科技有限公司
>>>
>>> MEIZU Technology Co., Ltd.
>>>
>>> 广东省珠海市科技创新海岸魅族科技楼
>>>
>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>
>>> Zhuhai, 519085, Guangdong, China
>>>
>>> meizu.com
>>>
>>>
>>> *From:* ShaoFeng Shi <[email protected]>
>>> *Date:* 2016-08-02 22:37
>>> *To:* user <[email protected]>
>>> *Subject:* Re: Re: Scan row count exceeded threshold
>>>
>>> This is a new change introduced in 1.5.3, JIRA is:
>>> https://issues.apache.org/jira/browse/KYLIN-1787
>>>
>>> In a short, it tries to avoid scanning too many rows (1 million) in one
>>> query. But the threshold wasn't externalized as a parameter. @Tianshuo,
>>> would you mind to contribute a patch?
>>>
>>> Besides, the query has "limit 1" but Kylin didn't smartly aware that.
>>>
>>> @Tian sheng, what's the SQL query looks like in your case?
>>>
>>> 2016-08-02 21:14 GMT+08:00 张天生 <[email protected]>:
>>>
>>>> I meet the same question. In 1.5.2.1 version it worked fine, but
>>>> switched 1.5.3 it ofter report this error.
>>>>
>>>> 赵天烁 <[email protected]>于2016年8月2日周二 上午11:57写道：
>>>>
>>>>> yes, it is 1.5.3,after limit the date range to a week,it still cost
>>>>> about 7 second to response, is there any way to get the query execute
>>>>> detail just like the diagnosis log?
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> 赵天烁
>>>>>
>>>>> Kevin Zhao
>>>>>
>>>>> *[email protected] <[email protected]>*
>>>>>
>>>>>
>>>>>
>>>>> 珠海市魅族科技有限公司
>>>>>
>>>>> MEIZU Technology Co., Ltd.
>>>>>
>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>
>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>
>>>>> Zhuhai, 519085, Guangdong, China
>>>>>
>>>>> meizu.com
>>>>>
>>>>>
>>>>> *From:* ShaoFeng Shi <[email protected]>
>>>>> *Date:* 2016-08-01 20:20
>>>>> *To:* user <[email protected]>
>>>>> *Subject:* Re: Scan row count exceeded threshold
>>>>>
>>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it.
>>>>> Besides, you can add some "where" contition to reduce the records to scan,
>>>>> as the error message mentioned.
>>>>>
>>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <[email protected]>:
>>>>>
>>>>>> recently I got the following error while execute query on a cube
>>>>>> which is not that big( about 400mb, 20milion record)
>>>>>> ==================
>>>>>> Error while executing SQL "select FCRASHTIME,count(1) from
>>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME 
>>>>>> limit
>>>>>> 1": Scan row count exceeded threshold: 10000000, please add filter
>>>>>> condition to narrow down backend scan range, like where clause.
>>>>>>
>>>>>> I guess what  it scan were the intermediate result, but It doesn't
>>>>>> any order by,also the result count is limit to just 1.so it could
>>>>>> scan to find any record with those two dimension and wala.
>>>>>> waiting kylin expert to give me some detail on that.
>>>>>> btw how to configure that threshold, any properties in
>>>>>> kylin.properties?
>>>>>> ------------------------------
>>>>>>
>>>>>> 赵天烁
>>>>>>
>>>>>> Kevin Zhao
>>>>>>
>>>>>> *[email protected] <[email protected]>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 珠海市魅族科技有限公司
>>>>>>
>>>>>> MEIZU Technology Co., Ltd.
>>>>>>
>>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>>
>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>>
>>>>>> Zhuhai, 519085, Guangdong, China
>>>>>>
>>>>>> meizu.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi
>>>
>>>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
>
>

Re: Re: Scan row count exceeded threshold

Reply via email to