Thanks. We checked and confirmed that the user ids are all going to one reducer 
and causing the skewness. If we increase the reducer count, does this mean the 
dsid count step will also take multiple reducers ?

Thanks for the help. We will try this anways

Sent from my iPhone

> On Jun 25, 2019, at 2:32 AM, ShaoFeng Shi <[email protected]> wrote:
> 
> Hi Cinto,
> 
> By default, Kylin uses one reducer for one column to remove the duplicated 
> values (for building dimension dictionaries). This is okay for the normal 
> case. 
> 
> In your case, the user id (dsids) is an ultra-high-cardinality column, so one 
> reducer is insufficient to process, Kylin needs to start more reducers for 
> it. As you already observed that this reducer is very slow, you can adjust 
> the configuration to increase the parallelism. e.g:
> 
> kylin.engine.mr.uhc-reducer-count=10
> 
> To take this effective, you need to restart Kylin, discard the current job 
> and re-submit the build job.
> 
> Best regards,
> 
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: [email protected]
> 
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: [email protected]
> Join Kylin dev mail group: [email protected]
> 
> 
> 
> 
> hit-lacus <[email protected]> 于2019年6月23日周日 下午1:16写道:
>> Hi,
>>    It looks like it is caused by data skew, which offten happen in many big 
>> data scene. As far as I know, I think you should check the high cardinality 
>> colmun and use it as a "Shard By" column (in "Advanced Setting" of cube 
>> design stage). You may check "Redistribute intermediate table" in 
>> http://kylin.apache.org/docs20/howto/howto_optimize_build.html for more 
>> information.
>>    If you find anything wrong or I misunderstand anything, please let me 
>> know. Thank you.
>> 
>> 
>> 
>> -----------------
>> -----------------
>> Best wishes to you ! 
>> From :Xiaoxiang Yu
>> 
>> At 2019-06-22 02:33:56, "Cinto Sunny" <[email protected]> wrote:
>> Thanks. We actually have 12 reducers. The problem is that one reducer is 
>> getting stuck with huge data. The rest completes. We have a 1.8 billion 
>> dsids and not sure if that is problem. If yes, how do we distribute the data
>> 
>> - Cinto
>> 
>> 
>>> On Fri, Jun 21, 2019 at 12:03 AM Chao Long <[email protected]> wrote:
>>> Hi Cinto Sunny,
>>>    You can try to set "kylin.engine.mr.uhc-reducer-count" a bigger value, 
>>> default is 1.
>>> 
>>>> On Fri, Jun 21, 2019 at 2:44 PM Cinto Sunny <[email protected]> 
>>>> wrote:
>>>> Hi All,
>>>> 
>>>> I am building a cube with 10 dimensions and two measures. The total input 
>>>> size is 100 GB. 
>>>> I am trying to build using Roaring BitMap. One of the fact is user and has 
>>>> ~1.8B userids. 
>>>> 
>>>> The build is getting stuck at stage - Extract Fact Table Distinct Columns. 
>>>> One executor is stuck and is processing over 800M lines.
>>>> 
>>>> I am using version - 2.6.
>>>> 
>>>> Any pointers would be appreciated. Let me know is any further information 
>>>> is required.
>>>> 
>>>> - Cinto

Reply via email to