Hi Cinto,

By default, Kylin uses one reducer for one column to remove the duplicated
values (for building dimension dictionaries). This is okay for the normal
case.

In your case, the user id (dsids) is an ultra-high-cardinality column, so
one reducer is insufficient to process, Kylin needs to start more reducers
for it. As you already observed that this reducer is very slow, you can
adjust the configuration to increase the parallelism. e.g:

kylin.engine.mr.uhc-reducer-count=10

To take this effective, you need to restart Kylin, discard the current job
and re-submit the build job.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




hit-lacus <[email protected]> 于2019年6月23日周日 下午1:16写道:

> Hi,
>    It looks like it is caused by data skew, which offten happen in many big
> data scene. As far as I know, I think you should check the high
> cardinality colmun and use it as a "Shard By" column (in "Advanced
> Setting" of cube design stage). You may check "Redistribute intermediate
> table" in http://kylin.apache.org/docs20/howto/howto_optimize_build.html for
> more information.
>    If you find anything wrong or I misunderstand anything, please let me
> know. Thank you.
>
>
>
> *-----------------*
> *-----------------*
> *Best wishes to you ! *
> *From :**Xiaoxiang Yu*
>
> At 2019-06-22 02:33:56, "Cinto Sunny" <[email protected]> wrote:
>
> Thanks. We actually have 12 reducers. The problem is that one reducer is
> getting stuck with huge data. The rest completes. We have a 1.8 billion
> dsids and not sure if that is problem. If yes, how do we distribute the data
>
> - Cinto
>
>
> On Fri, Jun 21, 2019 at 12:03 AM Chao Long <[email protected]>
> wrote:
>
>> Hi Cinto Sunny,
>>    You can try to set "kylin.engine.mr.uhc-reducer-count" a bigger value,
>> default is 1.
>>
>> On Fri, Jun 21, 2019 at 2:44 PM Cinto Sunny <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am building a cube with 10 dimensions and two measures. The total
>>> input size is 100 GB.
>>> I am trying to build using Roaring BitMap. One of the fact is user and
>>> has ~1.8B userids.
>>>
>>> The build is getting stuck at stage - Extract Fact Table Distinct
>>> Columns. One executor is stuck and is processing over 800M lines.
>>>
>>> I am using version - 2.6.
>>>
>>> Any pointers would be appreciated. Let me know is any
>>> further information is required.
>>>
>>> - Cinto
>>>
>>

Reply via email to