Re: Issues with Kylin with EMR and S3

ShaoFeng Shi Thu, 09 Nov 2017 05:57:29 -0800

Thanks Roberto;

I will also try that on tomorrow or this weekend; I had planned to draft a
document for EMR, it's time to do that now.


2017-11-09 19:54 GMT+08:00 Roberto Tardío <[email protected]>:

> Hi,
>
> With Kylin 2.1 YARN RM shows one JOB for Step1 was finished with
> successful. But there is no job when step2 get stucked. When we use HDFS as
> working dir this steps works fine and launch a Tez job on YARN RM that
> finish with success (and also all the sample cube build process).
>
> With Kylin 2.2 YARN RM do not show any MR job when Step 1 get stucked.
>
> However we are going to do again the test, maybe due to change kylin
> version from 2.1 to 2.2 we forget to clean some metadata, coprocessor,...
>
> El 09/11/2017 a las 11:10, ShaoFeng Shi escribió:
>
> Hi Robert,
>
> No need to set
> *kylin.storage.hbase.cluster-fs to the same bucket again. *
>
> For the stuck job, did you check YARN RM to see whether there is any
> indicator?
>
>
> 2017-11-09 17:38 GMT+08:00 Roberto Tardío <[email protected]>:
>
>> Hi,
>>
>> EMR version is 5.7 and Kylin version is 2.1. We have changed
>> kylin.env.hdfs-working-dir to s3://your-bucket/kylin but *we have not
>> changed **kylin.storage.hbase.cluster-fs to the same S3 bucket*. Could
>> it be because we did not change this *kylin.storage.hbase.cluster-fs 
>> *parameter
>> to S3?
>>
>> We have tried also with the last versión of Kylin (2.2). In this case
>> when build job start the first step get stucked with no errors or warns in
>> log files. Maybe we are doing something wrong. We are going to try tomorrow
>> setting *kylin.storage.hbase.cluster-fs *to S3.
>>
>> Others details about abour our architecture are:
>>
>>    - Kylin 2.1 (also tried with 2.2) on a separated ec2 machine, with
>>    Hadoop CLI for EMR and access to HDFS (EMR ephemeral) and S3.
>>    - EMR 5.7 cluster (1 master and 4 cores)
>>    - HBase on S3
>>       - Hive warehouse on S3 and metastore configured on MySQL in the
>>       ec2 machine (the same where Kylin runs)
>>       - HDFS
>>       - S3 with EMRFS
>>       - Zookeeper.
>>
>> I will give you feedback about tomorrow new tests.
>>
>> Many thanks ShaoFeng!
>>
>> El 09/11/2017 a las 1:12, ShaoFeng Shi escribió:
>>
>> Hi Roberto,
>>
>> What's your EMR version? I know that in 4.x version, EMR's Hive has a
>> problem with "insert overwrite" over S3, that is just what Kylin need in
>> the "redistribute flat hive table" step. You can also skip the
>> "redistribute" step by setting "kylin.source.hive.redistribut
>> e-flat-table=false" in kylin.properties.  (On EMR 5.7, there is no such
>> issue).
>>
>> The second option is, set "kylin.env.hdfs-working-dir" to local HDFS,
>> and "kylin.storage.hbase.cluster-fs" to a S3 bucket (HBase data also on
>> S3). Kylin will build the cube on HDFS and then output HFile to S3, and
>> finally load to HBase on S3. This will gain better build performance and
>> also ensure Cube data in S3 for high availability and durability. But if
>> you stop EMR, the intermediate cuboid files will be lost, which cause
>> segments couldn't be merged.
>>
>> The third option is to use a newer version like EMR 5.7,  use S3 as the
>> working dir (and HBase also on S3).
>>
>> For all the scenarios, please use Kylin v2.2, which includes the fix of
>> KYLIN-2788.
>>
>>
>>
>>
>>
>> 2017-11-09 3:45 GMT+08:00 Roberto Tardío <[email protected]>:
>>
>>> Hi,
>>>
>>> We have deployed Kylin on ec2 machine using an EMR cluster. After adding
>>> the "hbase.zookeeper.quorum" property to kylin_job_conf.xml, we have
>>> succesfully build sample cube. However, kylin data is stored on hdfs path
>>> /kylin. Due to the HDFS is ephemeral storage on EMR and it will be erased
>>> if you Terminate the cluster (e.g. to save costs of use, to change the kind
>>> of instances,...), we have to store data on S3.
>>>
>>> With this aim we changed 'kylin.env.hdfs-working-dir' property to s3,
>>> like s3://your-bucket/kylin. But after this change if we try to build
>>> sample cube, the build job starts but it gets stuck in step 2 "Redistribute
>>> Flat Hive Table". We have checked that this step never start and kylin logs
>>> do not show any error or warn.
>>>
>>> Do you have any idea how to solve this and make possible that Kylin
>>> works with S3?
>>>
>>> So far the only solution we have found is to copy the HDFS folder to S3
>>> before terminate the EMR cluster and copy it from S3 to HDFS when it is
>>> turned on. However this is a half solution, since the HDFS storage of EMR
>>> is ephemeral and we do not have as much space available as in S3. Which
>>> data stores kylin on kylin path? HBase tables are stored in this folder?
>>>
>>> We will appreciate you help,
>>>
>>> Roberto
>>> --
>>>
>>> *Roberto Tardío Olmos*
>>> *Senior Big Data & Business Intelligence Consultant*
>>> Avenida de Brasil, 17
>>> <https://maps.google.com/?q=Avenida+de+Brasil,+17&entry=gmail&source=g>,
>>> Planta 16.28020 Madrid
>>> Fijo: 91.788.34.10
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>> --
>>
>> *Roberto Tardío Olmos*
>> *Senior Big Data & Business Intelligence Consultant*
>> Avenida de Brasil, 17
>> <https://maps.google.com/?q=Avenida+de+Brasil,+17&entry=gmail&source=g>,
>> Planta 16.28020 Madrid
>> Fijo: 91.788.34.10
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
> --
>
> *Roberto Tardío Olmos*
> *Senior Big Data & Business Intelligence Consultant*
> Avenida de Brasil, 17
> <https://maps.google.com/?q=Avenida+de+Brasil,+17&entry=gmail&source=g>,
> Planta 16.28020 Madrid
> Fijo: 91.788.34.10
>



-- 
Best regards,

Shaofeng Shi 史少锋

Re: Issues with Kylin with EMR and S3

Reply via email to