Re: multiple EMRs sync

Sonny Heer Tue, 07 Aug 2018 21:38:31 -0700

So thats good handles reads utilizing load balancer to kylin query and
allows you scale nodes up/down using ECS.   but what if your EMR (single
master) goes down?  are those clustered as well to different AZs?



On Tue, Aug 7, 2018 at 6:54 PM Chase Zhang <[email protected]> wrote:

> Hi Sonny,
>
> I think *reload config*, instead of reload metadata will have the same
> effect of wiping the cache of cubes. Please have a try. (You have to do
> this upon each query node)
>
> Our Kylin instance and EMR are started separately. The EMR was started
> first, then we use docker (ECS) to start Kylin. As to customize the
> properties without building new docker image, we've written our own preload
>  script. We make templates of configs like kylin.properties with some
> fields filled with placeholder. Once the container is started, the script
> first replace those placeholders with values in environment variables,
> thus, the IP address to EMR is set here.
>
> As for the Kylin vs EMR mapping. We only have one Kylin master node per
> EMR cluster, but query node is deployed with auto scale, which means number
> will change according to the situation.
>
> I'm afraid we don't have a video (even there is one, it will be in Chinese
> which I think won't be helpful). Our docker file hasn't yet open sourced. I
> will follow the progress and notify you if there is any news.
>
> On Aug 7, 2018, 11:12 PM +0800, Sonny Heer <[email protected]>, wrote:
>
> Thanks Chase.  I'm assuming the wipe-cache is the same as "Reload
> Metadata" under "System" tab in kylin UI.  We did try doing reload metadata
> via UI but that didn't seem to update the query node.
>
> The other key problem is how did your team coordinate between kylin and
> EMR.  that is also hardcoded properties in kylin.properties for where to
> connect.  Did you bring up Kylin & EMR at the same time so therefore
> bootstrap of kylin has the EMR master node ips?  Is there a 1;1 mapping of
> kylin node to emr cluster?
>
> Is there a video of that slide deck?  Also will be curious to look at your
> docker image if available.  thanks
>
>
>
> On Mon, Aug 6, 2018 at 8:37 PM Chase Zhang <[email protected]>
> wrote:
>
>> Hi Sonny,
>>
>> I'm  Chase from Strikingly. As Shaofeng has mentioned our solution, I'd
>> like to have a brief introduction about it in case it will be helpful to
>> you.
>>
>> To my understanding, the key problem of you is how to coordinate the
>> master node of Kylin and its query nodes.
>>
>> Currently, Kylin must have a hard coded target urls at the master side
>> for all query nodes and once a cube is built, master node of kylin will
>> notify query nodes to update the metadata. This is because Kylin has a
>> cache for related configs, although the hbase is having latest values, the
>> cache might be out of date.
>>
>> Luckily, Kylin has provided a RESTful API for updating the cache (see
>> http://kylin.apache.org/docs23/howto/howto_use_restapi.html#wipe-cache).
>>
>> In theory, you can manually trigger this API to make query node's
>> metadata cache up to date. But if you are having multiple query instances,
>> this will be come troublesome.
>>
>> Not like other Big Data solutions, Kylin's architecture is simple. It
>> does not depends on service discovery component like Zookeeper. This makes
>> Kylin easy to deploy and use, but if you're having some advanced demand,
>> like auto scale, A hard coded query node's IP address and ports might not
>> be good enough.
>>
>> As to mitigate this problem, we have developed a tool set. The basic
>> ideas are:
>>
>> 1. Deploy Kylin with docker container
>> 2. Make a separated scheduler to trigger build and monitor the status
>> through RESTful API upon master nodes
>> 3. Use AWS's Target Group as a service discovery solution. As query nodes
>> are running inside a target group, we can use AWS's API to get all
>> instance's IP address and serving ports.
>> 4. Knowing a cube has been built finished as well as the entry point of
>> each query nodes, the scheduler can make RESTful API to query nodes one by
>> one to update their cache.
>>
>> Furthermore, we're now having some advanced cache management logic (like
>> avoid invalidate cache while a build is failed and wait for the next build
>> to recover). We embedded all these logic to our own scheduler.
>>
>> Hope this reply will help you.
>>
>> On Aug 7, 2018, 3:28 AM +0800, Sonny Heer <[email protected]>, wrote:
>>
>> [image: Screen Shot 2018-08-06 at 10.27.35 AM.png]
>>
>>
>> In this diagram (from slide deck).  is each HBase a different EMR
>> cluster?  if so how is kylin configured to connect to both?  - notice the
>> kylin query node shows a line connecting to both clusters.  Thanks for the
>> input...
>>
>>
>>
>>
>> On Mon, Aug 6, 2018 at 10:56 AM Sonny Heer <[email protected]> wrote:
>>
>>> ShaoFeng,
>>>
>>> Is Strikingly open to sharing their work?  It appears our use case is
>>> similar and would love to see what work they have matches ours.
>>>
>>> On Mon, Aug 6, 2018 at 7:01 AM Sonny Heer <[email protected]> wrote:
>>>
>>>> Does that require a HA cluster & kylin installed on its own instance?
>>>> EMR doesn't spin up services as HA on its master node.   I'd be curious to
>>>> see what Strikingly has done and if they have it deployed on AWS.
>>>>
>>>>
>>>>
>>>> On Sun, Aug 5, 2018 at 10:57 PM ShaoFeng Shi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Sonny,
>>>>>
>>>>> You can configure an R/W separated deployment with two EMRs: one is
>>>>> Hadoop only and the other is the HBase cluster. In the EC2 that run Kylin,
>>>>> install both Hadoop and HBase client/configuration. And then tell Kylin 
>>>>> you
>>>>> have Hadoop and HBase in two clusters (refer to the blog). Kylin will run
>>>>> jobs in the W cluster and bulk load HFile to the R cluster.
>>>>>
>>>>> https://kylin.apache.org/blog/2016/06/10/standalone-hbase-cluster/
>>>>>
>>>>> Many Kylin users run in this R/W separated architecture. I once tried
>>>>> it on Azure with two clusters, it worked well. Not tested with EMR, but I
>>>>> think they are similar.
>>>>>
>>>>>
>>>>> 2018-08-06 10:55 GMT+08:00 Sonny Heer <[email protected]>:
>>>>>
>>>>>> Yea that would be great if Kylin can have a centralized metastore in
>>>>>> RDS.
>>>>>>
>>>>>> The big problem for us now is this:
>>>>>>
>>>>>> 2 emr clusters each running kylin on master node.  Both share hbase
>>>>>> s3 root dir.
>>>>>>
>>>>>> Cluster A creates a cube and does a build.  Cluster B can see the
>>>>>> cube as it builds in “monitor”, but once cube is finished.  Cube is 
>>>>>> “ready”
>>>>>> only in cluster A (job launched from).
>>>>>>
>>>>>> We need somewhat isolated kylin nodes that can still share the same
>>>>>> backend.  This is a big win since then each cluster can scale read/write
>>>>>> independently in EMR - this is our goal.  Having read/write in the same
>>>>>> cluster doesn’t work for various reasons...
>>>>>>
>>>>>> It seems kylin is really close since the monitoring of the cube is in
>>>>>> sync when sharing same hbase backend.
>>>>>>
>>>>>> Using read replica did not work - when we try to login from the
>>>>>> replica kylin want able to work
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 5, 2018 at 7:01 PM ShaoFeng Shi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sonny,
>>>>>>>
>>>>>>> EMR HBase read replica is a great feature, but we didn't try. Are
>>>>>>> you going to using this feature? or just want to deploy Kylin as a 
>>>>>>> cluster?
>>>>>>>
>>>>>>> If putting Kylin metadata to RDS, can it be easier for you?
>>>>>>>
>>>>>>> 2018-08-04 0:05 GMT+08:00 Sonny Heer <[email protected]>:
>>>>>>>
>>>>>>>> we'd like to use emr hbase read replicas if possible.  We had some
>>>>>>>> issues using this stragety since kylin requires write capability from 
>>>>>>>> all
>>>>>>>> nodes (on login for example).
>>>>>>>>
>>>>>>>> idea is to cluster kylin using multiple EMRs on master node.  If
>>>>>>>> this isn't possible we may go with separate instance approach, but 
>>>>>>>> that is
>>>>>>>> prone to errors as emr libs have to copied around..
>>>>>>>>
>>>>>>>> ref:
>>>>>>>>
>>>>>>>> https://aws.amazon.com/blogs/big-data/setting-up-read-replica-clusters-with-hbase-on-amazon-s3/
>>>>>>>>
>>>>>>>> Anyone else have experience or can share their use case on emr?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Thu, Aug 2, 2018 at 2:32 PM Sonny Heer <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is it possible in the new version of kylin to have multiple EMR
>>>>>>>>> clusters with Kylin installed on master node but talking to the same 
>>>>>>>>> S3
>>>>>>>>> location.
>>>>>>>>>
>>>>>>>>> e.g. one Write EMR cluster and one Read EMR cluster
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Shaofeng Shi 史少锋
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>

Re: multiple EMRs sync

Reply via email to