Re: multiple EMRs sync

Chase Zhang Tue, 07 Aug 2018 18:55:07 -0700

Hi Sonny,

I think reload config, instead of reload metadata will have the same effect of 
wiping the cache of cubes. Please have a try. (You have to do this upon each 
query node)


Our Kylin instance and EMR are started separately. The EMR was started first, 
then we use docker (ECS) to start Kylin. As to customize the properties without 
building new docker image, we've written our own preload  script. We make 
templates of configs like kylin.properties with some fields filled with 
placeholder. Once the container is started, the script first replace those 
placeholders with values in environment variables, thus, the IP address to EMR 
is set here.

As for the Kylin vs EMR mapping. We only have one Kylin master node per EMR 
cluster, but query node is deployed with auto scale, which means number will 
change according to the situation.

I'm afraid we don't have a video (even there is one, it will be in Chinese 
which I think won't be helpful). Our docker file hasn't yet open sourced. I 
will follow the progress and notify you if there is any news.

On Aug 7, 2018, 11:12 PM +0800, Sonny Heer <[email protected]>, wrote:
> Thanks Chase.  I'm assuming the wipe-cache is the same as "Reload Metadata" 
> under "System" tab in kylin UI.  We did try doing reload metadata via UI but 
> that didn't seem to update the query node.
>
> The other key problem is how did your team coordinate between kylin and EMR.  
> that is also hardcoded properties in kylin.properties for where to connect.  
> Did you bring up Kylin & EMR at the same time so therefore bootstrap of kylin 
> has the EMR master node ips?  Is there a 1;1 mapping of kylin node to emr 
> cluster?
>
> Is there a video of that slide deck?  Also will be curious to look at your 
> docker image if available.  thanks
>
>
>
> > On Mon, Aug 6, 2018 at 8:37 PM Chase Zhang <[email protected]> wrote:
> > > Hi Sonny,
> > >
> > > I'm  Chase from Strikingly. As Shaofeng has mentioned our solution, I'd 
> > > like to have a brief introduction about it in case it will be helpful to 
> > > you.
> > >
> > > To my understanding, the key problem of you is how to coordinate the 
> > > master node of Kylin and its query nodes.
> > >
> > > Currently, Kylin must have a hard coded target urls at the master side 
> > > for all query nodes and once a cube is built, master node of kylin will 
> > > notify query nodes to update the metadata. This is because Kylin has a 
> > > cache for related configs, although the hbase is having latest values, 
> > > the cache might be out of date.
> > >
> > > Luckily, Kylin has provided a RESTful API for updating the cache (see 
> > > http://kylin.apache.org/docs23/howto/howto_use_restapi.html#wipe-cache).
> > >
> > > In theory, you can manually trigger this API to make query node's 
> > > metadata cache up to date. But if you are having multiple query 
> > > instances, this will be come troublesome.
> > >
> > > Not like other Big Data solutions, Kylin's architecture is simple. It 
> > > does not depends on service discovery component like Zookeeper. This 
> > > makes Kylin easy to deploy and use, but if you're having some advanced 
> > > demand, like auto scale, A hard coded query node's IP address and ports 
> > > might not be good enough.
> > >
> > > As to mitigate this problem, we have developed a tool set. The basic 
> > > ideas are:
> > >
> > > 1. Deploy Kylin with docker container
> > > 2. Make a separated scheduler to trigger build and monitor the status 
> > > through RESTful API upon master nodes
> > > 3. Use AWS's Target Group as a service discovery solution. As query nodes 
> > > are running inside a target group, we can use AWS's API to get all 
> > > instance's IP address and serving ports.
> > > 4. Knowing a cube has been built finished as well as the entry point of 
> > > each query nodes, the scheduler can make RESTful API to query nodes one 
> > > by one to update their cache.
> > >
> > > Furthermore, we're now having some advanced cache management logic (like 
> > > avoid invalidate cache while a build is failed and wait for the next 
> > > build to recover). We embedded all these logic to our own scheduler.
> > >
> > > Hope this reply will help you.
> > >
> > > On Aug 7, 2018, 3:28 AM +0800, Sonny Heer <[email protected]>, wrote:
> > > >
> > > >
> > > >
> > > > In this diagram (from slide deck).  is each HBase a different EMR 
> > > > cluster?  if so how is kylin configured to connect to both?  - notice 
> > > > the kylin query node shows a line connecting to both clusters.  Thanks 
> > > > for the input...
> > > >
> > > >
> > > >
> > > >
> > > > > On Mon, Aug 6, 2018 at 10:56 AM Sonny Heer <[email protected]> 
> > > > > wrote:
> > > > > > ShaoFeng,
> > > > > >
> > > > > > Is Strikingly open to sharing their work?  It appears our use case 
> > > > > > is similar and would love to see what work they have matches ours.
> > > > > >
> > > > > > > On Mon, Aug 6, 2018 at 7:01 AM Sonny Heer <[email protected]> 
> > > > > > > wrote:
> > > > > > > > Does that require a HA cluster & kylin installed on its own 
> > > > > > > > instance?  EMR doesn't spin up services as HA on its master 
> > > > > > > > node.   I'd be curious to see what Strikingly has done and if 
> > > > > > > > they have it deployed on AWS.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Sun, Aug 5, 2018 at 10:57 PM ShaoFeng Shi 
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > Hi Sonny,
> > > > > > > > > >
> > > > > > > > > > You can configure an R/W separated deployment with two 
> > > > > > > > > > EMRs: one is Hadoop only and the other is the HBase 
> > > > > > > > > > cluster. In the EC2 that run Kylin, install both Hadoop and 
> > > > > > > > > > HBase client/configuration. And then tell Kylin you have 
> > > > > > > > > > Hadoop and HBase in two clusters (refer to the blog). Kylin 
> > > > > > > > > > will run jobs in the W cluster and bulk load HFile to the R 
> > > > > > > > > > cluster.
> > > > > > > > > >
> > > > > > > > > > https://kylin.apache.org/blog/2016/06/10/standalone-hbase-cluster/
> > > > > > > > > >
> > > > > > > > > > Many Kylin users run in this R/W separated architecture. I 
> > > > > > > > > > once tried it on Azure with two clusters, it worked well. 
> > > > > > > > > > Not tested with EMR, but I think they are similar.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2018-08-06 10:55 GMT+08:00 Sonny Heer 
> > > > > > > > > > > <[email protected]>:
> > > > > > > > > > > > Yea that would be great if Kylin can have a centralized 
> > > > > > > > > > > > metastore in RDS.
> > > > > > > > > > > >
> > > > > > > > > > > > The big problem for us now is this:
> > > > > > > > > > > >
> > > > > > > > > > > > 2 emr clusters each running kylin on master node.  Both 
> > > > > > > > > > > > share hbase s3 root dir.
> > > > > > > > > > > >
> > > > > > > > > > > > Cluster A creates a cube and does a build.  Cluster B 
> > > > > > > > > > > > can see the cube as it builds in “monitor”, but once 
> > > > > > > > > > > > cube is finished.  Cube is “ready” only in cluster A 
> > > > > > > > > > > > (job launched from).
> > > > > > > > > > > >
> > > > > > > > > > > > We need somewhat isolated kylin nodes that can still 
> > > > > > > > > > > > share the same backend.  This is a big win since then 
> > > > > > > > > > > > each cluster can scale read/write independently in EMR 
> > > > > > > > > > > > - this is our goal.  Having read/write in the same 
> > > > > > > > > > > > cluster doesn’t work for various reasons...
> > > > > > > > > > > >
> > > > > > > > > > > > It seems kylin is really close since the monitoring of 
> > > > > > > > > > > > the cube is in sync when sharing same hbase backend.
> > > > > > > > > > > >
> > > > > > > > > > > > Using read replica did not work - when we try to login 
> > > > > > > > > > > > from the replica kylin want able to work
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Aug 5, 2018 at 7:01 PM ShaoFeng Shi 
> > > > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > > > > > Hi Sonny,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > EMR HBase read replica is a great feature, but we 
> > > > > > > > > > > > > > didn't try. Are you going to using this feature? or 
> > > > > > > > > > > > > > just want to deploy Kylin as a cluster?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If putting Kylin metadata to RDS, can it be easier 
> > > > > > > > > > > > > > for you?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2018-08-04 0:05 GMT+08:00 Sonny Heer 
> > > > > > > > > > > > > > > <[email protected]>:
> > > > > > > > > > > > > > > > we'd like to use emr hbase read replicas if 
> > > > > > > > > > > > > > > > possible.  We had some issues using this 
> > > > > > > > > > > > > > > > stragety since kylin requires write capability 
> > > > > > > > > > > > > > > > from all nodes (on login for example).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > idea is to cluster kylin using multiple EMRs on 
> > > > > > > > > > > > > > > > master node.  If this isn't possible we may go 
> > > > > > > > > > > > > > > > with separate instance approach, but that is 
> > > > > > > > > > > > > > > > prone to errors as emr libs have to copied 
> > > > > > > > > > > > > > > > around..
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ref:
> > > > > > > > > > > > > > > > https://aws.amazon.com/blogs/big-data/setting-up-read-replica-clusters-with-hbase-on-amazon-s3/
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Anyone else have experience or can share their 
> > > > > > > > > > > > > > > > use case on emr?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Aug 2, 2018 at 2:32 PM Sonny Heer 
> > > > > > > > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > > > > > > > > > Is it possible in the new version of kylin 
> > > > > > > > > > > > > > > > > > to have multiple EMR clusters with Kylin 
> > > > > > > > > > > > > > > > > > installed on master node but talking to the 
> > > > > > > > > > > > > > > > > > same S3 location.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > e.g. one Write EMR cluster and one Read EMR 
> > > > > > > > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Shaofeng Shi 史少锋
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > > Shaofeng Shi 史少锋
> > > > > > > > > >

Re: multiple EMRs sync

Reply via email to