Re: Issue when recreating EMR cluster with HBase data on S3

ShaoFeng Shi Thu, 27 Jun 2019 08:21:18 -0700

Hello Andras,

>From 3.0 Kylin starts to persist some real-time metadata to zookeeper; I
think it didn't consider such case (on AWS). We need to provide a guideline
on how to backup/restore that part. Thank you for the feedback. Keep tuned.


Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Andras Nagy <[email protected]> 于2019年6月27日周四 下午9:24写道：

> Hi Xiaoxiang,
>
> >In fact, we currently have no way to backup or restore the streaming
> metadata which related to replica set/assignment etc.
> >I think these metadata are volatile, such as hostname of each worker may
> be different in two cluster
>
> Exactly, I agree it makes no sense to persist these. It would make more
> sense to rebuild these on the new cluster, based on the specifics of the
> new cluster.
>
> What I'm looking for is how to ensure that the runtime environments (both
> the Kylin processes and for EMR cluster that is hosting HBase, Spark and
> MapReduce) become stateless and if they are failing or destroyed, this does
> not affect the the cube data, which is persistent (in this case, in S3), so
> the runtime infrastructure can fail over to new instances.
>
> By hosting the HBase data on S3 it seemed to be possible, as the data
> (cubes) built in a previous HBase environment (EMR) are now available in a
> new HBase (new EMR cluster).
> Still, even though I have the cube data, I can't query it from Kylin,
> because the query layer also relies on this volatile streaming metadata.
> Is this understanding correct?
> If it is, how far do you think Kylin is from being able to support this
> scenario?
>
> Many thanks,
> Andras
>
> On Thu, Jun 27, 2019 at 12:50 PM Xiaoxiang Yu <[email protected]> wrote:
>
>> Hi Andras,
>>    In fact, we currently have no way to backup or restore the streaming
>> metadata which related to replica set/assignment etc.
>>    I think these metadata are volatile, such as hostname of each worker
>> may be different in two cluster. But if you find backup/restore is really
>> useful for streaming metadata. Please submit a JIRA.
>>
>> *-----------------*
>> *-----------------*
>> *Best wishes to you ! *
>> *From ：**Xiaoxiang Yu*
>>
>> At 2019-06-27 17:54:08, "Andras Nagy" <[email protected]>
>> wrote:
>>
>> OK, this worked, so I could proceed one step. I disabled all HBase
>> tables, manually altered them so the coprocessor locations point to the new
>> HDFS cluster, and re-enabled them. After this, there are no errors in the
>> regionserver's logs, and Kylin starts up, so this seems fine.
>> (Interestingly, the DeployCoprocessorCLI did assemble the correct HDFS URL,
>> but could not alter the table definitions, so after running
>> DeployCoprocessorCLI, the table definitions have not changed. This is on
>> HBase version 1.4.9.)
>>
>> However when I try to query the existing cubes, I get a failure with a
>> NullPointerException at
>> org.apache.kylin.stream.coordinator.assign.AssignmentsCache.getReplicaSetsByCube(AssignmentsCache.java:61).
>> Just quickly looking at it, it seems like these cube assignments come from
>> Zookeeper, and I'm missing them. Since I'm now running on a completely new
>> EMR cluster (with new Zookeeper), I wonder if there is some persistent
>> state in Zookeeper that should also be backed up and restored.
>>
>> (This deployment used hdfs-working-dir on HDFS, so before terminating the
>> old cluster I backed up the hdfs-working-dir and have restored it in the
>> new cluster; but nothing from Zookeeper.)
>>
>> Thanks in advance for any pointers about this.
>>
>> On Thu, Jun 27, 2019 at 10:30 AM Andras Nagy <
>> [email protected]> wrote:
>>
>>> Checked the table definition in HBase, and that's what explicitely
>>> references the coprocessor location on the old cluster. I'll update that
>>> and let you know.
>>>
>>> On Thu, Jun 27, 2019 at 10:26 AM Andras Nagy <
>>> [email protected]> wrote:
>>>
>>>> Actually as I noticed, it's not the corpocessor that's failing, but
>>>> HBase when trying to load the coprocessor itself from HDFS (form a
>>>> reference somewhere that still points to the old HDFS namenode).
>>>>
>>>> On Thu, Jun 27, 2019 at 10:19 AM Andras Nagy <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi ShaoFeng,
>>>>>
>>>>> After disabling the "KYLIN_*" tables (but not 'kylin_metadata') the
>>>>> RegionServers could indeed start up and the coprocessor refresh succeeded.
>>>>>
>>>>> But after re-enabling those tables again, the issue continues, and
>>>>> again the RegionServers fail by trying to connect to the old master node.
>>>>> What I noticed now from the stacktrace is that the coprocessor is actually
>>>>> trying to connect to the old HDFS namenode on port 8020 (and not to the
>>>>> HBase master).
>>>>>
>>>>> Best regards,
>>>>> Andras
>>>>>
>>>>>
>>>>> On Thu, Jun 27, 2019 at 4:21 AM ShaoFeng Shi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I see; Can you try this way: disable all "KYLIN_*" tables in HBase
>>>>>> console, and then see whether the region servers can start.
>>>>>>
>>>>>> If they can start, then run the above command to refresh the
>>>>>> coprocessor.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>> Apache Kylin PMC
>>>>>> Email: [email protected]
>>>>>>
>>>>>> Apache Kylin FAQ:
>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>> Join Kylin user mail group: [email protected]
>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Andras Nagy <[email protected]> 于2019年6月26日周三 下午10:57写道：
>>>>>>
>>>>>>> Hi ShaoFeng,
>>>>>>> Yes, but it fails as well. Actually it fails because the
>>>>>>> RegionServers are not running (as they fail when starting up).
>>>>>>> Best regards,
>>>>>>> Andras
>>>>>>>
>>>>>>> On Wed, Jun 26, 2019 at 4:42 PM ShaoFeng Shi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andras,
>>>>>>>>
>>>>>>>> Did you try this?
>>>>>>>> https://kylin.apache.org/docs/howto/howto_update_coprocessor.html
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> Apache Kylin PMC
>>>>>>>> Email: [email protected]
>>>>>>>>
>>>>>>>> Apache Kylin FAQ:
>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>>>> Join Kylin user mail group: [email protected]
>>>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Andras Nagy <[email protected]> 于2019年6月26日周三 下午10:05写道：
>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I'm testing a setup where HBase is running on AWS EMR and HBase
>>>>>>>>> data is stored on S3. It's working fine so far, but when I terminate 
>>>>>>>>> the
>>>>>>>>> EMR cluster and recreate it with the same S3 location for HBase, HBase
>>>>>>>>> won't start up properly. Before shutting down, I did execute the
>>>>>>>>> disable_all_tables.sh script to flush HBase state to S3.
>>>>>>>>>
>>>>>>>>> Actually the issue is that RegionServers don't start up. Maybe I'm
>>>>>>>>> missing something in the EMR setup and not in Kylin setup, but the
>>>>>>>>> exceptions I get in the RegionServer's log point at Kylin's
>>>>>>>>> CubeVisitService coprocessor, which is still trying to connect to the 
>>>>>>>>> old
>>>>>>>>> HBase master on the old EMR cluster's master node and fails with:
>>>>>>>>> "coprocessor.CoprocessorHost: The coprocessor
>>>>>>>>> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService
>>>>>>>>> threw java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; "
>>>>>>>>>
>>>>>>>>> (Here, ip-172-35-7-125 was the old clusters' master node.)
>>>>>>>>>
>>>>>>>>> Does anyone have any idea what I'm doing wrong here?
>>>>>>>>> The HBase master node's address seems to be cached somewhere, and
>>>>>>>>> when starting up HBase on the new cluster with the same S3 location 
>>>>>>>>> for
>>>>>>>>> HFiles, this old address is used still.
>>>>>>>>> Is there anything specific I have missed to get this scenario to
>>>>>>>>> work properly?
>>>>>>>>>
>>>>>>>>> This is the full stacktrace:
>>>>>>>>>
>>>>>>>>> 2019-06-26 12:33:53,352 ERROR
>>>>>>>>> [RS_OPEN_REGION-ip-172-35-5-11:16020-1] coprocessor.CoprocessorHost: 
>>>>>>>>> The
>>>>>>>>> coprocessor
>>>>>>>>> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService
>>>>>>>>> threw java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; For more
>>>>>>>>> details see:  http://wiki.apache.org/hadoop/NoRouteToHost
>>>>>>>>> java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; For more
>>>>>>>>> details see:  http://wiki.apache.org/hadoop/NoRouteToHost
>>>>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>> Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>>>>>>>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
>>>>>>>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>>>>>>>>> at com.sun.proxy.$Proxy36.getFileInfo(Unknown Source)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:796)
>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
>>>>>>>>> at com.sun.proxy.$Proxy37.getFileInfo(Unknown Source)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1649)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452)
>>>>>>>>> at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1466)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.util.CoprocessorClassLoader.getClassLoader(CoprocessorClassLoader.java:264)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:214)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:188)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:376)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(RegionCoprocessorHost.java:238)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:802)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:710)
>>>>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>> Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:6716)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7020)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6992)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6948)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6899)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:364)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:131)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>>>>> Caused by: java.net.NoRouteToHostException: No route to host
>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>>>>>> at
>>>>>>>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>>>>>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
>>>>>>>>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
>>>>>>>>> ... 43 more
>>>>>>>>>
>>>>>>>>> Many thanks,
>>>>>>>>> Andras
>>>>>>>>>
>>>>>>>>

Re: Issue when recreating EMR cluster with HBase data on S3

Reply via email to