Re: Re: Issue when recreating EMR cluster with HBase data on S3

ShaoFeng Shi Fri, 28 Jun 2019 03:19:46 -0700

Hi Gang,

On the cloud, after the cluster re-creation, the RT nodes' addresses were
changed, the assignment in zookeeper is also out of date, so it has no need
to back up the data in zk, is that true?


Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Ma Gang <[email protected]> 于2019年6月28日周五 下午5:42写道：

> Hi Andras,
>
> Yes, Kylin real-time assignments currently is stored in zookeeper, you
> need to backup the streaming metadata information, and restored to the new
> zookeeper. If you don't care the previous assignments and the real-time
> data, you can just disable the cube, and enable it to start stream
> consuming.
>
> 在 2019-06-27 21:24:14，"Andras Nagy" <[email protected]> 写道：
>
> Hi Xiaoxiang,
>
> >In fact, we currently have no way to backup or restore the streaming
> metadata which related to replica set/assignment etc.
> >I think these metadata are volatile, such as hostname of each worker may
> be different in two cluster
>
> Exactly, I agree it makes no sense to persist these. It would make more
> sense to rebuild these on the new cluster, based on the specifics of the
> new cluster.
>
> What I'm looking for is how to ensure that the runtime environments (both
> the Kylin processes and for EMR cluster that is hosting HBase, Spark and
> MapReduce) become stateless and if they are failing or destroyed, this does
> not affect the the cube data, which is persistent (in this case, in S3), so
> the runtime infrastructure can fail over to new instances.
>
> By hosting the HBase data on S3 it seemed to be possible, as the data
> (cubes) built in a previous HBase environment (EMR) are now available in a
> new HBase (new EMR cluster).
> Still, even though I have the cube data, I can't query it from Kylin,
> because the query layer also relies on this volatile streaming metadata.
> Is this understanding correct?
> If it is, how far do you think Kylin is from being able to support this
> scenario?
>
> Many thanks,
> Andras
>
> On Thu, Jun 27, 2019 at 12:50 PM Xiaoxiang Yu <[email protected]> wrote:
>
>> Hi Andras,
>>    In fact, we currently have no way to backup or restore the streaming
>> metadata which related to replica set/assignment etc.
>>    I think these metadata are volatile, such as hostname of each worker
>> may be different in two cluster. But if you find backup/restore is really
>> useful for streaming metadata. Please submit a JIRA.
>>
>> *-----------------*
>> *-----------------*
>> *Best wishes to you ! *
>> *From ：**Xiaoxiang Yu*
>>
>> At 2019-06-27 17:54:08, "Andras Nagy" <[email protected]>
>> wrote:
>>
>> OK, this worked, so I could proceed one step. I disabled all HBase
>> tables, manually altered them so the coprocessor locations point to the new
>> HDFS cluster, and re-enabled them. After this, there are no errors in the
>> regionserver's logs, and Kylin starts up, so this seems fine.
>> (Interestingly, the DeployCoprocessorCLI did assemble the correct HDFS URL,
>> but could not alter the table definitions, so after running
>> DeployCoprocessorCLI, the table definitions have not changed. This is on
>> HBase version 1.4.9.)
>>
>> However when I try to query the existing cubes, I get a failure with a
>> NullPointerException at
>> org.apache.kylin.stream.coordinator.assign.AssignmentsCache.getReplicaSetsByCube(AssignmentsCache.java:61).
>> Just quickly looking at it, it seems like these cube assignments come from
>> Zookeeper, and I'm missing them. Since I'm now running on a completely new
>> EMR cluster (with new Zookeeper), I wonder if there is some persistent
>> state in Zookeeper that should also be backed up and restored.
>>
>> (This deployment used hdfs-working-dir on HDFS, so before terminating the
>> old cluster I backed up the hdfs-working-dir and have restored it in the
>> new cluster; but nothing from Zookeeper.)
>>
>> Thanks in advance for any pointers about this.
>>
>> On Thu, Jun 27, 2019 at 10:30 AM Andras Nagy <
>> [email protected]> wrote:
>>
>>> Checked the table definition in HBase, and that's what explicitely
>>> references the coprocessor location on the old cluster. I'll update that
>>> and let you know.
>>>
>>> On Thu, Jun 27, 2019 at 10:26 AM Andras Nagy <
>>> [email protected]> wrote:
>>>
>>>> Actually as I noticed, it's not the corpocessor that's failing, but
>>>> HBase when trying to load the coprocessor itself from HDFS (form a
>>>> reference somewhere that still points to the old HDFS namenode).
>>>>
>>>> On Thu, Jun 27, 2019 at 10:19 AM Andras Nagy <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi ShaoFeng,
>>>>>
>>>>> After disabling the "KYLIN_*" tables (but not 'kylin_metadata') the
>>>>> RegionServers could indeed start up and the coprocessor refresh succeeded.
>>>>>
>>>>> But after re-enabling those tables again, the issue continues, and
>>>>> again the RegionServers fail by trying to connect to the old master node.
>>>>> What I noticed now from the stacktrace is that the coprocessor is actually
>>>>> trying to connect to the old HDFS namenode on port 8020 (and not to the
>>>>> HBase master).
>>>>>
>>>>> Best regards,
>>>>> Andras
>>>>>
>>>>>
>>>>> On Thu, Jun 27, 2019 at 4:21 AM ShaoFeng Shi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I see; Can you try this way: disable all "KYLIN_*" tables in HBase
>>>>>> console, and then see whether the region servers can start.
>>>>>>
>>>>>> If they can start, then run the above command to refresh the
>>>>>> coprocessor.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>> Apache Kylin PMC
>>>>>> Email: [email protected]
>>>>>>
>>>>>> Apache Kylin FAQ:
>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>> Join Kylin user mail group: [email protected]
>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Andras Nagy <[email protected]> 于2019年6月26日周三 下午10:57写道：
>>>>>>
>>>>>>> Hi ShaoFeng,
>>>>>>> Yes, but it fails as well. Actually it fails because the
>>>>>>> RegionServers are not running (as they fail when starting up).
>>>>>>> Best regards,
>>>>>>> Andras
>>>>>>>
>>>>>>> On Wed, Jun 26, 2019 at 4:42 PM ShaoFeng Shi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andras,
>>>>>>>>
>>>>>>>> Did you try this?
>>>>>>>> https://kylin.apache.org/docs/howto/howto_update_coprocessor.html
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>> Apache Kylin PMC
>>>>>>>> Email: [email protected]
>>>>>>>>
>>>>>>>> Apache Kylin FAQ:
>>>>>>>> https://kylin.apache.org/docs/gettingstarted/faq.html
>>>>>>>> Join Kylin user mail group: [email protected]
>>>>>>>> Join Kylin dev mail group: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Andras Nagy <[email protected]> 于2019年6月26日周三 下午10:05写道：
>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I'm testing a setup where HBase is running on AWS EMR and HBase
>>>>>>>>> data is stored on S3. It's working fine so far, but when I terminate 
>>>>>>>>> the
>>>>>>>>> EMR cluster and recreate it with the same S3 location for HBase, HBase
>>>>>>>>> won't start up properly. Before shutting down, I did execute the
>>>>>>>>> disable_all_tables.sh script to flush HBase state to S3.
>>>>>>>>>
>>>>>>>>> Actually the issue is that RegionServers don't start up. Maybe I'm
>>>>>>>>> missing something in the EMR setup and not in Kylin setup, but the
>>>>>>>>> exceptions I get in the RegionServer's log point at Kylin's
>>>>>>>>> CubeVisitService coprocessor, which is still trying to connect to the 
>>>>>>>>> old
>>>>>>>>> HBase master on the old EMR cluster's master node and fails with:
>>>>>>>>> "coprocessor.CoprocessorHost: The coprocessor
>>>>>>>>> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService
>>>>>>>>> threw java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; "
>>>>>>>>>
>>>>>>>>> (Here, ip-172-35-7-125 was the old clusters' master node.)
>>>>>>>>>
>>>>>>>>> Does anyone have any idea what I'm doing wrong here?
>>>>>>>>> The HBase master node's address seems to be cached somewhere, and
>>>>>>>>> when starting up HBase on the new cluster with the same S3 location 
>>>>>>>>> for
>>>>>>>>> HFiles, this old address is used still.
>>>>>>>>> Is there anything specific I have missed to get this scenario to
>>>>>>>>> work properly?
>>>>>>>>>
>>>>>>>>> This is the full stacktrace:
>>>>>>>>>
>>>>>>>>> 2019-06-26 12:33:53,352 ERROR
>>>>>>>>> [RS_OPEN_REGION-ip-172-35-5-11:16020-1] coprocessor.CoprocessorHost: 
>>>>>>>>> The
>>>>>>>>> coprocessor
>>>>>>>>> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService
>>>>>>>>> threw java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; For more
>>>>>>>>> details see:  http://wiki.apache.org/hadoop/NoRouteToHost
>>>>>>>>> java.net.NoRouteToHostException: No Route to Host from
>>>>>>>>>  ip-172-35-5-11/172.35.5.11 to
>>>>>>>>> ip-172-35-7-125.us-west-2.compute.internal:8020 failed on socket 
>>>>>>>>> timeout
>>>>>>>>> exception: java.net.NoRouteToHostException: No route to host; For more
>>>>>>>>> details see:  http://wiki.apache.org/hadoop/NoRouteToHost
>>>>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>> Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>>>>>>>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
>>>>>>>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>>>>>>>>> at com.sun.proxy.$Proxy36.getFileInfo(Unknown Source)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:796)
>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
>>>>>>>>> at com.sun.proxy.$Proxy37.getFileInfo(Unknown Source)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1649)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452)
>>>>>>>>> at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1466)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.util.CoprocessorClassLoader.getClassLoader(CoprocessorClassLoader.java:264)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:214)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:188)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:376)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(RegionCoprocessorHost.java:238)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:802)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:710)
>>>>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>> Method)
>>>>>>>>> at
>>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>>>>> at
>>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:6716)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7020)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6992)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6948)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6899)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:364)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:131)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>>>>> Caused by: java.net.NoRouteToHostException: No route to host
>>>>>>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>>>>>> at
>>>>>>>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>>>>>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
>>>>>>>>> at
>>>>>>>>> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
>>>>>>>>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
>>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
>>>>>>>>> ... 43 more
>>>>>>>>>
>>>>>>>>> Many thanks,
>>>>>>>>> Andras
>>>>>>>>>
>>>>>>>>
>
>
>

Re: Re: Issue when recreating EMR cluster with HBase data on S3

Reply via email to