Re: Can't recover - HDFS
On 7/3/2018 6:55 AM, Joe Obernberger wrote: > I think the root issue is related to some weirdness with HDFS. Log > file is here: > http://lovehorsepower.com/solr.log.4 > Config is here: > http://lovehorsepower.com/solrconfig.xml > I don't see anything set to 20 seconds. > > I believe the root exception is: > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /solr7.1.0/UNCLASS_30DAYS/core_node-1684300827/data/tlog/tlog.0008930 > could only be replicated to 0 nodes instead of minReplication (=1). > There are 41 datanode(s) running and no node(s) are excluded in this > operation. That does look like what's causing all the errors. This is a purely hadoop/hdfs exception. There are no Solr classes in the "Caused by" part of the exception. If you have any hdfs experts in-house, you should talk to them. If not, you may need to find a hadoop mailing list. Looking up the exception, I've seen a couple of answers that say when this happens you have to format your datanode and lose all your data. Or it could be a configuration problem, a permission problem, or a disk space problem. Perhaps if I knew anything about HDFS, I could make sense of the google search results. The logs on your hadoop servers might have more information, but I do not know how to interpret them. Thanks, Shawn
Re: Can't recover - HDFS
Thank you Shawn - I think the root issue is related to some weirdness with HDFS. Log file is here: http://lovehorsepower.com/solr.log.4 Config is here: http://lovehorsepower.com/solrconfig.xml I don't see anything set to 20 seconds. I believe the root exception is: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /solr7.1.0/UNCLASS_30DAYS/core_node-1684300827/data/tlog/tlog.0008930 could only be replicated to 0 nodes instead of minReplication (=1). There are 41 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1724) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3449) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:692) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:217) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275) at org.apache.hadoop.ipc.Client.call(Client.java:1504) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy11.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:423) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy12.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1860) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1656) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:790) 2018-07-02 14:50:24.949 ERROR (indexFetcher-41-thread-1) [c:UNCLASS_30DAYS s:shard37 r:core_node-1684300827 x:UNCLASS_30DAYS_shard37_replica_t-1246382645] o.a.s.h.ReplicationHandler Exception in fetching index org.apache.solr.common.SolrException: Error logging add at org.apache.solr.update.TransactionLog.write(TransactionLog.java:420) at org.apache.solr.update.UpdateLog.add(UpdateLog.java:535) at org.apache.solr.update.UpdateLog.add(UpdateLog.java:519) at org.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1213) at org.apache.solr.update.UpdateLog.copyAndSwitchToNewTlog(UpdateLog.java:1168) at org.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1155) at org.apache.solr.cloud.ReplicateFromLeader.lambda$startReplication$0(ReplicateFromLeader.java:100) at org.apache.solr.handler.ReplicationHandler.lambda$setupPolling$12(ReplicationHandler.java:1160) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Thank you very much for the help! -Joe On 7/2/2018 8:32 PM, Shawn Heisey wrote: On 7/2/2018 1:40 PM, Joe Obernberger wrote: Hi All - having this same
Re: Can't recover - HDFS
On 7/2/2018 1:40 PM, Joe Obernberger wrote: > Hi All - having this same problem again with a large index in HDFS. A > replica needs to recover, and it just spins retrying over and over > again. Any ideas? Is there an adjustable timeout? > > Screenshot: > http://lovehorsepower.com/images/SolrShot1.jpg There is considerably more log detail available than can be seen in the screenshot. Can you please make your solr.log file from this server available so we can see full error and warning log messages, and let us know the exact Solr version that wrote the log? You'll probably need to use a file sharing site, and make sure the file is available until after the problem has been examined. Attachments sent to the mailing list are almost always stripped. Based on the timestamps in the screenshot, it is taking about 22 to 24 seconds to transfer 1750073344 bytes. Which calculates to right around the 75 MB per second rate that you were configuring in your last email thread. In order for that single large file to transfer successfully, you're going to need a timeout of at least 40 seconds. Based on what I see, it sounds like the timeout has been set to 20 seconds. The default client socket timeout on replication should be about two minutes, which would be plenty for a file of that size to transfer. This might be a timeout issue, but without seeing the full log and knowing the exact version of Solr that created it, it is difficult to know for sure where the problem might be or what can be done to fix it. We will need that logfile. If there are multiple servers involved, we may need logfiles from both ends of the replication. Do you have any config in solrconfig.xml for the /replication handler other than the maxWriteMBPerSec config you showed last time? Have you configured anything (particularly a socket timeout or sotimeout setting) to a value near 20 or 2? Thanks, Shawn
Can't recover - HDFS
Hi All - having this same problem again with a large index in HDFS. A replica needs to recover, and it just spins retrying over and over again. Any ideas? Is there an adjustable timeout? Screenshot: http://lovehorsepower.com/images/SolrShot1.jpg Thank you! -Joe Obernberger