Re: Can't recover - HDFS

2018-07-03 Thread Shawn Heisey
On 7/3/2018 6:55 AM, Joe Obernberger wrote:
> I think the root issue is related to some weirdness with HDFS. Log
> file is here:
> http://lovehorsepower.com/solr.log.4
> Config is here:
> http://lovehorsepower.com/solrconfig.xml
> I don't see anything set to 20 seconds.
>
> I believe the root exception is:
>
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /solr7.1.0/UNCLASS_30DAYS/core_node-1684300827/data/tlog/tlog.0008930
> could only be replicated to 0 nodes instead of minReplication (=1). 
> There are 41 datanode(s) running and no node(s) are excluded in this
> operation.

That does look like what's causing all the errors.  This is a purely
hadoop/hdfs exception.  There are no Solr classes in the "Caused by"
part of the exception.  If you have any hdfs experts in-house, you
should talk to them.  If not, you may need to find a hadoop mailing list.

Looking up the exception, I've seen a couple of answers that say when
this happens you have to format your datanode and lose all your data. 
Or it could be a configuration problem, a permission problem, or a disk
space problem.  Perhaps if I knew anything about HDFS, I could make
sense of the google search results.

The logs on your hadoop servers might have more information, but I do
not know how to interpret them.

Thanks,
Shawn



Re: Can't recover - HDFS

2018-07-03 Thread Joe Obernberger

Thank you Shawn -

I think the root issue is related to some weirdness with HDFS. Log file 
is here:

http://lovehorsepower.com/solr.log.4
Config is here:
http://lovehorsepower.com/solrconfig.xml
I don't see anything set to 20 seconds.

I believe the root exception is:

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/solr7.1.0/UNCLASS_30DAYS/core_node-1684300827/data/tlog/tlog.0008930 
could only be replicated to 0 nodes instead of minReplication (=1).  
There are 41 datanode(s) running and no node(s) are excluded in this 
operation.
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1724)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3449)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:692)
    at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:217)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506)
    at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

    at org.apache.hadoop.ipc.Client.call(Client.java:1504)
    at org.apache.hadoop.ipc.Client.call(Client.java:1441)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)

    at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:423)

    at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)

    at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1860)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1656)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:790)
2018-07-02 14:50:24.949 ERROR (indexFetcher-41-thread-1) 
[c:UNCLASS_30DAYS s:shard37 r:core_node-1684300827 
x:UNCLASS_30DAYS_shard37_replica_t-1246382645] 
o.a.s.h.ReplicationHandler Exception in fetching index

org.apache.solr.common.SolrException: Error logging add
    at 
org.apache.solr.update.TransactionLog.write(TransactionLog.java:420)

    at org.apache.solr.update.UpdateLog.add(UpdateLog.java:535)
    at org.apache.solr.update.UpdateLog.add(UpdateLog.java:519)
    at 
org.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1213)
    at 
org.apache.solr.update.UpdateLog.copyAndSwitchToNewTlog(UpdateLog.java:1168)
    at 
org.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1155)
    at 
org.apache.solr.cloud.ReplicateFromLeader.lambda$startReplication$0(ReplicateFromLeader.java:100)
    at 
org.apache.solr.handler.ReplicationHandler.lambda$setupPolling$12(ReplicationHandler.java:1160)
    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

Thank you very much for the help!

-Joe


On 7/2/2018 8:32 PM, Shawn Heisey wrote:

On 7/2/2018 1:40 PM, Joe Obernberger wrote:

Hi All - having this same 

Re: Can't recover - HDFS

2018-07-02 Thread Shawn Heisey
On 7/2/2018 1:40 PM, Joe Obernberger wrote:
> Hi All - having this same problem again with a large index in HDFS.  A
> replica needs to recover, and it just spins retrying over and over
> again.  Any ideas?  Is there an adjustable timeout?
>
> Screenshot:
> http://lovehorsepower.com/images/SolrShot1.jpg

There is considerably more log detail available than can be seen in the
screenshot.  Can you please make your solr.log file from this server
available so we can see full error and warning log messages, and let us
know the exact Solr version that wrote the log?  You'll probably need to
use a file sharing site, and make sure the file is available until after
the problem has been examined.  Attachments sent to the mailing list are
almost always stripped.

Based on the timestamps in the screenshot, it is taking about 22 to 24
seconds to transfer 1750073344 bytes.  Which calculates to right around
the 75 MB per second rate that you were configuring in your last email
thread.  In order for that single large file to transfer successfully,
you're going to need a timeout of at least 40 seconds.  Based on what I
see, it sounds like the timeout has been set to 20 seconds.  The default
client socket timeout on replication should be about two minutes, which
would be plenty for a file of that size to transfer.

This might be a timeout issue, but without seeing the full log and
knowing the exact version of Solr that created it, it is difficult to
know for sure where the problem might be or what can be done to fix it. 
We will need that logfile.  If there are multiple servers involved, we
may need logfiles from both ends of the replication.

Do you have any config in solrconfig.xml for the /replication handler
other than the maxWriteMBPerSec config you showed last time?

Have you configured anything (particularly a socket timeout or sotimeout
setting) to a value near 20 or 2?

Thanks,
Shawn



Can't recover - HDFS

2018-07-02 Thread Joe Obernberger
Hi All - having this same problem again with a large index in HDFS.  A 
replica needs to recover, and it just spins retrying over and over 
again.  Any ideas?  Is there an adjustable timeout?


Screenshot:
http://lovehorsepower.com/images/SolrShot1.jpg

Thank you!

-Joe Obernberger