Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

John Vines Thu, 16 Aug 2012 12:25:12 -0700

You're work ont he clone table failed because you took it offline. The
table needs to be online in order to do a job against it.


John

On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[email protected]> wrote:

> Jim, William, Adam: Thanks for your help with this, I'm running out of
> ideas to google search for answers.
>
> I'm going to try to answer your questions:
>
> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
> didn't see anything in log indicating that it was shutting down connections
> due to reaching a connection limit.  However, throughout the zookeeper
> logs, even before the hadoop job was run.  I did see lots of lines like
> these.  Up til this point, I've assumed it's innocuous and unrelated to my
> issue.
> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
> client /127.0.0.1:59158 (no session established for client)
>
> 2) I had also already adjusted the dfs.datanode.max.xcievers property to
> 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
> is not being reached.  It had been being reached at some point before I
> increased the value from the default.  I had increased the value while
> troubleshooting, but prior to posting to this list; it didn't appear to
> have a noticeable affect on the behavior of the hadoop job.
>
> 3) I'm writing out to a sequence file, so accumulo is only being used for
> input.  As a side note, at one point during my troubleshooting, I
> compacted, cloned, and then took the cloned table offline and tried to use
> that instead.  That failed immediately, without processing any records.
>  From the stacktrace, it appeared as though the iterator was trying to use
> one of the files for the original table (from what I understand about
> cloning, this is normal b/c no changes had been made to the original table)
> but said it did not exist.  I was, however, able to find the file on hdfs.
>  So I just gave up on that.  Also, under the normal case, ie using the
> original table online, nothing is writing to the table while the hadoop job
> is running.
>
> 4) My original open file limit was the os default of 256.  So I upped it
> to 1024, and performed another attempt.  The behavior was the same as
> before.
>
> I'm including a snippet from the tserver debug log.  It looks like an
> expired session might be the root of the problem, but I'm not sure what
> would cause that:
> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 
> in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s)
> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
> totalmem=132,055,040
> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1]
> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor
> compaction chosen
> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING
> !0<;~  total = 120,840 ingestMemory = 120,840
> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem
> = 2,416 chosenIT = 300.23 load 3,044
> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs
> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor
> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~
> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE
> compactionThreshold = 0.605 maxObserved = 120,840
> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER
> compactionThreshold = 0.666
> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read |
> 4 written |  4,800 entries/sec |  0.005 secs
> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c
>  16 13:47:51,956 [log.TabletServerLogger] DEBUG:  wrote MinC finish  35:
> writeTime:1ms
> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] ->
> /default_tablet/F000051i.rf
> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs
> !0<;~
> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 
> in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s)
> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB
> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0,
> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss
> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0
> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032),
> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0,
> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%,
> Evicted/Run=NaN, Duplicate Reads=1
> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs,
> wait 0.00 secs
> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs,
> wait 0.00 secs
> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0<
> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
> /table_info/A000051j.rf_tmp
> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<<
> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
> /root_tablet/A000051k.rf_tmp
> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 read
> | 21 written |    276 entries/sec |  0.141 secs
> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56
> read | 51 written |    323 entries/sec |  0.173 secs
> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs
> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC
> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
> /root_tablet/A000051k.rf
> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for
> extent !0<;~ as alias 2
> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs
> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC
> [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
> /table_info/A000051j.rf
> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02)
> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536)
> totalmem=132,055,040
> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:51480146 
> in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s ct=0.004s)
> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1]
> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01)
> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904)
> totalmem=132,055,040
> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
> ranges:1)
> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
> ranges:1)
> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid
> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired
> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server lock
> (reason = SESSION_EXPIRED), exiting.
>
>
> I'm also including a snippet from the zookeeper log as well.  Around the
> same the tserver logs show the loss of the tablet server lock due to
> expired session, the zookeeper logs mentions the expiration of several
> sessions due to timeout.
> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn$StatCommand@1153]
> - Stat command output
> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn@1435] - Closed
> socket connection for client /127.0.0.1:58633 (no session established for
> client)
> 2012-08-16 13:48:08,371 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
> connection from /127.0.0.1:59158
> 2012-08-16 13:48:08,372 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
> 127.0.0.1:59158
> 2012-08-16 13:48:08,378 - INFO  [Thread-371:NIOServerCnxn$StatCommand@1153]
> - Stat command output
> 2012-08-16 13:48:08,379 - INFO  [Thread-371:NIOServerCnxn@1435] - Closed
> socket connection for client /127.0.0.1:59158 (no session established for
> client)
> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
> connection from /127.0.0.1:59635
> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
> 127.0.0.1:59635
> 2012-08-16 13:48:09,399 - INFO  [Thread-372:NIOServerCnxn$StatCommand@1153]
> - Stat command output
> 2012-08-16 13:48:09,400 - INFO  [Thread-372:NIOServerCnxn@1435] - Closed
> socket connection for client /127.0.0.1:59635 (no session established for
> client)
> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded
> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded
> 2012-08-16 13:48:32,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded
> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80005
> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80004
> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80001
> 2012-08-16 13:48:32,003 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
> Closed socket connection for client /127.0.0.1:65339 which had sessionid
> 0x1393086dfb80005
> 2012-08-16 13:48:34,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded
> 2012-08-16 13:48:36,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded
> 2012-08-16 13:48:38,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded
> 2012-08-16 13:48:38,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded
> 2012-08-16 13:48:40,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded
> 2012-08-16 13:48:44,456 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had
> sessionid 0x1393086dfb80004
> 2012-08-16 13:48:44,458 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
> Closed socket connection for client /127.0.0.1:65316 which had sessionid
> 0x1393086dfb80001
> 2012-08-16 13:48:44,461 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80002
> 2012-08-16 13:48:44,482 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80007
> 2012-08-16 13:48:44,484 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of
> session 0x1393086dfb80003 due to java.io.IOException: Connection reset by
> peer
> 2012-08-16 13:48:44,486 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had
> sessionid 0x1393086dfb80002
> 2012-08-16 13:48:44,491 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb8004d
> 2012-08-16 13:48:44,493 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80006
> 2012-08-16 13:48:44,495 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003
> 2012-08-16 13:48:44,495 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x1393086dfb80007, likely client
> has closed socket
> 2012-08-16 13:48:44,496 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007
> 2012-08-16 13:48:44,496 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x1393086dfb8004d, likely client
> has closed socket
> 2012-08-16 13:48:44,497 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d
> 2012-08-16 13:48:44,498 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
> - Processed session termination for sessionid: 0x1393086dfb80003
> 2012-08-16 13:48:44,500 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had
> sessionid 0x1393086dfb80006
> 2012-08-16 13:48:44,613 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
> connection from /0:0:0:0:0:0:0:1%0:59789
> 2012-08-16 13:48:44,615 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
> session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789
> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session
> 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired
> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001
>
>
> Sorry for the huge post of log drivel, but if you'd like to see more I can
> post (or via pastebin) the full logs.  I made sure to shut down everything,
> wipe out the logs, then start everything back up so the logs would be as
> clean as possible.
>
>
> I looked for a way to increase the session timeout on the
> AccumuloInputFormat, but there doesn't seem to be one.  I'm sure that's not
> the right fix anyway, but figured I'd mention it.
>
>
> Do you guys have any other suggestions.  I'm sure this is due to some
> configuration error on my part, but I'm running out of idea.
>
>
> Thanks a ton,
> -Arjumand
>
>
>
> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum <
> [email protected]> wrote:
>
>> What does your TServer debug log say? Also, are you writing back out to
>> Accumulo?
>>
>> To follow up what Jim said, you can check the zookeeper log to see if max
>> connections is being hit. You may also want to check and see what your max
>> xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if
>> it is mentioned.
>>
>>
>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <[email protected]>wrote:
>>
>>> Hello,
>>>
>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may be
>>> due to poor configuration on my part, but I'm running out of ideas.
>>>
>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from
>>> cdh3u4) in pseudo-distributed mode.
>>> zookeeper version zookeeper-3.3.5 from cdh3u4
>>> I'm using the 1.4.1 release of accumulo with a configuration copied from
>>> "conf/examples/512MB/standalone"
>>>
>>> I've got a Map task that is using an accumulo table as the input.
>>> I'm fetching all rows, but just a single column family, that has
>>> hundreds or even thousands of different column qualifiers.
>>> The table has a SummingCombiner installed for the given the column
>>> family.
>>>
>>> The task runs fine at first, but after ~9-15K records (I print the
>>> record count to the console every 1K records), it hangs and the following
>>> messages are printed to the console where I'm running the job:
>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
>>> data from server sessionid 0x1392cc35b460d1c, likely server has closed
>>> socket, closing socket connection and attempting reconnect
>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection
>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
>>> connection
>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client
>>> connection, connectString=localhost sessionTimeout=30000
>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server localhost/0:0:0:0:0:0:0:1:2181
>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection
>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>>> 0x1392cc35b460d25, negotiated timeout = 30000
>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>>>
>>> Sometimes the messages contain a stacktrace like this below:
>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40
>>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
>>> socket connection and attempting reconnect
>>> java.io.IOException: Connection reset by peer
>>>  at sun.nio.ch.FileDispatcher.read0(Native Method)
>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>>>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>>>  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server localhost/127.0.0.1:2181
>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection
>>> established to localhost/127.0.0.1:2181, initiating session
>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
>>> connection
>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client
>>> connection, connectString=localhost sessionTimeout=30000
>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection
>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>>> 0x1392cc35b460b46, negotiated timeout = 30000
>>>
>>>
>>> I've poked through the logs in accumulo, and I've noticed that when it
>>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
>>> current session : Expired
>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
>>> registration at null
>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>>>
>>>
>>> I've noticed that if I make the map task print out the record count more
>>> frequently (ie every 10 records), it seems to be able get through more
>>> records than when I only print every 1K records. My assumption was that
>>> this had something to do with more time being spent in the map task, and
>>> not fetching data from accumulo.  There was at least one occasion where I
>>> printed to the console for every record, and in that situation it managed
>>> to process 47K records, although I have been unable to repeat that behavior.
>>>
>>> I've also noticed that if I stop and start accumulo, the map-reduce job
>>> will pickup where it left off, but seems to fail quicker.
>>>
>>>
>>>
>>> Could someone make some suggestions as to what my problem might be? It
>>> would be greatly appreciated.  If you need any additional information from
>>> me, just let me know.  I'd paste my config files, driver setup, and example
>>> data into this post, but I think it's probably long enough already.
>>>
>>>
>>> Thanks in advance,
>>> -Arjumand
>>>
>>>
>>
>

Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection

Reply via email to