Jim, William, Adam: Thanks for your help with this, I'm running out of ideas to google search for answers.
I'm going to try to answer your questions: 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I didn't see anything in log indicating that it was shutting down connections due to reaching a connection limit. However, throughout the zookeeper logs, even before the hadoop job was run. I did see lots of lines like these. Up til this point, I've assumed it's innocuous and unrelated to my issue. INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:59158 (no session established for client) 2) I had also already adjusted the dfs.datanode.max.xcievers property to 4096 in hdfs-site.xml. An investigation of the log shows that this limit is not being reached. It had been being reached at some point before I increased the value from the default. I had increased the value while troubleshooting, but prior to posting to this list; it didn't appear to have a noticeable affect on the behavior of the hadoop job. 3) I'm writing out to a sequence file, so accumulo is only being used for input. As a side note, at one point during my troubleshooting, I compacted, cloned, and then took the cloned table offline and tried to use that instead. That failed immediately, without processing any records. From the stacktrace, it appeared as though the iterator was trying to use one of the files for the original table (from what I understand about cloning, this is normal b/c no changes had been made to the original table) but said it did not exist. I was, however, able to find the file on hdfs. So I just gave up on that. Also, under the normal case, ie using the original table online, nothing is writing to the table while the hadoop job is running. 4) My original open file limit was the os default of 256. So I upped it to 1024, and performed another attempt. The behavior was the same as before. I'm including a snippet from the tserver debug log. It looks like an expired session might be the root of the problem, but I'm not sure what would cause that: 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728) totalmem=132,055,040 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1] 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor compaction chosen 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING !0<;~ total = 120,840 ingestMemory = 120,840 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem = 2,416 chosenIT = 300.23 load 3,044 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~ 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1] 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE compactionThreshold = 0.605 maxObserved = 120,840 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER compactionThreshold = 0.666 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read | 4 written | 4,800 entries/sec | 0.005 secs 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted: !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c 16 13:47:51,956 [log.TabletServerLogger] DEBUG: wrote MinC finish 35: writeTime:1ms 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] -> /default_tablet/F000051i.rf 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs !0<;~ 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s) 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0, Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032), Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0, Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%, Evicted/Run=NaN, Duplicate Reads=1 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, wait 0.00 secs 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, wait 0.00 secs 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0< (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] --> /table_info/A000051j.rf_tmp 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<< (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> /root_tablet/A000051k.rf_tmp 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 read | 21 written | 276 entries/sec | 0.141 secs 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56 read | 51 written | 323 entries/sec | 0.173 secs 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> /root_tablet/A000051k.rf 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for extent !0<;~ as alias 2 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC [/table_info/A000051d.rf, /table_info/F000051h.rf] --> /table_info/A000051j.rf 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536) totalmem=132,055,040 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:51480146 in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s ct=0.004s) 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1] 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904) totalmem=132,055,040 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 ranges:1) 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 ranges:1) 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED), exiting. I'm also including a snippet from the zookeeper log as well. Around the same the tserver logs show the loss of the tablet server lock due to expired session, the zookeeper logs mentions the expiration of several sessions due to timeout. 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn$StatCommand@1153] - Stat command output 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:58633 (no session established for client) 2012-08-16 13:48:08,371 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket connection from /127.0.0.1:59158 2012-08-16 13:48:08,372 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / 127.0.0.1:59158 2012-08-16 13:48:08,378 - INFO [Thread-371:NIOServerCnxn$StatCommand@1153] - Stat command output 2012-08-16 13:48:08,379 - INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:59158 (no session established for client) 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket connection from /127.0.0.1:59635 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / 127.0.0.1:59635 2012-08-16 13:48:09,399 - INFO [Thread-372:NIOServerCnxn$StatCommand@1153] - Stat command output 2012-08-16 13:48:09,400 - INFO [Thread-372:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:59635 (no session established for client) 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded 2012-08-16 13:48:32,001 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80005 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80004 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80001 2012-08-16 13:48:32,003 - INFO [SyncThread:0:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:65339 which had sessionid 0x1393086dfb80005 2012-08-16 13:48:34,001 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded 2012-08-16 13:48:36,001 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded 2012-08-16 13:48:38,000 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded 2012-08-16 13:48:38,001 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded 2012-08-16 13:48:40,001 - INFO [SessionTracker:ZooKeeperServer@316] - Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded 2012-08-16 13:48:44,456 - INFO [SyncThread:0:NIOServerCnxn@1435] - Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had sessionid 0x1393086dfb80004 2012-08-16 13:48:44,458 - INFO [SyncThread:0:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:65316 which had sessionid 0x1393086dfb80001 2012-08-16 13:48:44,461 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80002 2012-08-16 13:48:44,482 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80007 2012-08-16 13:48:44,484 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of session 0x1393086dfb80003 due to java.io.IOException: Connection reset by peer 2012-08-16 13:48:44,486 - INFO [SyncThread:0:NIOServerCnxn@1435] - Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had sessionid 0x1393086dfb80002 2012-08-16 13:48:44,491 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb8004d 2012-08-16 13:48:44,493 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80006 2012-08-16 13:48:44,495 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003 2012-08-16 13:48:44,495 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1393086dfb80007, likely client has closed socket 2012-08-16 13:48:44,496 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007 2012-08-16 13:48:44,496 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x1393086dfb8004d, likely client has closed socket 2012-08-16 13:48:44,497 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d 2012-08-16 13:48:44,498 - INFO [ProcessThread:-1:PrepRequestProcessor@399] - Processed session termination for sessionid: 0x1393086dfb80003 2012-08-16 13:48:44,500 - INFO [SyncThread:0:NIOServerCnxn@1435] - Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had sessionid 0x1393086dfb80006 2012-08-16 13:48:44,613 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket connection from /0:0:0:0:0:0:0:1%0:59789 2012-08-16 13:48:44,615 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001 Sorry for the huge post of log drivel, but if you'd like to see more I can post (or via pastebin) the full logs. I made sure to shut down everything, wipe out the logs, then start everything back up so the logs would be as clean as possible. I looked for a way to increase the session timeout on the AccumuloInputFormat, but there doesn't seem to be one. I'm sure that's not the right fix anyway, but figured I'd mention it. Do you guys have any other suggestions. I'm sure this is due to some configuration error on my part, but I'm running out of idea. Thanks a ton, -Arjumand On Thu, Aug 16, 2012 at 7:24 AM, William Slacum < [email protected]> wrote: > What does your TServer debug log say? Also, are you writing back out to > Accumulo? > > To follow up what Jim said, you can check the zookeeper log to see if max > connections is being hit. You may also want to check and see what your max > xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if > it is mentioned. > > > On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <[email protected]>wrote: > >> Hello, >> >> I'm fairly new to both Accumulo and Hadoop, so I think my problem may be >> due to poor configuration on my part, but I'm running out of ideas. >> >> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) >> in pseudo-distributed mode. >> zookeeper version zookeeper-3.3.5 from cdh3u4 >> I'm using the 1.4.1 release of accumulo with a configuration copied from >> "conf/examples/512MB/standalone" >> >> I've got a Map task that is using an accumulo table as the input. >> I'm fetching all rows, but just a single column family, that has hundreds >> or even thousands of different column qualifiers. >> The table has a SummingCombiner installed for the given the column family. >> >> The task runs fine at first, but after ~9-15K records (I print the record >> count to the console every 1K records), it hangs and the following messages >> are printed to the console where I'm running the job: >> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional >> data from server sessionid 0x1392cc35b460d1c, likely server has closed >> socket, closing socket connection and attempting reconnect >> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/fe80:0:0:0:0:0:0:1%1:2181 >> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection >> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to >> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket >> connection >> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down >> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, >> connectString=localhost sessionTimeout=30000 >> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c >> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/0:0:0:0:0:0:0:1:2181 >> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection >> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session >> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment >> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = >> 0x1392cc35b460d25, negotiated timeout = 30000 >> 12/08/16 02:57:11 INFO mapred.LocalJobRunner: >> 12/08/16 02:57:14 INFO mapred.LocalJobRunner: >> 12/08/16 02:57:17 INFO mapred.LocalJobRunner: >> >> Sometimes the messages contain a stacktrace like this below: >> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 >> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing >> socket connection and attempting reconnect >> java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcher.read0(Native Method) >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) >> at sun.nio.ch.IOUtil.read(IOUtil.java:166) >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) >> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) >> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) >> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/127.0.0.1:2181 >> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection >> established to localhost/127.0.0.1:2181, initiating session >> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to >> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket >> connection >> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down >> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection, >> connectString=localhost sessionTimeout=30000 >> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 >> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to >> server localhost/fe80:0:0:0:0:0:0:1%1:2181 >> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection >> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment >> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = >> 0x1392cc35b460b46, negotiated timeout = 30000 >> >> >> I've poked through the logs in accumulo, and I've noticed that when it >> hangs, the following is written to the "logger_HOSTNAME.debug.log" file: >> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected >> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of >> current session : Expired >> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired >> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper >> registration at null >> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down >> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down >> >> >> I've noticed that if I make the map task print out the record count more >> frequently (ie every 10 records), it seems to be able get through more >> records than when I only print every 1K records. My assumption was that >> this had something to do with more time being spent in the map task, and >> not fetching data from accumulo. There was at least one occasion where I >> printed to the console for every record, and in that situation it managed >> to process 47K records, although I have been unable to repeat that behavior. >> >> I've also noticed that if I stop and start accumulo, the map-reduce job >> will pickup where it left off, but seems to fail quicker. >> >> >> >> Could someone make some suggestions as to what my problem might be? It >> would be greatly appreciated. If you need any additional information from >> me, just let me know. I'd paste my config files, driver setup, and example >> data into this post, but I think it's probably long enough already. >> >> >> Thanks in advance, >> -Arjumand >> >> >
