You're work ont he clone table failed because you took it offline. The table needs to be online in order to do a job against it.
John On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[email protected]> wrote: > Jim, William, Adam: Thanks for your help with this, I'm running out of > ideas to google search for answers. > > I'm going to try to answer your questions: > > 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I > didn't see anything in log indicating that it was shutting down connections > due to reaching a connection limit. However, throughout the zookeeper > logs, even before the hadoop job was run. I did see lots of lines like > these. Up til this point, I've assumed it's innocuous and unrelated to my > issue. > INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for > client /127.0.0.1:59158 (no session established for client) > > 2) I had also already adjusted the dfs.datanode.max.xcievers property to > 4096 in hdfs-site.xml. An investigation of the log shows that this limit > is not being reached. It had been being reached at some point before I > increased the value from the default. I had increased the value while > troubleshooting, but prior to posting to this list; it didn't appear to > have a noticeable affect on the behavior of the hadoop job. > > 3) I'm writing out to a sequence file, so accumulo is only being used for > input. As a side note, at one point during my troubleshooting, I > compacted, cloned, and then took the cloned table offline and tried to use > that instead. That failed immediately, without processing any records. > From the stacktrace, it appeared as though the iterator was trying to use > one of the files for the original table (from what I understand about > cloning, this is normal b/c no changes had been made to the original table) > but said it did not exist. I was, however, able to find the file on hdfs. > So I just gave up on that. Also, under the normal case, ie using the > original table online, nothing is writing to the table while the hadoop job > is running. > > 4) My original open file limit was the os default of 256. So I upped it > to 1024, and performed another attempt. The behavior was the same as > before. > > I'm including a snippet from the tserver debug log. It looks like an > expired session might be the root of the problem, but I'm not sure what > would cause that: > 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 > in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) > 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] > 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] > 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01) > secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728) > totalmem=132,055,040 > 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] > 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1] > 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1] > 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor > compaction chosen > 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING > !0<;~ total = 120,840 ingestMemory = 120,840 > 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem > = 2,416 chosenIT = 300.23 load 3,044 > 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs > 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor > compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~ > 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1] > 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE > compactionThreshold = 0.605 maxObserved = 120,840 > 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER > compactionThreshold = 0.666 > 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read | > 4 written | 4,800 entries/sec | 0.005 secs > 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted: > !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c > 16 13:47:51,956 [log.TabletServerLogger] DEBUG: wrote MinC finish 35: > writeTime:1ms > 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] -> > /default_tablet/F000051i.rf > 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs > !0<;~ > 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 > in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s) > 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: > Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB > (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0, > Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss > Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0 > 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: > Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032), > Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0, > Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%, > Evicted/Run=NaN, Duplicate Reads=1 > 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, > wait 0.00 secs > 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, > wait 0.00 secs > 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0< > (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] --> > /table_info/A000051j.rf_tmp > 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<< > (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> > /root_tablet/A000051k.rf_tmp > 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 read > | 21 written | 276 entries/sec | 0.141 secs > 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56 > read | 51 written | 323 entries/sec | 0.173 secs > 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs > 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC > [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> > /root_tablet/A000051k.rf > 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for > extent !0<;~ as alias 2 > 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs > 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC > [/table_info/A000051d.rf, /table_info/F000051h.rf] --> > /table_info/A000051j.rf > 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02) > secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536) > totalmem=132,055,040 > 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:51480146 > in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s ct=0.004s) > 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1] > 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01) > secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904) > totalmem=132,055,040 > 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] > 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 > ranges:1) > 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess > 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 > ranges:1) > 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid > 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] > 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected > 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired > 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server lock > (reason = SESSION_EXPIRED), exiting. > > > I'm also including a snippet from the zookeeper log as well. Around the > same the tserver logs show the loss of the tablet server lock due to > expired session, the zookeeper logs mentions the expiration of several > sessions due to timeout. > 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn$StatCommand@1153] > - Stat command output > 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn@1435] - Closed > socket connection for client /127.0.0.1:58633 (no session established for > client) > 2012-08-16 13:48:08,371 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket > connection from /127.0.0.1:59158 > 2012-08-16 13:48:08,372 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / > 127.0.0.1:59158 > 2012-08-16 13:48:08,378 - INFO [Thread-371:NIOServerCnxn$StatCommand@1153] > - Stat command output > 2012-08-16 13:48:08,379 - INFO [Thread-371:NIOServerCnxn@1435] - Closed > socket connection for client /127.0.0.1:59158 (no session established for > client) > 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket > connection from /127.0.0.1:59635 > 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / > 127.0.0.1:59635 > 2012-08-16 13:48:09,399 - INFO [Thread-372:NIOServerCnxn$StatCommand@1153] > - Stat command output > 2012-08-16 13:48:09,400 - INFO [Thread-372:NIOServerCnxn@1435] - Closed > socket connection for client /127.0.0.1:59635 (no session established for > client) > 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded > 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded > 2012-08-16 13:48:32,001 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded > 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80005 > 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80004 > 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80001 > 2012-08-16 13:48:32,003 - INFO [SyncThread:0:NIOServerCnxn@1435] - > Closed socket connection for client /127.0.0.1:65339 which had sessionid > 0x1393086dfb80005 > 2012-08-16 13:48:34,001 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded > 2012-08-16 13:48:36,001 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded > 2012-08-16 13:48:38,000 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded > 2012-08-16 13:48:38,001 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded > 2012-08-16 13:48:40,001 - INFO [SessionTracker:ZooKeeperServer@316] - > Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded > 2012-08-16 13:48:44,456 - INFO [SyncThread:0:NIOServerCnxn@1435] - > Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had > sessionid 0x1393086dfb80004 > 2012-08-16 13:48:44,458 - INFO [SyncThread:0:NIOServerCnxn@1435] - > Closed socket connection for client /127.0.0.1:65316 which had sessionid > 0x1393086dfb80001 > 2012-08-16 13:48:44,461 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80002 > 2012-08-16 13:48:44,482 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80007 > 2012-08-16 13:48:44,484 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of > session 0x1393086dfb80003 due to java.io.IOException: Connection reset by > peer > 2012-08-16 13:48:44,486 - INFO [SyncThread:0:NIOServerCnxn@1435] - > Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had > sessionid 0x1393086dfb80002 > 2012-08-16 13:48:44,491 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb8004d > 2012-08-16 13:48:44,493 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80006 > 2012-08-16 13:48:44,495 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003 > 2012-08-16 13:48:44,495 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to > read additional data from client sessionid 0x1393086dfb80007, likely client > has closed socket > 2012-08-16 13:48:44,496 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007 > 2012-08-16 13:48:44,496 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to > read additional data from client sessionid 0x1393086dfb8004d, likely client > has closed socket > 2012-08-16 13:48:44,497 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d > 2012-08-16 13:48:44,498 - INFO [ProcessThread:-1:PrepRequestProcessor@399] > - Processed session termination for sessionid: 0x1393086dfb80003 > 2012-08-16 13:48:44,500 - INFO [SyncThread:0:NIOServerCnxn@1435] - > Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had > sessionid 0x1393086dfb80006 > 2012-08-16 13:48:44,613 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket > connection from /0:0:0:0:0:0:0:1%0:59789 > 2012-08-16 13:48:44,615 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew > session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789 > 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session > 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired > 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for > client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001 > > > Sorry for the huge post of log drivel, but if you'd like to see more I can > post (or via pastebin) the full logs. I made sure to shut down everything, > wipe out the logs, then start everything back up so the logs would be as > clean as possible. > > > I looked for a way to increase the session timeout on the > AccumuloInputFormat, but there doesn't seem to be one. I'm sure that's not > the right fix anyway, but figured I'd mention it. > > > Do you guys have any other suggestions. I'm sure this is due to some > configuration error on my part, but I'm running out of idea. > > > Thanks a ton, > -Arjumand > > > > On Thu, Aug 16, 2012 at 7:24 AM, William Slacum < > [email protected]> wrote: > >> What does your TServer debug log say? Also, are you writing back out to >> Accumulo? >> >> To follow up what Jim said, you can check the zookeeper log to see if max >> connections is being hit. You may also want to check and see what your max >> xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if >> it is mentioned. >> >> >> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <[email protected]>wrote: >> >>> Hello, >>> >>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may be >>> due to poor configuration on my part, but I'm running out of ideas. >>> >>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from >>> cdh3u4) in pseudo-distributed mode. >>> zookeeper version zookeeper-3.3.5 from cdh3u4 >>> I'm using the 1.4.1 release of accumulo with a configuration copied from >>> "conf/examples/512MB/standalone" >>> >>> I've got a Map task that is using an accumulo table as the input. >>> I'm fetching all rows, but just a single column family, that has >>> hundreds or even thousands of different column qualifiers. >>> The table has a SummingCombiner installed for the given the column >>> family. >>> >>> The task runs fine at first, but after ~9-15K records (I print the >>> record count to the console every 1K records), it hangs and the following >>> messages are printed to the console where I'm running the job: >>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional >>> data from server sessionid 0x1392cc35b460d1c, likely server has closed >>> socket, closing socket connection and attempting reconnect >>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection >>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection >>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to >>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket >>> connection >>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down >>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client >>> connection, connectString=localhost sessionTimeout=30000 >>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c >>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection >>> to server localhost/0:0:0:0:0:0:0:1:2181 >>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection >>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session >>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment >>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = >>> 0x1392cc35b460d25, negotiated timeout = 30000 >>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner: >>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner: >>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner: >>> >>> Sometimes the messages contain a stacktrace like this below: >>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 >>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing >>> socket connection and attempting reconnect >>> java.io.IOException: Connection reset by peer >>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) >>> at sun.nio.ch.IOUtil.read(IOUtil.java:166) >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) >>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) >>> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) >>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection >>> to server localhost/127.0.0.1:2181 >>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection >>> established to localhost/127.0.0.1:2181, initiating session >>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to >>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket >>> connection >>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down >>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client >>> connection, connectString=localhost sessionTimeout=30000 >>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 >>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection >>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection >>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment >>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = >>> 0x1392cc35b460b46, negotiated timeout = 30000 >>> >>> >>> I've poked through the logs in accumulo, and I've noticed that when it >>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file: >>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected >>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of >>> current session : Expired >>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired >>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper >>> registration at null >>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down >>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down >>> >>> >>> I've noticed that if I make the map task print out the record count more >>> frequently (ie every 10 records), it seems to be able get through more >>> records than when I only print every 1K records. My assumption was that >>> this had something to do with more time being spent in the map task, and >>> not fetching data from accumulo. There was at least one occasion where I >>> printed to the console for every record, and in that situation it managed >>> to process 47K records, although I have been unable to repeat that behavior. >>> >>> I've also noticed that if I stop and start accumulo, the map-reduce job >>> will pickup where it left off, but seems to fail quicker. >>> >>> >>> >>> Could someone make some suggestions as to what my problem might be? It >>> would be greatly appreciated. If you need any additional information from >>> me, just let me know. I'd paste my config files, driver setup, and example >>> data into this post, but I think it's probably long enough already. >>> >>> >>> Thanks in advance, >>> -Arjumand >>> >>> >> >
