While this isn't my original concern/problem, so it's not terribly important, but I'm not sure I understand and I'd like to learn as much as possible.
Why wouldn't it work with the cloned table offline? I followed the example laid out in org.apache.accumulo.examples.simple.mapreduce.UniqueColumns Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(), true);" setup. Can you give me some detail as to why it definitely would not have worked offline? Is this method no longer supported. Thanks! -Arjumand On Thu, Aug 16, 2012 at 3:24 PM, John Vines <[email protected]> wrote: > You're work ont he clone table failed because you took it offline. The > table needs to be online in order to do a job against it. > > John > > > On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <[email protected]>wrote: > >> Jim, William, Adam: Thanks for your help with this, I'm running out of >> ideas to google search for answers. >> >> I'm going to try to answer your questions: >> >> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100. I >> didn't see anything in log indicating that it was shutting down connections >> due to reaching a connection limit. However, throughout the zookeeper >> logs, even before the hadoop job was run. I did see lots of lines like >> these. Up til this point, I've assumed it's innocuous and unrelated to my >> issue. >> INFO [Thread-371:NIOServerCnxn@1435] - Closed socket connection for >> client /127.0.0.1:59158 (no session established for client) >> >> 2) I had also already adjusted the dfs.datanode.max.xcievers property to >> 4096 in hdfs-site.xml. An investigation of the log shows that this limit >> is not being reached. It had been being reached at some point before I >> increased the value from the default. I had increased the value while >> troubleshooting, but prior to posting to this list; it didn't appear to >> have a noticeable affect on the behavior of the hadoop job. >> >> 3) I'm writing out to a sequence file, so accumulo is only being used for >> input. As a side note, at one point during my troubleshooting, I >> compacted, cloned, and then took the cloned table offline and tried to use >> that instead. That failed immediately, without processing any records. >> From the stacktrace, it appeared as though the iterator was trying to use >> one of the files for the original table (from what I understand about >> cloning, this is normal b/c no changes had been made to the original table) >> but said it did not exist. I was, however, able to find the file on hdfs. >> So I just gave up on that. Also, under the normal case, ie using the >> original table online, nothing is writing to the table while the hadoop job >> is running. >> >> 4) My original open file limit was the os default of 256. So I upped it >> to 1024, and performed another attempt. The behavior was the same as >> before. >> >> I'm including a snippet from the tserver debug log. It looks like an >> expired session might be the root of the problem, but I'm not sure what >> would cause that: >> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 >> in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s) >> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1] >> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1] >> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01) >> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728) >> totalmem=132,055,040 >> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1] >> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1] >> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1] >> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE >> minor compaction chosen >> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: >> COMPACTING !0<;~ total = 120,840 ingestMemory = 120,840 >> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem >> = 2,416 chosenIT = 300.23 load 3,044 >> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs >> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor >> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~ >> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1] >> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE >> compactionThreshold = 0.605 maxObserved = 120,840 >> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER >> compactionThreshold = 0.666 >> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read >> | 4 written | 4,800 entries/sec | 0.005 secs >> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted: >> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c >> 16 13:47:51,956 [log.TabletServerLogger] DEBUG: wrote MinC finish 35: >> writeTime:1ms >> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] -> >> /default_tablet/F000051i.rf >> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs >> !0<;~ >> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 >> in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s) >> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: >> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB >> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0, >> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss >> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0 >> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: >> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032), >> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0, >> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%, >> Evicted/Run=NaN, Duplicate Reads=1 >> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 >> secs, wait 0.00 secs >> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 >> secs, wait 0.00 secs >> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0< >> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] --> >> /table_info/A000051j.rf_tmp >> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<< >> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> >> /root_tablet/A000051k.rf_tmp >> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 >> read | 21 written | 276 entries/sec | 0.141 secs >> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56 >> read | 51 written | 323 entries/sec | 0.173 secs >> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs >> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC >> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] --> >> /root_tablet/A000051k.rf >> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for >> extent !0<;~ as alias 2 >> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs >> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC >> [/table_info/A000051d.rf, /table_info/F000051h.rf] --> >> /table_info/A000051j.rf >> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02) >> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536) >> totalmem=132,055,040 >> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:51480146 >> in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s ct=0.004s) >> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1] >> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01) >> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904) >> totalmem=132,055,040 >> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1] >> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess >> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 >> ranges:1) >> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess >> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1 >> ranges:1) >> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid >> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1] >> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected >> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired >> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server >> lock (reason = SESSION_EXPIRED), exiting. >> >> >> I'm also including a snippet from the zookeeper log as well. Around the >> same the tserver logs show the loss of the tablet server lock due to >> expired session, the zookeeper logs mentions the expiration of several >> sessions due to timeout. >> 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn$StatCommand@1153] >> - Stat command output >> 2012-08-16 13:48:07,368 - INFO [Thread-370:NIOServerCnxn@1435] - Closed >> socket connection for client /127.0.0.1:58633 (no session established >> for client) >> 2012-08-16 13:48:08,371 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket >> connection from /127.0.0.1:59158 >> 2012-08-16 13:48:08,372 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / >> 127.0.0.1:59158 >> 2012-08-16 13:48:08,378 - INFO [Thread-371:NIOServerCnxn$StatCommand@1153] >> - Stat command output >> 2012-08-16 13:48:08,379 - INFO [Thread-371:NIOServerCnxn@1435] - Closed >> socket connection for client /127.0.0.1:59158 (no session established >> for client) >> 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket >> connection from /127.0.0.1:59635 >> 2012-08-16 13:48:09,395 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from / >> 127.0.0.1:59635 >> 2012-08-16 13:48:09,399 - INFO [Thread-372:NIOServerCnxn$StatCommand@1153] >> - Stat command output >> 2012-08-16 13:48:09,400 - INFO [Thread-372:NIOServerCnxn@1435] - Closed >> socket connection for client /127.0.0.1:59635 (no session established >> for client) >> 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded >> 2012-08-16 13:48:32,000 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded >> 2012-08-16 13:48:32,001 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded >> 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80005 >> 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80004 >> 2012-08-16 13:48:32,001 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80001 >> 2012-08-16 13:48:32,003 - INFO [SyncThread:0:NIOServerCnxn@1435] - >> Closed socket connection for client /127.0.0.1:65339 which had sessionid >> 0x1393086dfb80005 >> 2012-08-16 13:48:34,001 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded >> 2012-08-16 13:48:36,001 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded >> 2012-08-16 13:48:38,000 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded >> 2012-08-16 13:48:38,001 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded >> 2012-08-16 13:48:40,001 - INFO [SessionTracker:ZooKeeperServer@316] - >> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded >> 2012-08-16 13:48:44,456 - INFO [SyncThread:0:NIOServerCnxn@1435] - >> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had >> sessionid 0x1393086dfb80004 >> 2012-08-16 13:48:44,458 - INFO [SyncThread:0:NIOServerCnxn@1435] - >> Closed socket connection for client /127.0.0.1:65316 which had sessionid >> 0x1393086dfb80001 >> 2012-08-16 13:48:44,461 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80002 >> 2012-08-16 13:48:44,482 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80007 >> 2012-08-16 13:48:44,484 - WARN [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of >> session 0x1393086dfb80003 due to java.io.IOException: Connection reset by >> peer >> 2012-08-16 13:48:44,486 - INFO [SyncThread:0:NIOServerCnxn@1435] - >> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had >> sessionid 0x1393086dfb80002 >> 2012-08-16 13:48:44,491 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb8004d >> 2012-08-16 13:48:44,493 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80006 >> 2012-08-16 13:48:44,495 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for >> client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003 >> 2012-08-16 13:48:44,495 - WARN [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable >> to read additional data from client sessionid 0x1393086dfb80007, likely >> client has closed socket >> 2012-08-16 13:48:44,496 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for >> client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007 >> 2012-08-16 13:48:44,496 - WARN [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable >> to read additional data from client sessionid 0x1393086dfb8004d, likely >> client has closed socket >> 2012-08-16 13:48:44,497 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for >> client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d >> 2012-08-16 13:48:44,498 - INFO [ProcessThread:-1:PrepRequestProcessor@399] >> - Processed session termination for sessionid: 0x1393086dfb80003 >> 2012-08-16 13:48:44,500 - INFO [SyncThread:0:NIOServerCnxn@1435] - >> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had >> sessionid 0x1393086dfb80006 >> 2012-08-16 13:48:44,613 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket >> connection from /0:0:0:0:0:0:0:1%0:59789 >> 2012-08-16 13:48:44,615 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew >> session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789 >> 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session >> 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired >> 2012-08-16 13:48:44,616 - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for >> client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001 >> >> >> Sorry for the huge post of log drivel, but if you'd like to see more I >> can post (or via pastebin) the full logs. I made sure to shut down >> everything, wipe out the logs, then start everything back up so the logs >> would be as clean as possible. >> >> >> I looked for a way to increase the session timeout on the >> AccumuloInputFormat, but there doesn't seem to be one. I'm sure that's not >> the right fix anyway, but figured I'd mention it. >> >> >> Do you guys have any other suggestions. I'm sure this is due to some >> configuration error on my part, but I'm running out of idea. >> >> >> Thanks a ton, >> -Arjumand >> >> >> >> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum < >> [email protected]> wrote: >> >>> What does your TServer debug log say? Also, are you writing back out to >>> Accumulo? >>> >>> To follow up what Jim said, you can check the zookeeper log to see if >>> max connections is being hit. You may also want to check and see what your >>> max xceivers is set to for HDFS and check your Accumulo and HDFS logs to >>> see if it is mentioned. >>> >>> >>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <[email protected]>wrote: >>> >>>> Hello, >>>> >>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may >>>> be due to poor configuration on my part, but I'm running out of ideas. >>>> >>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from >>>> cdh3u4) in pseudo-distributed mode. >>>> zookeeper version zookeeper-3.3.5 from cdh3u4 >>>> I'm using the 1.4.1 release of accumulo with a configuration copied >>>> from "conf/examples/512MB/standalone" >>>> >>>> I've got a Map task that is using an accumulo table as the input. >>>> I'm fetching all rows, but just a single column family, that has >>>> hundreds or even thousands of different column qualifiers. >>>> The table has a SummingCombiner installed for the given the column >>>> family. >>>> >>>> The task runs fine at first, but after ~9-15K records (I print the >>>> record count to the console every 1K records), it hangs and the following >>>> messages are printed to the console where I'm running the job: >>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional >>>> data from server sessionid 0x1392cc35b460d1c, likely server has closed >>>> socket, closing socket connection and attempting reconnect >>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection >>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection >>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to >>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket >>>> connection >>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down >>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client >>>> connection, connectString=localhost sessionTimeout=30000 >>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c >>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection >>>> to server localhost/0:0:0:0:0:0:0:1:2181 >>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection >>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session >>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment >>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = >>>> 0x1392cc35b460d25, negotiated timeout = 30000 >>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner: >>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner: >>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner: >>>> >>>> Sometimes the messages contain a stacktrace like this below: >>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 >>>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing >>>> socket connection and attempting reconnect >>>> java.io.IOException: Connection reset by peer >>>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) >>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166) >>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) >>>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) >>>> at >>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) >>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection >>>> to server localhost/127.0.0.1:2181 >>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection >>>> established to localhost/127.0.0.1:2181, initiating session >>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to >>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket >>>> connection >>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down >>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client >>>> connection, connectString=localhost sessionTimeout=30000 >>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 >>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection >>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181 >>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection >>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session >>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment >>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = >>>> 0x1392cc35b460b46, negotiated timeout = 30000 >>>> >>>> >>>> I've poked through the logs in accumulo, and I've noticed that when it >>>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file: >>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected >>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of >>>> current session : Expired >>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired >>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper >>>> registration at null >>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down >>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down >>>> >>>> >>>> I've noticed that if I make the map task print out the record count >>>> more frequently (ie every 10 records), it seems to be able get through more >>>> records than when I only print every 1K records. My assumption was that >>>> this had something to do with more time being spent in the map task, and >>>> not fetching data from accumulo. There was at least one occasion where I >>>> printed to the console for every record, and in that situation it managed >>>> to process 47K records, although I have been unable to repeat that >>>> behavior. >>>> >>>> I've also noticed that if I stop and start accumulo, the map-reduce job >>>> will pickup where it left off, but seems to fail quicker. >>>> >>>> >>>> >>>> Could someone make some suggestions as to what my problem might be? It >>>> would be greatly appreciated. If you need any additional information from >>>> me, just let me know. I'd paste my config files, driver setup, and example >>>> data into this post, but I think it's probably long enough already. >>>> >>>> >>>> Thanks in advance, >>>> -Arjumand >>>> >>>> >>> >> >
