Re: WARN add_table: Missing .regioninfo:.. No server address.. what to do?
On Wed, Aug 25, 2010 at 11:22 AM, Stuart Smith stu24m...@yahoo.com wrote: Just curious, though, (if it happens again) - assume the regions were invalid - I don't know, maybe it was halfway through splitting something and died - but say they're invalid. (See if a failed MR task associated with the bad region. You could also tgz' the bad region and we can take a look at it for you.) Would the best thing to do in that case be a manual deletion of the hdfs directories containing the invalid regions? What hbase handle that OK? If its a 'bad' region, should be fine. There'd be no holes in loaded table. But if its not... And a side question that ties a lot of my issues together - I finally have a (somewhat) clean interface that moves the occasional too big file into hdfs, and stores everything else into hbase - I built this up as a layer in java with a metadata/filestore split in hbase (all file metadata is in hbase, files are directed to hbase/hdfs based on size). Is there another project that does this? It seems too handy to be the first time someone did this... Or does something like this always end up needing domain-specific tweaks interfaces? I haven't heard of a project like this (though as you say, you can't be the first... maybe you are though?) Because once you have huge cells in hbase, it really seems to be unhappy. Especially when a good chunk of your tasks are done as M/R tasks or some layer on top of M/R. Yeah, I'd imagine so. At least default configuration is set for cells in the 0-50k or so size. I'd imagine they'd need to be pulled around some if cells are MBs. Or would this be a good project to open-source? Or pointless to do so? Do it on github as Ted suggests. It'll either flourish and then you'll have to figure out how to support it or it'll wither when you move on (add it to supporting projects on wiki so its easier for folks to find?) I guess in the long-run hbase could absorb these requirements with some tweaks of the file format, but I thought it could be nice to do this with a little library layer on top. You are a good man Stu, St.Ack --- On Mon, 8/23/10, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: WARN add_table: Missing .regioninfo:.. No server address.. what to do? To: user@hbase.apache.org Date: Monday, August 23, 2010, 6:08 PM On Mon, Aug 23, 2010 at 1:35 PM, Stuart Smith stu24m...@yahoo.com wrote: Hmm... AFAICT, if the regioninfo files is gone from a region directory (and I looked on hdfs, and it is gone), the region is hosed. Is it a legit region? Its wholesome looking with hfiles that make sense (non-zero)? My guess is that the regions are incompletes and loadtable is not smart enough recognizing them as so. If you grep your master log for the region encoded name, do you find anything? Maybe this way you can figure its provenance? St.Ack
Re: WARN add_table: Missing .regioninfo:.. No server address.. what to do?
Hey, Awesome. Well, this is a research project for work, so I have to ask the powers that be if it's OK to publish the plumbing parts. It's really just plumbing though, so from the techy perspective it's not the interesting part. So hopefully I can sell it as such (selling my work to the boss as not interesting.. hmm... ;) ). We'll see. I'm not an expert Java coder either, but, hopefully I can get it up and stimulate something... Take care, -stu --- On Thu, 8/26/10, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: WARN add_table: Missing .regioninfo:.. No server address.. what to do? To: user@hbase.apache.org Date: Thursday, August 26, 2010, 2:11 AM On Wed, Aug 25, 2010 at 11:22 AM, Stuart Smith stu24m...@yahoo.com wrote: Just curious, though, (if it happens again) - assume the regions were invalid - I don't know, maybe it was halfway through splitting something and died - but say they're invalid. (See if a failed MR task associated with the bad region. You could also tgz' the bad region and we can take a look at it for you.) Would the best thing to do in that case be a manual deletion of the hdfs directories containing the invalid regions? What hbase handle that OK? If its a 'bad' region, should be fine. There'd be no holes in loaded table. But if its not... And a side question that ties a lot of my issues together - I finally have a (somewhat) clean interface that moves the occasional too big file into hdfs, and stores everything else into hbase - I built this up as a layer in java with a metadata/filestore split in hbase (all file metadata is in hbase, files are directed to hbase/hdfs based on size). Is there another project that does this? It seems too handy to be the first time someone did this... Or does something like this always end up needing domain-specific tweaks interfaces? I haven't heard of a project like this (though as you say, you can't be the first... maybe you are though?) Because once you have huge cells in hbase, it really seems to be unhappy. Especially when a good chunk of your tasks are done as M/R tasks or some layer on top of M/R. Yeah, I'd imagine so. At least default configuration is set for cells in the 0-50k or so size. I'd imagine they'd need to be pulled around some if cells are MBs. Or would this be a good project to open-source? Or pointless to do so? Do it on github as Ted suggests. It'll either flourish and then you'll have to figure out how to support it or it'll wither when you move on (add it to supporting projects on wiki so its easier for folks to find?) I guess in the long-run hbase could absorb these requirements with some tweaks of the file format, but I thought it could be nice to do this with a little library layer on top. You are a good man Stu, St.Ack --- On Mon, 8/23/10, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: WARN add_table: Missing .regioninfo:.. No server address.. what to do? To: user@hbase.apache.org Date: Monday, August 23, 2010, 6:08 PM On Mon, Aug 23, 2010 at 1:35 PM, Stuart Smith stu24m...@yahoo.com wrote: Hmm... AFAICT, if the regioninfo files is gone from a region directory (and I looked on hdfs, and it is gone), the region is hosed. Is it a legit region? Its wholesome looking with hfiles that make sense (non-zero)? My guess is that the regions are incompletes and loadtable is not smart enough recognizing them as so. If you grep your master log for the region encoded name, do you find anything? Maybe this way you can figure its provenance? St.Ack
Out Of Memory on region servers upon bulk import
Hi, I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files, but some imports require to do gets and scans as well. In the master UI I see that the heap used gets very close to the 6GB limit, but I know hbase is eager for memory and will use the heap as much as possible.I use block caching. Looking at similar posts I see that modifying the handler count and memory store upper/ower limits may be key to solving this issue. Nevertheless I wanted to ask if there is a way to estimate the extra memory used by hbase that makes it crash and if there are other configuration settings I should be looking into to prevent OOME. The job runs correctly for some time but region servers eventually crash. More information about the cluster: - All nodes have 16GM total memory. - 7 nodes running region server (6GB) + datanodes (1GB) + task trackers (1GB Heap). Map reduce jobs running w/ 756MB tops each. - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB). - 3 of the nodes have zookeeper running with 512MB Heap Many thanks, Martin 2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254 java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273) at org.apache.hadoop.hbase.regionserver.StoreFile.init(StoreFile.java:129) at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410) at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571) at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538) at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458) at java.lang.Thread.run(Thread.java:619)
Re: Out Of Memory on region servers upon bulk import
On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze marnan...@gmail.com wrote: Hi, I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files, How large? Unless very large, it should not be OOMEing. but some imports require to do gets and scans as well. In the master UI I see that the heap used gets very close to the 6GB limit, but I know hbase is eager for memory and will use the heap as much as possible.I use block caching. Looking at similar posts I see that modifying the handler count and memory store upper/ower limits may be key to solving this issue. Nevertheless I wanted to ask if there is a way to estimate the extra memory used by hbase that makes it crash and if there are other configuration settings I should be looking into to prevent OOME. The job runs correctly for some time but region servers eventually crash. More information about the cluster: - All nodes have 16GM total memory. - 7 nodes running region server (6GB) + datanodes (1GB) + task trackers (1GB Heap). Map reduce jobs running w/ 756MB tops each. Good. How many MR child tasks can run on each node concurrently? - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB). - 3 of the nodes have zookeeper running with 512MB Heap Many thanks, Martin Can we see the lines before the below is thrown? Also, do a listing (ls -r) on this region in hdfs and lets see if anything pops out about files sizes, etc. You'll need to manually map the below region name to its encoded name to figure the region but the encoded name should be earlier in the log. You'll do something like: bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME Thanks, St.Ack 2010-08-26 07:19:14,859 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254 java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hbase.io.hfile.HFile$BlockIndex.readIndex(HFile.java:1538) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:806) at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273) at org.apache.hadoop.hbase.regionserver.StoreFile.init(StoreFile.java:129) at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410) at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1636) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:321) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1571) at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1538) at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1458) at java.lang.Thread.run(Thread.java:619)
Re: RegionServer can't recover after a failure
On Thu, Aug 26, 2010 at 8:16 AM, Andrey Timerbaev atimerb...@gmx.net wrote: Dear experts, Could you kindly suggest, how to help the RegionServer to complete initialization in the following situation: After a failure of one or RegionServers, which is running on a dedicated node in a HBase/Hadoop cluster (HBase v.0.20.3), the RegionServer can't initialize available tables. The region server's log contains this exception: You are running transactional hbase? This is intentional I take it. After a look into HBase source code I found out, that the Table not created. Call createTable() message appears, if the HBaseBackedTransactionLogger is unable to find the __GLOBAL_TRX_LOG__ table. But I've got no idea, where the table should be, whether it is critical and what should I do in this situation. Me neither. Let me poke the transactional fellows and see if they can offer help. Thanks, St.Ack
Re: Out Of Memory on region servers upon bulk import
I provide the answers below. Thanks! Martin On Aug 26, 2010, at 11:45 AM, Stack wrote: On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze marnan...@gmail.com wrote: Hi, I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files, How large? about 10 million records each a few Kb. Unless very large, it should not be OOMEing. but some imports require to do gets and scans as well. In the master UI I see that the heap used gets very close to the 6GB limit, but I know hbase is eager for memory and will use the heap as much as possible.I use block caching. Looking at similar posts I see that modifying the handler count and memory store upper/ower limits may be key to solving this issue. Nevertheless I wanted to ask if there is a way to estimate the extra memory used by hbase that makes it crash and if there are other configuration settings I should be looking into to prevent OOME. The job runs correctly for some time but region servers eventually crash. More information about the cluster: - All nodes have 16GM total memory. - 7 nodes running region server (6GB) + datanodes (1GB) + task trackers (1GB Heap). Map reduce jobs running w/ 756MB tops each. Good. How many MR child tasks can run on each node concurrently? three mappers and two reducers - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB). - 3 of the nodes have zookeeper running with 512MB Heap Many thanks, Martin Can we see the lines before the below is thrown? Also, do a listing (ls -r) on this region in hdfs and lets see if anything pops out about files sizes, etc. You'll need to manually map the below region name to its encoded name to figure the region but the encoded name should be earlier in the log. You'll do something like: bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME /usr/lib/hadoop-0.20/bin/hadoop fs -lsr /hbase/table_import/1698505444 -rw-r--r-- 3 hadoop supergroup 1450 2010-08-25 23:17 /hbase/table_import/1698505444/.regioninfo drwxr-xr-x - hadoop supergroup 0 2010-08-26 11:07 /hbase/table_importl/1698505444/fam -rw-r--r-- 3 hadoop supergroup4244491 2010-08-26 11:07 /hbase/table_import/1698505444/fam/5785049964186428982 -rw-r--r-- 3 hadoop supergroup 180147216 2010-08-26 06:09 /hbase/table_import/1698505444/fam/705757673046090229 Previous log: 010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,f1cbb42c-b6ae-404d-800c-043da5409441-9223370754623831807WkmpwnRDmveKYzWEfw/tb4GpP9yHDl+/G7OCaZWEgrmGcW+XEF131YDTQwDqZsO93tDicdPcOdRq\x0AU7zDBqoxpA==,1282790086498/1451783432 available; sequence id is 1518500874 2010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399 2010-08-26 07:18:14,691 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Creating region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399, encoded=1510556231 2010-08-26 07:18:21,085 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=958.38617MB (1004940736), Free=238.2MB (249864000), Max=1196.675MB (1254804736), Counts: Blocks=115717, Access=51364517, Hit=231796, Miss=51132721, Evictions=15, Evicted=218920, Ratios: Hit Ratio=0.45127649791538715%, Miss Ratio=99.54872131347656%, Evicted/Run=14594.6669921875 2010-08-26 07:18:27,659 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded /hbase/table_import/1510556231/fam/2639910770219077750, isReference=false, sequence id=1518500860, length=200014693, majorCompaction=false 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399/1510556231 available; sequence id is 1518500861 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254 2010-08-26 07:18:35,189 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Creating region table_import,8ded1642-1c52-444a-bfdc-43521b220714-9223370754627999807UbbWwFDcGatAe8OniLMUXoaVeEdOvSkqiwXfJgUxNlt0aosKXsWevrlra8QDbEvTZelj/jLyux8y\x0AcCBiLeHbqg==,1282792675254, encoded=1698505444 2010-08-26 07:19:14,859 ERROR
hbase
I'm going through the overiew-summary instructions for setting up and running hbase. Right now I'm running hbase in pseudo-distributed mode, and looking to go fully-distributed on 25 nodes. Every time I restart hbase, I get: Couldn't start ZK at requested address of n, instead got: n+1. Aborting. Why? Because clients (eg shell) won't be able to find this ZK quorum If I change the hbase.zookeeper.property.clientPort to the n+1 from the message it starts right up. Which file do I need to modify to keep this on one port, and what do I need to put into it? Is this something that should be added to the overview-summary page? Thanks, TimW
Re: hbase
Use netstat to see who is occupying port n Maybe HQuorumPeer wasn't stopped from previous run ? On Thu, Aug 26, 2010 at 9:49 AM, Witteveen, Tim t...@pnl.gov wrote: I'm going through the overiew-summary instructions for setting up and running hbase. Right now I'm running hbase in pseudo-distributed mode, and looking to go fully-distributed on 25 nodes. Every time I restart hbase, I get: Couldn't start ZK at requested address of n, instead got: n+1. Aborting. Why? Because clients (eg shell) won't be able to find this ZK quorum If I change the hbase.zookeeper.property.clientPort to the n+1 from the message it starts right up. Which file do I need to modify to keep this on one port, and what do I need to put into it? Is this something that should be added to the overview-summary page? Thanks, TimW
RE: hbase
Thanks! Netstat revealed I was running zookeeper twice. I stopped manually starting it, and things are working as expected. TimW -Original Message- From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, August 26, 2010 9:56 AM To: user@hbase.apache.org Subject: Re: hbase Use netstat to see who is occupying port n Maybe HQuorumPeer wasn't stopped from previous run ? On Thu, Aug 26, 2010 at 9:49 AM, Witteveen, Tim t...@pnl.gov wrote: I'm going through the overiew-summary instructions for setting up and running hbase. Right now I'm running hbase in pseudo-distributed mode, and looking to go fully-distributed on 25 nodes. Every time I restart hbase, I get: Couldn't start ZK at requested address of n, instead got: n+1. Aborting. Why? Because clients (eg shell) won't be able to find this ZK quorum If I change the hbase.zookeeper.property.clientPort to the n+1 from the message it starts right up. Which file do I need to modify to keep this on one port, and what do I need to put into it? Is this something that should be added to the overview-summary page? Thanks, TimW
Region splits in 0.89...
My hbase table issued a mass split after I loaded regions with greater sizes than maxfilesize.. (my bad..) Now, when I try accessing the master through the web interface, it just hangs... And, if I scan the META, I get the parent regions set to offline.. And the child regions have random byte keys (that was part of the HBASE 2515).. But one thing that troubles me is that even the split child regions don't seem to have been assigned to any server according to META: I am not sure if this is expected.. DocDB,03617973,12column=info:regioninfo, timestamp=1282810230593, value=REGION = {NAME = 82331439883.6a66354e2c587 'DocDB,03617973,1282331439883.6a66354e2c58779278942a8e5cf152ca.' 79278942a8e5cf152ca., STARTKEY = '03617973', ENDKEY = '03751972', ENCODED = 6a66354e2c58779278942a8e5cf152ca, OFFLINE = true, SPLIT = true, TABLE = {{NAME = 'DocDB', MAX_FILESIZE = '4294967296', FAMILIES = [{NAME = 'bigColumn', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '1', COMPRESSION = 'NONE', TTL = '2147483647', BLOCKSIZE = '1048576 ', IN_MEMORY = 'false', BLOCKCACHE = 'false'}]}} DocDB,03617973,12column=info:server, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:serverstartcode, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12column=info:splitA, timestamp=1282810230593, value=\x00\x10036849 82331439883.6a66354e2c587 63\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03617973,1282810229890.57f 79278942a8e5cf152ca. 2eed2d43cece270244a39233f5a5a.\x00\x1003617973\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false0\xF1\xD4\xBB DocDB,03617973,12column=info:splitA_checked, timestamp=1282815236171, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:splitB, timestamp=1282810230593, value=\x00\x10037519 82331439883.6a66354e2c587 72\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03684963,1282810229890.5fe 79278942a8e5cf152ca. 1f7c4de1640642fb8454a1b8c623c.\x00\x1003684963\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false\xB54N\x1E DocDB,03617973,12 column=info:splitB_checked, timestamp=1282815236219, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca.
Re: region servers crashing
Without gc logs you cannot diagnose what you suspect are gc issues... make sure you are logging and then check them out. If you are running a recent JVM you can use -XX:+PrintGCDateStamps and get better log entries. Also you cannot swap at all, even 1 page of swapping in a java process can be killer. Combined with the hypervisor stealing your CPU you can have a lot of elapsed wall time with not very many cpu slices being executed. Consider vmstat and top to diagnose that one issue. On the GC issue, the one setting you are using which is initiating occupancy fraction is set kind of low. This means you will kick in the GC once you hit 50% of your memory usage. You might consider testing with that set to a medium level, say 75% or so. -ryan On Thu, Aug 26, 2010 at 12:17 PM, Dmitry Chechik dmi...@tellapart.com wrote: Hi all, We're still seeing these crashes pretty frequently. Attached is the error from the regionserver logs as well as a GC dump of the last hour of the regionserver: 2010-08-26 13:34:10,855 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 157041ms, ten times longer than scheduled: 1 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 148602ms, ten times longer than scheduled: 1000 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 148602 milliseconds - retrying Since our workload is mostly scans in mapreduce, we've turned off block caching as per https://issues.apache.org/jira/browse/HBASE-2252 in case that had anything to do with it. We've also decreased NewSize and MaxNewSize and decreased CMSInitiatingOccupancyFraction, so our GC settings now are: -Xmx2000m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50 -XX:NewSize=32m -XX:MaxNewSize=32m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamp We're running with 2G of RAM. Is the solution here only to move to machines with more RAM, or are there other GC settings we should look at? Thanks, - Dmitry On Wed, Jul 14, 2010 at 4:39 PM, Dmitry Chechik dmi...@tellapart.com wrote: We're running with 1GB of heap space. Thanks all - we'll look into GC tuning some more. On Wed, Jul 14, 2010 at 3:47 PM, Jonathan Gray jg...@facebook.com wrote: This doesn't look like a clock skew issue. @Dmitry, while you should be running CMS, this is still a garbage collector and is still vulnerable to GC pauses. There are additional configuration parameters to tune even more. How much heap are you running with on your RSs? If you are hitting your servers with lots of load you should run with 4GB or more. Also, having ZK on the same servers as RS/DN is going to create problems if you're already hitting your IO limits. JG -Original Message- From: Arun Ramakrishnan [mailto:aramakrish...@languageweaver.com] Sent: Wednesday, July 14, 2010 3:33 PM To: user@hbase.apache.org Subject: RE: region servers crashing Had a problem that caused issues that looked like this. 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 86246ms, ten times longer than scheduled: 1000 Our problem was with clock skew. We just had to make sure ntp was running on all machines and also the timezones detected on all the machines were the same. -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean- Daniel Cryans Sent: Wednesday, July 14, 2010 3:11 PM To: user@hbase.apache.org Subject: Re: region servers crashing Dmitry, Your log shows this: 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 86246ms, ten times longer than scheduled: 1000 This is a pause that lasted more than a minute, the process was in that state (GC, swapping, mix of all of them) for some reason and it was long enough to expire the ZooKeeper session (since from its point of view the region server stopped responding). The NPE is just a side-effect, it is caused by the huge pause. It's well worth upgrading, but it won't solve your pausing issues. I can only recommend closer monitoring, setting swappiness to 0 and giving more memory to HBase (if available). J-D On Wed, Jul 14, 2010 at 3:03 PM, Dmitry Chechik dmi...@tellapart.com wrote: Hi all, We've been having issues for a few days with HBase region servers crashing when under load from mapreduce jobs. There are a few different errors in the region server logs - I've attached a sample log of 4 different region servers crashing within an hour of each other. Some details: - This happens when a full table scan from a mapreduce is in progress. - We are running HBase 0.20.3, with a 16-slave cluster, on EC2. - Some of the region server errors are NPEs which look a lot like https://issues.apache.org/jira/browse/HBASE-2077. I'm not
Re: region servers crashing
Ok I didnt see your logs earlier - normally attachments are filtered out and we use pastebin for logs. I am not seeing any large pauses in your gc logs. Not sure if the log is complete enough or what... -ryan On Thu, Aug 26, 2010 at 12:32 PM, Ryan Rawson ryano...@gmail.com wrote: Without gc logs you cannot diagnose what you suspect are gc issues... make sure you are logging and then check them out. If you are running a recent JVM you can use -XX:+PrintGCDateStamps and get better log entries. Also you cannot swap at all, even 1 page of swapping in a java process can be killer. Combined with the hypervisor stealing your CPU you can have a lot of elapsed wall time with not very many cpu slices being executed. Consider vmstat and top to diagnose that one issue. On the GC issue, the one setting you are using which is initiating occupancy fraction is set kind of low. This means you will kick in the GC once you hit 50% of your memory usage. You might consider testing with that set to a medium level, say 75% or so. -ryan On Thu, Aug 26, 2010 at 12:17 PM, Dmitry Chechik dmi...@tellapart.com wrote: Hi all, We're still seeing these crashes pretty frequently. Attached is the error from the regionserver logs as well as a GC dump of the last hour of the regionserver: 2010-08-26 13:34:10,855 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 157041ms, ten times longer than scheduled: 1 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 148602ms, ten times longer than scheduled: 1000 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 148602 milliseconds - retrying Since our workload is mostly scans in mapreduce, we've turned off block caching as per https://issues.apache.org/jira/browse/HBASE-2252 in case that had anything to do with it. We've also decreased NewSize and MaxNewSize and decreased CMSInitiatingOccupancyFraction, so our GC settings now are: -Xmx2000m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50 -XX:NewSize=32m -XX:MaxNewSize=32m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamp We're running with 2G of RAM. Is the solution here only to move to machines with more RAM, or are there other GC settings we should look at? Thanks, - Dmitry On Wed, Jul 14, 2010 at 4:39 PM, Dmitry Chechik dmi...@tellapart.com wrote: We're running with 1GB of heap space. Thanks all - we'll look into GC tuning some more. On Wed, Jul 14, 2010 at 3:47 PM, Jonathan Gray jg...@facebook.com wrote: This doesn't look like a clock skew issue. @Dmitry, while you should be running CMS, this is still a garbage collector and is still vulnerable to GC pauses. There are additional configuration parameters to tune even more. How much heap are you running with on your RSs? If you are hitting your servers with lots of load you should run with 4GB or more. Also, having ZK on the same servers as RS/DN is going to create problems if you're already hitting your IO limits. JG -Original Message- From: Arun Ramakrishnan [mailto:aramakrish...@languageweaver.com] Sent: Wednesday, July 14, 2010 3:33 PM To: user@hbase.apache.org Subject: RE: region servers crashing Had a problem that caused issues that looked like this. 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 86246ms, ten times longer than scheduled: 1000 Our problem was with clock skew. We just had to make sure ntp was running on all machines and also the timezones detected on all the machines were the same. -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean- Daniel Cryans Sent: Wednesday, July 14, 2010 3:11 PM To: user@hbase.apache.org Subject: Re: region servers crashing Dmitry, Your log shows this: 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 86246ms, ten times longer than scheduled: 1000 This is a pause that lasted more than a minute, the process was in that state (GC, swapping, mix of all of them) for some reason and it was long enough to expire the ZooKeeper session (since from its point of view the region server stopped responding). The NPE is just a side-effect, it is caused by the huge pause. It's well worth upgrading, but it won't solve your pausing issues. I can only recommend closer monitoring, setting swappiness to 0 and giving more memory to HBase (if available). J-D On Wed, Jul 14, 2010 at 3:03 PM, Dmitry Chechik dmi...@tellapart.com wrote: Hi all, We've been having issues for a few days with HBase region servers crashing when under load from mapreduce jobs. There are a few different errors in the region server logs - I've attached a sample log of 4 different region servers crashing within an hour of each
Getting data from Hbase from client/remote computer
Hi All I am new to hbase client API and want to know how to get data from hbase from cleint/remote machine. The target is to develop a java program which should connect to hbase server and then get results from it. Anyone have any example??? Thanks -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal Cell: +92 3214207445
Re: Region splits in 0.89...
Hey Vidhya, Anything interesting in the logs on the master? It's possible that the master is slowly assigning out all the daughter regions, and it's just taking a really long time since you loaded so many. -Todd On Thu, Aug 26, 2010 at 11:59 AM, Vidhyashankar Venkataraman vidhy...@yahoo-inc.com wrote: My hbase table issued a mass split after I loaded regions with greater sizes than maxfilesize.. (my bad..) Now, when I try accessing the master through the web interface, it just hangs... And, if I scan the META, I get the parent regions set to offline.. And the child regions have random byte keys (that was part of the HBASE 2515).. But one thing that troubles me is that even the split child regions don't seem to have been assigned to any server according to META: I am not sure if this is expected.. DocDB,03617973,12column=info:regioninfo, timestamp=1282810230593, value=REGION = {NAME = 82331439883.6a66354e2c587 'DocDB,03617973,1282331439883.6a66354e2c58779278942a8e5cf152ca.' 79278942a8e5cf152ca., STARTKEY = '03617973', ENDKEY = '03751972', ENCODED = 6a66354e2c58779278942a8e5cf152ca, OFFLINE = true, SPLIT = true, TABLE = {{NAME = 'DocDB', MAX_FILESIZE = '4294967296', FAMILIES = [{NAME = 'bigColumn', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '1', COMPRESSION = 'NONE', TTL = '2147483647', BLOCKSIZE = '1048576 ', IN_MEMORY = 'false', BLOCKCACHE = 'false'}]}} DocDB,03617973,12column=info:server, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:serverstartcode, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12column=info:splitA, timestamp=1282810230593, value=\x00\x10036849 82331439883.6a66354e2c587 63\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03617973,1282810229890.57f 79278942a8e5cf152ca. 2eed2d43cece270244a39233f5a5a.\x00\x1003617973\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false0\xF1\xD4\xBB DocDB,03617973,12column=info:splitA_checked, timestamp=1282815236171, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:splitB, timestamp=1282810230593, value=\x00\x10037519 82331439883.6a66354e2c587 72\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03684963,1282810229890.5fe 79278942a8e5cf152ca. 1f7c4de1640642fb8454a1b8c623c.\x00\x1003684963\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false\xB54N\x1E DocDB,03617973,12 column=info:splitB_checked, timestamp=1282815236219, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca. -- Todd Lipcon Software Engineer, Cloudera
Re: Region splits in 0.89...
It's possible that the master is slowly assigning out all the daughter regions, and it's just taking a really long time since you loaded so many. I forgot to add it the last time.. But I think that's what is happening.. Master getting choked up because of a flash crowd of splits.. (Anyways, I just wanted to verify if the META table scan was a possible output)... I will reload the data with some altered configs.. Thank you Vidhya On 8/26/10 1:16 PM, Todd Lipcon t...@cloudera.com wrote: Hey Vidhya, Anything interesting in the logs on the master? It's possible that the master is slowly assigning out all the daughter regions, and it's just taking a really long time since you loaded so many. -Todd On Thu, Aug 26, 2010 at 11:59 AM, Vidhyashankar Venkataraman vidhy...@yahoo-inc.com wrote: My hbase table issued a mass split after I loaded regions with greater sizes than maxfilesize.. (my bad..) Now, when I try accessing the master through the web interface, it just hangs... And, if I scan the META, I get the parent regions set to offline.. And the child regions have random byte keys (that was part of the HBASE 2515).. But one thing that troubles me is that even the split child regions don't seem to have been assigned to any server according to META: I am not sure if this is expected.. DocDB,03617973,12column=info:regioninfo, timestamp=1282810230593, value=REGION = {NAME = 82331439883.6a66354e2c587 'DocDB,03617973,1282331439883.6a66354e2c58779278942a8e5cf152ca.' 79278942a8e5cf152ca., STARTKEY = '03617973', ENDKEY = '03751972', ENCODED = 6a66354e2c58779278942a8e5cf152ca, OFFLINE = true, SPLIT = true, TABLE = {{NAME = 'DocDB', MAX_FILESIZE = '4294967296', FAMILIES = [{NAME = 'bigColumn', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '1', COMPRESSION = 'NONE', TTL = '2147483647', BLOCKSIZE = '1048576 ', IN_MEMORY = 'false', BLOCKCACHE = 'false'}]}} DocDB,03617973,12column=info:server, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:serverstartcode, timestamp=1282810230593, value= 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12column=info:splitA, timestamp=1282810230593, value=\x00\x10036849 82331439883.6a66354e2c587 63\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03617973,1282810229890.57f 79278942a8e5cf152ca. 2eed2d43cece270244a39233f5a5a.\x00\x1003617973\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false0\xF1\xD4\xBB DocDB,03617973,12column=info:splitA_checked, timestamp=1282815236171, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca. DocDB,03617973,12 column=info:splitB, timestamp=1282810230593, value=\x00\x10037519 82331439883.6a66354e2c587 72\x00\x00\x00\x01*\xADr\xB4\x82FDocDB,03684963,1282810229890.5fe 79278942a8e5cf152ca. 1f7c4de1640642fb8454a1b8c623c.\x00\x1003684963\x00\x00\x00\x05\x0 5DocDB\x00\x00\x00\x00\x00\x03\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05fals e\x00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x0CMAX_FILESIZE \x00\x00\x00\x0A4294967296\x00\x00\x00\x01\x08\x09bigColumn\x00\x00\x00\x 08\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICAT ION_SCOPE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE \x00\x00\x00\x08VERSIONS\x00\x00\x00\x011\x00\x00\x00\x03TTL\x00\x00\x00\ x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x071048576\x00\x00\x00 \x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\ x05false\xB54N\x1E DocDB,03617973,12 column=info:splitB_checked, timestamp=1282815236219, value=\x01 82331439883.6a66354e2c587 79278942a8e5cf152ca. -- Todd Lipcon Software Engineer, Cloudera
Re: Getting data from Hbase from client/remote computer
Check the documentation: http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/client/package-summary.html#overview J-D On Thu, Aug 26, 2010 at 12:41 PM, Shuja Rehman shujamug...@gmail.com wrote: Hi All I am new to hbase client API and want to know how to get data from hbase from cleint/remote machine. The target is to develop a java program which should connect to hbase server and then get results from it. Anyone have any example??? Thanks -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal Cell: +92 3214207445
jobtracker.jsp
I'm running map/reduce jobs from java app (table mapper reducer) in true distributed mode..I don't see anything in jobtracker page..Map/reduce job runs fine..Am I missing some config? thanks venkatesh
Re: jobtracker.jsp
So what's the log in your client side ? On Thu, Aug 26, 2010 at 6:23 PM, Venkatesh vramanatha...@aol.com wrote: I'm running map/reduce jobs from java app (table mapper reducer) in true distributed mode..I don't see anything in jobtracker page..Map/reduce job runs fine..Am I missing some config? thanks venkatesh -- Best Regards Jeff Zhang
RE: Getting data from Hbase from client/remote computer
Another way, By Rest http://wiki.apache.org/hadoop/Hbase/Stargate Xiujin Yang Date: Thu, 26 Aug 2010 13:33:00 -0700 Subject: Re: Getting data from Hbase from client/remote computer From: jdcry...@apache.org To: user@hbase.apache.org Check the documentation: http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/client/package-summary.html#overview J-D On Thu, Aug 26, 2010 at 12:41 PM, Shuja Rehman shujamug...@gmail.com wrote: Hi All I am new to hbase client API and want to know how to get data from hbase from cleint/remote machine. The target is to develop a java program which should connect to hbase server and then get results from it. Anyone have any example??? Thanks -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal Cell: +92 3214207445
RE: jobtracker.jsp
Hi When I run Hbase performance, I met the same problem. When job are run on local, it don't show up on job list. Best Xiujin Yang. To: user@hbase.apache.org Subject: Re: jobtracker.jsp Date: Thu, 26 Aug 2010 22:30:09 -0400 From: vramanatha...@aol.com yeah..log says it's running Locally..i've to figure out why.. 2010-08-26 08:49:01,491 INFO Thread-16 org.apache.hadoop.mapred.MapTask - Starting flush of map output 2010-08-26 08:49:01,578 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 2010-08-26 08:49:01,586 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,587 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_m_00_0' done. 2010-08-26 08:49:01,613 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,630 INFO Thread-16 org.apache.hadoop.mapred.Merger - Merging 1 sorted segments 2010-08-26 08:49:01,640 INFO Thread-16 org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 0 segments left of total size: 0 bytes 2010-08-26 08:49:01,640 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,658 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_r_00_0 is done. And is in the process of commiting 2010-08-26 08:49:01,659 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - reduce reduce 2010-08-26 08:49:01,660 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_r_00_0' done. -Original Message- From: Jeff Zhang zjf...@gmail.com To: user@hbase.apache.org Sent: Thu, Aug 26, 2010 9:42 pm Subject: Re: jobtracker.jsp So what's the log in your client side ? On Thu, Aug 26, 2010 at 6:23 PM, Venkatesh vramanatha...@aol.com wrote: I'm running map/reduce jobs from java app (table mapper reducer) in true distributed mode..I don't see anything in jobtracker page..Map/reduce job runs fine..Am I missing some config? thanks venkatesh -- Best Regards Jeff Zhang
Re: Out Of Memory on region servers upon bulk import
Hi Martin, Can you paste your conf? Have you by any chance upped your handler count a lot? Each handler takes up an amount of RAM equal to the largest Puts you do. With normal write buffer sizes, you're looking at around 2MB per handler, so while it sounds nice to bump the handler count up to a really high number, you can get OOMEs like you're seeing. Thanks -Todd On Thu, Aug 26, 2010 at 9:31 AM, Martin Arnandze marnan...@gmail.comwrote: I provide the answers below. Thanks! Martin On Aug 26, 2010, at 11:45 AM, Stack wrote: On Thu, Aug 26, 2010 at 8:07 AM, Martin Arnandze marnan...@gmail.com wrote: Hi, I'm doing an experiment on an 8 node cluster, each of which has 6GB of RAM allocated to hbase region server. Basically, doing a bulk import processing large files, How large? about 10 million records each a few Kb. Unless very large, it should not be OOMEing. but some imports require to do gets and scans as well. In the master UI I see that the heap used gets very close to the 6GB limit, but I know hbase is eager for memory and will use the heap as much as possible.I use block caching. Looking at similar posts I see that modifying the handler count and memory store upper/ower limits may be key to solving this issue. Nevertheless I wanted to ask if there is a way to estimate the extra memory used by hbase that makes it crash and if there are other configuration settings I should be looking into to prevent OOME. The job runs correctly for some time but region servers eventually crash. More information about the cluster: - All nodes have 16GM total memory. - 7 nodes running region server (6GB) + datanodes (1GB) + task trackers (1GB Heap). Map reduce jobs running w/ 756MB tops each. Good. How many MR child tasks can run on each node concurrently? three mappers and two reducers - 1 node running hbase master (2GB Heap allocated), namenode (4GB), Secondary Namenode (4GB), JobTracker (4GB) and Master (2GB). - 3 of the nodes have zookeeper running with 512MB Heap Many thanks, Martin Can we see the lines before the below is thrown? Also, do a listing (ls -r) on this region in hdfs and lets see if anything pops out about files sizes, etc. You'll need to manually map the below region name to its encoded name to figure the region but the encoded name should be earlier in the log. You'll do something like: bin/hbase fs -lsr /hbase/table_import/REGION_ENCODED_NAME /usr/lib/hadoop-0.20/bin/hadoop fs -lsr /hbase/table_import/1698505444 -rw-r--r-- 3 hadoop supergroup 1450 2010-08-25 23:17 /hbase/table_import/1698505444/.regioninfo drwxr-xr-x - hadoop supergroup 0 2010-08-26 11:07 /hbase/table_importl/1698505444/fam -rw-r--r-- 3 hadoop supergroup4244491 2010-08-26 11:07 /hbase/table_import/1698505444/fam/5785049964186428982 -rw-r--r-- 3 hadoop supergroup 180147216 2010-08-26 06:09 /hbase/table_import/1698505444/fam/705757673046090229 Previous log: 010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,f1cbb42c-b6ae-404d-800c-043da5409441-9223370754623831807WkmpwnRDmveKYzWEfw/tb4GpP9yHDl+/G7OCaZWEgrmGcW+XEF131YDTQwDqZsO93tDicdPcOdRq\x0AU7zDBqoxpA==,1282790086498/1451783432 available; sequence id is 1518500874 2010-08-26 07:18:14,691 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399 2010-08-26 07:18:14,691 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Creating region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399, encoded=1510556231 2010-08-26 07:18:21,085 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=958.38617MB (1004940736), Free=238.2MB (249864000), Max=1196.675MB (1254804736), Counts: Blocks=115717, Access=51364517, Hit=231796, Miss=51132721, Evictions=15, Evicted=218920, Ratios: Hit Ratio=0.45127649791538715%, Miss Ratio=99.54872131347656%, Evicted/Run=14594.6669921875 2010-08-26 07:18:27,659 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded /hbase/table_import/1510556231/fam/2639910770219077750, isReference=false, sequence id=1518500860, length=200014693, majorCompaction=false 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: region table_import,d1e50232ac85a0a965e48647de5dc6ce-92233707546248658079F073MJ/gGEEs6mwkLsY/lLH+QvHGVBhBavAz0HSPEEKY+NrjTTzHUJdPtuJ0lXqz2i2Qs2DmFkz\x0A5P2broA7Gg==,128278399/1510556231 available; sequence id is 1518500861 2010-08-26 07:18:35,188 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
Re: region servers crashing
Looking in RS log I see this: # 2010-08-26 13:34:11,056 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 36 on 60020 took 148265ms appending an edit to hlog; editcount=2990 # 2010-08-26 13:34:11,056 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 86 on 60020 took 148265ms appending an edit to hlog; editcount=2991 # That it took near 3 minutes appending the log is probably because we were under a stop-the-world GC pause. Regarding CMSInitiatingOccupancyFraction, we used to have it set to 88% and we thought that by setting it lower we'd kick off more frequent (but smaller) GC collections, and reduce the chance of any one of them pausing. Isn't 88% almost the default? Come down more I'd say. I see earlier you ran with 50% CMSInitiatingOccupancyFraction? That didn't work? Starting earlier, you'll pay in CPU but should help. Set UseCMSInitiatingOccupancyOnly too so the GC will consider your CMSInitiatingOccupancyFraction setting only before it starts the CMS full GC [1]. How many cores do you have? If = 4, you should add -XX:+CMSIncrementalMode. Update your JVM. u21 has some fixes to help with heap fragmentation ( -XX:+DoEscapeAnalysis probably won't work on u21, IIRC). I just noticed you are running on EC2. Use bigger nodes? You are also running old hbase which had a particular way of scanning in a manner that was RAM expensive. Update to 0.20.6 hbase. Doing this will have no effect on the zk lease, the thing that is responsible for servers going down: We tried increasing hbase.regionserver.lease.period to 2 minutes but that didn't seem to make a difference here. Up the ticktime, this setting: property namehbase.zookeeper.property.tickTime/name value3000/value descriptionProperty from ZooKeeper's config zoo.cfg. The number of milliseconds of each tick. See zookeeper.session.timeout description. /description /property Also up the below (see description for relation between below and the above ticktime). property namezookeeper.session.timeout/name value6/value descriptionZooKeeper session timeout. HBase passes this to the zk quorum as suggested maximum time for a session. See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions The client sends a requested timeout, the server responds with the timeout that it can give the client. The current implementation requires that the timeout be a minimum of 2 times the tickTime (as set in the server configuration) and a maximum of 20 times the tickTime. Set the zk ticktime with hbase.zookeeper.property.tickTime. In milliseconds. /description /property St.Ack 1. http://markmail.org/thread/e43gybkrcecg5rxo So, if this isn't a GC issue, is there anything else it could be, based on the logs? Thanks, - Dmitry On Thu, Aug 26, 2010 at 12:37 PM, Ryan Rawson ryano...@gmail.com wrote: Ok I didnt see your logs earlier - normally attachments are filtered out and we use pastebin for logs. I am not seeing any large pauses in your gc logs. Not sure if the log is complete enough or what... -ryan On Thu, Aug 26, 2010 at 12:32 PM, Ryan Rawson ryano...@gmail.com wrote: Without gc logs you cannot diagnose what you suspect are gc issues... make sure you are logging and then check them out. If you are running a recent JVM you can use -XX:+PrintGCDateStamps and get better log entries. Also you cannot swap at all, even 1 page of swapping in a java process can be killer. Combined with the hypervisor stealing your CPU you can have a lot of elapsed wall time with not very many cpu slices being executed. Consider vmstat and top to diagnose that one issue. On the GC issue, the one setting you are using which is initiating occupancy fraction is set kind of low. This means you will kick in the GC once you hit 50% of your memory usage. You might consider testing with that set to a medium level, say 75% or so. -ryan On Thu, Aug 26, 2010 at 12:17 PM, Dmitry Chechik dmi...@tellapart.com wrote: Hi all, We're still seeing these crashes pretty frequently. Attached is the error from the regionserver logs as well as a GC dump of the last hour of the regionserver: 2010-08-26 13:34:10,855 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 157041ms, ten times longer than scheduled: 1 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 148602ms, ten times longer than scheduled: 1000 2010-08-26 13:34:10,925 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 148602 milliseconds - retrying Since our workload is mostly scans in mapreduce, we've turned off block caching as per https://issues.apache.org/jira/browse/HBASE-2252 in case that had anything to do with it. We've also decreased NewSize and MaxNewSize and decreased CMSInitiatingOccupancyFraction,
Re: jobtracker.jsp
Thanks J-D I figured I did n't have mapred-site.xml in my WEB-INF/classes directory (classpth) I copied that from the cluster ..that fixed part of it..Now i don't have zookeper in hadoop-env.sh:HADOOP_CLASSPATH I distinctly looked at this link a while ago.. it did n't have zookeper listed .(i've everything else i.e hbase-*.) perhaps i had a old link can all the config in mapred-site.xml be added to hbase-site.xml?..It kind of works with them being separate.. just wondering.. Have one more question..I also have trouble stopping namenode/datanode/jobtracker..to make this classpath effective Is there a force shutdown option? (other than kill -9)..? venkatesh -Original Message- From: Jean-Daniel Cryans jdcry...@apache.org To: user@hbase.apache.org Sent: Fri, Aug 27, 2010 12:10 am Subject: Re: jobtracker.jsp HBase needs to know about the job tracker, it could be on the same machine or distant, and that's taken care by giving HBase mapred's configurations. Here's the relevant documentation : http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath J-D 2010/8/26 xiujin yang xiujiny...@hotmail.com: Hi When I run Hbase performance, I met the same problem. When job are run on local, it don't show up on job list. Best Xiujin Yang. To: user@hbase.apache.org Subject: Re: jobtracker.jsp Date: Thu, 26 Aug 2010 22:30:09 -0400 From: vramanatha...@aol.com yeah..log says it's running Locally..i've to figure out why.. 2010-08-26 08:49:01,491 INFO Thread-16 org.apache.hadoop.mapred.MapTask - Starting flush of map output 2010-08-26 08:49:01,578 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 2010-08-26 08:49:01,586 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,587 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_m_00_0' done. 2010-08-26 08:49:01,613 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,630 INFO Thread-16 org.apache.hadoop.mapred.Merger - Merging 1 sorted segments 2010-08-26 08:49:01,640 INFO Thread-16 org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 0 segments left of total size: 0 bytes 2010-08-26 08:49:01,640 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - 2010-08-26 08:49:01,658 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_r_00_0 is done. And is in the process of commiting 2010-08-26 08:49:01,659 INFO Thread-16 org.apache.hadoop.mapred.LocalJobRunner - reduce reduce 2010-08-26 08:49:01,660 INFO Thread-16 org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_r_00_0' done. -Original Message- From: Jeff Zhang zjf...@gmail.com To: user@hbase.apache.org Sent: Thu, Aug 26, 2010 9:42 pm Subject: Re: jobtracker.jsp So what's the log in your client side ? On Thu, Aug 26, 2010 at 6:23 PM, Venkatesh vramanatha...@aol.com wrote: I'm running map/reduce jobs from java app (table mapper reducer) in true distributed mode..I don't see anything in jobtracker page..Map/reduce job runs fine..Am I missing some config? thanks venkatesh -- Best Regards Jeff Zhang