Re: habse schema design and retrieving values through REST interface
@ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, i need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls i will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. Thanks in advance. Sreejith -- Sreejith PK Nesote Technologies (P) Ltd
Hash keys
Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers.
Re: Hash keys
(For 2) I think the hash function should work in the shell if it returns a string type (like what '' defines in-place). On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers. -- Harsh J http://harshj.com
Re: habse schema design and retrieving values through REST interface
With this schema, if i can limit the column family over a particular range, I can manage everything else. (like Select first n columns of a column family) Sreejith On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote: @ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, i need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls i will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. Thanks in advance. Sreejith -- Sreejith PK Nesote Technologies (P) Ltd -- Sreejith PK Nesote Technologies (P) Ltd
Re: Hash keys
Hi Eric, Mozilla Socorro uses an approach where they bucket ranges using leading hashes to distribute them across servers. When you want to do scans you need to create N scans, where N is the number of hashes and then do a next() on each scanner, putting all KVs into one sorted list (use the KeyComparator for example) while stripping the prefix hash first. You can then access the rows in sorted order where the first element in the list is the one with the first key to read. Once you took of the first element (being the lowest KV key) you next the underlying scanner and reinsert it into the list, reordering it. You keep taking from the top and therefore always see the entire range, even if the same scanner would return the next logical rows to read. The shell is written in JRuby, so any function you can use there would make sense to use in the prefix, then you could compute it on the fly. This will not help with merging the bucketed key ranges, you need to do this with the above approach in code. Though since this is JRuby you could write that code in Ruby and add it to you local shell giving you what you need. Lars On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles eric.char...@u-mangate.com wrote: Oops, forget my first question about range query (if keys are hashed, they can not be queried based on a range...) Still curious to have info on hash function in shell shell (2.) and advice on md5/jenkins/sha1 (3.) Tks, Eric On 16/03/2011 09:52, Eric Charles wrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers.
Re: Hash keys
Hi, I understand from your answer that it's possible but not available. Did anyone already implemented such a functionality? If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - I know nothing about jruby. Tks, - Eric On 16/03/2011 10:39, Harsh J wrote: (For 2) I think the hash function should work in the shell if it returns a string type (like what '' defines in-place). On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers.
Re: Hash keys
Hi Lars, Are you talking about http://code.google.com/p/socorro/ ? I can find python scripts, but no jruby one... Aside the hash function I could reuse, are you saying that range queries are possible even with hashed keys (randomly distributed)? (If possible with the script, it will also be possible from the hbase java client). Even with your explanation, I can't figure out how compound keys (hasedkey+key) can be range-queried. Tks, - Eric On 16/03/2011 11:38, Lars George wrote: Hi Eric, Mozilla Socorro uses an approach where they bucket ranges using leading hashes to distribute them across servers. When you want to do scans you need to create N scans, where N is the number of hashes and then do a next() on each scanner, putting all KVs into one sorted list (use the KeyComparator for example) while stripping the prefix hash first. You can then access the rows in sorted order where the first element in the list is the one with the first key to read. Once you took of the first element (being the lowest KV key) you next the underlying scanner and reinsert it into the list, reordering it. You keep taking from the top and therefore always see the entire range, even if the same scanner would return the next logical rows to read. The shell is written in JRuby, so any function you can use there would make sense to use in the prefix, then you could compute it on the fly. This will not help with merging the bucketed key ranges, you need to do this with the above approach in code. Though since this is JRuby you could write that code in Ruby and add it to you local shell giving you what you need. Lars On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles eric.char...@u-mangate.com wrote: Oops, forget my first question about range query (if keys are hashed, they can not be queried based on a range...) Still curious to have info on hash function in shell shell (2.) and advice on md5/jenkins/sha1 (3.) Tks, Eric On 16/03/2011 09:52, Eric Charles wrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers.
Re: One of the regionserver aborted, then the master shut down itself
Hi J-D, Thanks for your reply. You said, == Just as an example, every value that you insert first has to be copied from the socket before it can be inserted into the MemStore. If you are using a big write buffer, that means that every insert currently in flight in a region server takes double that amount of space. == How can I control the size of write buffer? I find a property 'hbase.client.write.buffer' in hbase-default.xml, do you mean this one? We use RESTful api to put our cells, hopefully, this would not make any difference. As for the memroy usage of the master, I did a further investigation today. What I was doing was keeping putting cells as before. As I said yesterday, the Java heap kept increasing accordingly, and eventually OOME happened as I expected. I set -Xmx to 1GB to speed up OOME. Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells that most of the java heap is occupied by an instance of Class AssignmentManager (For ease of reading, I think you can copy the result part to what ever editor you like, at least it works for me.) Class Name | Shallow Heap | Retained Heap --- org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f01050d4c98 | 112 | 974,967,592 |- class class org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f013c21ebd0 |8 | 8 |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0 master-cloud135:6 Busy Monitor, Thread | 328 | 3,000 |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @ 0x7f01050c1000 | 88 | 296 |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @ 0x7f01051cce68 | 136 | 1,720 |- timeoutMonitor org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @ 0x7f01052505a8 cloud135:6.timeoutMonitor Thread| 208 | 592 |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @ 0x7f01052c0318 | 32 | 400 |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @ 0x7f01052c5fd0 | 72 | 376 |- serverManager org.apache.hadoop.hbase.master.ServerManager @ 0x7f01052f0138 | 80 | 932,000 |- regionPlans java.util.TreeMap @ 0x7f01052f01d8 | 80 | 104 |- servers java.util.TreeMap @ 0x7f01052f0228 | 80 |75,128 |- regions java.util.TreeMap @ 0x7f01052f0278 | 80 | 950,435,488 | |- class class java.util.TreeMap @ 0x7f013be45c30 System Class | 16 |16 | |- root java.util.TreeMap$Entry @ 0x7f010542b790 | 64 | 950,435,408 | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | |- left java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | |- right java.util.TreeMap$Entry @ 0x7f01053d34f0 | 64 | 270,674,784 | | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | | |- left java.util.TreeMap$Entry @ 0x7f01053c7568 | 64 | 162,321,936 | | | | |- parent java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | | |- right java.util.TreeMap$Entry @ 0x7f01054cbbe8 | 64 | 107,828,656 | | | | |- value org.apache.hadoop.hbase.HServerInfo @ 0x7f010f6866c0 | 72 | 154,328 | | | | | |- class class org.apache.hadoop.hbase.HServerInfo @ 0x7f013c61e3e0 |8 | 8 | | | | | |- load org.apache.hadoop.hbase.HServerLoad @ 0x7f010540a548 | 40 | 153,776 | | | | | |- serverName java.lang.String @ 0x7f010540a9a8 cloud138,60020,1300161207678 | 40 | 120 | | | | | |- hostname java.lang.String @ 0x7f010540ab60 cloud138 | 40 |80 | | | | | |- serverAddress org.apache.hadoop.hbase.HServerAddress @ 0x7f01054c3020 | 32 | 280 | | | | | '- Total: 5 entries | | | | | | |- key org.apache.hadoop.hbase.HRegionInfo @ 0x7f010f77bd68 | 88 | 3,200 | | | | '- Total: 6 entries | | | | | |- parent java.util.TreeMap$Entry @ 0x7f010542b790 | 64 | 950,435,408 | | | |- left java.util.TreeMap$Entry @ 0x7f0105432b70 | 64 | 307,135,480 | | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | | |- parent java.util.TreeMap$Entry @ 0x7f01053d34b0 |
Re: One of the regionserver aborted, then the master shut down itself
Regarding AssignmentManager, it looks like only hold regions in transition. We can see lots of region split and unsignment in the master log. I guess it was due to our large cells and the endless insertion. Does this make sense? I have not dig into the code, I do belive it removes the regions from the AssignmentManager.regions once the transition completes, right? Mao Xu-Feng On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote: Hi J-D, Thanks for your reply. You said, == Just as an example, every value that you insert first has to be copied from the socket before it can be inserted into the MemStore. If you are using a big write buffer, that means that every insert currently in flight in a region server takes double that amount of space. == How can I control the size of write buffer? I find a property 'hbase.client.write.buffer' in hbase-default.xml, do you mean this one? We use RESTful api to put our cells, hopefully, this would not make any difference. As for the memroy usage of the master, I did a further investigation today. What I was doing was keeping putting cells as before. As I said yesterday, the Java heap kept increasing accordingly, and eventually OOME happened as I expected. I set -Xmx to 1GB to speed up OOME. Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells that most of the java heap is occupied by an instance of Class AssignmentManager (For ease of reading, I think you can copy the result part to what ever editor you like, at least it works for me.) Class Name | Shallow Heap | Retained Heap --- org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f01050d4c98 | 112 | 974,967,592 |- class class org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f013c21ebd0 |8 | 8 |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0 master-cloud135:6 Busy Monitor, Thread | 328 | 3,000 |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @ 0x7f01050c1000 | 88 | 296 |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @ 0x7f01051cce68 | 136 | 1,720 |- timeoutMonitor org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @ 0x7f01052505a8 cloud135:6.timeoutMonitor Thread| 208 | 592 |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @ 0x7f01052c0318 | 32 | 400 |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @ 0x7f01052c5fd0 | 72 | 376 |- serverManager org.apache.hadoop.hbase.master.ServerManager @ 0x7f01052f0138 | 80 | 932,000 |- regionPlans java.util.TreeMap @ 0x7f01052f01d8 | 80 | 104 |- servers java.util.TreeMap @ 0x7f01052f0228 | 80 |75,128 |- regions java.util.TreeMap @ 0x7f01052f0278 | 80 | 950,435,488 | |- class class java.util.TreeMap @ 0x7f013be45c30 System Class | 16 |16 | |- root java.util.TreeMap$Entry @ 0x7f010542b790 | 64 | 950,435,408 | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | |- left java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | |- right java.util.TreeMap$Entry @ 0x7f01053d34f0 | 64 | 270,674,784 | | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | | |- left java.util.TreeMap$Entry @ 0x7f01053c7568 | 64 | 162,321,936 | | | | |- parent java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | | |- right java.util.TreeMap$Entry @ 0x7f01054cbbe8 | 64 | 107,828,656 | | | | |- value org.apache.hadoop.hbase.HServerInfo @ 0x7f010f6866c0 | 72 | 154,328 | | | | | |- class class org.apache.hadoop.hbase.HServerInfo @ 0x7f013c61e3e0 |8 | 8 | | | | | |- load org.apache.hadoop.hbase.HServerLoad @ 0x7f010540a548 | 40 | 153,776 | | | | | |- serverName java.lang.String @ 0x7f010540a9a8 cloud138,60020,1300161207678 | 40 | 120 | | | | | |- hostname java.lang.String @ 0x7f010540ab60 cloud138 | 40 |80 | | | | | |- serverAddress org.apache.hadoop.hbase.HServerAddress @ 0x7f01054c3020 | 32 | 280 | | | | | '- Total: 5 entries | | | | | | |- key org.apache.hadoop.hbase.HRegionInfo @
Re: Hash keys
Hi Eric, Socorro is Java and Python, I was just mentioning it as a possible source of inspiration :) You can learn Ruby and implement it (I hear it is easy... *cough*) or write that same in a small Java app and use it from the command line or so. And yes, you can range scan using a prefix. We were discussing this recently and there is this notion of design for reads, or design for writes. DFR is usually sequential keys and DFW is random keys. It is tough to find common grounds as both designs are on the far end of the same spectrum. Finding a middle ground is the bucketed (or salted) approach, which gives you distribution but still being able to scan... but not without some client side support. One typical class of data is timeseries based keys. As for scanning them, you need N client side scanners. Imagine this example: row 1 ... 1000 - Prefix h1_ row 1001 ... 2000 - Prefix h2_ row 2001 ... 3000 - Prefix h3_ row 3001 ... 4000 - Prefix h4_ row 4001 ... 5000 - Prefix h5_ row 5001 ... 6000 - Prefix h6_ row 6001 ... 7000 - Prefix h7_ So you have divided the entire range into 7 buckets. The prefixes (also sometimes called salt) are used to distribute them row keys to region servers. To scan the entire range as one large key space you need to create 7 scanners: 1. scanner: start row: h1_, end row h2_ 2. scanner: start row: h2_, end row h3_ 3. scanner: start row: h3_, end row h4_ 4. scanner: start row: h4_, end row h5_ 5. scanner: start row: h5_, end row h6_ 6. scanner: start row: h6_, end row h7_ 7. scanner: start row: h7_, end row Now each of them gives you the first row that matches the start and end row keys they are configure for. So you then take that first KV they offer and add it to a list, sorted by ky.getRow() while removing the hash prefix. For example, scanner 1 may have row h1_1 to offer, then split and drop the prefix h1_ to get 1. The list then would hold something like: 1. row 1 - kv from scanner 1 2. row 1010 - kv from scanner 2 3. row 2001 - kv from scanner 3 4. row 3033 - kv from scanner 4 5. row 4001 - kv from scanner 5 6. row 5002 - kv from scanner 6 7. row 6000 - kv from scanner 7 (assuming that the keys are not contiguous but have gaps) You then pop element #1 and do a scanner1.next() to get its next KV offering. Then insert that into the list and you get 1. row 3 - kv from scanner 1 2. row 1010 - kv from scanner 2 3. row 2001 - kv from scanner 3 4. row 3033 - kv from scanner 4 5. row 4001 - kv from scanner 5 6. row 5002 - kv from scanner 6 7. row 6000 - kv from scanner 7 Notice how you always only have a list with N elements on the client side, each representing the next value the scanners offer. Since the list is sorted you always access item #1 and therefore the next in the entire key space. Once scanner 1 runs out you can close and remove it, the list will then give you values from scanner 2 as the first elements in it. And so on. Makes more sense? Lars On Wed, Mar 16, 2011 at 12:09 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi Lars, Are you talking about http://code.google.com/p/socorro/ ? I can find python scripts, but no jruby one... Aside the hash function I could reuse, are you saying that range queries are possible even with hashed keys (randomly distributed)? (If possible with the script, it will also be possible from the hbase java client). Even with your explanation, I can't figure out how compound keys (hasedkey+key) can be range-queried. Tks, - Eric On 16/03/2011 11:38, Lars George wrote: Hi Eric, Mozilla Socorro uses an approach where they bucket ranges using leading hashes to distribute them across servers. When you want to do scans you need to create N scans, where N is the number of hashes and then do a next() on each scanner, putting all KVs into one sorted list (use the KeyComparator for example) while stripping the prefix hash first. You can then access the rows in sorted order where the first element in the list is the one with the first key to read. Once you took of the first element (being the lowest KV key) you next the underlying scanner and reinsert it into the list, reordering it. You keep taking from the top and therefore always see the entire range, even if the same scanner would return the next logical rows to read. The shell is written in JRuby, so any function you can use there would make sense to use in the prefix, then you could compute it on the fly. This will not help with merging the bucketed key ranges, you need to do this with the above approach in code. Though since this is JRuby you could write that code in Ruby and add it to you local shell giving you what you need. Lars On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles eric.char...@u-mangate.com wrote: Oops, forget my first question about range query (if keys are hashed, they can not be queried based on a range...) Still curious to have info on hash function in shell shell (2.) and advice on md5/jenkins/sha1 (3.) Tks,
Re: CopyTable MR job hangs
Double thanks (one for each reply) J-D, I'll use distcp as you suggest. -eran On Tue, Mar 15, 2011 at 19:10, Jean-Daniel Cryans jdcry...@apache.orgwrote: Strangely enough I did answer that question the day you sent it but it doesn't show up on the mailing list aggregators even tho gmail marks it as sent... anyways here's what I said: It won't work because those versions aren't wire-compatible. What you can do instead is doing an Export, distcp the files, then do an Import. If the hadoop versions are different, use the hftp interface like the distcp documentation recommends. J-D On Tue, Mar 15, 2011 at 1:11 AM, Eran Kutner e...@gigya.com wrote: No idea anyone? -eran On Wed, Mar 2, 2011 at 16:40, Eran Kutner e...@gigya.com wrote: Hi, I'm trying to copy data from an older cluster using 0.89 (CDH3b3) to a new one using 0.91 (CDH3b4) using the CopyTable MR job but it always hangs on map 0% reduce 0% until eventually the job is killed by Hadoop for not responding after 600 seconds. I verified that it works fine when copying from one table to another on the same cluster and I verified that the servers in the source cluster have network access to those in the destination cluster. Any idea what could be causing it? -eran
Hbase without hadoop
Hi, I am new to hbase and trying to write first POJO to access hbase table, please bear with my query it seems to be very simple but i am not able to find the answer myself. I am using the sample code in api. Configuration config = HBaseConfiguration.create(); HTable table = new HTable(config, myLittleHBaseTable); . Used hbase-0.90.1 jar in class path, the strange thing i found is Configuration[org.apache.hadoop.conf.Configuration ] class is not in jar. so is it that the case that I need to add hadoop jar in class path as well, I very well undesrtand that hbase and hadoop go hand in hand, is that the reason the configuration class is not included in hbase 0.90.1 distribution?
Re: Hash keys
Hi Lars, Many tks for your explanations! About DFR (sequential-keys) vs DFW (random-keys) distinction, I imagine different cases (just rephrasing what you said to be sure I get it): - Keys are really random (GUID or whatever): you have the distribution for free, still can't do, and probably don't need, range-queries. - If keys are monotonically increasing (timestamp, autoincremented,...), there are two cases: 1) sometimes, you don't need to do some range-queries and can store the key as a real hash (md5,...) to have distribution. 2) For timebased series for example, you may need to do some range queries, and adding a salt can be an answer to combine best-of-world. I understand the salt approach as recreating on the client side artifical key spaces. I was first confused reading row 1...1000 - prefix h1_. To really make the distribution random, I would have seen prefix/salt attributed randomly for a key leading to for example a h1 keyspace as such: h1_key2032, h1_key0023, h1_key1014343, ... Maybe you meant the intermediate approach where time keys of hour 1 going to h1 keyspace, keys of hour 2 going to h2 keyspace,... In that case, if you look for keys in hour 1, you would only need one scanner cause you know that they reside in h1_, and you could query with scan(h1_time1, h1_time2). But at at time, as you describe, you may need to scan different buckets with different scanners and use an ordered list to contain the result. - What about performance in that case? for very large dataset, a range query will take much time. I can imagine async client at the rescue. Maybe also mapreduce jobs could help cause if will benefit from data locality. - Also, the client application must manage the salts: it's a bit like reinventing a salt layer on top of the hbase region servers, letting client carry on this layer. The client will have to store (in hbase :)) the mapping between key ranges and their salt prefixes. It's a bit like exporting some core? functionality to the client. Strange, I fell I missed your point :) Tks, - Eric Sidenote: ...and yes, it seems I will have to learn some ruby stuff (should get used to, cause I just learned another scripting language running on jvm for another project...) On 16/03/2011 13:00, Lars George wrote: Hi Eric, Socorro is Java and Python, I was just mentioning it as a possible source of inspiration :) You can learn Ruby and implement it (I hear it is easy... *cough*) or write that same in a small Java app and use it from the command line or so. And yes, you can range scan using a prefix. We were discussing this recently and there is this notion of design for reads, or design for writes. DFR is usually sequential keys and DFW is random keys. It is tough to find common grounds as both designs are on the far end of the same spectrum. Finding a middle ground is the bucketed (or salted) approach, which gives you distribution but still being able to scan... but not without some client side support. One typical class of data is timeseries based keys. As for scanning them, you need N client side scanners. Imagine this example: row 1 ... 1000 - Prefix h1_ row 1001 ... 2000 - Prefix h2_ row 2001 ... 3000 - Prefix h3_ row 3001 ... 4000 - Prefix h4_ row 4001 ... 5000 - Prefix h5_ row 5001 ... 6000 - Prefix h6_ row 6001 ... 7000 - Prefix h7_ So you have divided the entire range into 7 buckets. The prefixes (also sometimes called salt) are used to distribute them row keys to region servers. To scan the entire range as one large key space you need to create 7 scanners: 1. scanner: start row: h1_, end row h2_ 2. scanner: start row: h2_, end row h3_ 3. scanner: start row: h3_, end row h4_ 4. scanner: start row: h4_, end row h5_ 5. scanner: start row: h5_, end row h6_ 6. scanner: start row: h6_, end row h7_ 7. scanner: start row: h7_, end row Now each of them gives you the first row that matches the start and end row keys they are configure for. So you then take that first KV they offer and add it to a list, sorted by ky.getRow() while removing the hash prefix. For example, scanner 1 may have row h1_1 to offer, then split and drop the prefix h1_ to get 1. The list then would hold something like: 1. row 1 - kv from scanner 1 2. row 1010 - kv from scanner 2 3. row 2001 - kv from scanner 3 4. row 3033 - kv from scanner 4 5. row 4001 - kv from scanner 5 6. row 5002 - kv from scanner 6 7. row 6000 - kv from scanner 7 (assuming that the keys are not contiguous but have gaps) You then pop element #1 and do a scanner1.next() to get its next KV offering. Then insert that into the list and you get 1. row 3 - kv from scanner 1 2. row 1010 - kv from scanner 2 3. row 2001 - kv from scanner 3 4. row 3033 - kv from scanner 4 5. row 4001 - kv from scanner 5 6. row 5002 - kv from scanner 6 7. row 6000 - kv from scanner 7 Notice how you always only have a list with N elements on the client side, each representing the next value the
Re: which hadoop and zookeeper version should I use with hbase 0.90.1
On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote: On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , sorry for asking the same question couple of times , but I still have no clear understanding which hadoop version I have to install for hbase 0.90.1. Any information will be really appreciated Yeah, our version story is a little messy at the moment. Would appreciate any input that would help us make it more clear. More below... 1) From http://hbase.apache.org/notsoquick.html#hadoop I understand that hadoop-0.20-append is an official version for hbase. I case I am going to compile it : Do I have checkout main branch or there is a recomended tag? If someone already compiled this version and had an issues please share it. So, the documentation says No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch., so yes, you'll have to build it. The branch-0.20-append link in the documentation is to the branch in SVN that you'd need to checkout and build. This was not obvious to you so I need to reword this paragraph to be more clear. How about if I insert after the above sentence Checkout this branch [with a link to the branch in svn] and then compile it by... Would that be better? 2) I found cloudera maven repository and I see there only hadoop-0.20.2 version. Does this version supports durability and suitable for hbase 0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2 cloudera version? I looked for CDH3 and CDH4 but didn't find hadoop-0.20-append version. Again, the documentation must be insufficiently clear here. We link to the CDH3 page. We also state it beta. What would you suggest? Question: does cloudera hadoop version (0.20.2) is suitable for hbase 0.90.1? CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++). In case I am going to use cloudera do I need to install all parts (hadoop, hbase ,zookeper ...) from cloudera or it is possible to take only hadoop installation and other products (hbase , zookeper) I can install from standard distributions? Any of above combinations should work. If you use CDH3b4, you can take all from CDH since it includes 0.90.1. Otherwise, you could use CDH hadoop and use your hbase build for the rest. St.Ack It took some time , but we succeeded to compile hadoop version. We decided to take an official version for hbase. I am only concern about version which we get after compilation. The version is *0.20.3-SNAPSHOT, r1057313. * * Does this version is a suitable version for hbase?* * * Thanks in advance , Oleg. * * * *
Re: Hash keys
Using Java classes itself is possible from within HBase shell (since it is JRuby), but yes some Ruby knowledge should be helpful too! For instance, I can use java.lang.String by simply importing it: hbase(main):004:0 import java.lang.String = Java::JavaLang::String hbase(main):004:0 get String.new('test'), String.new('row1') COLUMN CELL f:a timestamp=1300170063837, value=val4 1 row(s) in 0.0420 seconds On Wed, Mar 16, 2011 at 4:26 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi, I understand from your answer that it's possible but not available. Did anyone already implemented such a functionality? If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - I know nothing about jruby. Tks, - Eric On 16/03/2011 10:39, Harsh J wrote: (For 2) I think the hash function should work in the shell if it returns a string type (like what '' defines in-place). On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers. -- Harsh J http://harshj.com
Re: One of the regionserver aborted, then the master shut down itself
Thanks for your analysis. Once a region is offline, it is removed from regions BTW your cluster needs more machines. 7600 regions over 4 nodes place too much load on the servers. On Wed, Mar 16, 2011 at 4:28 AM, 茅旭峰 m9s...@gmail.com wrote: Regarding AssignmentManager, it looks like only hold regions in transition. We can see lots of region split and unsignment in the master log. I guess it was due to our large cells and the endless insertion. Does this make sense? I have not dig into the code, I do belive it removes the regions from the AssignmentManager.regions once the transition completes, right? Mao Xu-Feng On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote: Hi J-D, Thanks for your reply. You said, == Just as an example, every value that you insert first has to be copied from the socket before it can be inserted into the MemStore. If you are using a big write buffer, that means that every insert currently in flight in a region server takes double that amount of space. == How can I control the size of write buffer? I find a property 'hbase.client.write.buffer' in hbase-default.xml, do you mean this one? We use RESTful api to put our cells, hopefully, this would not make any difference. As for the memroy usage of the master, I did a further investigation today. What I was doing was keeping putting cells as before. As I said yesterday, the Java heap kept increasing accordingly, and eventually OOME happened as I expected. I set -Xmx to 1GB to speed up OOME. Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells that most of the java heap is occupied by an instance of Class AssignmentManager (For ease of reading, I think you can copy the result part to what ever editor you like, at least it works for me.) Class Name | Shallow Heap | Retained Heap --- org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f01050d4c98 | 112 | 974,967,592 |- class class org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f013c21ebd0 |8 | 8 |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0 master-cloud135:6 Busy Monitor, Thread | 328 | 3,000 |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @ 0x7f01050c1000 | 88 | 296 |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @ 0x7f01051cce68 | 136 | 1,720 |- timeoutMonitor org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @ 0x7f01052505a8 cloud135:6.timeoutMonitor Thread| 208 | 592 |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @ 0x7f01052c0318 | 32 | 400 |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @ 0x7f01052c5fd0 | 72 | 376 |- serverManager org.apache.hadoop.hbase.master.ServerManager @ 0x7f01052f0138 | 80 | 932,000 |- regionPlans java.util.TreeMap @ 0x7f01052f01d8 | 80 | 104 |- servers java.util.TreeMap @ 0x7f01052f0228 | 80 |75,128 |- regions java.util.TreeMap @ 0x7f01052f0278 | 80 | 950,435,488 | |- class class java.util.TreeMap @ 0x7f013be45c30 System Class | 16 |16 | |- root java.util.TreeMap$Entry @ 0x7f010542b790 | 64 | 950,435,408 | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | |- left java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class | 0 | 0 | | | |- right java.util.TreeMap$Entry @ 0x7f01053d34f0 | 64 | 270,674,784 | | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | | | |- left java.util.TreeMap$Entry @ 0x7f01053c7568 | 64 | 162,321,936 | | | | |- parent java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | | |- right java.util.TreeMap$Entry @ 0x7f01054cbbe8 | 64 | 107,828,656 | | | | |- value org.apache.hadoop.hbase.HServerInfo @ 0x7f010f6866c0 | 72 | 154,328 | | | | | |- class class org.apache.hadoop.hbase.HServerInfo @ 0x7f013c61e3e0 |8 | 8 | | | | | |- load org.apache.hadoop.hbase.HServerLoad @ 0x7f010540a548 | 40 | 153,776 | | | | | |- serverName java.lang.String @ 0x7f010540a9a8 cloud138,60020,1300161207678 | 40 | 120 | | | | | |-
java.io.FileNotFoundException:
Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java
Re: Hash keys
Cool. Everything is already available. I simply have to import MD5Hash and use the to_java_bytes ruby function. hbase(main):001:0 import org.apache.hadoop.hbase.util.MD5Hash = Java::OrgApacheHadoopHbaseUtil::MD5Hash hbase(main):002:0 put 'test', MD5Hash.getMD5AsHex('row1'.to_java_bytes), 'cf:a', 'value1' 0 row(s) in 0.5880 seconds hbase(main):004:0 get 'test', 'row1' COLUMN CELL 0 row(s) in 0.0170 seconds hbase(main):003:0 get 'test', MD5Hash.getMD5AsHex('row1'.to_java_bytes) COLUMN CELL cf:a timestamp=1300287899911, value=value1 1 row(s) in 0.0840 seconds Many tks, Eric On 16/03/2011 15:44, Harsh J wrote: Using Java classes itself is possible from within HBase shell (since it is JRuby), but yes some Ruby knowledge should be helpful too! For instance, I can use java.lang.String by simply importing it: hbase(main):004:0 import java.lang.String = Java::JavaLang::String hbase(main):004:0 get String.new('test'), String.new('row1') COLUMN CELL f:a timestamp=1300170063837, value=val4 1 row(s) in 0.0420 seconds On Wed, Mar 16, 2011 at 4:26 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi, I understand from your answer that it's possible but not available. Did anyone already implemented such a functionality? If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - I know nothing about jruby. Tks, - Eric On 16/03/2011 10:39, Harsh J wrote: (For 2) I think the hash function should work in the shell if it returns a string type (like what '' defines in-place). On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles eric.char...@u-mangate.comwrote: Hi, To help avoid hotspots, I'm planning to use hashed keys in some tables. 1. I wonder if this strategy is adviced for range queries (from/to key) use case, because the rows will be randomly distributed in different regions. Will it cause some performance loose? 2. Is it possible to query from hbase shell with something like get 't1', @hash('r1'), to let the shell compute the hash for you from the readable key. 3. There are MD5 and Jenkins classes in hbase.util package. What would you advice? what about SHA1? Tks, - Eric PS: I searched the archive but didn't find the answers.
Re: One of the regionserver aborted, then the master shut down itself
Thanks Ted! === Once a region is offline, it is removed from regions === By 'offline' here you mean unassigned, and has already been split into smaller regions? I think we have too many regions because we're using large cells, and normally a region is size of hundards of mega bytes. BTW, any property can set the size of a region? Do you think set larger region could helpful for our scenario? If AssignmentManager.regions holds all the online regions, the size of regions is (number of online regions) X (number of online regions) / (number of region servers), right? So to cut the size of regions, either we can increase the region size, or add more region servers, right? Just out of curiosity, why should we keep per region load for each HServerLoad for AssignmentManager.regions, I guess it keeps changing dynamically. Thanks and regards, Mao Xu-Feng On Wed, Mar 16, 2011 at 11:03 PM, Ted Yu yuzhih...@gmail.com wrote: Thanks for your analysis. Once a region is offline, it is removed from regions BTW your cluster needs more machines. 7600 regions over 4 nodes place too much load on the servers. On Wed, Mar 16, 2011 at 4:28 AM, 茅旭峰 m9s...@gmail.com wrote: Regarding AssignmentManager, it looks like only hold regions in transition. We can see lots of region split and unsignment in the master log. I guess it was due to our large cells and the endless insertion. Does this make sense? I have not dig into the code, I do belive it removes the regions from the AssignmentManager.regions once the transition completes, right? Mao Xu-Feng On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote: Hi J-D, Thanks for your reply. You said, == Just as an example, every value that you insert first has to be copied from the socket before it can be inserted into the MemStore. If you are using a big write buffer, that means that every insert currently in flight in a region server takes double that amount of space. == How can I control the size of write buffer? I find a property 'hbase.client.write.buffer' in hbase-default.xml, do you mean this one? We use RESTful api to put our cells, hopefully, this would not make any difference. As for the memroy usage of the master, I did a further investigation today. What I was doing was keeping putting cells as before. As I said yesterday, the Java heap kept increasing accordingly, and eventually OOME happened as I expected. I set -Xmx to 1GB to speed up OOME. Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells that most of the java heap is occupied by an instance of Class AssignmentManager (For ease of reading, I think you can copy the result part to what ever editor you like, at least it works for me.) Class Name | Shallow Heap | Retained Heap --- org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f01050d4c98 | 112 | 974,967,592 |- class class org.apache.hadoop.hbase.master.AssignmentManager @ 0x7f013c21ebd0 |8 | 8 |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0 master-cloud135:6 Busy Monitor, Thread | 328 | 3,000 |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @ 0x7f01050c1000 | 88 | 296 |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @ 0x7f01051cce68 | 136 | 1,720 |- timeoutMonitor org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @ 0x7f01052505a8 cloud135:6.timeoutMonitor Thread| 208 | 592 |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @ 0x7f01052c0318 | 32 | 400 |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @ 0x7f01052c5fd0 | 72 | 376 |- serverManager org.apache.hadoop.hbase.master.ServerManager @ 0x7f01052f0138 | 80 | 932,000 |- regionPlans java.util.TreeMap @ 0x7f01052f01d8 | 80 | 104 |- servers java.util.TreeMap @ 0x7f01052f0228 | 80 |75,128 |- regions java.util.TreeMap @ 0x7f01052f0278 | 80 | 950,435,488 | |- class class java.util.TreeMap @ 0x7f013be45c30 System Class | 16 |16 | |- root java.util.TreeMap$Entry @ 0x7f010542b790 | 64 | 950,435,408 | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class |0 | 0 | | |- left java.util.TreeMap$Entry @ 0x7f01053d34b0 | 64 | 579,650,616 | | | |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System Class
after upgrade, fatal error in regionserver compacter, LzoCompressor, AbstractMethodError
We upgraded to Hadoop 0.20.1 and Hbase 0.90.1 (both CDH3B4). We are using 64bit machines. Starting goes great, only right after the first compaction we get this error: Uncaught exception in service thread regionserver60020.compactor java.lang.AbstractMethodError: com.hadoop.compression.lzo.LzoCompressor.reinit(Lorg/apache/hadoop/conf/Configuration;)V at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:105) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:112) at org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.getCompressor(Compression.java:200) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.getCompressingStream(HFile.java:397) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.newBlock(HFile.java:383) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkBlockBoundary(HFile.java:354) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:536) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:501) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:836) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:935) at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) Lzo worked fine. This is how I believe we used it. # LZO compression in Hbase will pass through three layers: # 1) hadoop-gpl-compression-*.jar in the hbase/lib directory; the entry point # 2) libgplcompression.* in the hbase native lib directory; the native connectors # 3) liblzo2.so.2 in the hbase native lib directory; the base native library Anyway, it would be great if somebody could help us out.
Re: java.io.FileNotFoundException:
0.90.1 ships with zookeeper-3.3.2, not with 3.2.2. St.Ack On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote: Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java
Re: which hadoop and zookeeper version should I use with hbase 0.90.1
From where did you get the src? Thanks, St.Ack On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com wrote: On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote: On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , sorry for asking the same question couple of times , but I still have no clear understanding which hadoop version I have to install for hbase 0.90.1. Any information will be really appreciated Yeah, our version story is a little messy at the moment. Would appreciate any input that would help us make it more clear. More below... 1) From http://hbase.apache.org/notsoquick.html#hadoop I understand that hadoop-0.20-append is an official version for hbase. I case I am going to compile it : Do I have checkout main branch or there is a recomended tag? If someone already compiled this version and had an issues please share it. So, the documentation says No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch., so yes, you'll have to build it. The branch-0.20-append link in the documentation is to the branch in SVN that you'd need to checkout and build. This was not obvious to you so I need to reword this paragraph to be more clear. How about if I insert after the above sentence Checkout this branch [with a link to the branch in svn] and then compile it by... Would that be better? 2) I found cloudera maven repository and I see there only hadoop-0.20.2 version. Does this version supports durability and suitable for hbase 0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2 cloudera version? I looked for CDH3 and CDH4 but didn't find hadoop-0.20-append version. Again, the documentation must be insufficiently clear here. We link to the CDH3 page. We also state it beta. What would you suggest? Question: does cloudera hadoop version (0.20.2) is suitable for hbase 0.90.1? CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++). In case I am going to use cloudera do I need to install all parts (hadoop, hbase ,zookeper ...) from cloudera or it is possible to take only hadoop installation and other products (hbase , zookeper) I can install from standard distributions? Any of above combinations should work. If you use CDH3b4, you can take all from CDH since it includes 0.90.1. Otherwise, you could use CDH hadoop and use your hbase build for the rest. St.Ack It took some time , but we succeeded to compile hadoop version. We decided to take an official version for hbase. I am only concern about version which we get after compilation. The version is *0.20.3-SNAPSHOT, r1057313. * * Does this version is a suitable version for hbase?* * * Thanks in advance , Oleg. * * * *
Re: habse schema design and retrieving values through REST interface
You can limit the return when scanning from the java api; see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int) This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). So, yes, wide rows, if thousands of elements of some size, since they need to be composed all in RAM, could bring on an OOME if the composed size available heap. St.Ack On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com wrote: With this schema, if i can limit the column family over a particular range, I can manage everything else. (like Select first n columns of a column family) Sreejith On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote: @ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, i need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls i will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. Thanks in advance. Sreejith -- Sreejith PK Nesote Technologies (P) Ltd -- Sreejith PK Nesote Technologies (P) Ltd
Re: Hash keys
On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles eric.char...@u-mangate.com wrote: Cool. Everything is already available. Great! 1 row(s) in 0.0840 seconds 1 row(s) in 0.0420 seconds Interesting, how your test's get time is exactly the double of my test ;-) -- Harsh J http://harshj.com
Re: Hash keys
A new laptop is definitively on my invest plan :) Tks, Eric On 16/03/2011 18:56, Harsh J wrote: On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles eric.char...@u-mangate.com wrote: Cool. Everything is already available. Great! 1 row(s) in 0.0840 seconds 1 row(s) in 0.0420 seconds Interesting, how your test's get time is exactly the double of my test ;-)
Re: Hash keys
...and probably the additional hashing doesn't help the performance. Eric On 16/03/2011 19:17, Eric Charles wrote: A new laptop is definitively on my invest plan :) Tks, Eric On 16/03/2011 18:56, Harsh J wrote: On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles eric.char...@u-mangate.com wrote: Cool. Everything is already available. Great! 1 row(s) in 0.0840 seconds 1 row(s) in 0.0420 seconds Interesting, how your test's get time is exactly the double of my test ;-)
Re: which hadoop and zookeeper version should I use with hbase 0.90.1
I get the src from here. http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote: From where did you get the src? Thanks, St.Ack On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com wrote: On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote: On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , sorry for asking the same question couple of times , but I still have no clear understanding which hadoop version I have to install for hbase 0.90.1. Any information will be really appreciated Yeah, our version story is a little messy at the moment. Would appreciate any input that would help us make it more clear. More below... 1) From http://hbase.apache.org/notsoquick.html#hadoop I understand that hadoop-0.20-append is an official version for hbase. I case I am going to compile it : Do I have checkout main branch or there is a recomended tag? If someone already compiled this version and had an issues please share it. So, the documentation says No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch., so yes, you'll have to build it. The branch-0.20-append link in the documentation is to the branch in SVN that you'd need to checkout and build. This was not obvious to you so I need to reword this paragraph to be more clear. How about if I insert after the above sentence Checkout this branch [with a link to the branch in svn] and then compile it by... Would that be better? 2) I found cloudera maven repository and I see there only hadoop-0.20.2 version. Does this version supports durability and suitable for hbase 0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2 cloudera version? I looked for CDH3 and CDH4 but didn't find hadoop-0.20-append version. Again, the documentation must be insufficiently clear here. We link to the CDH3 page. We also state it beta. What would you suggest? Question: does cloudera hadoop version (0.20.2) is suitable for hbase 0.90.1? CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++). In case I am going to use cloudera do I need to install all parts (hadoop, hbase ,zookeper ...) from cloudera or it is possible to take only hadoop installation and other products (hbase , zookeper) I can install from standard distributions? Any of above combinations should work. If you use CDH3b4, you can take all from CDH since it includes 0.90.1. Otherwise, you could use CDH hadoop and use your hbase build for the rest. St.Ack It took some time , but we succeeded to compile hadoop version. We decided to take an official version for hbase. I am only concern about version which we get after compilation. The version is *0.20.3-SNAPSHOT, r1057313. * * Does this version is a suitable version for hbase?* * * Thanks in advance , Oleg. * * * *
Re: which hadoop and zookeeper version should I use with hbase 0.90.1
Thats the correct branch, so you should be good! On Wed, Mar 16, 2011 at 1:17 PM, Oleg Ruchovets oruchov...@gmail.com wrote: I get the src from here. http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote: From where did you get the src? Thanks, St.Ack On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com wrote: On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote: On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , sorry for asking the same question couple of times , but I still have no clear understanding which hadoop version I have to install for hbase 0.90.1. Any information will be really appreciated Yeah, our version story is a little messy at the moment. Would appreciate any input that would help us make it more clear. More below... 1) From http://hbase.apache.org/notsoquick.html#hadoop I understand that hadoop-0.20-append is an official version for hbase. I case I am going to compile it : Do I have checkout main branch or there is a recomended tag? If someone already compiled this version and had an issues please share it. So, the documentation says No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch., so yes, you'll have to build it. The branch-0.20-append link in the documentation is to the branch in SVN that you'd need to checkout and build. This was not obvious to you so I need to reword this paragraph to be more clear. How about if I insert after the above sentence Checkout this branch [with a link to the branch in svn] and then compile it by... Would that be better? 2) I found cloudera maven repository and I see there only hadoop-0.20.2 version. Does this version supports durability and suitable for hbase 0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2 cloudera version? I looked for CDH3 and CDH4 but didn't find hadoop-0.20-append version. Again, the documentation must be insufficiently clear here. We link to the CDH3 page. We also state it beta. What would you suggest? Question: does cloudera hadoop version (0.20.2) is suitable for hbase 0.90.1? CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++). In case I am going to use cloudera do I need to install all parts (hadoop, hbase ,zookeper ...) from cloudera or it is possible to take only hadoop installation and other products (hbase , zookeper) I can install from standard distributions? Any of above combinations should work. If you use CDH3b4, you can take all from CDH since it includes 0.90.1. Otherwise, you could use CDH hadoop and use your hbase build for the rest. St.Ack It took some time , but we succeeded to compile hadoop version. We decided to take an official version for hbase. I am only concern about version which we get after compilation. The version is *0.20.3-SNAPSHOT, r1057313. * * Does this version is a suitable version for hbase?* * * Thanks in advance , Oleg. * * * *
Re: which hadoop and zookeeper version should I use with hbase 0.90.1
Got it , thank you St.Ack. On Wed, Mar 16, 2011 at 10:23 PM, Stack st...@duboce.net wrote: You should be good then. Make sure you put the hadoop you built under hbase/lib (removing the old hadoop). The hadoop-0.20.X-SNAPSHOT.x.x. is just how its named on that branch. See the build.xml. St.Ack On Wed, Mar 16, 2011 at 1:17 PM, Oleg Ruchovets oruchov...@gmail.com wrote: I get the src from here. http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote: From where did you get the src? Thanks, St.Ack On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com wrote: On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote: On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , sorry for asking the same question couple of times , but I still have no clear understanding which hadoop version I have to install for hbase 0.90.1. Any information will be really appreciated Yeah, our version story is a little messy at the moment. Would appreciate any input that would help us make it more clear. More below... 1) From http://hbase.apache.org/notsoquick.html#hadoop I understand that hadoop-0.20-append is an official version for hbase. I case I am going to compile it : Do I have checkout main branch or there is a recomended tag? If someone already compiled this version and had an issues please share it. So, the documentation says No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch., so yes, you'll have to build it. The branch-0.20-append link in the documentation is to the branch in SVN that you'd need to checkout and build. This was not obvious to you so I need to reword this paragraph to be more clear. How about if I insert after the above sentence Checkout this branch [with a link to the branch in svn] and then compile it by... Would that be better? 2) I found cloudera maven repository and I see there only hadoop-0.20.2 version. Does this version supports durability and suitable for hbase 0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2 cloudera version? I looked for CDH3 and CDH4 but didn't find hadoop-0.20-append version. Again, the documentation must be insufficiently clear here. We link to the CDH3 page. We also state it beta. What would you suggest? Question: does cloudera hadoop version (0.20.2) is suitable for hbase 0.90.1? CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++). In case I am going to use cloudera do I need to install all parts (hadoop, hbase ,zookeper ...) from cloudera or it is possible to take only hadoop installation and other products (hbase , zookeper) I can install from standard distributions? Any of above combinations should work. If you use CDH3b4, you can take all from CDH since it includes 0.90.1. Otherwise, you could use CDH hadoop and use your hbase build for the rest. St.Ack It took some time , but we succeeded to compile hadoop version. We decided to take an official version for hbase. I am only concern about version which we get after compilation. The version is *0.20.3-SNAPSHOT, r1057313. * * Does this version is a suitable version for hbase?* * * Thanks in advance , Oleg. * * * *
Row Counters
1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. When I try to use it, I get java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) If this is deprecated, is there any other way of finding the counts? I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? Viv
Re: Row Counters
On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna vivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack
Re: Row Counters
$ ./bin/hadoop jar hbase*.jar rowcounter Search for related discusson on search-hadoop On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna vivekris...@gmail.comwrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. When I try to use it, I get java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) If this is deprecated, is there any other way of finding the counts? I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? Viv
Re: Row Counters
Just a random thought. What about keeping a per region row count? Then if you needed to get a row count for a table you'd just have to query each region once and sum. Seems like it wouldn't be too expensive because you'd just have a row counter variable. It maybe more complicated than I'm making it out to be though... ~Jeff On 3/16/2011 2:40 PM, Stack wrote: On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack -- Jeff Whiting Qualtrics Senior Software Engineer je...@qualtrics.com
RE: Row Counters
When I need to know a row count for table I kept a separate table just for that purpose and would update/query that table. Low tech but it worked. -Pete -Original Message- From: Jeff Whiting [mailto:je...@qualtrics.com] Sent: Wednesday, March 16, 2011 1:46 PM To: user@hbase.apache.org Cc: Stack Subject: Re: Row Counters Just a random thought. What about keeping a per region row count? Then if you needed to get a row count for a table you'd just have to query each region once and sum. Seems like it wouldn't be too expensive because you'd just have a row counter variable. It maybe more complicated than I'm making it out to be though... ~Jeff On 3/16/2011 2:40 PM, Stack wrote: On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack -- Jeff Whiting Qualtrics Senior Software Engineer je...@qualtrics.com
Re: Row Counters
Jeff, The problem is that when hbase receives a put or delete, it doesn't know if the put is overwriting an existing row or inserting a new one, and it doesn't know if whether the requested row was there to delete. This isn't known until read or compaction time. So to keep the counter up to date on every insert, it would have to check all of the region's storefiles which would slow down your inserts a lot. Matt On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote: Since we have lived so long without this information, I guess we can hold for longer :-) Another issue I am working on is to reduce memory footprint. See the following discussion thread: One of the regionserver aborted, then the master shut down itself We have to bear in mind that there would be around 10K regions or more in production. Cheers On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com wrote: Just a random thought. What about keeping a per region row count? Then if you needed to get a row count for a table you'd just have to query each region once and sum. Seems like it wouldn't be too expensive because you'd just have a row counter variable. It maybe more complicated than I'm making it out to be though... ~Jeff On 3/16/2011 2:40 PM, Stack wrote: On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack -- Jeff Whiting Qualtrics Senior Software Engineer je...@qualtrics.com
Re: Row Counters
I guess it is using the mapred class 11/03/16 20:58:27 INFO mapred.JobClient: Task Id : attempt_201103161245_0005_m_04_0, Status : FAILED java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) How do I use mapreduce class? Viv On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote: Since we have lived so long without this information, I guess we can hold for longer :-) Another issue I am working on is to reduce memory footprint. See the following discussion thread: One of the regionserver aborted, then the master shut down itself We have to bear in mind that there would be around 10K regions or more in production. Cheers On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com wrote: Just a random thought. What about keeping a per region row count? Then if you needed to get a row count for a table you'd just have to query each region once and sum. Seems like it wouldn't be too expensive because you'd just have a row counter variable. It maybe more complicated than I'm making it out to be though... ~Jeff On 3/16/2011 2:40 PM, Stack wrote: On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack -- Jeff Whiting Qualtrics Senior Software Engineer je...@qualtrics.com
Re: java.io.FileNotFoundException:
Thanks St.Ack..I'm blind..Got past that.. Now I get for hadoop-0.20.2-core.jar I've removed *append*.jar all over the place replace with hadoop-0.20.2-core.jar 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but not the mapreduce job java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) -Original Message- From: Stack st...@duboce.net To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 1:39 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2. St.Ack On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote: Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java
Re: java.io.FileNotFoundException:
0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop 0.20.2). Look up its name in the lib/ directory of the distribution (comes with a rev #) :) On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote: Thanks St.Ack..I'm blind..Got past that.. Now I get for hadoop-0.20.2-core.jar I've removed *append*.jar all over the place replace with hadoop-0.20.2-core.jar 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but not the mapreduce job java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) -Original Message- From: Stack st...@duboce.net To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 1:39 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2. St.Ack On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote: Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java -- Harsh J http://harshj.com
Re: Row Counters
In the future, describe your environment a bit. The way I approach this is: find the correct commandline from src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java Then I issue: [hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar rowcounter packageindex Then I check the map/reduce task on job tracker URL On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.comwrote: I guess it is using the mapred class 11/03/16 20:58:27 INFO mapred.JobClient: Task Id : attempt_201103161245_0005_m_04_0, Status : FAILED java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) How do I use mapreduce class? Viv On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote: Since we have lived so long without this information, I guess we can hold for longer :-) Another issue I am working on is to reduce memory footprint. See the following discussion thread: One of the regionserver aborted, then the master shut down itself We have to bear in mind that there would be around 10K regions or more in production. Cheers On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com wrote: Just a random thought. What about keeping a per region row count? Then if you needed to get a row count for a table you'd just have to query each region once and sum. Seems like it wouldn't be too expensive because you'd just have a row counter variable. It maybe more complicated than I'm making it out to be though... ~Jeff On 3/16/2011 2:40 PM, Stack wrote: On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com wrote: 1. How do I count rows fast in hbase? First I tired count 'test' , takes ages. Saw that I could use RowCounter, but looks like it is deprecated. It is not. Make sure you are using the one from mapreduce package as opposed to mapred package. I just need to verify the total counts. Is it possible to see somewhere in the web interface or ganglia or by any other means? We don't keep a current count on a table. Too expensive. Run the rowcounter MR job. This page may be of help: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description Good luck, St.Ack -- Jeff Whiting Qualtrics Senior Software Engineer je...@qualtrics.com
Re: Hash keys
Hi Eric, Oops, you are right, my example was not clear and actually confusing the keys with sequential ones. The hash should map every Nth row key to the same bucket, so that you would for example see an interleaved distribution of row keys to regions. Region 1 holds 1, 8, 15,... while region 2 holds 2, 9, 16,... and so on. I do not think performance is a big issue. And yes, this is currently all client side driven :( Lars On Wed, Mar 16, 2011 at 2:57 PM, Eric Charles eric.char...@u-mangate.com wrote: Hi Lars, Many tks for your explanations! About DFR (sequential-keys) vs DFW (random-keys) distinction, I imagine different cases (just rephrasing what you said to be sure I get it): - Keys are really random (GUID or whatever): you have the distribution for free, still can't do, and probably don't need, range-queries. - If keys are monotonically increasing (timestamp, autoincremented,...), there are two cases: 1) sometimes, you don't need to do some range-queries and can store the key as a real hash (md5,...) to have distribution. 2) For timebased series for example, you may need to do some range queries, and adding a salt can be an answer to combine best-of-world. I understand the salt approach as recreating on the client side artifical key spaces. I was first confused reading row 1...1000 - prefix h1_. To really make the distribution random, I would have seen prefix/salt attributed randomly for a key leading to for example a h1 keyspace as such: h1_key2032, h1_key0023, h1_key1014343, ... Maybe you meant the intermediate approach where time keys of hour 1 going to h1 keyspace, keys of hour 2 going to h2 keyspace,... In that case, if you look for keys in hour 1, you would only need one scanner cause you know that they reside in h1_, and you could query with scan(h1_time1, h1_time2). But at at time, as you describe, you may need to scan different buckets with different scanners and use an ordered list to contain the result. - What about performance in that case? for very large dataset, a range query will take much time. I can imagine async client at the rescue. Maybe also mapreduce jobs could help cause if will benefit from data locality. - Also, the client application must manage the salts: it's a bit like reinventing a salt layer on top of the hbase region servers, letting client carry on this layer. The client will have to store (in hbase :)) the mapping between key ranges and their salt prefixes. It's a bit like exporting some core? functionality to the client. Strange, I fell I missed your point :) Tks, - Eric Sidenote: ...and yes, it seems I will have to learn some ruby stuff (should get used to, cause I just learned another scripting language running on jvm for another project...) On 16/03/2011 13:00, Lars George wrote: Hi Eric, Socorro is Java and Python, I was just mentioning it as a possible source of inspiration :) You can learn Ruby and implement it (I hear it is easy... *cough*) or write that same in a small Java app and use it from the command line or so. And yes, you can range scan using a prefix. We were discussing this recently and there is this notion of design for reads, or design for writes. DFR is usually sequential keys and DFW is random keys. It is tough to find common grounds as both designs are on the far end of the same spectrum. Finding a middle ground is the bucketed (or salted) approach, which gives you distribution but still being able to scan... but not without some client side support. One typical class of data is timeseries based keys. As for scanning them, you need N client side scanners. Imagine this example: row 1 ... 1000 - Prefix h1_ row 1001 ... 2000 - Prefix h2_ row 2001 ... 3000 - Prefix h3_ row 3001 ... 4000 - Prefix h4_ row 4001 ... 5000 - Prefix h5_ row 5001 ... 6000 - Prefix h6_ row 6001 ... 7000 - Prefix h7_ So you have divided the entire range into 7 buckets. The prefixes (also sometimes called salt) are used to distribute them row keys to region servers. To scan the entire range as one large key space you need to create 7 scanners: 1. scanner: start row: h1_, end row h2_ 2. scanner: start row: h2_, end row h3_ 3. scanner: start row: h3_, end row h4_ 4. scanner: start row: h4_, end row h5_ 5. scanner: start row: h5_, end row h6_ 6. scanner: start row: h6_, end row h7_ 7. scanner: start row: h7_, end row Now each of them gives you the first row that matches the start and end row keys they are configure for. So you then take that first KV they offer and add it to a list, sorted by ky.getRow() while removing the hash prefix. For example, scanner 1 may have row h1_1 to offer, then split and drop the prefix h1_ to get 1. The list then would hold something like: 1. row 1 - kv from scanner 1 2. row 1010 - kv from scanner 2 3. row 2001 - kv from scanner 3 4. row 3033 - kv from scanner 4 5. row 4001 - kv from scanner 5 6. row 5002 - kv from scanner
Re: habse schema design and retrieving values through REST interface
This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). Wrong. :-) See ScannerModel in the rest package: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html ScannerModel#setBatch - Andy --- On Wed, 3/16/11, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: habse schema design and retrieving values through REST interface To: user@hbase.apache.org Date: Wednesday, March 16, 2011, 10:47 AM You can limit the return when scanning from the java api; see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int) This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). So, yes, wide rows, if thousands of elements of some size, since they need to be composed all in RAM, could bring on an OOME if the composed size available heap. St.Ack On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com wrote: With this schema, if i can limit the column family over a particular range, I can manage everything else. (like Select first n columns of a column family) Sreejith On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote: @ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, i need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls i will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. Thanks in advance. Sreejith -- Sreejith PK Nesote Technologies (P) Ltd -- Sreejith PK Nesote Technologies (P) Ltd
Re: Row Counters
Oops. sorry about the environment. I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4 and zookeeper-3.3.2-CDH3B4. I was able to configure jars and run the command, hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test, but I get java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) The previous error in the task's full log is .. 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986) ... 15 more Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133) ... 16 more find I am pretty sure zookeeper master is running in the same machine at port 2181. Not sure why the connection loss occurs. Do I need HBASE-3578https://issues.apache.org/jira/browse/HBASE-3578by any chance? Viv On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote: In the future, describe your environment a bit. The way I approach this is: find the correct commandline from src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java Then I issue: [hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar rowcounter packageindex Then I check the map/reduce task on job tracker URL On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.com wrote: I guess it is using the mapred class 11/03/16 20:58:27 INFO mapred.JobClient: Task Id : attempt_201103161245_0005_m_04_0, Status : FAILED java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at
Re: Row Counters
The connection loss was due to inability of finding zookeeper quorum Use the commandline in my previous email. On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna vivekris...@gmail.comwrote: Oops. sorry about the environment. I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4 and zookeeper-3.3.2-CDH3B4. I was able to configure jars and run the command, hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test, but I get java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) The previous error in the task's full log is .. 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986) ... 15 more Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133) ... 16 more find I am pretty sure zookeeper master is running in the same machine at port 2181. Not sure why the connection loss occurs. Do I need HBASE-3578https://issues.apache.org/jira/browse/HBASE-3578by any chance? Viv On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote: In the future, describe your environment a bit. The way I approach this is: find the correct commandline from src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java Then I issue: [hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar rowcounter packageindex Then I check the map/reduce task on job tracker URL On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.com wrote: I guess it is using the mapred class 11/03/16 20:58:27 INFO mapred.JobClient: Task Id : attempt_201103161245_0005_m_04_0, Status : FAILED java.io.IOException: Cannot create a record reader because of a previous
OT - Hash Code Creation
Hi, This is a little off topic but this group seems pretty swift so I thought I would ask. I am aggregating a day's worth of log data which means I have a Map of over 24 million elements. What would be a good algorithm to use for generating Hash Codes for these elements that cut down on collisions? I application starts out reading in a log (144 logs in all) in about 20 seconds and by the time I reach the last log it is taking around 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. I've played around with different Hashing algorithms and cut the original time from over 300 seconds to 120 but I know I can do better. The key I am using for the Map is an alpha-numeric string that is approximately 16 character long with the last 4 or 5 character being the most unique. Any ideas? Thanks -Pete
Re: OT - Hash Code Creation
Try hash table with double hashing. Something like this http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm 2011/3/17 Peter Haidinyak phaidin...@local.com Hi, This is a little off topic but this group seems pretty swift so I thought I would ask. I am aggregating a day's worth of log data which means I have a Map of over 24 million elements. What would be a good algorithm to use for generating Hash Codes for these elements that cut down on collisions? I application starts out reading in a log (144 logs in all) in about 20 seconds and by the time I reach the last log it is taking around 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. I've played around with different Hashing algorithms and cut the original time from over 300 seconds to 120 but I know I can do better. The key I am using for the Map is an alpha-numeric string that is approximately 16 character long with the last 4 or 5 character being the most unique. Any ideas? Thanks -Pete
Re: java.io.FileNotFoundException:
The below is pretty basic error. Reference the jar that is actually present on your cluster. St.Ack On Wed, Mar 16, 2011 at 3:50 PM, Venkatesh vramanatha...@aol.com wrote: yeah..i was aware of that..I removed that tried with hadoop-0.20.2-core.jar as I was n't ready to upgrade hadoop.. I tried this time with the *append*.jar ..now it's complaining FileNotFound for append File /data/servers/datastore/mapred/mapred/system/job_201103161750_0030/libjars/hadoop-core-0.20-append-r1056497.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448) -Original Message- From: Harsh J qwertyman...@gmail.com To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 5:32 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop 0.20.2). Look up its name in the lib/ directory of the distribution (comes with a rev #) :) On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote: Thanks St.Ack..I'm blind..Got past that.. Now I get for hadoop-0.20.2-core.jar I've removed *append*.jar all over the place replace with hadoop-0.20.2-core.jar 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but not the mapreduce job java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) -Original Message- From: Stack st...@duboce.net To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 1:39 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2. St.Ack On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote: Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java -- Harsh J http://harshj.com
Re: habse schema design and retrieving values through REST interface
Thank you Andrew. St.Ack On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell apurt...@apache.org wrote: This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). Wrong. :-) See ScannerModel in the rest package: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html ScannerModel#setBatch - Andy --- On Wed, 3/16/11, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: habse schema design and retrieving values through REST interface To: user@hbase.apache.org Date: Wednesday, March 16, 2011, 10:47 AM You can limit the return when scanning from the java api; see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int) This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). So, yes, wide rows, if thousands of elements of some size, since they need to be composed all in RAM, could bring on an OOME if the composed size available heap. St.Ack On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com wrote: With this schema, if i can limit the column family over a particular range, I can manage everything else. (like Select first n columns of a column family) Sreejith On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote: @ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, i need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls i will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. Thanks in advance. Sreejith -- Sreejith PK Nesote Technologies (P) Ltd -- Sreejith PK Nesote Technologies (P) Ltd
Re: Does HBase use spaces of HDFS?
Thanks for the info. That link you referred me to was great!! Thanks again :) Ed 2011/3/8 Suraj Varma svarma...@gmail.com In the standalone mode, HBase uses the local file system as its storage. In pseudo-distributed and fully-distributed modes, HBase uses HDFS as the storage. See http://hbase.apache.org/notsoquick.html for more details on the different modes. For details on storage, see http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html --Suraj On Mon, Mar 7, 2011 at 9:42 PM, edward choi mp2...@gmail.com wrote: Sorry for this totally newbie question. I'm just wondering if HBase uses HDFS space. I read it in the reference book that HBase table size increases automatically as the table entry increases. So I am guessing that HBase manages a separate storage other than HDFS. (But then why does HBase operate on top of HDFS? Truly confusing...) If HBase doesn't use HDFS space, I can designate only a single machine to be a HDFS slave, and assign bunch of other machines to be HBase slaves. But if HBase does use HDFS space, I'd have balance the ratio of HDFS and HBase within my machines. Could anyone give me a clear heads up? Ed
Re: Row Counters
Back to the issue of keeping a count, I've often wondered if this would be easy to do without much cost at compaction time? It of course wouldn't be a true real-time total but something like a compactedRowCount. It could be a useful metric to expose via JMX to get a feel for growth over time. On Wed, Mar 16, 2011 at 3:40 PM, Vivek Krishna vivekris...@gmail.com wrote: Works. Thanks. Viv On Wed, Mar 16, 2011 at 6:21 PM, Ted Yu yuzhih...@gmail.com wrote: The connection loss was due to inability of finding zookeeper quorum Use the commandline in my previous email. On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna vivekris...@gmail.comwrote: Oops. sorry about the environment. I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4 and zookeeper-3.3.2-CDH3B4. I was able to configure jars and run the command, hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test, but I get java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) The previous error in the task's full log is .. 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986) ... 15 more Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133) ... 16 more find I am pretty sure zookeeper master is running in the same machine at port 2181. Not sure why the connection loss occurs. Do I need HBASE-3578 https://issues.apache.org/jira/browse/HBASE-3578 by any chance? Viv On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote: In the future, describe your environment a bit. The way I approach this is: find the correct commandline from src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java Then I issue: [hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath`
Re: java.io.FileNotFoundException:
yeah..thats why i feel very stupid..I'm pretty sure it exists on my cluster..but i still get the err.. I'll try on a fresh day -Original Message- From: Stack st...@duboce.net To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 7:44 pm Subject: Re: java.io.FileNotFoundException: The below is pretty basic error. Reference the jar that is actually present on your cluster. St.Ack On Wed, Mar 16, 2011 at 3:50 PM, Venkatesh vramanatha...@aol.com wrote: yeah..i was aware of that..I removed that tried with hadoop-0.20.2-core.jar as I was n't ready to upgrade hadoop.. I tried this time with the *append*.jar ..now it's complaining FileNotFound for append File /data/servers/datastore/mapred/mapred/system/job_201103161750_0030/libjars/hadoop-core-0.20-append-r1056497.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448) -Original Message- From: Harsh J qwertyman...@gmail.com To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 5:32 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop 0.20.2). Look up its name in the lib/ directory of the distribution (comes with a rev #) :) On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote: Thanks St.Ack..I'm blind..Got past that.. Now I get for hadoop-0.20.2-core.jar I've removed *append*.jar all over the place replace with hadoop-0.20.2-core.jar 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but not the mapreduce job java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) -Original Message- From: Stack st...@duboce.net To: user@hbase.apache.org Sent: Wed, Mar 16, 2011 1:39 pm Subject: Re: java.io.FileNotFoundException: 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2. St.Ack On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote: Does anyone how to get around this? Trying to run a mapreduce job in a cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code change java.io.FileNotFoundException: File /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java -- Harsh J http://harshj.com
Is there any influence to the performance of hbase if we use TTL to clean data?
I'm doing performance test on hbase, and found that performance get lower as the data grows. I set TTL to be 86400(one day). Is there any influence to the performance when hbase doing major compact to clean data outdated ? Thanks a lot. Zhou Shuaifeng(Frank) - This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: Is there any influence to the performance of hbase if we use TTL to clean data?
So, yes, a major compaction is disk io intensive and can influence performance. Here's a thread on this http://search-hadoop.com/m/PI1dl1pXgEg2 And here's a more recent one: http://search-hadoop.com/m/BNxKZeI8z --Suraj On Wed, Mar 16, 2011 at 7:49 PM, Zhou Shuaifeng zhoushuaif...@huawei.com wrote: I'm doing performance test on hbase, and found that performance get lower as the data grows. I set TTL to be 86400(one day). Is there any influence to the performance when hbase doing major compact to clean data outdated ? Thanks a lot. Zhou Shuaifeng(Frank) - This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: hbase 0.90.1 upgrade issue - mapreduce job
Does this help?: http://search-hadoop.com/m/JI3ro1EKY0u --Suraj On Tue, Mar 15, 2011 at 7:39 PM, Venkatesh vramanatha...@aol.com wrote: Hi When I upgraded to 0.90.1, mapreduce fails with exception.. system/job_201103151601_0121/libjars/hbase-0.90.1.jar does not exist. I have the jar file in classpath (hadoop-env.sh) any ideas? thanks
Re: habse schema design and retrieving values through REST interface
Hi Andrew, I am new to hbase. Can you just elaborate the same and can you help me with the schema design? http://stackoverflow.com/questions/5325616/hbase-schema-design# I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in thousands). Corresponding to each url, I need to store a number, which will in turn resemble the priority value the keyword holds. Let me explain you a bit, Suppose i have a keyword 'united states', i need to store about ten thousand urls corresponding to that keyword. Each keyword will be holding a priority value which is an integer. Again i have thousands of keywords like that. The rare thing about this is i need to do the project in PHP. I have configured a hadoop-hbase cluster consists of three machines. My plan was to design the schema by taking the keyword as 'row key'. The urls I will keep as column family. The schema looked fine at first. I have done a lot of research on how to retrieve the url list if i know the keyword. Any ways i managed a way out by preg-matching the xml data out put using the url http://localhost:8080/tablename/rowkey (REST interface i used). It also works fine if the url list has a limited number of urls. When it comes in thousands, it seems i cannot fetch the xml data itself! Now I am in a do or die situation. Please correct me if my schema design needs any changes (I do believe it should change!) and please help me up to retrieve the column family values (urls) corresponding to each row-key in an efficient way. Please guide me how i can do the same using PHP-REST interface. If I am wrong with schema, please help me setting up a new one. From the table I should be able to list all URls corresponding to any keyword given(order by descending priority value). I may need to limit the results(Like, giving condition to priority-'where priority30') Thanks in advance Sreejith PK Nesote Technologies (P) Ltd On Thu, Mar 17, 2011 at 5:14 AM, Stack st...@duboce.net wrote: Thank you Andrew. St.Ack On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell apurt...@apache.org wrote: This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). Wrong. :-) See ScannerModel in the rest package: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html ScannerModel#setBatch - Andy --- On Wed, 3/16/11, Stack st...@duboce.net wrote: From: Stack st...@duboce.net Subject: Re: habse schema design and retrieving values through REST interface To: user@hbase.apache.org Date: Wednesday, March 16, 2011, 10:47 AM You can limit the return when scanning from the java api; see http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int) This facility is not exposed in the REST API at the moment (not that I know of -- please someone correct me if I'm wrong). So, yes, wide rows, if thousands of elements of some size, since they need to be composed all in RAM, could bring on an OOME if the composed size available heap. St.Ack On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com wrote: With this schema, if i can limit the column family over a particular range, I can manage everything else. (like Select first n columns of a column family) Sreejith On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote: @ Jean-Daniel, As i told, each row key contains thousands of column family values (may be i am wrong with the schema design). I started REST and tried to cURL http:/localhost/tablename/rowname. It seems it will work only with limited amount of data (may be i can limit the cURL output), and how i can limit the column values for a particular row? Suppose i have two thousand urls under a keyword and i need to fetch the urls and should limit the result to five hundred. How it is possible?? @ tsuna, It seems http://www.elasticsearch.org/ using CouchDB right? On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Can you tell why it's not able to get the bigger rows? Why would you try another schema if you don't even know what's going on right now? If you have the same issue with the new schema, you're back to square one right? Looking at the logs should give you some hints. J-D On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com wrote: Hello experts, I have a scenario as follows, I need to maintain a huge table for a 'web crawler' project in HBASE. Basically it contains thousands of keywords and for each keyword i need to maintain a list of urls (it again will count in