Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread sreejith P. K.
@ Jean-Daniel,

As i told, each row key contains thousands of column family values (may be i
am wrong with the schema design). I started REST and tried to cURL
http:/localhost/tablename/rowname. It seems it will work only with limited
amount of data (may be i can limit the cURL output), and how i can limit the
column values for a particular row?
Suppose i have two thousand urls under a keyword and i need to fetch the
urls and should limit the result to five hundred. How it is possible??

@ tsuna,

 It seems http://www.elasticsearch.org/ using CouchDB right?

On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 Can you tell why it's not able to get the bigger rows? Why would you
 try another schema if you don't even know what's going on right now?
 If you have the same issue with the new schema, you're back to square
 one right?

 Looking at the logs should give you some hints.

 J-D

 On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com
 wrote:
  Hello experts,
 
  I have a scenario as follows,
  I need to maintain a huge table for a 'web crawler' project in HBASE.
  Basically it contains thousands of keywords and for each keyword i need
 to
  maintain a list of urls (it again will count in thousands). Corresponding
 to
  each url, i need to store a number, which will in turn resemble the
 priority
  value the keyword holds.
  Let me explain you a bit, Suppose i have a keyword 'united states', i
 need
  to store about ten thousand urls corresponding to that keyword. Each
 keyword
  will be holding a priority value which is an integer. Again i have
 thousands
  of keywords like that. The rare thing about this is i need to do the
 project
  in PHP.
 
  I have configured a hadoop-hbase cluster consists of three machines. My
 plan
  was to design the schema by taking the keyword as 'row key'. The urls i
 will
  keep as column family. The schema looked fine at first. I have done a lot
 of
  research on how to retrieve the url list if i know the keyword. Any ways
 i
  managed a way out by preg-matching the xml data out put using the url
  http://localhost:8080/tablename/rowkey (REST interface i used). It also
  works fine if the url list has a limited number of urls. When it comes in
  thousands, it seems i cannot fetch the xml data itself!
  Now I am in a do or die situation. Please correct me if my schema design
  needs any changes (I do believe it should change!) and please help me up
 to
  retrieve the column family values (urls)
   corresponding to each row-key in an efficient way. Please guide me how i
  can do the same using PHP-REST interface.
  Thanks in advance.
 
  Sreejith
 




-- 
Sreejith PK
Nesote Technologies (P) Ltd


Hash keys

2011-03-16 Thread Eric Charles

Hi,

To help avoid hotspots, I'm planning to use hashed keys in some tables.

1. I wonder if this strategy is adviced for range queries (from/to key) 
use case, because the rows will be randomly distributed in different 
regions. Will it cause some performance loose?
2. Is it possible to query from hbase shell with something like get 
't1', @hash('r1'), to let the shell compute the hash for you from the 
readable key.
3. There are MD5 and Jenkins classes in hbase.util package. What would 
you advice? what about SHA1?


Tks,
- Eric

PS: I searched the archive but didn't find the answers.



Re: Hash keys

2011-03-16 Thread Harsh J
(For 2) I think the hash function should work in the shell if it
returns a string type (like what '' defines in-place).

On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles
eric.char...@u-mangate.com wrote:
 Hi,

 To help avoid hotspots, I'm planning to use hashed keys in some tables.

 1. I wonder if this strategy is adviced for range queries (from/to key) use
 case, because the rows will be randomly distributed in different regions.
 Will it cause some performance loose?
 2. Is it possible to query from hbase shell with something like get 't1',
 @hash('r1'), to let the shell compute the hash for you from the readable
 key.
 3. There are MD5 and Jenkins classes in hbase.util package. What would you
 advice? what about SHA1?

 Tks,
 - Eric

 PS: I searched the archive but didn't find the answers.





-- 
Harsh J
http://harshj.com


Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread sreejith P. K.
With this schema, if i can limit the column family over a particular range,
I can manage everything else. (like Select first n columns of a column
family)

Sreejith


On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote:

 @ Jean-Daniel,

 As i told, each row key contains thousands of column family values (may be
 i am wrong with the schema design). I started REST and tried to cURL
 http:/localhost/tablename/rowname. It seems it will work only with limited
 amount of data (may be i can limit the cURL output), and how i can limit the
 column values for a particular row?
 Suppose i have two thousand urls under a keyword and i need to fetch the
 urls and should limit the result to five hundred. How it is possible??

 @ tsuna,

  It seems http://www.elasticsearch.org/ using CouchDB right?


 On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans 
 jdcry...@apache.orgwrote:

 Can you tell why it's not able to get the bigger rows? Why would you
 try another schema if you don't even know what's going on right now?
 If you have the same issue with the new schema, you're back to square
 one right?

 Looking at the logs should give you some hints.

 J-D

 On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com
 wrote:
  Hello experts,
 
  I have a scenario as follows,
  I need to maintain a huge table for a 'web crawler' project in HBASE.
  Basically it contains thousands of keywords and for each keyword i need
 to
  maintain a list of urls (it again will count in thousands).
 Corresponding to
  each url, i need to store a number, which will in turn resemble the
 priority
  value the keyword holds.
  Let me explain you a bit, Suppose i have a keyword 'united states', i
 need
  to store about ten thousand urls corresponding to that keyword. Each
 keyword
  will be holding a priority value which is an integer. Again i have
 thousands
  of keywords like that. The rare thing about this is i need to do the
 project
  in PHP.
 
  I have configured a hadoop-hbase cluster consists of three machines. My
 plan
  was to design the schema by taking the keyword as 'row key'. The urls i
 will
  keep as column family. The schema looked fine at first. I have done a
 lot of
  research on how to retrieve the url list if i know the keyword. Any ways
 i
  managed a way out by preg-matching the xml data out put using the url
  http://localhost:8080/tablename/rowkey (REST interface i used). It also
  works fine if the url list has a limited number of urls. When it comes
 in
  thousands, it seems i cannot fetch the xml data itself!
  Now I am in a do or die situation. Please correct me if my schema design
  needs any changes (I do believe it should change!) and please help me up
 to
  retrieve the column family values (urls)
   corresponding to each row-key in an efficient way. Please guide me how
 i
  can do the same using PHP-REST interface.
  Thanks in advance.
 
  Sreejith
 




 --
 Sreejith PK
 Nesote Technologies (P) Ltd





-- 
Sreejith PK
Nesote Technologies (P) Ltd


Re: Hash keys

2011-03-16 Thread Lars George
Hi Eric,

Mozilla Socorro uses an approach where they bucket ranges using
leading hashes to distribute them across servers. When you want to do
scans you need to create N scans, where N is the number of hashes and
then do a next() on each scanner, putting all KVs into one sorted list
(use the KeyComparator for example) while stripping the prefix hash
first. You can then access the rows in sorted order where the first
element in the list is the one with the first key to read. Once you
took of the first element (being the lowest KV key) you next the
underlying scanner and reinsert it into the list, reordering it. You
keep taking from the top and therefore always see the entire range,
even if the same scanner would return the next logical rows to read.

The shell is written in JRuby, so any function you can use there would
make sense to use in the prefix, then you could compute it on the fly.
This will not help with merging the bucketed key ranges, you need to
do this with the above approach in code. Though since this is JRuby
you could write that code in Ruby and add it to you local shell giving
you what you need.

Lars

On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles
eric.char...@u-mangate.com wrote:
 Oops, forget my first question about range query (if keys are hashed, they
 can not be queried based on a range...)
 Still curious to have info on hash function in shell shell (2.) and advice
 on md5/jenkins/sha1 (3.)
 Tks,
 Eric

 On 16/03/2011 09:52, Eric Charles wrote:

 Hi,

 To help avoid hotspots, I'm planning to use hashed keys in some tables.

 1. I wonder if this strategy is adviced for range queries (from/to key)
 use case, because the rows will be randomly distributed in different
 regions. Will it cause some performance loose?
 2. Is it possible to query from hbase shell with something like get 't1',
 @hash('r1'), to let the shell compute the hash for you from the readable
 key.
 3. There are MD5 and Jenkins classes in hbase.util package. What would you
 advice? what about SHA1?

 Tks,
 - Eric

 PS: I searched the archive but didn't find the answers.





Re: Hash keys

2011-03-16 Thread Eric Charles

Hi,
I understand from your answer that it's possible but not available.
Did anyone already implemented such a functionality?
If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - 
I know nothing about jruby.

Tks,
- Eric

On 16/03/2011 10:39, Harsh J wrote:

(For 2) I think the hash function should work in the shell if it
returns a string type (like what '' defines in-place).

On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles
eric.char...@u-mangate.com  wrote:

Hi,

To help avoid hotspots, I'm planning to use hashed keys in some tables.

1. I wonder if this strategy is adviced for range queries (from/to key) use
case, because the rows will be randomly distributed in different regions.
Will it cause some performance loose?
2. Is it possible to query from hbase shell with something like get 't1',
@hash('r1'), to let the shell compute the hash for you from the readable
key.
3. There are MD5 and Jenkins classes in hbase.util package. What would you
advice? what about SHA1?

Tks,
- Eric

PS: I searched the archive but didn't find the answers.









Re: Hash keys

2011-03-16 Thread Eric Charles

Hi Lars,
Are you talking about http://code.google.com/p/socorro/ ?
I can find python scripts, but no jruby one...

Aside the hash function I could reuse, are you saying that range queries 
are possible even with hashed keys (randomly distributed)?
(If possible with the script, it will also be possible from the hbase 
java client).
Even with your explanation, I can't figure out how compound keys 
(hasedkey+key) can be range-queried.


Tks,
- Eric

On 16/03/2011 11:38, Lars George wrote:

Hi Eric,

Mozilla Socorro uses an approach where they bucket ranges using
leading hashes to distribute them across servers. When you want to do
scans you need to create N scans, where N is the number of hashes and
then do a next() on each scanner, putting all KVs into one sorted list
(use the KeyComparator for example) while stripping the prefix hash
first. You can then access the rows in sorted order where the first
element in the list is the one with the first key to read. Once you
took of the first element (being the lowest KV key) you next the
underlying scanner and reinsert it into the list, reordering it. You
keep taking from the top and therefore always see the entire range,
even if the same scanner would return the next logical rows to read.

The shell is written in JRuby, so any function you can use there would
make sense to use in the prefix, then you could compute it on the fly.
This will not help with merging the bucketed key ranges, you need to
do this with the above approach in code. Though since this is JRuby
you could write that code in Ruby and add it to you local shell giving
you what you need.

Lars

On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles
eric.char...@u-mangate.com  wrote:

Oops, forget my first question about range query (if keys are hashed, they
can not be queried based on a range...)
Still curious to have info on hash function in shell shell (2.) and advice
on md5/jenkins/sha1 (3.)
Tks,
Eric

On 16/03/2011 09:52, Eric Charles wrote:

Hi,

To help avoid hotspots, I'm planning to use hashed keys in some tables.

1. I wonder if this strategy is adviced for range queries (from/to key)
use case, because the rows will be randomly distributed in different
regions. Will it cause some performance loose?
2. Is it possible to query from hbase shell with something like get 't1',
@hash('r1'), to let the shell compute the hash for you from the readable
key.
3. There are MD5 and Jenkins classes in hbase.util package. What would you
advice? what about SHA1?

Tks,
- Eric

PS: I searched the archive but didn't find the answers.







Re: One of the regionserver aborted, then the master shut down itself

2011-03-16 Thread 茅旭峰
Hi J-D,

Thanks for your reply.

You said,
==
Just as an example, every value that
you insert first has to be copied from the socket before it can be
inserted into the MemStore.  If you are using a big write buffer, that
means that every insert currently in flight in a region server takes
double that amount of space.
==

How can I control the size of write buffer? I find a property
'hbase.client.write.buffer' in hbase-default.xml, do you mean this one?
We use RESTful api to put our cells, hopefully, this would not make
any difference.

As for the memroy usage of the master, I did a further investigation today.
What I was doing was keeping putting cells as before. As I said yesterday,
the Java heap kept increasing accordingly, and eventually OOME happened
as I expected. I set -Xmx to 1GB to speed up OOME.

Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells that
most of the java heap is occupied by an instance of Class AssignmentManager

(For ease of reading, I think you can copy the result part to what ever
editor you like, at least it works for me.)

Class
Name
| Shallow Heap | Retained Heap
---
org.apache.hadoop.hbase.master.AssignmentManager @
0x7f01050d4c98
|  112 |   974,967,592
|- class class org.apache.hadoop.hbase.master.AssignmentManager @
0x7f013c21ebd0
|8 | 8
|- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0
master-cloud135:6 Busy Monitor, Thread
|  328 | 3,000
|- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @
0x7f01050c1000
|   88 |   296
|- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @
0x7f01051cce68
|  136 | 1,720
|- timeoutMonitor
org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @
0x7f01052505a8  cloud135:6.timeoutMonitor Thread|  208
|   592
|- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @
0x7f01052c0318
|   32 |   400
|- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @
0x7f01052c5fd0
|   72 |   376
|- serverManager org.apache.hadoop.hbase.master.ServerManager @
0x7f01052f0138
|   80 |   932,000
|- regionPlans java.util.TreeMap @
0x7f01052f01d8
|   80 |   104
|- servers java.util.TreeMap @
0x7f01052f0228
|   80 |75,128
|- regions java.util.TreeMap @
0x7f01052f0278
|   80 |   950,435,488
|  |- class class java.util.TreeMap @ 0x7f013be45c30 System
Class
|   16 |16
|  |- root java.util.TreeMap$Entry @
0x7f010542b790
|   64 |   950,435,408
|  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
Class
|0 | 0
|  |  |- left java.util.TreeMap$Entry @
0x7f01053d34b0
|   64 |   579,650,616
|  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
Class |0
| 0
|  |  |  |- right java.util.TreeMap$Entry @
0x7f01053d34f0
|   64 |   270,674,784
|  |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
Class  |0
| 0
|  |  |  |  |- left java.util.TreeMap$Entry @
0x7f01053c7568
|   64 |   162,321,936
|  |  |  |  |- parent java.util.TreeMap$Entry @
0x7f01053d34b0
|   64 |   579,650,616
|  |  |  |  |- right java.util.TreeMap$Entry @
0x7f01054cbbe8
|   64 |   107,828,656
|  |  |  |  |- value org.apache.hadoop.hbase.HServerInfo @
0x7f010f6866c0
|   72 |   154,328
|  |  |  |  |  |- class class org.apache.hadoop.hbase.HServerInfo @
0x7f013c61e3e0
|8 | 8
|  |  |  |  |  |- load org.apache.hadoop.hbase.HServerLoad @
0x7f010540a548
|   40 |   153,776
|  |  |  |  |  |- serverName java.lang.String @ 0x7f010540a9a8
cloud138,60020,1300161207678
|   40 |   120
|  |  |  |  |  |- hostname java.lang.String @ 0x7f010540ab60
cloud138
|   40 |80
|  |  |  |  |  |- serverAddress org.apache.hadoop.hbase.HServerAddress @
0x7f01054c3020 |
32 |   280
|  |  |  |  |  '- Total: 5
entries
|  |
|  |  |  |  |- key org.apache.hadoop.hbase.HRegionInfo @
0x7f010f77bd68
|   88 | 3,200
|  |  |  |  '- Total: 6
entries
|  |
|  |  |  |- parent java.util.TreeMap$Entry @
0x7f010542b790
|   64 |   950,435,408
|  |  |  |- left java.util.TreeMap$Entry @
0x7f0105432b70
|   64 |   307,135,480
|  |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
Class  |0
| 0
|  |  |  |  |- parent java.util.TreeMap$Entry @
0x7f01053d34b0
|   

Re: One of the regionserver aborted, then the master shut down itself

2011-03-16 Thread 茅旭峰
Regarding AssignmentManager, it looks like only hold regions in transition.
We can see lots of region split and unsignment in the master log. I guess
it was due to our large cells and the endless insertion. Does this make
sense?
I have not dig into the code, I do belive it removes the regions from the
AssignmentManager.regions once the transition completes, right?

Mao Xu-Feng

On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote:

 Hi J-D,

 Thanks for your reply.

 You said,
 ==

 Just as an example, every value that
 you insert first has to be copied from the socket before it can be
 inserted into the MemStore.  If you are using a big write buffer, that
 means that every insert currently in flight in a region server takes
 double that amount of space.
 ==

 How can I control the size of write buffer? I find a property
 'hbase.client.write.buffer' in hbase-default.xml, do you mean this one?
 We use RESTful api to put our cells, hopefully, this would not make
 any difference.

 As for the memroy usage of the master, I did a further investigation today.
 What I was doing was keeping putting cells as before. As I said yesterday,
 the Java heap kept increasing accordingly, and eventually OOME happened
 as I expected. I set -Xmx to 1GB to speed up OOME.

 Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells
 that
 most of the java heap is occupied by an instance of Class AssignmentManager

 (For ease of reading, I think you can copy the result part to what ever
 editor you like, at least it works for me.)

 Class
 Name
 | Shallow Heap | Retained Heap

 ---
 org.apache.hadoop.hbase.master.AssignmentManager @
 0x7f01050d4c98
 |  112 |   974,967,592
 |- class class org.apache.hadoop.hbase.master.AssignmentManager @
 0x7f013c21ebd0
 |8 | 8
 |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0
 master-cloud135:6 Busy Monitor, Thread
 |  328 | 3,000
 |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @
 0x7f01050c1000
 |   88 |   296
 |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @
 0x7f01051cce68
 |  136 | 1,720
 |- timeoutMonitor
 org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @
 0x7f01052505a8  cloud135:6.timeoutMonitor Thread|  208
 |   592
 |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @
 0x7f01052c0318
 |   32 |   400
 |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @
 0x7f01052c5fd0
 |   72 |   376
 |- serverManager org.apache.hadoop.hbase.master.ServerManager @
 0x7f01052f0138
 |   80 |   932,000
 |- regionPlans java.util.TreeMap @
 0x7f01052f01d8
 |   80 |   104
 |- servers java.util.TreeMap @
 0x7f01052f0228
 |   80 |75,128
 |- regions java.util.TreeMap @
 0x7f01052f0278
 |   80 |   950,435,488
 |  |- class class java.util.TreeMap @ 0x7f013be45c30 System
 Class
 |   16 |16
 |  |- root java.util.TreeMap$Entry @
 0x7f010542b790
 |   64 |   950,435,408
 |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
 Class
 |0 | 0
 |  |  |- left java.util.TreeMap$Entry @
 0x7f01053d34b0
 |   64 |   579,650,616
 |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
 Class |0
 | 0
 |  |  |  |- right java.util.TreeMap$Entry @
 0x7f01053d34f0
 |   64 |   270,674,784
 |  |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08
 System Class
 |0 | 0
 |  |  |  |  |- left java.util.TreeMap$Entry @
 0x7f01053c7568
 |   64 |   162,321,936
 |  |  |  |  |- parent java.util.TreeMap$Entry @
 0x7f01053d34b0
 |   64 |   579,650,616
 |  |  |  |  |- right java.util.TreeMap$Entry @
 0x7f01054cbbe8
 |   64 |   107,828,656
 |  |  |  |  |- value org.apache.hadoop.hbase.HServerInfo @
 0x7f010f6866c0
 |   72 |   154,328
 |  |  |  |  |  |- class class org.apache.hadoop.hbase.HServerInfo @
 0x7f013c61e3e0
 |8 | 8
 |  |  |  |  |  |- load org.apache.hadoop.hbase.HServerLoad @
 0x7f010540a548
 |   40 |   153,776
 |  |  |  |  |  |- serverName java.lang.String @ 0x7f010540a9a8
 cloud138,60020,1300161207678
 |   40 |   120
 |  |  |  |  |  |- hostname java.lang.String @ 0x7f010540ab60
 cloud138
 |   40 |80
 |  |  |  |  |  |- serverAddress org.apache.hadoop.hbase.HServerAddress @
 0x7f01054c3020 |
 32 |   280
 |  |  |  |  |  '- Total: 5
 entries
 |  |
 |  |  |  |  |- key org.apache.hadoop.hbase.HRegionInfo @
 

Re: Hash keys

2011-03-16 Thread Lars George
Hi Eric,

Socorro is Java and Python, I was just mentioning it as a possible
source of inspiration :) You can learn Ruby and implement it (I hear
it is easy... *cough*) or write that same in a small Java app and use
it from the command line or so.

And yes, you can range scan using a prefix. We were discussing this
recently and there is this notion of design for reads, or design for
writes. DFR is usually sequential keys and DFW is random keys. It is
tough to find common grounds as both designs are on the far end of the
same spectrum. Finding a middle ground is the bucketed (or salted)
approach, which gives you distribution but still being able to scan...
but not without some client side support. One typical class of data is
timeseries based keys. As for scanning them, you need N client side
scanners. Imagine this example:

row   1 ... 1000 - Prefix h1_
row 1001 ... 2000 - Prefix h2_
row 2001 ... 3000 - Prefix h3_
row 3001 ... 4000 - Prefix h4_
row 4001 ... 5000 - Prefix h5_
row 5001 ... 6000 - Prefix h6_
row 6001 ... 7000 - Prefix h7_

So you have divided the entire range into 7 buckets. The prefixes
(also sometimes called salt) are used to distribute them row keys to
region servers. To scan the entire range as one large key space you
need to create 7 scanners:

1. scanner: start row: h1_, end row h2_
2. scanner: start row: h2_, end row h3_
3. scanner: start row: h3_, end row h4_
4. scanner: start row: h4_, end row h5_
5. scanner: start row: h5_, end row h6_
6. scanner: start row: h6_, end row h7_
7. scanner: start row: h7_, end row 

Now each of them gives you the first row that matches the start and
end row keys they are configure for. So you then take that first KV
they offer and add it to a list, sorted by ky.getRow() while removing
the hash prefix. For example, scanner 1 may have row h1_1 to offer,
then split and drop the prefix h1_ to get 1. The list then would
hold something like:

1. row 1 - kv from scanner 1
2. row 1010 - kv from scanner 2
3. row 2001 - kv from scanner 3
4. row 3033 - kv from scanner 4
5. row 4001 - kv from scanner 5
6. row 5002 - kv from scanner 6
7. row 6000 - kv from scanner 7

(assuming that the keys are not contiguous but have gaps)

You then pop element #1 and do a scanner1.next() to get its next KV
offering. Then insert that into the list and you get

1. row 3 - kv from scanner 1
2. row 1010 - kv from scanner 2
3. row 2001 - kv from scanner 3
4. row 3033 - kv from scanner 4
5. row 4001 - kv from scanner 5
6. row 5002 - kv from scanner 6
7. row 6000 - kv from scanner 7

Notice how you always only have a list with N elements on the client
side, each representing the next value the scanners offer. Since the
list is sorted you always access item #1 and therefore the next in the
entire key space.

Once scanner 1 runs out you can close and remove it, the list will
then give you values from scanner 2 as the first elements in it. And
so on.

Makes more sense?

Lars

On Wed, Mar 16, 2011 at 12:09 PM, Eric Charles
eric.char...@u-mangate.com wrote:
 Hi Lars,
 Are you talking about http://code.google.com/p/socorro/ ?
 I can find python scripts, but no jruby one...

 Aside the hash function I could reuse, are you saying that range queries are
 possible even with hashed keys (randomly distributed)?
 (If possible with the script, it will also be possible from the hbase java
 client).
 Even with your explanation, I can't figure out how compound keys
 (hasedkey+key) can be range-queried.

 Tks,
 - Eric

 On 16/03/2011 11:38, Lars George wrote:

 Hi Eric,

 Mozilla Socorro uses an approach where they bucket ranges using
 leading hashes to distribute them across servers. When you want to do
 scans you need to create N scans, where N is the number of hashes and
 then do a next() on each scanner, putting all KVs into one sorted list
 (use the KeyComparator for example) while stripping the prefix hash
 first. You can then access the rows in sorted order where the first
 element in the list is the one with the first key to read. Once you
 took of the first element (being the lowest KV key) you next the
 underlying scanner and reinsert it into the list, reordering it. You
 keep taking from the top and therefore always see the entire range,
 even if the same scanner would return the next logical rows to read.

 The shell is written in JRuby, so any function you can use there would
 make sense to use in the prefix, then you could compute it on the fly.
 This will not help with merging the bucketed key ranges, you need to
 do this with the above approach in code. Though since this is JRuby
 you could write that code in Ruby and add it to you local shell giving
 you what you need.

 Lars

 On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles
 eric.char...@u-mangate.com  wrote:

 Oops, forget my first question about range query (if keys are hashed,
 they
 can not be queried based on a range...)
 Still curious to have info on hash function in shell shell (2.) and
 advice
 on md5/jenkins/sha1 (3.)
 Tks,
 

Re: CopyTable MR job hangs

2011-03-16 Thread Eran Kutner
Double thanks (one for each reply) J-D, I'll use distcp as you suggest.

-eran



On Tue, Mar 15, 2011 at 19:10, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 Strangely enough I did answer that question the day you sent it but it
 doesn't show up on the mailing list aggregators even tho gmail marks
 it as sent... anyways here's what I said:

 It won't work because those versions aren't wire-compatible.

 What you can do instead is doing an Export, distcp the files, then do
 an Import. If the hadoop versions are different, use the hftp
 interface like the distcp documentation recommends.

 J-D

 On Tue, Mar 15, 2011 at 1:11 AM, Eran Kutner e...@gigya.com wrote:
  No idea anyone?
 
  -eran
 
 
 
  On Wed, Mar 2, 2011 at 16:40, Eran Kutner e...@gigya.com wrote:
 
  Hi,
  I'm trying to copy data from an older cluster using 0.89 (CDH3b3) to a
 new
  one using 0.91 (CDH3b4) using the CopyTable MR job but it always hangs
 on
  map 0% reduce 0% until eventually the job is killed by Hadoop for not
  responding after 600 seconds.
  I verified that it works fine when copying from one table to another on
 the
  same cluster and I verified that the servers in the source cluster have
  network access to those in the destination cluster.
 
  Any idea what could be causing it?
 
  -eran
 
 
 



Hbase without hadoop

2011-03-16 Thread Sumeet M Nikam

Hi,

I am new to hbase and trying to write first POJO to access hbase table,
please bear with my query it seems to be very simple but i am not able to
find the answer myself. I am using the sample code in api.

Configuration config = HBaseConfiguration.create();
HTable table = new HTable(config, myLittleHBaseTable);
.

Used hbase-0.90.1 jar in class path, the strange thing i found is
Configuration[org.apache.hadoop.conf.Configuration
] class is not in jar. so is it that the case that I need to add hadoop jar
in class path as well,  I very well undesrtand that hbase and hadoop go
hand in hand, is that the reason the configuration class is not included in
hbase 0.90.1 distribution?




Re: Hash keys

2011-03-16 Thread Eric Charles

Hi Lars,
Many tks for your explanations!

About DFR (sequential-keys) vs DFW (random-keys) distinction, I imagine 
different cases (just rephrasing what you said to be sure I get it):


- Keys are really random (GUID or whatever): you have the distribution 
for free, still can't do, and probably don't need, range-queries.


- If keys are monotonically increasing (timestamp, autoincremented,...), 
there are two cases:
1) sometimes, you don't need to do some range-queries and can store the 
key as a real hash (md5,...) to have distribution.
2) For timebased series for example, you may need to do some range 
queries, and adding a salt can be an answer to combine best-of-world.


I understand the salt approach as recreating on the client side 
artifical key spaces.


I was first confused reading row 1...1000 - prefix h1_.
To really make the distribution random, I would have seen prefix/salt 
attributed randomly for a key leading to for example a h1 keyspace as 
such: h1_key2032, h1_key0023, h1_key1014343, ...


Maybe you meant the intermediate approach where time keys of hour 1 
going to h1 keyspace, keys of hour 2 going to h2 keyspace,...
In that case, if you look for keys in hour 1, you would only need one 
scanner cause you know that they reside in h1_, and you could query 
with scan(h1_time1, h1_time2).


But at at time, as you describe, you may need to scan different buckets 
with different scanners and use an ordered list to contain the result.
- What about performance in that case? for very large dataset, a range 
query will take much time. I can imagine async client at the rescue. 
Maybe also mapreduce jobs could help cause if will benefit from data 
locality.
- Also, the client application must manage the salts: it's a bit like 
reinventing a salt layer on top of the hbase region servers, letting 
client carry on this layer. The client will have to store (in hbase :)) 
the mapping between key ranges and their salt prefixes. It's a bit like 
exporting some core? functionality to the client.


Strange, I fell I missed your point :)
Tks,

- Eric

Sidenote: ...and yes, it seems I will have to learn some ruby stuff 
(should get used to, cause I just learned another scripting language 
running on jvm for another project...)



On 16/03/2011 13:00, Lars George wrote:

Hi Eric,

Socorro is Java and Python, I was just mentioning it as a possible
source of inspiration :) You can learn Ruby and implement it (I hear
it is easy... *cough*) or write that same in a small Java app and use
it from the command line or so.

And yes, you can range scan using a prefix. We were discussing this
recently and there is this notion of design for reads, or design for
writes. DFR is usually sequential keys and DFW is random keys. It is
tough to find common grounds as both designs are on the far end of the
same spectrum. Finding a middle ground is the bucketed (or salted)
approach, which gives you distribution but still being able to scan...
but not without some client side support. One typical class of data is
timeseries based keys. As for scanning them, you need N client side
scanners. Imagine this example:

row   1 ... 1000 -  Prefix h1_
row 1001 ... 2000 -  Prefix h2_
row 2001 ... 3000 -  Prefix h3_
row 3001 ... 4000 -  Prefix h4_
row 4001 ... 5000 -  Prefix h5_
row 5001 ... 6000 -  Prefix h6_
row 6001 ... 7000 -  Prefix h7_

So you have divided the entire range into 7 buckets. The prefixes
(also sometimes called salt) are used to distribute them row keys to
region servers. To scan the entire range as one large key space you
need to create 7 scanners:

1. scanner: start row: h1_, end row h2_
2. scanner: start row: h2_, end row h3_
3. scanner: start row: h3_, end row h4_
4. scanner: start row: h4_, end row h5_
5. scanner: start row: h5_, end row h6_
6. scanner: start row: h6_, end row h7_
7. scanner: start row: h7_, end row 

Now each of them gives you the first row that matches the start and
end row keys they are configure for. So you then take that first KV
they offer and add it to a list, sorted by ky.getRow() while removing
the hash prefix. For example, scanner 1 may have row h1_1 to offer,
then split and drop the prefix h1_ to get 1. The list then would
hold something like:

1. row 1 -  kv from scanner 1
2. row 1010 -  kv from scanner 2
3. row 2001 -  kv from scanner 3
4. row 3033 -  kv from scanner 4
5. row 4001 -  kv from scanner 5
6. row 5002 -  kv from scanner 6
7. row 6000 -  kv from scanner 7

(assuming that the keys are not contiguous but have gaps)

You then pop element #1 and do a scanner1.next() to get its next KV
offering. Then insert that into the list and you get

1. row 3 -  kv from scanner 1
2. row 1010 -  kv from scanner 2
3. row 2001 -  kv from scanner 3
4. row 3033 -  kv from scanner 4
5. row 4001 -  kv from scanner 5
6. row 5002 -  kv from scanner 6
7. row 6000 -  kv from scanner 7

Notice how you always only have a list with N elements on the client
side, each representing the next value the 

Re: which hadoop and zookeeper version should I use with hbase 0.90.1

2011-03-16 Thread Oleg Ruchovets
On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote:

 On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com
 wrote:
  Hi ,
sorry for asking the same question couple of times , but I still have
 no
  clear understanding which hadoop version I have to install for hbase
 0.90.1.
  Any  information will be really appreciated
 

 Yeah, our version story is a little messy at the moment.  Would
 appreciate any input that would help us make it more clear.  More
 below...


  1)  From http://hbase.apache.org/notsoquick.html#hadoop I understand
 that
  hadoop-0.20-append is an official version for hbase. I case I am going to
  compile it : Do I have checkout main branch or there is a recomended tag?
  If someone already compiled this version and had an issues please share
 it.
 

 So, the documentation says  No official releases have been made from
 this branch up to now so you will have to build your own Hadoop from
 the tip of this branch., so yes, you'll have to build it.  The
 branch-0.20-append link in the documentation is to the branch in SVN
 that you'd need to checkout and build.  This was not obvious to you so
 I need to reword this paragraph to be more clear.   How about if I
 insert after the above sentence Checkout this branch [with a link to
 the branch in svn] and then compile it by...  Would that be better?


 
  2) I found cloudera maven repository and I see there only hadoop-0.20.2
  version. Does this version supports durability and suitable for hbase
  0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2
  cloudera version? I looked for CDH3 and CDH4 but didn't find
  hadoop-0.20-append version.


 Again, the documentation must be insufficiently clear here.  We link
 to the CDH3 page.  We also state it beta.  What would you suggest?


  Question: does cloudera hadoop version (0.20.2) is suitable for hbase
  0.90.1?


 CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++).


  In case I am going to use cloudera do I need to install all parts
 (hadoop,
  hbase ,zookeper ...) from cloudera or it is possible to take only hadoop
  installation and other products  (hbase , zookeper) I can install from
  standard distributions?
 
 Any of above combinations should work.  If you use CDH3b4, you can
 take all from CDH since it includes 0.90.1.  Otherwise, you could use
 CDH hadoop and use your hbase build for the rest.

 St.Ack



It took some time , but we succeeded to compile hadoop version. We decided
to take an official version for hbase.
I am only concern  about version which we get after compilation.
The version is *0.20.3-SNAPSHOT, r1057313. *
*   Does this version is a suitable version for hbase?*
*
*
Thanks in advance , Oleg.
*
*
*
*


Re: Hash keys

2011-03-16 Thread Harsh J
Using Java classes itself is possible from within HBase shell (since
it is JRuby), but yes some Ruby knowledge should be helpful too!

For instance, I can use java.lang.String by simply importing it:

hbase(main):004:0 import java.lang.String
= Java::JavaLang::String
hbase(main):004:0 get String.new('test'), String.new('row1')
COLUMN CELL

 f:a   timestamp=1300170063837, value=val4

1 row(s) in 0.0420 seconds

On Wed, Mar 16, 2011 at 4:26 PM, Eric Charles
eric.char...@u-mangate.com wrote:
 Hi,
 I understand from your answer that it's possible but not available.
 Did anyone already implemented such a functionality?
 If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - I
 know nothing about jruby.
 Tks,
 - Eric

 On 16/03/2011 10:39, Harsh J wrote:

 (For 2) I think the hash function should work in the shell if it
 returns a string type (like what '' defines in-place).

 On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles
 eric.char...@u-mangate.com  wrote:

 Hi,

 To help avoid hotspots, I'm planning to use hashed keys in some tables.

 1. I wonder if this strategy is adviced for range queries (from/to key)
 use
 case, because the rows will be randomly distributed in different regions.
 Will it cause some performance loose?
 2. Is it possible to query from hbase shell with something like get
 't1',
 @hash('r1'), to let the shell compute the hash for you from the readable
 key.
 3. There are MD5 and Jenkins classes in hbase.util package. What would
 you
 advice? what about SHA1?

 Tks,
 - Eric

 PS: I searched the archive but didn't find the answers.









-- 
Harsh J
http://harshj.com


Re: One of the regionserver aborted, then the master shut down itself

2011-03-16 Thread Ted Yu
Thanks for your analysis.
Once a region is offline, it is removed from regions

BTW your cluster needs more machines. 7600 regions over 4 nodes place too
much load on the servers.

On Wed, Mar 16, 2011 at 4:28 AM, 茅旭峰 m9s...@gmail.com wrote:

 Regarding AssignmentManager, it looks like only hold regions in transition.
 We can see lots of region split and unsignment in the master log. I guess
 it was due to our large cells and the endless insertion. Does this make
 sense?
 I have not dig into the code, I do belive it removes the regions from the
 AssignmentManager.regions once the transition completes, right?

 Mao Xu-Feng

 On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote:

  Hi J-D,
 
  Thanks for your reply.
 
  You said,
  ==
 
  Just as an example, every value that
  you insert first has to be copied from the socket before it can be
  inserted into the MemStore.  If you are using a big write buffer, that
  means that every insert currently in flight in a region server takes
  double that amount of space.
  ==
 
  How can I control the size of write buffer? I find a property
  'hbase.client.write.buffer' in hbase-default.xml, do you mean this one?
  We use RESTful api to put our cells, hopefully, this would not make
  any difference.
 
  As for the memroy usage of the master, I did a further investigation
 today.
  What I was doing was keeping putting cells as before. As I said
 yesterday,
  the Java heap kept increasing accordingly, and eventually OOME happened
  as I expected. I set -Xmx to 1GB to speed up OOME.
 
  Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells
  that
  most of the java heap is occupied by an instance of Class
 AssignmentManager
 
  (For ease of reading, I think you can copy the result part to what ever
  editor you like, at least it works for me.)
 
  Class
  Name
  | Shallow Heap | Retained Heap
 
 
 ---
  org.apache.hadoop.hbase.master.AssignmentManager @
  0x7f01050d4c98
  |  112 |   974,967,592
  |- class class org.apache.hadoop.hbase.master.AssignmentManager @
  0x7f013c21ebd0
  |8 | 8
  |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0
  master-cloud135:6 Busy Monitor, Thread
  |  328 | 3,000
  |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @
  0x7f01050c1000
  |   88 |   296
  |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @
  0x7f01051cce68
  |  136 | 1,720
  |- timeoutMonitor
  org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @
  0x7f01052505a8  cloud135:6.timeoutMonitor Thread|  208
  |   592
  |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @
  0x7f01052c0318
  |   32 |   400
  |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @
  0x7f01052c5fd0
  |   72 |   376
  |- serverManager org.apache.hadoop.hbase.master.ServerManager @
  0x7f01052f0138
  |   80 |   932,000
  |- regionPlans java.util.TreeMap @
  0x7f01052f01d8
  |   80 |   104
  |- servers java.util.TreeMap @
  0x7f01052f0228
  |   80 |75,128
  |- regions java.util.TreeMap @
  0x7f01052f0278
  |   80 |   950,435,488
  |  |- class class java.util.TreeMap @ 0x7f013be45c30 System
  Class
  |   16 |16
  |  |- root java.util.TreeMap$Entry @
  0x7f010542b790
  |   64 |   950,435,408
  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
  Class
  |0 | 0
  |  |  |- left java.util.TreeMap$Entry @
  0x7f01053d34b0
  |   64 |   579,650,616
  |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
  Class |
  0
  | 0
  |  |  |  |- right java.util.TreeMap$Entry @
  0x7f01053d34f0
  |   64 |   270,674,784
  |  |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08
  System Class
  |0 | 0
  |  |  |  |  |- left java.util.TreeMap$Entry @
  0x7f01053c7568
  |   64 |   162,321,936
  |  |  |  |  |- parent java.util.TreeMap$Entry @
  0x7f01053d34b0
  |   64 |   579,650,616
  |  |  |  |  |- right java.util.TreeMap$Entry @
  0x7f01054cbbe8
  |   64 |   107,828,656
  |  |  |  |  |- value org.apache.hadoop.hbase.HServerInfo @
  0x7f010f6866c0
  |   72 |   154,328
  |  |  |  |  |  |- class class org.apache.hadoop.hbase.HServerInfo @
  0x7f013c61e3e0
  |8 | 8
  |  |  |  |  |  |- load org.apache.hadoop.hbase.HServerLoad @
  0x7f010540a548
  |   40 |   153,776
  |  |  |  |  |  |- serverName java.lang.String @ 0x7f010540a9a8
  cloud138,60020,1300161207678
  |   40 |   120
  |  |  |  |  |  |- 

java.io.FileNotFoundException:

2011-03-16 Thread Venkatesh

 Does anyone how to get around this? Trying to run a mapreduce job in a 
cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code 
change


 java.io.FileNotFoundException: File 
/data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar
 does not exist.
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at 
org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at 
com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java




Re: Hash keys

2011-03-16 Thread Eric Charles

Cool.

Everything is already available.
I simply have to import MD5Hash and use the to_java_bytes ruby function.

hbase(main):001:0 import org.apache.hadoop.hbase.util.MD5Hash
= Java::OrgApacheHadoopHbaseUtil::MD5Hash
hbase(main):002:0 put 'test',  
MD5Hash.getMD5AsHex('row1'.to_java_bytes), 'cf:a', 'value1'

0 row(s) in 0.5880 seconds
hbase(main):004:0 get 'test', 'row1'
COLUMN   CELL
0 row(s) in 0.0170 seconds
hbase(main):003:0 get 'test', MD5Hash.getMD5AsHex('row1'.to_java_bytes)
COLUMN   CELL
 cf:a
timestamp=1300287899911, value=value1

1 row(s) in 0.0840 seconds

Many tks,

Eric


On 16/03/2011 15:44, Harsh J wrote:

Using Java classes itself is possible from within HBase shell (since
it is JRuby), but yes some Ruby knowledge should be helpful too!

For instance, I can use java.lang.String by simply importing it:

hbase(main):004:0  import java.lang.String
=  Java::JavaLang::String
hbase(main):004:0  get String.new('test'), String.new('row1')
COLUMN CELL

  f:a   timestamp=1300170063837, value=val4

1 row(s) in 0.0420 seconds

On Wed, Mar 16, 2011 at 4:26 PM, Eric Charles
eric.char...@u-mangate.com  wrote:

Hi,
I understand from your answer that it's possible but not available.
Did anyone already implemented such a functionality?
If not, where should I begin to look at (hirb.rb, any tutorial,... ?) - I
know nothing about jruby.
Tks,
- Eric

On 16/03/2011 10:39, Harsh J wrote:

(For 2) I think the hash function should work in the shell if it
returns a string type (like what '' defines in-place).

On Wed, Mar 16, 2011 at 2:22 PM, Eric Charles
eric.char...@u-mangate.comwrote:

Hi,

To help avoid hotspots, I'm planning to use hashed keys in some tables.

1. I wonder if this strategy is adviced for range queries (from/to key)
use
case, because the rows will be randomly distributed in different regions.
Will it cause some performance loose?
2. Is it possible to query from hbase shell with something like get
't1',
@hash('r1'), to let the shell compute the hash for you from the readable
key.
3. There are MD5 and Jenkins classes in hbase.util package. What would
you
advice? what about SHA1?

Tks,
- Eric

PS: I searched the archive but didn't find the answers.













Re: One of the regionserver aborted, then the master shut down itself

2011-03-16 Thread 茅旭峰
Thanks Ted!

===
Once a region is offline, it is removed from regions
===
By 'offline' here you mean unassigned, and has already been split into
smaller regions?
I think we have too many regions because we're using large cells, and
normally a region
is size of hundards of mega bytes. BTW, any property can set the size of a
region?
Do you think set larger region could helpful for our scenario? If
AssignmentManager.regions
holds all the online regions, the size of regions is
(number of online regions) X (number of online regions) / (number of region
servers), right?
So to cut the size of regions, either we can increase the region size, or
add more region servers,
right?

Just out of curiosity, why should we keep per region load for
each HServerLoad for
AssignmentManager.regions, I guess it keeps changing dynamically.

Thanks and regards,

Mao Xu-Feng

On Wed, Mar 16, 2011 at 11:03 PM, Ted Yu yuzhih...@gmail.com wrote:

 Thanks for your analysis.
 Once a region is offline, it is removed from regions

 BTW your cluster needs more machines. 7600 regions over 4 nodes place too
 much load on the servers.

 On Wed, Mar 16, 2011 at 4:28 AM, 茅旭峰 m9s...@gmail.com wrote:

  Regarding AssignmentManager, it looks like only hold regions in
 transition.
  We can see lots of region split and unsignment in the master log. I guess
  it was due to our large cells and the endless insertion. Does this make
  sense?
  I have not dig into the code, I do belive it removes the regions from the
  AssignmentManager.regions once the transition completes, right?
 
  Mao Xu-Feng
 
  On Wed, Mar 16, 2011 at 7:09 PM, 茅旭峰 m9s...@gmail.com wrote:
 
   Hi J-D,
  
   Thanks for your reply.
  
   You said,
   ==
  
   Just as an example, every value that
   you insert first has to be copied from the socket before it can be
   inserted into the MemStore.  If you are using a big write buffer, that
   means that every insert currently in flight in a region server takes
   double that amount of space.
   ==
  
   How can I control the size of write buffer? I find a property
   'hbase.client.write.buffer' in hbase-default.xml, do you mean this one?
   We use RESTful api to put our cells, hopefully, this would not make
   any difference.
  
   As for the memroy usage of the master, I did a further investigation
  today.
   What I was doing was keeping putting cells as before. As I said
  yesterday,
   the Java heap kept increasing accordingly, and eventually OOME happened
   as I expected. I set -Xmx to 1GB to speed up OOME.
  
   Then I used Eclipse Memory Analyzer to analyze the hprof file. It tells
   that
   most of the java heap is occupied by an instance of Class
  AssignmentManager
  
   (For ease of reading, I think you can copy the result part to what ever
   editor you like, at least it works for me.)
  
   Class
   Name
   | Shallow Heap | Retained Heap
  
  
 
 ---
   org.apache.hadoop.hbase.master.AssignmentManager @
   0x7f01050d4c98
   |  112 |   974,967,592
   |- class class org.apache.hadoop.hbase.master.AssignmentManager @
   0x7f013c21ebd0
   |8 | 8
   |- master org.apache.hadoop.hbase.master.HMaster @ 0x7f01050521e0
   master-cloud135:6 Busy Monitor, Thread
   |  328 | 3,000
   |- regionsInTransition java.util.concurrent.ConcurrentSkipListMap @
   0x7f01050c1000
   |   88 |   296
   |- watcher org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher @
   0x7f01051cce68
   |  136 | 1,720
   |- timeoutMonitor
   org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor @
   0x7f01052505a8  cloud135:6.timeoutMonitor Thread|  208
   |   592
   |- zkTable org.apache.hadoop.hbase.zookeeper.ZKTable @
   0x7f01052c0318
   |   32 |   400
   |- catalogTracker org.apache.hadoop.hbase.catalog.CatalogTracker @
   0x7f01052c5fd0
   |   72 |   376
   |- serverManager org.apache.hadoop.hbase.master.ServerManager @
   0x7f01052f0138
   |   80 |   932,000
   |- regionPlans java.util.TreeMap @
   0x7f01052f01d8
   |   80 |   104
   |- servers java.util.TreeMap @
   0x7f01052f0228
   |   80 |75,128
   |- regions java.util.TreeMap @
   0x7f01052f0278
   |   80 |   950,435,488
   |  |- class class java.util.TreeMap @ 0x7f013be45c30 System
   Class
   |   16 |16
   |  |- root java.util.TreeMap$Entry @
   0x7f010542b790
   |   64 |   950,435,408
   |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08 System
   Class
   |0 | 0
   |  |  |- left java.util.TreeMap$Entry @
   0x7f01053d34b0
   |   64 |   579,650,616
   |  |  |  |- class class java.util.TreeMap$Entry @ 0x7f013bef1e08
 System
   Class   

after upgrade, fatal error in regionserver compacter, LzoCompressor, AbstractMethodError

2011-03-16 Thread Ferdy Galema
We upgraded to Hadoop 0.20.1 and Hbase 0.90.1 (both CDH3B4). We are 
using 64bit machines.


Starting goes great, only right after the first compaction we get this 
error:

Uncaught exception in service thread regionserver60020.compactor
java.lang.AbstractMethodError: 
com.hadoop.compression.lzo.LzoCompressor.reinit(Lorg/apache/hadoop/conf/Configuration;)V
at 
org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:105)
at 
org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:112)
at 
org.apache.hadoop.hbase.io.hfile.Compression$Algorithm.getCompressor(Compression.java:200)
at 
org.apache.hadoop.hbase.io.hfile.HFile$Writer.getCompressingStream(HFile.java:397)
at 
org.apache.hadoop.hbase.io.hfile.HFile$Writer.newBlock(HFile.java:383)
at 
org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkBlockBoundary(HFile.java:354)
at 
org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:536)
at 
org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:501)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:836)
at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:935)
at 
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
at 
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)


Lzo worked fine. This is how I believe we used it.
# LZO compression in Hbase will pass through three layers:
# 1) hadoop-gpl-compression-*.jar in the hbase/lib directory; the entry 
point
# 2) libgplcompression.* in the hbase native lib directory; the native 
connectors

# 3) liblzo2.so.2 in the hbase native lib directory; the base native library

Anyway, it would be great if somebody could help us out.


Re: java.io.FileNotFoundException:

2011-03-16 Thread Stack
0.90.1 ships with zookeeper-3.3.2, not with 3.2.2.
St.Ack

On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote:

  Does anyone how to get around this? Trying to run a mapreduce job in a 
 cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code 
 change


  java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar
  does not exist.
        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)
        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at 
 com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java





Re: which hadoop and zookeeper version should I use with hbase 0.90.1

2011-03-16 Thread Stack
From where did you get the src?
Thanks,
St.Ack

On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
 On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote:

 On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com
 wrote:
  Hi ,
    sorry for asking the same question couple of times , but I still have
 no
  clear understanding which hadoop version I have to install for hbase
 0.90.1.
  Any  information will be really appreciated
 

 Yeah, our version story is a little messy at the moment.  Would
 appreciate any input that would help us make it more clear.  More
 below...


  1)  From http://hbase.apache.org/notsoquick.html#hadoop I understand
 that
  hadoop-0.20-append is an official version for hbase. I case I am going to
  compile it : Do I have checkout main branch or there is a recomended tag?
  If someone already compiled this version and had an issues please share
 it.
 

 So, the documentation says  No official releases have been made from
 this branch up to now so you will have to build your own Hadoop from
 the tip of this branch., so yes, you'll have to build it.  The
 branch-0.20-append link in the documentation is to the branch in SVN
 that you'd need to checkout and build.  This was not obvious to you so
 I need to reword this paragraph to be more clear.   How about if I
 insert after the above sentence Checkout this branch [with a link to
 the branch in svn] and then compile it by...  Would that be better?


 
  2) I found cloudera maven repository and I see there only hadoop-0.20.2
  version. Does this version supports durability and suitable for hbase
  0.90.1? or I need to copy jars from hadoop-0.20-append to hadoop-0.20.2
  cloudera version? I looked for CDH3 and CDH4 but didn't find
  hadoop-0.20-append version.


 Again, the documentation must be insufficiently clear here.  We link
 to the CDH3 page.  We also state it beta.  What would you suggest?


  Question: does cloudera hadoop version (0.20.2) is suitable for hbase
  0.90.1?


 CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++).


  In case I am going to use cloudera do I need to install all parts
 (hadoop,
  hbase ,zookeper ...) from cloudera or it is possible to take only hadoop
  installation and other products  (hbase , zookeper) I can install from
  standard distributions?
 
 Any of above combinations should work.  If you use CDH3b4, you can
 take all from CDH since it includes 0.90.1.  Otherwise, you could use
 CDH hadoop and use your hbase build for the rest.

 St.Ack



 It took some time , but we succeeded to compile hadoop version. We decided
 to take an official version for hbase.
    I am only concern  about version which we get after compilation.
 The version is *0.20.3-SNAPSHOT, r1057313. *
 *   Does this version is a suitable version for hbase?*
 *
 *
 Thanks in advance , Oleg.
 *
 *
 *
 *



Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread Stack
You can limit the return when scanning from the java api; see
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
 This facility is not exposed in the REST API at the moment (not that
I know of -- please someone correct me if I'm wrong).   So, yes, wide
rows, if thousands of elements of some size, since they need to be
composed all in RAM, could bring on an OOME if the composed size 
available heap.

St.Ack


On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com wrote:
 With this schema, if i can limit the column family over a particular range,
 I can manage everything else. (like Select first n columns of a column
 family)

 Sreejith


 On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K. sreejit...@nesote.comwrote:

 @ Jean-Daniel,

 As i told, each row key contains thousands of column family values (may be
 i am wrong with the schema design). I started REST and tried to cURL
 http:/localhost/tablename/rowname. It seems it will work only with limited
 amount of data (may be i can limit the cURL output), and how i can limit the
 column values for a particular row?
 Suppose i have two thousand urls under a keyword and i need to fetch the
 urls and should limit the result to five hundred. How it is possible??

 @ tsuna,

  It seems http://www.elasticsearch.org/ using CouchDB right?


 On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel Cryans 
 jdcry...@apache.orgwrote:

 Can you tell why it's not able to get the bigger rows? Why would you
 try another schema if you don't even know what's going on right now?
 If you have the same issue with the new schema, you're back to square
 one right?

 Looking at the logs should give you some hints.

 J-D

 On Tue, Mar 15, 2011 at 10:19 AM, sreejith P. K. sreejit...@nesote.com
 wrote:
  Hello experts,
 
  I have a scenario as follows,
  I need to maintain a huge table for a 'web crawler' project in HBASE.
  Basically it contains thousands of keywords and for each keyword i need
 to
  maintain a list of urls (it again will count in thousands).
 Corresponding to
  each url, i need to store a number, which will in turn resemble the
 priority
  value the keyword holds.
  Let me explain you a bit, Suppose i have a keyword 'united states', i
 need
  to store about ten thousand urls corresponding to that keyword. Each
 keyword
  will be holding a priority value which is an integer. Again i have
 thousands
  of keywords like that. The rare thing about this is i need to do the
 project
  in PHP.
 
  I have configured a hadoop-hbase cluster consists of three machines. My
 plan
  was to design the schema by taking the keyword as 'row key'. The urls i
 will
  keep as column family. The schema looked fine at first. I have done a
 lot of
  research on how to retrieve the url list if i know the keyword. Any ways
 i
  managed a way out by preg-matching the xml data out put using the url
  http://localhost:8080/tablename/rowkey (REST interface i used). It also
  works fine if the url list has a limited number of urls. When it comes
 in
  thousands, it seems i cannot fetch the xml data itself!
  Now I am in a do or die situation. Please correct me if my schema design
  needs any changes (I do believe it should change!) and please help me up
 to
  retrieve the column family values (urls)
   corresponding to each row-key in an efficient way. Please guide me how
 i
  can do the same using PHP-REST interface.
  Thanks in advance.
 
  Sreejith
 




 --
 Sreejith PK
 Nesote Technologies (P) Ltd





 --
 Sreejith PK
 Nesote Technologies (P) Ltd



Re: Hash keys

2011-03-16 Thread Harsh J
On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles
eric.char...@u-mangate.com wrote:
 Cool.
 Everything is already available.

Great!

 1 row(s) in 0.0840 seconds
 1 row(s) in 0.0420 seconds

Interesting, how your test's get time is exactly the double of my test ;-)

-- 
Harsh J
http://harshj.com


Re: Hash keys

2011-03-16 Thread Eric Charles

A new laptop is definitively on my invest plan :)
Tks,
Eric

On 16/03/2011 18:56, Harsh J wrote:

On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles
eric.char...@u-mangate.com  wrote:

Cool.
Everything is already available.

Great!


1 row(s) in 0.0840 seconds

1 row(s) in 0.0420 seconds

Interesting, how your test's get time is exactly the double of my test ;-)





Re: Hash keys

2011-03-16 Thread Eric Charles

...and probably the additional hashing doesn't help the performance.
Eric


On 16/03/2011 19:17, Eric Charles wrote:

A new laptop is definitively on my invest plan :)
Tks,
Eric

On 16/03/2011 18:56, Harsh J wrote:

On Wed, Mar 16, 2011 at 8:36 PM, Eric Charles
eric.char...@u-mangate.com  wrote:

Cool.
Everything is already available.

Great!


1 row(s) in 0.0840 seconds

1 row(s) in 0.0420 seconds
Interesting, how your test's get time is exactly the double of my 
test ;-)








Re: which hadoop and zookeeper version should I use with hbase 0.90.1

2011-03-16 Thread Oleg Ruchovets
I get the src from here.

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/



On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote:

 From where did you get the src?
 Thanks,
 St.Ack

 On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com
 wrote:
  On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote:
 
  On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com
  wrote:
   Hi ,
 sorry for asking the same question couple of times , but I still
 have
  no
   clear understanding which hadoop version I have to install for hbase
  0.90.1.
   Any  information will be really appreciated
  
 
  Yeah, our version story is a little messy at the moment.  Would
  appreciate any input that would help us make it more clear.  More
  below...
 
 
   1)  From http://hbase.apache.org/notsoquick.html#hadoop I understand
  that
   hadoop-0.20-append is an official version for hbase. I case I am going
 to
   compile it : Do I have checkout main branch or there is a recomended
 tag?
   If someone already compiled this version and had an issues please
 share
  it.
  
 
  So, the documentation says  No official releases have been made from
  this branch up to now so you will have to build your own Hadoop from
  the tip of this branch., so yes, you'll have to build it.  The
  branch-0.20-append link in the documentation is to the branch in SVN
  that you'd need to checkout and build.  This was not obvious to you so
  I need to reword this paragraph to be more clear.   How about if I
  insert after the above sentence Checkout this branch [with a link to
  the branch in svn] and then compile it by...  Would that be better?
 
 
  
   2) I found cloudera maven repository and I see there only
 hadoop-0.20.2
   version. Does this version supports durability and suitable for hbase
   0.90.1? or I need to copy jars from hadoop-0.20-append to
 hadoop-0.20.2
   cloudera version? I looked for CDH3 and CDH4 but didn't find
   hadoop-0.20-append version.
 
 
  Again, the documentation must be insufficiently clear here.  We link
  to the CDH3 page.  We also state it beta.  What would you suggest?
 
 
   Question: does cloudera hadoop version (0.20.2) is suitable for hbase
   0.90.1?
 
 
  CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++).
 
 
   In case I am going to use cloudera do I need to install all parts
  (hadoop,
   hbase ,zookeper ...) from cloudera or it is possible to take only
 hadoop
   installation and other products  (hbase , zookeper) I can install from
   standard distributions?
  
  Any of above combinations should work.  If you use CDH3b4, you can
  take all from CDH since it includes 0.90.1.  Otherwise, you could use
  CDH hadoop and use your hbase build for the rest.
 
  St.Ack
 
 
 
  It took some time , but we succeeded to compile hadoop version. We
 decided
  to take an official version for hbase.
 I am only concern  about version which we get after compilation.
  The version is *0.20.3-SNAPSHOT, r1057313. *
  *   Does this version is a suitable version for hbase?*
  *
  *
  Thanks in advance , Oleg.
  *
  *
  *
  *
 



Re: which hadoop and zookeeper version should I use with hbase 0.90.1

2011-03-16 Thread Ryan Rawson
Thats the correct branch, so you should be good!

On Wed, Mar 16, 2011 at 1:17 PM, Oleg Ruchovets oruchov...@gmail.com wrote:
 I get the src from here.

 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/



 On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote:

 From where did you get the src?
 Thanks,
 St.Ack

 On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com
 wrote:
  On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote:
 
  On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets oruchov...@gmail.com
  wrote:
   Hi ,
     sorry for asking the same question couple of times , but I still
 have
  no
   clear understanding which hadoop version I have to install for hbase
  0.90.1.
   Any  information will be really appreciated
  
 
  Yeah, our version story is a little messy at the moment.  Would
  appreciate any input that would help us make it more clear.  More
  below...
 
 
   1)  From http://hbase.apache.org/notsoquick.html#hadoop I understand
  that
   hadoop-0.20-append is an official version for hbase. I case I am going
 to
   compile it : Do I have checkout main branch or there is a recomended
 tag?
   If someone already compiled this version and had an issues please
 share
  it.
  
 
  So, the documentation says  No official releases have been made from
  this branch up to now so you will have to build your own Hadoop from
  the tip of this branch., so yes, you'll have to build it.  The
  branch-0.20-append link in the documentation is to the branch in SVN
  that you'd need to checkout and build.  This was not obvious to you so
  I need to reword this paragraph to be more clear.   How about if I
  insert after the above sentence Checkout this branch [with a link to
  the branch in svn] and then compile it by...  Would that be better?
 
 
  
   2) I found cloudera maven repository and I see there only
 hadoop-0.20.2
   version. Does this version supports durability and suitable for hbase
   0.90.1? or I need to copy jars from hadoop-0.20-append to
 hadoop-0.20.2
   cloudera version? I looked for CDH3 and CDH4 but didn't find
   hadoop-0.20-append version.
 
 
  Again, the documentation must be insufficiently clear here.  We link
  to the CDH3 page.  We also state it beta.  What would you suggest?
 
 
   Question: does cloudera hadoop version (0.20.2) is suitable for hbase
   0.90.1?
 
 
  CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop 0.20.2++).
 
 
   In case I am going to use cloudera do I need to install all parts
  (hadoop,
   hbase ,zookeper ...) from cloudera or it is possible to take only
 hadoop
   installation and other products  (hbase , zookeper) I can install from
   standard distributions?
  
  Any of above combinations should work.  If you use CDH3b4, you can
  take all from CDH since it includes 0.90.1.  Otherwise, you could use
  CDH hadoop and use your hbase build for the rest.
 
  St.Ack
 
 
 
  It took some time , but we succeeded to compile hadoop version. We
 decided
  to take an official version for hbase.
     I am only concern  about version which we get after compilation.
  The version is *0.20.3-SNAPSHOT, r1057313. *
  *   Does this version is a suitable version for hbase?*
  *
  *
  Thanks in advance , Oleg.
  *
  *
  *
  *
 




Re: which hadoop and zookeeper version should I use with hbase 0.90.1

2011-03-16 Thread Oleg Ruchovets
Got it ,
thank you St.Ack.

On Wed, Mar 16, 2011 at 10:23 PM, Stack st...@duboce.net wrote:

 You should be good then.  Make sure you put the hadoop you built under
 hbase/lib (removing the old hadoop).  The hadoop-0.20.X-SNAPSHOT.x.x.
 is just how its named on that branch.  See the build.xml.
 St.Ack

 On Wed, Mar 16, 2011 at 1:17 PM, Oleg Ruchovets oruchov...@gmail.com
 wrote:
  I get the src from here.
 
 
 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/
 
 
 
  On Wed, Mar 16, 2011 at 7:40 PM, Stack st...@duboce.net wrote:
 
  From where did you get the src?
  Thanks,
  St.Ack
 
  On Wed, Mar 16, 2011 at 7:12 AM, Oleg Ruchovets oruchov...@gmail.com
  wrote:
   On Mon, Feb 28, 2011 at 8:11 PM, Stack st...@duboce.net wrote:
  
   On Sun, Feb 27, 2011 at 1:31 PM, Oleg Ruchovets 
 oruchov...@gmail.com
   wrote:
Hi ,
  sorry for asking the same question couple of times , but I still
  have
   no
clear understanding which hadoop version I have to install for
 hbase
   0.90.1.
Any  information will be really appreciated
   
  
   Yeah, our version story is a little messy at the moment.  Would
   appreciate any input that would help us make it more clear.  More
   below...
  
  
1)  From http://hbase.apache.org/notsoquick.html#hadoop I
 understand
   that
hadoop-0.20-append is an official version for hbase. I case I am
 going
  to
compile it : Do I have checkout main branch or there is a
 recomended
  tag?
If someone already compiled this version and had an issues please
  share
   it.
   
  
   So, the documentation says  No official releases have been made from
   this branch up to now so you will have to build your own Hadoop from
   the tip of this branch., so yes, you'll have to build it.  The
   branch-0.20-append link in the documentation is to the branch in SVN
   that you'd need to checkout and build.  This was not obvious to you
 so
   I need to reword this paragraph to be more clear.   How about if I
   insert after the above sentence Checkout this branch [with a link to
   the branch in svn] and then compile it by...  Would that be better?
  
  
   
2) I found cloudera maven repository and I see there only
  hadoop-0.20.2
version. Does this version supports durability and suitable for
 hbase
0.90.1? or I need to copy jars from hadoop-0.20-append to
  hadoop-0.20.2
cloudera version? I looked for CDH3 and CDH4 but didn't find
hadoop-0.20-append version.
  
  
   Again, the documentation must be insufficiently clear here.  We link
   to the CDH3 page.  We also state it beta.  What would you suggest?
  
  
Question: does cloudera hadoop version (0.20.2) is suitable for
 hbase
0.90.1?
  
  
   CDH3b2,CDH3b3, or CDH3b4 are all suitable (each is an hadoop
 0.20.2++).
  
  
In case I am going to use cloudera do I need to install all parts
   (hadoop,
hbase ,zookeper ...) from cloudera or it is possible to take only
  hadoop
installation and other products  (hbase , zookeper) I can install
 from
standard distributions?
   
   Any of above combinations should work.  If you use CDH3b4, you can
   take all from CDH since it includes 0.90.1.  Otherwise, you could use
   CDH hadoop and use your hbase build for the rest.
  
   St.Ack
  
  
  
   It took some time , but we succeeded to compile hadoop version. We
  decided
   to take an official version for hbase.
  I am only concern  about version which we get after compilation.
   The version is *0.20.3-SNAPSHOT, r1057313. *
   *   Does this version is a suitable version for hbase?*
   *
   *
   Thanks in advance , Oleg.
   *
   *
   *
   *
  
 
 



Row Counters

2011-03-16 Thread Vivek Krishna
1.  How do I count rows fast in hbase?

First I tired count 'test'  , takes ages.

Saw that I could use RowCounter, but looks like it is deprecated.  When I
try to use it, I get

java.io.IOException: Cannot create a record reader because of a previous
error. Please look at the previous logs lines from the task's full log for
more details.
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)

If this is deprecated, is there any other way of finding the counts?

I just need to verify the total counts.  Is it possible to see somewhere in
the web interface or ganglia or by any other means?

Viv


Re: Row Counters

2011-03-16 Thread Stack
On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna vivekris...@gmail.com wrote:
 1.  How do I count rows fast in hbase?

 First I tired count 'test'  , takes ages.

 Saw that I could use RowCounter, but looks like it is deprecated.

It is not.  Make sure you are using the one from mapreduce package as
opposed to mapred package.


 I just need to verify the total counts.  Is it possible to see somewhere in
 the web interface or ganglia or by any other means?


We don't keep a current count on a table.  Too expensive.  Run the
rowcounter MR job.  This page may be of help:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description

Good luck,
St.Ack


Re: Row Counters

2011-03-16 Thread Ted Yu
$ ./bin/hadoop jar hbase*.jar rowcounter

Search for related discusson on search-hadoop

On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna vivekris...@gmail.comwrote:

 1.  How do I count rows fast in hbase?

 First I tired count 'test'  , takes ages.

 Saw that I could use RowCounter, but looks like it is deprecated.  When I
 try to use it, I get

 java.io.IOException: Cannot create a record reader because of a previous
 error. Please look at the previous logs lines from the task's full log for
 more details.
 at

 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)

 If this is deprecated, is there any other way of finding the counts?

 I just need to verify the total counts.  Is it possible to see somewhere in
 the web interface or ganglia or by any other means?

 Viv



Re: Row Counters

2011-03-16 Thread Jeff Whiting
Just a random thought.  What about keeping a per region row count?  Then if you needed to get a row 
count for a table you'd just have to query each region once and sum.  Seems like it wouldn't be too 
expensive because you'd just have a row counter variable.  It maybe more complicated than I'm making 
it out to be though...


~Jeff

On 3/16/2011 2:40 PM, Stack wrote:

On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com  wrote:

1.  How do I count rows fast in hbase?

First I tired count 'test'  , takes ages.

Saw that I could use RowCounter, but looks like it is deprecated.

It is not.  Make sure you are using the one from mapreduce package as
opposed to mapred package.



I just need to verify the total counts.  Is it possible to see somewhere in
the web interface or ganglia or by any other means?


We don't keep a current count on a table.  Too expensive.  Run the
rowcounter MR job.  This page may be of help:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description

Good luck,
St.Ack


--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com



RE: Row Counters

2011-03-16 Thread Peter Haidinyak
When I need to know a row count for table I kept a separate table just for that 
purpose and would update/query that table. Low tech but it worked.

-Pete

-Original Message-
From: Jeff Whiting [mailto:je...@qualtrics.com] 
Sent: Wednesday, March 16, 2011 1:46 PM
To: user@hbase.apache.org
Cc: Stack
Subject: Re: Row Counters

Just a random thought.  What about keeping a per region row count?  Then if you 
needed to get a row 
count for a table you'd just have to query each region once and sum.  Seems 
like it wouldn't be too 
expensive because you'd just have a row counter variable.  It maybe more 
complicated than I'm making 
it out to be though...

~Jeff

On 3/16/2011 2:40 PM, Stack wrote:
 On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com  wrote:
 1.  How do I count rows fast in hbase?

 First I tired count 'test'  , takes ages.

 Saw that I could use RowCounter, but looks like it is deprecated.
 It is not.  Make sure you are using the one from mapreduce package as
 opposed to mapred package.


 I just need to verify the total counts.  Is it possible to see somewhere in
 the web interface or ganglia or by any other means?

 We don't keep a current count on a table.  Too expensive.  Run the
 rowcounter MR job.  This page may be of help:
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description

 Good luck,
 St.Ack

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com



Re: Row Counters

2011-03-16 Thread Matt Corgan
Jeff,

The problem is that when hbase receives a put or delete, it doesn't know if
the put is overwriting an existing row or inserting a new one, and it
doesn't know if whether the requested row was there to delete.  This isn't
known until read or compaction time.

So to keep the counter up to date on every insert, it would have to check
all of the region's storefiles which would slow down your inserts a lot.

Matt


On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote:

 Since we have lived so long without this information, I guess we can hold
 for longer :-)
 Another issue I am working on is to reduce memory footprint. See the
 following discussion thread:
 One of the regionserver aborted, then the master shut down itself

 We have to bear in mind that there would be around 10K regions or more in
 production.

 Cheers

 On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com wrote:

  Just a random thought.  What about keeping a per region row count?  Then
 if
  you needed to get a row count for a table you'd just have to query each
  region once and sum.  Seems like it wouldn't be too expensive because
 you'd
  just have a row counter variable.  It maybe more complicated than I'm
 making
  it out to be though...
 
  ~Jeff
 
 
  On 3/16/2011 2:40 PM, Stack wrote:
 
  On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com
   wrote:
 
  1.  How do I count rows fast in hbase?
 
  First I tired count 'test'  , takes ages.
 
  Saw that I could use RowCounter, but looks like it is deprecated.
 
  It is not.  Make sure you are using the one from mapreduce package as
  opposed to mapred package.
 
 
   I just need to verify the total counts.  Is it possible to see
 somewhere
  in
  the web interface or ganglia or by any other means?
 
   We don't keep a current count on a table.  Too expensive.  Run the
  rowcounter MR job.  This page may be of help:
 
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
 
  Good luck,
  St.Ack
 
 
  --
  Jeff Whiting
  Qualtrics Senior Software Engineer
  je...@qualtrics.com
 
 



Re: Row Counters

2011-03-16 Thread Vivek Krishna
I guess it is using the mapred class

11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
attempt_201103161245_0005_m_04_0, Status : FAILED
java.io.IOException: Cannot create a record reader because of a previous
error. Please look at the previous logs lines from the task's full log for
more details.
 at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:234)

How do I use mapreduce class?
Viv



On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote:

 Since we have lived so long without this information, I guess we can hold
 for longer :-)
 Another issue I am working on is to reduce memory footprint. See the
 following discussion thread:
 One of the regionserver aborted, then the master shut down itself

 We have to bear in mind that there would be around 10K regions or more in
 production.

 Cheers

 On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com wrote:

  Just a random thought.  What about keeping a per region row count?  Then
 if
  you needed to get a row count for a table you'd just have to query each
  region once and sum.  Seems like it wouldn't be too expensive because
 you'd
  just have a row counter variable.  It maybe more complicated than I'm
 making
  it out to be though...
 
  ~Jeff
 
 
  On 3/16/2011 2:40 PM, Stack wrote:
 
  On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com
   wrote:
 
  1.  How do I count rows fast in hbase?
 
  First I tired count 'test'  , takes ages.
 
  Saw that I could use RowCounter, but looks like it is deprecated.
 
  It is not.  Make sure you are using the one from mapreduce package as
  opposed to mapred package.
 
 
   I just need to verify the total counts.  Is it possible to see
 somewhere
  in
  the web interface or ganglia or by any other means?
 
   We don't keep a current count on a table.  Too expensive.  Run the
  rowcounter MR job.  This page may be of help:
 
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
 
  Good luck,
  St.Ack
 
 
  --
  Jeff Whiting
  Qualtrics Senior Software Engineer
  je...@qualtrics.com
 
 



Re: java.io.FileNotFoundException:

2011-03-16 Thread Venkatesh
Thanks St.Ack..I'm blind..Got past that..
Now I get for hadoop-0.20.2-core.jar

I've removed *append*.jar all over the place  replace with 
hadoop-0.20.2-core.jar
0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but not 
the mapreduce job

java.io.FileNotFoundException: File 
/data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar
 does not exist.
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at 
org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)


 


 

 

-Original Message-
From: Stack st...@duboce.net
To: user@hbase.apache.org
Sent: Wed, Mar 16, 2011 1:39 pm
Subject: Re: java.io.FileNotFoundException:


0.90.1 ships with zookeeper-3.3.2, not with 3.2.2.

St.Ack



On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote:



  Does anyone how to get around this? Trying to run a mapreduce job in a 

cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code 

change





  java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar
  

does not exist.

at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)

at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)

at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)

at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)

at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)

at 
 com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java








 


Re: java.io.FileNotFoundException:

2011-03-16 Thread Harsh J
0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop
0.20.2). Look up its name in the lib/ directory of the distribution
(comes with a rev #) :)

On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote:
 Thanks St.Ack..I'm blind..Got past that..
 Now I get for hadoop-0.20.2-core.jar

 I've removed *append*.jar all over the place  replace with 
 hadoop-0.20.2-core.jar
 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but 
 not the mapreduce job

 java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar
  does not exist.
        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)
        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)









 -Original Message-
 From: Stack st...@duboce.net
 To: user@hbase.apache.org
 Sent: Wed, Mar 16, 2011 1:39 pm
 Subject: Re: java.io.FileNotFoundException:


 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2.

 St.Ack



 On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote:



  Does anyone how to get around this? Trying to run a mapreduce job in a

 cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code

 change





  java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar

 does not exist.

        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)

        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)

        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)

        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)

        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)

        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)

        at 
 com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java













-- 
Harsh J
http://harshj.com


Re: Row Counters

2011-03-16 Thread Ted Yu
In the future, describe your environment a bit.

The way I approach this is:
find the correct commandline from
src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java

Then I issue:
[hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
rowcounter packageindex

Then I check the map/reduce task on job tracker URL

On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.comwrote:

 I guess it is using the mapred class

 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
 attempt_201103161245_0005_m_04_0, Status : FAILED
 java.io.IOException: Cannot create a record reader because of a previous
 error. Please look at the previous logs lines from the task's full log for
 more details.
  at

 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
  at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
  at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
 at org.apache.hadoop.mapred.Child.main(Child.java:234)

 How do I use mapreduce class?
 Viv



 On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu yuzhih...@gmail.com wrote:

  Since we have lived so long without this information, I guess we can hold
  for longer :-)
  Another issue I am working on is to reduce memory footprint. See the
  following discussion thread:
  One of the regionserver aborted, then the master shut down itself
 
  We have to bear in mind that there would be around 10K regions or more in
  production.
 
  Cheers
 
  On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting je...@qualtrics.com
 wrote:
 
   Just a random thought.  What about keeping a per region row count?
  Then
  if
   you needed to get a row count for a table you'd just have to query each
   region once and sum.  Seems like it wouldn't be too expensive because
  you'd
   just have a row counter variable.  It maybe more complicated than I'm
  making
   it out to be though...
  
   ~Jeff
  
  
   On 3/16/2011 2:40 PM, Stack wrote:
  
   On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishnavivekris...@gmail.com
wrote:
  
   1.  How do I count rows fast in hbase?
  
   First I tired count 'test'  , takes ages.
  
   Saw that I could use RowCounter, but looks like it is deprecated.
  
   It is not.  Make sure you are using the one from mapreduce package as
   opposed to mapred package.
  
  
I just need to verify the total counts.  Is it possible to see
  somewhere
   in
   the web interface or ganglia or by any other means?
  
We don't keep a current count on a table.  Too expensive.  Run the
   rowcounter MR job.  This page may be of help:
  
  
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
  
   Good luck,
   St.Ack
  
  
   --
   Jeff Whiting
   Qualtrics Senior Software Engineer
   je...@qualtrics.com
  
  
 



Re: Hash keys

2011-03-16 Thread Lars George
Hi Eric,

Oops, you are right, my example was not clear and actually confusing
the keys with sequential ones. The hash should map every Nth row key
to the same bucket, so that you would for example see an interleaved
distribution of row keys to regions. Region 1 holds 1, 8, 15,... while
region 2 holds 2, 9, 16,... and so on. I do not think performance is a
big issue. And yes, this is currently all client side driven :(

Lars

On Wed, Mar 16, 2011 at 2:57 PM, Eric Charles
eric.char...@u-mangate.com wrote:
 Hi Lars,
 Many tks for your explanations!

 About DFR (sequential-keys) vs DFW (random-keys) distinction, I imagine
 different cases (just rephrasing what you said to be sure I get it):

 - Keys are really random (GUID or whatever): you have the distribution for
 free, still can't do, and probably don't need, range-queries.

 - If keys are monotonically increasing (timestamp, autoincremented,...),
 there are two cases:
 1) sometimes, you don't need to do some range-queries and can store the key
 as a real hash (md5,...) to have distribution.
 2) For timebased series for example, you may need to do some range queries,
 and adding a salt can be an answer to combine best-of-world.

 I understand the salt approach as recreating on the client side
 artifical key spaces.

 I was first confused reading row 1...1000 - prefix h1_.
 To really make the distribution random, I would have seen prefix/salt
 attributed randomly for a key leading to for example a h1 keyspace as such:
 h1_key2032, h1_key0023, h1_key1014343, ...

 Maybe you meant the intermediate approach where time keys of hour 1 going
 to h1 keyspace, keys of hour 2 going to h2 keyspace,...
 In that case, if you look for keys in hour 1, you would only need one
 scanner cause you know that they reside in h1_, and you could query with
 scan(h1_time1, h1_time2).

 But at at time, as you describe, you may need to scan different buckets with
 different scanners and use an ordered list to contain the result.
 - What about performance in that case? for very large dataset, a range query
 will take much time. I can imagine async client at the rescue. Maybe also
 mapreduce jobs could help cause if will benefit from data locality.
 - Also, the client application must manage the salts: it's a bit like
 reinventing a salt layer on top of the hbase region servers, letting
 client carry on this layer. The client will have to store (in hbase :)) the
 mapping between key ranges and their salt prefixes. It's a bit like
 exporting some core? functionality to the client.

 Strange, I fell I missed your point :)
 Tks,

 - Eric

 Sidenote: ...and yes, it seems I will have to learn some ruby stuff (should
 get used to, cause I just learned another scripting language running on jvm
 for another project...)


 On 16/03/2011 13:00, Lars George wrote:

 Hi Eric,

 Socorro is Java and Python, I was just mentioning it as a possible
 source of inspiration :) You can learn Ruby and implement it (I hear
 it is easy... *cough*) or write that same in a small Java app and use
 it from the command line or so.

 And yes, you can range scan using a prefix. We were discussing this
 recently and there is this notion of design for reads, or design for
 writes. DFR is usually sequential keys and DFW is random keys. It is
 tough to find common grounds as both designs are on the far end of the
 same spectrum. Finding a middle ground is the bucketed (or salted)
 approach, which gives you distribution but still being able to scan...
 but not without some client side support. One typical class of data is
 timeseries based keys. As for scanning them, you need N client side
 scanners. Imagine this example:

 row       1 ... 1000 -  Prefix h1_
 row 1001 ... 2000 -  Prefix h2_
 row 2001 ... 3000 -  Prefix h3_
 row 3001 ... 4000 -  Prefix h4_
 row 4001 ... 5000 -  Prefix h5_
 row 5001 ... 6000 -  Prefix h6_
 row 6001 ... 7000 -  Prefix h7_

 So you have divided the entire range into 7 buckets. The prefixes
 (also sometimes called salt) are used to distribute them row keys to
 region servers. To scan the entire range as one large key space you
 need to create 7 scanners:

 1. scanner: start row: h1_, end row h2_
 2. scanner: start row: h2_, end row h3_
 3. scanner: start row: h3_, end row h4_
 4. scanner: start row: h4_, end row h5_
 5. scanner: start row: h5_, end row h6_
 6. scanner: start row: h6_, end row h7_
 7. scanner: start row: h7_, end row 

 Now each of them gives you the first row that matches the start and
 end row keys they are configure for. So you then take that first KV
 they offer and add it to a list, sorted by ky.getRow() while removing
 the hash prefix. For example, scanner 1 may have row h1_1 to offer,
 then split and drop the prefix h1_ to get 1. The list then would
 hold something like:

 1. row 1 -  kv from scanner 1
 2. row 1010 -  kv from scanner 2
 3. row 2001 -  kv from scanner 3
 4. row 3033 -  kv from scanner 4
 5. row 4001 -  kv from scanner 5
 6. row 5002 -  kv from scanner 

Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread Andrew Purtell
  This facility is not exposed in the REST API at the moment
 (not that I know of -- please someone correct me if I'm
 wrong).

Wrong. :-)

See ScannerModel in the rest package: 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html

ScannerModel#setBatch

   - Andy



--- On Wed, 3/16/11, Stack st...@duboce.net wrote:

 From: Stack st...@duboce.net
 Subject: Re: habse schema design and retrieving values through REST interface
 To: user@hbase.apache.org
 Date: Wednesday, March 16, 2011, 10:47 AM
 You can limit the return when
 scanning from the java api; see
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
  This facility is not exposed in the REST API at the moment
 (not that
 I know of -- please someone correct me if I'm
 wrong).   So, yes, wide
 rows, if thousands of elements of some size, since they
 need to be
 composed all in RAM, could bring on an OOME if the composed
 size 
 available heap.
 
 St.Ack
 
 
 On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com
 wrote:
  With this schema, if i can limit the column family
 over a particular range,
  I can manage everything else. (like Select first n
 columns of a column
  family)
 
  Sreejith
 
 
  On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
 sreejit...@nesote.comwrote:
 
  @ Jean-Daniel,
 
  As i told, each row key contains thousands of
 column family values (may be
  i am wrong with the schema design). I started REST
 and tried to cURL
  http:/localhost/tablename/rowname. It seems it
 will work only with limited
  amount of data (may be i can limit the cURL
 output), and how i can limit the
  column values for a particular row?
  Suppose i have two thousand urls under a keyword
 and i need to fetch the
  urls and should limit the result to five hundred.
 How it is possible??
 
  @ tsuna,
 
   It seems http://www.elasticsearch.org/ using
 CouchDB right?
 
 
  On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
 Cryans jdcry...@apache.orgwrote:
 
  Can you tell why it's not able to get the
 bigger rows? Why would you
  try another schema if you don't even know
 what's going on right now?
  If you have the same issue with the new
 schema, you're back to square
  one right?
 
  Looking at the logs should give you some
 hints.
 
  J-D
 
  On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
 K. sreejit...@nesote.com
  wrote:
   Hello experts,
  
   I have a scenario as follows,
   I need to maintain a huge table for a
 'web crawler' project in HBASE.
   Basically it contains thousands of
 keywords and for each keyword i need
  to
   maintain a list of urls (it again will
 count in thousands).
  Corresponding to
   each url, i need to store a number, which
 will in turn resemble the
  priority
   value the keyword holds.
   Let me explain you a bit, Suppose i have
 a keyword 'united states', i
  need
   to store about ten thousand urls
 corresponding to that keyword. Each
  keyword
   will be holding a priority value which is
 an integer. Again i have
  thousands
   of keywords like that. The rare thing
 about this is i need to do the
  project
   in PHP.
  
   I have configured a hadoop-hbase cluster
 consists of three machines. My
  plan
   was to design the schema by taking the
 keyword as 'row key'. The urls i
  will
   keep as column family. The schema looked
 fine at first. I have done a
  lot of
   research on how to retrieve the url list
 if i know the keyword. Any ways
  i
   managed a way out by preg-matching the
 xml data out put using the url
   http://localhost:8080/tablename/rowkey (REST interface
 i used). It also
   works fine if the url list has a limited
 number of urls. When it comes
  in
   thousands, it seems i cannot fetch the
 xml data itself!
   Now I am in a do or die situation. Please
 correct me if my schema design
   needs any changes (I do believe it should
 change!) and please help me up
  to
   retrieve the column family values (urls)
    corresponding to each row-key in an
 efficient way. Please guide me how
  i
   can do the same using PHP-REST
 interface.
   Thanks in advance.
  
   Sreejith
  
 
 
 
 
  --
  Sreejith PK
  Nesote Technologies (P) Ltd
 
 
 
 
 
  --
  Sreejith PK
  Nesote Technologies (P) Ltd
 
 





Re: Row Counters

2011-03-16 Thread Vivek Krishna
Oops. sorry about the environment.

I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
and zookeeper-3.3.2-CDH3B4.

I was able to configure jars and run the command,

hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,

but I get

java.io.IOException: Cannot create a record reader because of a
previous error. Please look at the previous logs lines from the task's
full log for more details.
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:234)


The previous error in the task's full log is ..

2011-03-16 21:41:03,367 ERROR
org.apache.hadoop.hbase.mapreduce.TableInputFormat:
org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292)
at 
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:234)
Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
... 15 more
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133)
... 16 more


find I am pretty sure zookeeper master is running in the same machine at
port 2181.  Not sure why the connection loss occurs.  Do I need
HBASE-3578https://issues.apache.org/jira/browse/HBASE-3578by any
chance?

Viv



On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 In the future, describe your environment a bit.

 The way I approach this is:
 find the correct commandline from
 src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java

 Then I issue:
 [hadoop@us01-ciqps1-name01 hbase]$
 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
 classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
 rowcounter packageindex

 Then I check the map/reduce task on job tracker URL

 On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.com
 wrote:

  I guess it is using the mapred class
 
  11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
  attempt_201103161245_0005_m_04_0, Status : FAILED
  java.io.IOException: Cannot create a record reader because of a previous
  error. Please look at the previous logs lines from the task's full log
 for
  more details.
   at
 
 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
  at 

Re: Row Counters

2011-03-16 Thread Ted Yu
The connection loss was due to inability of finding zookeeper quorum

Use the commandline in my previous email.

On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna vivekris...@gmail.comwrote:

 Oops. sorry about the environment.

 I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
 and zookeeper-3.3.2-CDH3B4.

 I was able to configure jars and run the command,

 hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,

 but I get

 java.io.IOException: Cannot create a record reader because of a previous 
 error. Please look at the previous logs lines from the task's full log for 
 more details.
   at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
   at org.apache.hadoop.mapred.Child.main(Child.java:234)


 The previous error in the task's full log is ..


 2011-03-16 21:41:03,367 ERROR 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat: 
 org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase
   at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
   at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
   at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292)
   at 
 org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
   at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167)
   at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145)
   at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
   at org.apache.hadoop.mapred.Child.main(Child.java:234)
 Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase
   at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147)
   at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
   ... 15 more
 Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
 KeeperErrorCode = ConnectionLoss for /hbase
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
   at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
   at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133)
   ... 16 more


 find I am pretty sure zookeeper master is running in the same machine at
 port 2181.  Not sure why the connection loss occurs.  Do I need 
 HBASE-3578https://issues.apache.org/jira/browse/HBASE-3578by any chance?

 Viv




 On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 In the future, describe your environment a bit.

 The way I approach this is:
 find the correct commandline from
 src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java

 Then I issue:
 [hadoop@us01-ciqps1-name01 hbase]$
 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
 classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
 rowcounter packageindex

 Then I check the map/reduce task on job tracker URL

 On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna vivekris...@gmail.com
 wrote:

  I guess it is using the mapred class
 
  11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
  attempt_201103161245_0005_m_04_0, Status : FAILED
  java.io.IOException: Cannot create a record reader because of a previous
  

OT - Hash Code Creation

2011-03-16 Thread Peter Haidinyak
Hi,
This is a little off topic but this group seems pretty swift so I 
thought I would ask. I am aggregating a day's worth of log data which means I 
have a Map of over 24 million elements. What would be a good algorithm to use 
for generating Hash Codes for these elements that cut down on collisions? I 
application starts out reading in a log (144 logs in all) in about 20 seconds 
and by the time I reach the last log it is taking around 120 seconds. The extra 
100 seconds have to do with Hash Table Collisions. I've played around with 
different Hashing algorithms and cut the original time from over 300 seconds to 
120 but I know I can do better.
The key I am using for the Map is an alpha-numeric string that is approximately 
16 character long with the last 4 or 5 character being the most unique.
  
Any ideas? 

Thanks

-Pete


Re: OT - Hash Code Creation

2011-03-16 Thread Andrey Stepachev
Try hash table with double hashing.
Something like this
http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm

2011/3/17 Peter Haidinyak phaidin...@local.com

 Hi,
This is a little off topic but this group seems pretty swift so I
 thought I would ask. I am aggregating a day's worth of log data which means
 I have a Map of over 24 million elements. What would be a good algorithm to
 use for generating Hash Codes for these elements that cut down on
 collisions? I application starts out reading in a log (144 logs in all) in
 about 20 seconds and by the time I reach the last log it is taking around
 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
 I've played around with different Hashing algorithms and cut the original
 time from over 300 seconds to 120 but I know I can do better.
 The key I am using for the Map is an alpha-numeric string that is
 approximately 16 character long with the last 4 or 5 character being the
 most unique.

 Any ideas?

 Thanks

 -Pete



Re: java.io.FileNotFoundException:

2011-03-16 Thread Stack
The below is pretty basic error.  Reference the jar that is actually
present on your cluster.
St.Ack

On Wed, Mar 16, 2011 at 3:50 PM, Venkatesh vramanatha...@aol.com wrote:
 yeah..i was aware of that..I removed that  tried with hadoop-0.20.2-core.jar 
 as I was n't ready to upgrade hadoop..

 I tried this time with the *append*.jar ..now it's complaining FileNotFound 
 for append






  File 
 /data/servers/datastore/mapred/mapred/system/job_201103161750_0030/libjars/hadoop-core-0.20-append-r1056497.jar
  does not exist.
        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)
        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)




 -Original Message-
 From: Harsh J qwertyman...@gmail.com
 To: user@hbase.apache.org
 Sent: Wed, Mar 16, 2011 5:32 pm
 Subject: Re: java.io.FileNotFoundException:


 0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop

 0.20.2). Look up its name in the lib/ directory of the distribution

 (comes with a rev #) :)



 On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote:

 Thanks St.Ack..I'm blind..Got past that..

 Now I get for hadoop-0.20.2-core.jar



 I've removed *append*.jar all over the place  replace with

 hadoop-0.20.2-core.jar

 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but

 not the mapreduce job



 java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar

 does not exist.

        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)

        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)

        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)

        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)

        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)

        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)



















 -Original Message-

 From: Stack st...@duboce.net

 To: user@hbase.apache.org

 Sent: Wed, Mar 16, 2011 1:39 pm

 Subject: Re: java.io.FileNotFoundException:





 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2.



 St.Ack







 On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote:







  Does anyone how to get around this? Trying to run a mapreduce job in a



 cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code



 change











  java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar



 does not exist.



        at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)



        at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)



        at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)



        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)



        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)



        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)



        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)



        at 
 com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)



        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java



























 --

 Harsh J

 http://harshj.com






Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread Stack
Thank you Andrew.
St.Ack

On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell apurt...@apache.org wrote:
  This facility is not exposed in the REST API at the moment
 (not that I know of -- please someone correct me if I'm
 wrong).

 Wrong. :-)

 See ScannerModel in the rest package: 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html

 ScannerModel#setBatch

   - Andy



 --- On Wed, 3/16/11, Stack st...@duboce.net wrote:

 From: Stack st...@duboce.net
 Subject: Re: habse schema design and retrieving values through REST interface
 To: user@hbase.apache.org
 Date: Wednesday, March 16, 2011, 10:47 AM
 You can limit the return when
 scanning from the java api; see
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
  This facility is not exposed in the REST API at the moment
 (not that
 I know of -- please someone correct me if I'm
 wrong).   So, yes, wide
 rows, if thousands of elements of some size, since they
 need to be
 composed all in RAM, could bring on an OOME if the composed
 size 
 available heap.

 St.Ack


 On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com
 wrote:
  With this schema, if i can limit the column family
 over a particular range,
  I can manage everything else. (like Select first n
 columns of a column
  family)
 
  Sreejith
 
 
  On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
 sreejit...@nesote.comwrote:
 
  @ Jean-Daniel,
 
  As i told, each row key contains thousands of
 column family values (may be
  i am wrong with the schema design). I started REST
 and tried to cURL
  http:/localhost/tablename/rowname. It seems it
 will work only with limited
  amount of data (may be i can limit the cURL
 output), and how i can limit the
  column values for a particular row?
  Suppose i have two thousand urls under a keyword
 and i need to fetch the
  urls and should limit the result to five hundred.
 How it is possible??
 
  @ tsuna,
 
   It seems http://www.elasticsearch.org/ using
 CouchDB right?
 
 
  On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
 Cryans jdcry...@apache.orgwrote:
 
  Can you tell why it's not able to get the
 bigger rows? Why would you
  try another schema if you don't even know
 what's going on right now?
  If you have the same issue with the new
 schema, you're back to square
  one right?
 
  Looking at the logs should give you some
 hints.
 
  J-D
 
  On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
 K. sreejit...@nesote.com
  wrote:
   Hello experts,
  
   I have a scenario as follows,
   I need to maintain a huge table for a
 'web crawler' project in HBASE.
   Basically it contains thousands of
 keywords and for each keyword i need
  to
   maintain a list of urls (it again will
 count in thousands).
  Corresponding to
   each url, i need to store a number, which
 will in turn resemble the
  priority
   value the keyword holds.
   Let me explain you a bit, Suppose i have
 a keyword 'united states', i
  need
   to store about ten thousand urls
 corresponding to that keyword. Each
  keyword
   will be holding a priority value which is
 an integer. Again i have
  thousands
   of keywords like that. The rare thing
 about this is i need to do the
  project
   in PHP.
  
   I have configured a hadoop-hbase cluster
 consists of three machines. My
  plan
   was to design the schema by taking the
 keyword as 'row key'. The urls i
  will
   keep as column family. The schema looked
 fine at first. I have done a
  lot of
   research on how to retrieve the url list
 if i know the keyword. Any ways
  i
   managed a way out by preg-matching the
 xml data out put using the url
   http://localhost:8080/tablename/rowkey (REST interface
 i used). It also
   works fine if the url list has a limited
 number of urls. When it comes
  in
   thousands, it seems i cannot fetch the
 xml data itself!
   Now I am in a do or die situation. Please
 correct me if my schema design
   needs any changes (I do believe it should
 change!) and please help me up
  to
   retrieve the column family values (urls)
    corresponding to each row-key in an
 efficient way. Please guide me how
  i
   can do the same using PHP-REST
 interface.
   Thanks in advance.
  
   Sreejith
  
 
 
 
 
  --
  Sreejith PK
  Nesote Technologies (P) Ltd
 
 
 
 
 
  --
  Sreejith PK
  Nesote Technologies (P) Ltd
 







Re: Does HBase use spaces of HDFS?

2011-03-16 Thread edward choi
Thanks for the info.
That link you referred me to was great!!
Thanks again :)

Ed

2011/3/8 Suraj Varma svarma...@gmail.com

 In the standalone mode, HBase uses the local file system as its storage. In
 pseudo-distributed and fully-distributed modes, HBase uses HDFS as the
 storage.
 See http://hbase.apache.org/notsoquick.html for more details on the
 different modes.

 For details on storage, see
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
 --Suraj

 On Mon, Mar 7, 2011 at 9:42 PM, edward choi mp2...@gmail.com wrote:

  Sorry for this totally newbie question.
 
  I'm just wondering if HBase uses HDFS space.
 
  I read it in the reference book that HBase table size increases
  automatically as the table entry increases.
  So I am guessing that HBase manages a separate storage other than HDFS.
  (But
  then why does HBase operate on top of HDFS? Truly confusing...)
 
  If HBase doesn't use HDFS space, I can designate only a single machine to
  be
  a HDFS slave, and assign bunch of other machines to be HBase slaves.
  But if HBase does use HDFS space, I'd have balance the ratio of HDFS and
  HBase within my machines.
 
  Could anyone give me a clear heads up?
 
  Ed
 



Re: Row Counters

2011-03-16 Thread Bill Graham
Back to the issue of keeping a count, I've often wondered if this
would be easy to do without much cost at compaction time? It of course
wouldn't be a true real-time total but something like a
compactedRowCount. It could be a useful metric to expose via JMX to
get a feel for growth over time.


On Wed, Mar 16, 2011 at 3:40 PM, Vivek Krishna vivekris...@gmail.com wrote:
 Works. Thanks.
 Viv



 On Wed, Mar 16, 2011 at 6:21 PM, Ted Yu yuzhih...@gmail.com wrote:

 The connection loss was due to inability of finding zookeeper quorum

 Use the commandline in my previous email.


 On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna vivekris...@gmail.comwrote:

 Oops. sorry about the environment.

 I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
 and zookeeper-3.3.2-CDH3B4.

 I was able to configure jars and run the command,

 hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,

 but I get

 java.io.IOException: Cannot create a record reader because of a previous 
 error. Please look at the previous logs lines from the task's full log for 
 more details.
      at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
      at org.apache.hadoop.mapred.Child.main(Child.java:234)


 The previous error in the task's full log is ..


 2011-03-16 21:41:03,367 ERROR 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat: 
 org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.zookeeper.KeeperException$ConnectionLossException: 
 KeeperErrorCode = ConnectionLoss for /hbase
      at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
      at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
      at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:292)
      at 
 org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
      at org.apache.hadoop.hbase.client.HTable.init(HTable.java:167)
      at org.apache.hadoop.hbase.client.HTable.init(HTable.java:145)
      at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
      at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
      at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
      at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
      at org.apache.hadoop.mapred.Child.main(Child.java:234)
 Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: 
 org.apache.zookeeper.KeeperException$ConnectionLossException: 
 KeeperErrorCode = ConnectionLoss for /hbase
      at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:147)
      at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
      ... 15 more
 Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
 KeeperErrorCode = ConnectionLoss for /hbase
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
      at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
      at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
      at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:133)
      ... 16 more


 find I am pretty sure zookeeper master is running in the same machine at
 port 2181.  Not sure why the connection loss occurs.  Do I need
 HBASE-3578 https://issues.apache.org/jira/browse/HBASE-3578 by any
 chance?

 Viv




 On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu yuzhih...@gmail.com wrote:

 In the future, describe your environment a bit.

 The way I approach this is:
 find the correct commandline from
 src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java

 Then I issue:
 [hadoop@us01-ciqps1-name01 hbase]$
 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
 classpath` 

Re: java.io.FileNotFoundException:

2011-03-16 Thread Venkatesh
yeah..thats why i feel very stupid..I'm pretty sure it exists on my 
cluster..but i still get the err..
I'll try on a fresh day

 

 


 

 

-Original Message-
From: Stack st...@duboce.net
To: user@hbase.apache.org
Sent: Wed, Mar 16, 2011 7:44 pm
Subject: Re: java.io.FileNotFoundException:


The below is pretty basic error.  Reference the jar that is actually

present on your cluster.

St.Ack



On Wed, Mar 16, 2011 at 3:50 PM, Venkatesh vramanatha...@aol.com wrote:

 yeah..i was aware of that..I removed that  tried with hadoop-0.20.2-core.jar 

as I was n't ready to upgrade hadoop..



 I tried this time with the *append*.jar ..now it's complaining FileNotFound 

for append













  File 
 /data/servers/datastore/mapred/mapred/system/job_201103161750_0030/libjars/hadoop-core-0.20-append-r1056497.jar
  

does not exist.

at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)

at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)

at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)

at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)

at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)









 -Original Message-

 From: Harsh J qwertyman...@gmail.com

 To: user@hbase.apache.org

 Sent: Wed, Mar 16, 2011 5:32 pm

 Subject: Re: java.io.FileNotFoundException:





 0.90.1 ships with a hadoop-0.20-append jar (not vanilla hadoop



 0.20.2). Look up its name in the lib/ directory of the distribution



 (comes with a rev #) :)







 On Thu, Mar 17, 2011 at 2:33 AM, Venkatesh vramanatha...@aol.com wrote:



 Thanks St.Ack..I'm blind..Got past that..



 Now I get for hadoop-0.20.2-core.jar







 I've removed *append*.jar all over the place  replace with



 hadoop-0.20.2-core.jar



 0.90.1 will work with hadoop-0.20.2-core right? Regular gets/puts work..but



 not the mapreduce job







 java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103161652_0004/libjars/hadoop-0.20.2-core.jar



 does not exist.



at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)



at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)



at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)



at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:633)



at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)



at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)



at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)







































 -Original Message-



 From: Stack st...@duboce.net



 To: user@hbase.apache.org



 Sent: Wed, Mar 16, 2011 1:39 pm



 Subject: Re: java.io.FileNotFoundException:











 0.90.1 ships with zookeeper-3.3.2, not with 3.2.2.







 St.Ack















 On Wed, Mar 16, 2011 at 8:05 AM, Venkatesh vramanatha...@aol.com wrote:















  Does anyone how to get around this? Trying to run a mapreduce job in a







 cluster..The one change was hbase upgraded to 0.90.1 (from 0.20.6)..No code







 change























  java.io.FileNotFoundException: File 
 /data/servers/datastore/mapred/mapred/system/job_201103151601_0363/libjars/zookeeper-3.2.2.jar







 does not exist.







at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)







at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)







at 
 org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)







at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:629)







at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)







at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)







at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)







at 
 com.aol.mail.antispam.Profiler.UserProfileJob.run(UserProfileJob.java:1916)







at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java























































 --



 Harsh J



 http://harshj.com










 


Is there any influence to the performance of hbase if we use TTL to clean data?

2011-03-16 Thread Zhou Shuaifeng
I'm doing performance test on hbase, and found that performance get lower as
the data grows. I set TTL  to be 86400(one day). Is there any influence to
the performance when hbase doing major compact to clean data  outdated ?

Thanks a lot.

 

Zhou Shuaifeng(Frank)





-
This e-mail and its attachments contain confidential information from
HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any
use of the 
information contained herein in any way (including, but not limited to,
total or partial 
disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender by 
phone or email immediately and delete it!

 



Re: Is there any influence to the performance of hbase if we use TTL to clean data?

2011-03-16 Thread Suraj Varma
So, yes, a major compaction is disk io intensive and can influence performance.

Here's a thread on this http://search-hadoop.com/m/PI1dl1pXgEg2
And here's a more recent one: http://search-hadoop.com/m/BNxKZeI8z

--Suraj

On Wed, Mar 16, 2011 at 7:49 PM, Zhou Shuaifeng
zhoushuaif...@huawei.com wrote:
 I'm doing performance test on hbase, and found that performance get lower as
 the data grows. I set TTL  to be 86400(one day). Is there any influence to
 the performance when hbase doing major compact to clean data  outdated ?

 Thanks a lot.



 Zhou Shuaifeng(Frank)




 
 -
 This e-mail and its attachments contain confidential information from
 HUAWEI, which
 is intended only for the person or entity whose address is listed above. Any
 use of the
 information contained herein in any way (including, but not limited to,
 total or partial
 disclosure, reproduction, or dissemination) by persons other than the
 intended
 recipient(s) is prohibited. If you receive this e-mail in error, please
 notify the sender by
 phone or email immediately and delete it!






Re: hbase 0.90.1 upgrade issue - mapreduce job

2011-03-16 Thread Suraj Varma
Does this help?: http://search-hadoop.com/m/JI3ro1EKY0u
--Suraj

On Tue, Mar 15, 2011 at 7:39 PM, Venkatesh vramanatha...@aol.com wrote:



  Hi
 When I upgraded to 0.90.1, mapreduce fails with exception..
 system/job_201103151601_0121/libjars/hbase-0.90.1.jar does not exist.

 I have the jar file in classpath (hadoop-env.sh)

 any ideas?
 thanks





Re: habse schema design and retrieving values through REST interface

2011-03-16 Thread sreejith P. K.
Hi Andrew,
I am new to hbase. Can you just elaborate the same and can you help me with
the schema design?


http://stackoverflow.com/questions/5325616/hbase-schema-design#

I have a scenario as follows, I need to maintain a huge table for a 'web
crawler' project in HBASE. Basically it contains thousands of keywords and
for each keyword i need to maintain a list of urls (it again will count in
thousands). Corresponding to each url, I need to store a number, which will
in turn resemble the priority value the keyword holds. Let me explain you a
bit, Suppose i have a keyword 'united states', i need to store about ten
thousand urls corresponding to that keyword. Each keyword will be holding a
priority value which is an integer. Again i have thousands of keywords like
that. The rare thing about this is i need to do the project in PHP.

I have configured a hadoop-hbase cluster consists of three machines. My plan
was to design the schema by taking the keyword as 'row key'. The urls I will
keep as column family. The schema looked fine at first. I have done a lot of
research on how to retrieve the url list if i know the keyword. Any ways i
managed a way out by preg-matching the xml data out put using the url
http://localhost:8080/tablename/rowkey (REST interface i used). It also
works fine if the url list has a limited number of urls. When it comes in
thousands, it seems i cannot fetch the xml data itself! Now I am in a do or
die situation. Please correct me if my schema design needs any changes (I do
believe it should change!) and please help me up to retrieve the column
family values (urls) corresponding to each row-key in an efficient way.
Please guide me how i can do the same using PHP-REST interface.

If I am wrong with schema, please help me setting up a new one. From the
table I should be able to list all URls corresponding to any keyword
given(order by descending priority value). I may need to limit the
results(Like, giving condition to priority-'where priority30')
Thanks in advance

Sreejith PK
Nesote Technologies (P) Ltd







On Thu, Mar 17, 2011 at 5:14 AM, Stack st...@duboce.net wrote:

 Thank you Andrew.
 St.Ack

 On Wed, Mar 16, 2011 at 3:12 PM, Andrew Purtell apurt...@apache.org
 wrote:
   This facility is not exposed in the REST API at the moment
  (not that I know of -- please someone correct me if I'm
  wrong).
 
  Wrong. :-)
 
  See ScannerModel in the rest package:
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/rest/model/ScannerModel.html
 
  ScannerModel#setBatch
 
- Andy
 
 
 
  --- On Wed, 3/16/11, Stack st...@duboce.net wrote:
 
  From: Stack st...@duboce.net
  Subject: Re: habse schema design and retrieving values through REST
 interface
  To: user@hbase.apache.org
  Date: Wednesday, March 16, 2011, 10:47 AM
  You can limit the return when
  scanning from the java api; see
 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
   This facility is not exposed in the REST API at the moment
  (not that
  I know of -- please someone correct me if I'm
  wrong).   So, yes, wide
  rows, if thousands of elements of some size, since they
  need to be
  composed all in RAM, could bring on an OOME if the composed
  size 
  available heap.
 
  St.Ack
 
 
  On Wed, Mar 16, 2011 at 2:41 AM, sreejith P. K. sreejit...@nesote.com
  wrote:
   With this schema, if i can limit the column family
  over a particular range,
   I can manage everything else. (like Select first n
  columns of a column
   family)
  
   Sreejith
  
  
   On Wed, Mar 16, 2011 at 12:33 PM, sreejith P. K.
  sreejit...@nesote.comwrote:
  
   @ Jean-Daniel,
  
   As i told, each row key contains thousands of
  column family values (may be
   i am wrong with the schema design). I started REST
  and tried to cURL
   http:/localhost/tablename/rowname. It seems it
  will work only with limited
   amount of data (may be i can limit the cURL
  output), and how i can limit the
   column values for a particular row?
   Suppose i have two thousand urls under a keyword
  and i need to fetch the
   urls and should limit the result to five hundred.
  How it is possible??
  
   @ tsuna,
  
It seems http://www.elasticsearch.org/ using
  CouchDB right?
  
  
   On Tue, Mar 15, 2011 at 11:32 PM, Jean-Daniel
  Cryans jdcry...@apache.orgwrote:
  
   Can you tell why it's not able to get the
  bigger rows? Why would you
   try another schema if you don't even know
  what's going on right now?
   If you have the same issue with the new
  schema, you're back to square
   one right?
  
   Looking at the logs should give you some
  hints.
  
   J-D
  
   On Tue, Mar 15, 2011 at 10:19 AM, sreejith P.
  K. sreejit...@nesote.com
   wrote:
Hello experts,
   
I have a scenario as follows,
I need to maintain a huge table for a
  'web crawler' project in HBASE.
Basically it contains thousands of
  keywords and for each keyword i need
   to
maintain a list of urls (it again will
  count in