Re: ERROR master.HMaster: Can not start master

2008-09-24 Thread Samuel Guo
check your port 6 has been used ? On Wed, Sep 24, 2008 at 11:00 AM, Conor Harty <[EMAIL PROTECTED]> wrote: > The problem we have is, if we stop hbase and hadoop and then start hadoop > again and try to start hbase, hbase master doesnt start up. Here is the > error we get: > > > ERROR master.H

Scalability of HBase

2008-09-24 Thread Alex Newman
Where are the scalability limitations with hbase. Number of tablets? The size of the columns? I am thinking about 100k rows 24 columns But with on the order 300M entries per column with something like (timestamp,<32 byte string>) would something like this scale.

Hbase schema for many-to-many association

2008-09-24 Thread Michael Dagaev
Hi All How would you design an Hbase table for many-to-many association between two entities, for example Student and Course? I would define two tables: Student: student id student data (name, address, ...) courses (use course ids as column qualifiers here) Course: course id

Re: Scalability of HBase

2008-09-24 Thread Jean-Daniel Cryans
Alex, The number of rows and families seems ok, tho that number of families may eat up a lot of memory when inserting the data because each family in each region on each region server will have a 64MB memcache and the store's index in memory. The main problem will be the 300M entries / family. Ins

RE: Hbase schema for many-to-many association

2008-09-24 Thread Jonathan Gray
Michael, Your design does make sense. As you said, you'd probably have two column-families in each of the Student and Course tables. One for the data, another with a column per student or course. For example, a student row might look like: Student : id/row/key = 1001 data:name = Student Name

Re: Hbase schema for many-to-many association

2008-09-24 Thread Fuad Efendi
I agree with your design... Without traditional RDBMS indexes from row-oriented world (which are indeed column-oriented structures very similar to Hadoop) we should have tables like: STUDENT_COURSE: student, course COURSE_STUDENT: course, student I feel we can think about Hadoop tables as o

Re: Hbase schema for many-to-many association

2008-09-24 Thread Fuad Efendi
Sorry for typo: HBase (not Hadoop) Quoting Fuad Efendi <[EMAIL PROTECTED]>: I agree with your design... Without traditional RDBMS indexes from row-oriented world (which are indeed column-oriented structures very similar to Hadoop) we should have tables like: STUDENT_COURSE: student, course CO

Re: Hbase schema for many-to-many association

2008-09-24 Thread Michael Dagaev
Fuad Thank you for the answer. I think I understand the pattern. Best Regards, Michael On Wed, Sep 24, 2008 at 5:27 PM, Fuad Efendi <[EMAIL PROTECTED]> wrote: > I agree with your design... > > Without traditional RDBMS indexes from row-oriented world (which are indeed > column-oriented struc

RE: Scalability of HBase

2008-09-24 Thread Jonathan Gray
Alex, So each row = 24 column-families(?) * 300,000,000 entries/family * ~40 bytes/entry = about 270GB/row ? And that * 100,000 rows = about 27 petabytes of data? Is my math right here? :) With a big enough cluster, you might be able to get that amount of data in hadoop. I'm not sure anyone h

Re: Hbase schema for many-to-many association

2008-09-24 Thread Michael Dagaev
Jonathan Thank you for the answer. I am just a little bit concerned about performance implications of calling Hbase twice if I want to retrieve data about all classes for a student. Best Regards, Michael On Wed, Sep 24, 2008 at 5:20 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > Michael, >

Identical rows with different timestamps

2008-09-24 Thread ROL
Hi list! Can i do this for example?: table1: ROW COLUMNTIMESTAMPVALUE 'row''url:1' t1 'val1' 'row''url:1' t2 'val1' 'row''url:1' t3 'val1' mailto:[EMAIL PROTECTED]

RE: Identical rows with different timestamps

2008-09-24 Thread Jim Kellerman
yes --- Jim Kellerman, Powerset (Live Search, Microsoft Corporation) > -Original Message- > From: ROL [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 2:51 AM > To: hbase-user@hadoop.apache.org > Subject: Identical rows with different timestamps > > Hi list! > > Can i do t

Re: Scalability of HBase

2008-09-24 Thread Alex Newman
Due to business concerns I need to be abstract about my application for now. I assume maximizing the number of rows is ideal simply because thats how data is distributed. I was hoping though that when a row got large enough it would be distributed onto multiple machines. Is that not the case? >A

RE: Scalability of HBase

2008-09-24 Thread Jim Kellerman
Alex, Data is distributed at the region level, which is a row range. A region contains at least one row and all the column values for that row. The default region size is 256MB (it is configurable, but few have had a single row larger than about 1GB). Since a region contains at least one entire

RE: Scalability of HBase

2008-09-24 Thread Alex Newman
Jonathan, That actually answers my questions. To answer your 27 petabyte question, I would have to say maybe... Cheers, > Alex, > So each row = 24 column-families(?) * 300,000,000 entries/family * ~40 > bytes/entry = about 270GB/row ? > And that * 100,000 rows = about 27 petabytes of data? > I

Re: Identical rows with different timestamps

2008-09-24 Thread stack
Be careful though; if you insert in a non-chronological order and you're doing deletes, you could get unexpected results: See https://issues.apache.org/jira/browse/HBASE-29 if you want to learn more. St.Ack Jim Kellerman wrote: yes --- Jim Kellerman, Powerset (Live Search, Microsoft Corporat

tracking progress for TableMap

2008-09-24 Thread Allen Day
How can I get an indicator of what % of rows a TableMap has processed so far? I want to be able to monitor job progress through e.g. the hadoop web monitoring system. I've tried using reporter.setStatus() and reporter.progress(), but it seems to not be what I need. Same question for TableReduce.

"Powered by" page is empty, we need your help!

2008-09-24 Thread Jean-Daniel Cryans
Hi community, The "powered by" page in the wiki is currently empty. This gives the wrong impression that nobody uses HBase which, I'm sure, is not the case. We would all be very grateful to have the following info (according to how much you are legal

Re: tracking progress for TableMap

2008-09-24 Thread stack
Yeah. Its an old issue. See HBASE-635 and in particular Billy Pearson's suggestion for how we might implement percentage complete. St.Ack Allen Day wrote: How can I get an indicator of what % of rows a TableMap has processed so far? I want to be able to monitor job progress through e.g. the

Question about recommended heap sizes

2008-09-24 Thread Daniel Ploeg
Hi all, I was running a test on our local hbase cluster (1 master node, 4 region servers) and I ran into some OutOfMemory exceptions. Basically, one of the region servers went down first, then the master node followed (ouch!) as I was inserting the data for the test. I was still using the default

Re: Question about recommended heap sizes

2008-09-24 Thread stack
Daniel Ploeg wrote: Hi all, I was running a test on our local hbase cluster (1 master node, 4 region servers) and I ran into some OutOfMemory exceptions. Basically, one of the region servers went down first, then the master node followed (ouch!) as I was inserting the data for the test. What

RE: Question about recommended heap sizes

2008-09-24 Thread Jonathan Gray
Daniel, I have seen similar issues during large scale imports. For now, we have gotten around the issue by increasing the regionserver heap size to 2GB. My slave machines also have 4GB of memory. How many total regions did you have when you received the OOME? Jonathan Gray -Original Mess

Re: Question about recommended heap sizes

2008-09-24 Thread Daniel Ploeg
Hi, Thanks for your quick responses! I'm using HBase 0.18.0. I restarted the hbase cluster and it's telling me on the master's web page that I have a total of 39 regions. I was using 100 threads to push data into Hbase, so I might try reducing that to, say, 20 on the next run. I'll also try wit

RE: Question about recommended heap sizes

2008-09-24 Thread Jonathan Gray
One thing to be aware of... Currently the HBase client serializes RPC calls for a process, so you are not getting true insert parallelism if all inserts are coming from a single java process despite the threading. Since you are also experiencing this, there must be something going on here. In 0.

Re: Question about recommended heap sizes

2008-09-24 Thread Daniel Ploeg
The changes to the heap size and thread count have so far been successful. I managed to get 10K rows into hbase (loading the 100K now, it may take a little longer). The query results for the 10K rows came back at an average of 360 ms. However, at the start of the run (probably about the first quar