Adarsh,

HBase doesn't have the concept of a globally unique auto-incrementing "ID" 
column; that would require that all PUTs to any region of a table first go 
through some central ID authority to get a unique ID, and that sort of goes 
against the general HBase approach (in which operations on regions are totally 
independent of each other, for unbounded horizontal scalability).

That said, there are a couple ways to achieve what it seems like you want:
 - You could create a natural compound row key composed of (for example) a hash 
of the URL plus a timestamp. That way, you would be guaranteed that two crawls 
of the same URL appear as different rows (assuming they can't happen at the 
same millisecond).
 - You could alternately use a UUID of some sort as the row key, but the 
advantage of using URL_hash + timestamp is that you can find all the rows for a 
particular URL just by knowing the URL; you don't need any external index.
 - You could also "roll your own" global ID creation counter in HBase using a 
table with a counter in it, and use the atomic increment function to get unique 
values. That would still serialize all PUT operations, but it would be done in 
your code (not automatically in HBase).

Remember that HBase doesn't have any secondary indexes, like the 3 you've added 
below. If you want to be able to access the data in HBase by these fields, you 
must either write it in that order according to the row key, or else manually 
write the information, denormalized, into "index-like" tables that you maintain 
yourself (noting that there's no transactional protection on this operation 
like in a relational database, so you must account for more failure scenarios). 
These are reminders that unless your data size is so massive that a relational 
database simply can't accommodate it, you're likely giving yourself more 
problems by using HBase rather than an RDBMS.

Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by the 
esteemed Mr. Lars George; in it, he uses a running example of a URL shortener 
application that might give you some ideas about your use case.

Ian

On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:

Dear all,

After some R n D on schema design in hbase. I am confused how to design
corresponding schema of a table in mysql.

CREATE TABLE `page_content` (
 `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
each url,
 `link_level` tinyint(4) DEFAULT NULL,
 `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
 `isvalid` tinyint(4) NOT NULL DEFAULT '1',
 `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
 `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
records are in analyzing phase',
 `content_language` varchar(10) DEFAULT NULL,
 `url_id` bigint(20) NOT NULL,
 `publishing_date` varchar(40) DEFAULT NULL,
 `heading` varchar(150) DEFAULT NULL,
 `category` varchar(150) DEFAULT NULL,
 `crawled_page_url` varchar(500) NOT NULL,
 `keywords` varchar(500) DEFAULT NULL,
 `dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `content` mediumtext,
 PRIMARY KEY (`crawled_page_id`),
 KEY `idx_url` (`crawled_page_url`),
 KEY `idx_head` (`heading`),
 KEY `idx_dtstamp` (`dt_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

In all the examples , I find the reverse url is the row key in hbase but
in mysql i create an auto increment column that uniquly locate a document.
Can anyone suggest what is the corresponding table in hbase.

Thanks

Reply via email to