thank you. it's great.
On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <[email protected]> wrote: > Hi LiLi, > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL > skin on top of HBase. You can model your schema and issue your queries just > like you would with MySQL. Something like this: > > // Create table that optimizes for your most common query > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your > rows ordered) > CREATE TABLE url_db ( > status TINYINT, > priority INTEGER NOT NULL, > added_time DATE, > url VARCHAR NOT NULL > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); > > int lastStatus = 0; > int lastPriority = 0; > Date lastAddedTime = new Date(0); > String lastUrl = ""; > > while (true) { > // Use row value constructor to page through results in batches of 1000 > String query = " > SELECT * FROM url_db > WHERE status=0 AND (status, priority, added_time, url) > (?, ?, ?, > ?) > ORDER BY status, priority, added_time, url > LIMIT 1000" > PreparedStatement stmt = connection.prepareStatement(query); > > // Bind parameters > stmt.setInt(1, lastStatus); > stmt.setInt(2, lastPriority); > stmt.setDate(3, lastAddedTime); > stmt.setString(4, lastUrl); > ResultSet resultSet = stmt.executeQuery(); > > while (resultSet.next()) { > // Remember last row processed so that you can start after that for > next batch > lastStatus = resultSet.getInt(1); > lastPriority = resultSet.getInt(2); > lastAddedTime = resultSet.getDate(3); > lastUrl = resultSet.getString(4); > > doSomethingWithUrls(); > > UPSERT INTO url_db(status, priority, added_time, url) > VALUES (1, ?, CURRENT_DATE(), ?); > > } > } > > If you need to efficiently query on url, add a secondary index like this: > > CREATE INDEX url_index ON url_db (url); > > Please let me know if you have questions. > > Thanks, > James > > > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <[email protected]> wrote: > >> thank you. But I can't use nutch. could you tell me how hbase is used >> in nutch? or hbase is only used to store webpage. >> >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic >> <[email protected]> wrote: >> > Hi, >> > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase under >> the >> > hood. >> > >> > Otis >> > -- >> > Performance Monitoring * Log Analytics * Search Analytics >> > Solr & Elasticsearch Support * http://sematext.com/ >> > >> > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[email protected]> wrote: >> > >> >> hi all, >> >> I want to use hbase to store all urls(crawled or not crawled). >> >> And each url will has a column named priority which represent the >> >> priority of the url. I want to get the top N urls order by priority(if >> >> priority is the same then url whose timestamp is ealier is prefered). >> >> in using something like mysql, my client application may like: >> >> while true: >> >> select url from url_db order by priority,addedTime limit >> >> 1000 where status='not_crawled'; >> >> do something with this urls; >> >> extract more urls and insert them into url_db; >> >> How should I design hbase schema for this application? Is hbase >> >> suitable for me? >> >> I found in this article >> >> >> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ >> >> , >> >> they use redis to store urls. I think hbase is originated from >> >> bigtable and google use bigtable to store webpage, so for huge number >> >> of urls, I prefer distributed system like hbase. >> >> >>
