Re: use hbase as distributed crawl's scheduler

Li Li Thu, 02 Jan 2014 23:38:26 -0800

thank you. it's great.


On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <[email protected]> wrote:
> Hi LiLi,
> Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL
> skin on top of HBase. You can model your schema and issue your queries just
> like you would with MySQL. Something like this:
>
> // Create table that optimizes for your most common query
> // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
> rows ordered)
> CREATE TABLE url_db (
>     status TINYINT,
>     priority INTEGER NOT NULL,
>     added_time DATE,
>     url VARCHAR NOT NULL
>     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
>
> int lastStatus = 0;
> int lastPriority = 0;
> Date lastAddedTime = new Date(0);
> String lastUrl = "";
>
> while (true) {
>     // Use row value constructor to page through results in batches of 1000
>     String query = "
>         SELECT * FROM url_db
>         WHERE status=0 AND (status, priority, added_time, url) > (?, ?, ?,
> ?)
>         ORDER BY status, priority, added_time, url
>         LIMIT 1000"
>     PreparedStatement stmt = connection.prepareStatement(query);
>
>     // Bind parameters
>     stmt.setInt(1, lastStatus);
>     stmt.setInt(2, lastPriority);
>     stmt.setDate(3, lastAddedTime);
>     stmt.setString(4, lastUrl);
>     ResultSet resultSet = stmt.executeQuery();
>
>     while (resultSet.next()) {
>         // Remember last row processed so that you can start after that for
> next batch
>         lastStatus = resultSet.getInt(1);
>         lastPriority = resultSet.getInt(2);
>         lastAddedTime = resultSet.getDate(3);
>         lastUrl = resultSet.getString(4);
>
>         doSomethingWithUrls();
>
>         UPSERT INTO url_db(status, priority, added_time, url)
>         VALUES (1, ?, CURRENT_DATE(), ?);
>
>     }
> }
>
> If you need to efficiently query on url, add a secondary index like this:
>
> CREATE INDEX url_index ON url_db (url);
>
> Please let me know if you have questions.
>
> Thanks,
> James
>
>
>
>
> On Thu, Jan 2, 2014 at 10:22 PM, Li Li <[email protected]> wrote:
>
>> thank you. But I can't use nutch. could you tell me how hbase is used
>> in nutch? or hbase is only used to store webpage.
>>
>> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
>> the
>> > hood.
>> >
>> > Otis
>> > --
>> > Performance Monitoring * Log Analytics * Search Analytics
>> > Solr & Elasticsearch Support * http://sematext.com/
>> >
>> >
>> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[email protected]> wrote:
>> >
>> >> hi all,
>> >>      I want to use hbase to store all urls(crawled or not crawled).
>> >> And each url will has a column named priority which represent the
>> >> priority of the url. I want to get the top N urls order by priority(if
>> >> priority is the same then url whose timestamp is ealier is prefered).
>> >>      in using something like mysql, my client application may like:
>> >>      while true:
>> >>          select  url from url_db order by priority,addedTime limit
>> >> 1000 where status='not_crawled';
>> >>          do something with this urls;
>> >>          extract more urls and insert them into url_db;
>> >>      How should I design hbase schema for this application? Is hbase
>> >> suitable for me?
>> >>      I found in this article
>> >>
>> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
>> >> ,
>> >> they use redis to store urls. I think hbase is originated from
>> >> bigtable and google use bigtable to store webpage, so for huge number
>> >> of urls, I prefer distributed system like hbase.
>> >>
>>

Re: use hbase as distributed crawl's scheduler

Reply via email to