Re: use hbase as distributed crawl's scheduler

James Taylor Sat, 04 Jan 2014 00:14:09 -0800

Please take a look at our Apache incubator proposal, as I think that may
answer your questions: https://wiki.apache.org/incubator/PhoenixProposal



On Fri, Jan 3, 2014 at 11:47 PM, Li Li <[email protected]> wrote:

> so what's the relationship of Phoenix and HBase? something like hadoop and
> hive?
>
>
> On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <[email protected]>
> wrote:
> > Hi LiLi,
> > Phoenix isn't an experimental project. We're on our 2.2 release, and many
> > companies (including the company for which I'm employed, Salesforce.com)
> > use it in production today.
> > Thanks,
> > James
> >
> >
> > On Fri, Jan 3, 2014 at 11:39 PM, Li Li <[email protected]> wrote:
> >
> >> hi James,
> >>     phoenix seems great but it's now only a experimental project. I
> >> want to use only hbase. could you tell me the difference of Phoenix
> >> and hbase? If I use hbase only, how should I design the schema and
> >> some extra things for my goal? thank you
> >>
> >> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <[email protected]>
> >> wrote:
> >> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <[email protected]>
> >> wrote:
> >> >
> >> >> Couple of notes:
> >> >> 1. When updating to status you essentially add a new rowkey into
> HBase,
> >> I
> >> >> would give it up all together. The essential requirement seems to
> point
> >> at
> >> >> retrieving a list of urls in a certain order.
> >> >>
> >> > Not sure on this, but seemed to me that setting the status field is
> >> forcing
> >> > the urls that have been processed to be at the end of the sort order.
> >> >
> >> > 2. Wouldn't salting ruin the sort order required? Priority, date
> added?
> >> >>
> >> > No, as Phoenix maintains returning rows in row key order even when
> >> they're
> >> > salted. We do parallel scans for each bucket and do a merge sort on
> the
> >> > client, so the cost is pretty low for this (we also provide a way of
> >> > turning this off if your use case doesn't need it).
> >> >
> >> > Two years, JM? Now you're really going to have to start using Phoenix
> :-)
> >> >
> >> >
> >> >> On Friday, January 3, 2014, James Taylor wrote:
> >> >>
> >> >> > Sure, no problem. One addition: depending on the cardinality of
> your
> >> >> > priority column, you may want to salt your table to prevent
> >> hotspotting,
> >> >> > since you'll have a monotonically increasing date in the key. To do
> >> that,
> >> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the
> >> number of
> >> >> > machines in your cluster. You can read more about salting here:
> >> >> > http://phoenix.incubator.apache.org/salted.html
> >> >> >
> >> >> >
> >> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <[email protected]>
> wrote:
> >> >> >
> >> >> > > thank you. it's great.
> >> >> > >
> >> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <
> >> [email protected]>
> >> >> > > wrote:
> >> >> > > > Hi LiLi,
> >> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/).
> >> It's
> >> >> a
> >> >> > > SQL
> >> >> > > > skin on top of HBase. You can model your schema and issue your
> >> >> queries
> >> >> > > just
> >> >> > > > like you would with MySQL. Something like this:
> >> >> > > >
> >> >> > > > // Create table that optimizes for your most common query
> >> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd
> >> want
> >> >> > your
> >> >> > > > rows ordered)
> >> >> > > > CREATE TABLE url_db (
> >> >> > > >     status TINYINT,
> >> >> > > >     priority INTEGER NOT NULL,
> >> >> > > >     added_time DATE,
> >> >> > > >     url VARCHAR NOT NULL
> >> >> > > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
> >> url));
> >> >> > > >
> >> >> > > > int lastStatus = 0;
> >> >> > > > int lastPriority = 0;
> >> >> > > > Date lastAddedTime = new Date(0);
> >> >> > > > String lastUrl = "";
> >> >> > > >
> >> >> > > > while (true) {
> >> >> > > >     // Use row value constructor to page through results in
> >> batches
> >> >> of
> >> >> > > 1000
> >> >> > > >     String query = "
> >> >> > > >         SELECT * FROM url_db
> >> >> > > >         WHERE status=0 AND (status, priority, added_time, url)
> >
> >> (?,
> >> >> ?,
> >> >> > > ?,
> >> >> > > > ?)
> >> >> > > >         ORDER BY status, priority, added_time, url
> >> >> > > >         LIMIT 1000"
> >> >> > > >     PreparedStatement stmt =
> connection.prepareStatement(query);
> >> >> > > >
> >> >> > > >     // Bind parameters
> >> >> > > >     stmt.setInt(1, lastStatus);
> >> >> > > >     stmt.setInt(2, lastPriority);
> >> >> > > >     stmt.setDate(3, lastAddedTime);
> >> >> > > >     stmt.setString(4, lastUrl);
> >> >> > > >     ResultSet resultSet = stmt.executeQuery();
> >> >> > > >
> >> >> > > >     while (resultSet.next()) {
> >> >> > > >         // Remember last row processed so that you can start
> after
> >> >> that
> >> >> > > for
> >> >> > > > next batch
> >> >> > > >         lastStatus = resultSet.getInt(1);
> >> >> > > >         lastPriority = resultSet.getInt(2);
> >> >> > > >         lastAddedTime = resultSet.getDate(3);
> >> >> > > >         lastUrl = resultSet.getString(4);
> >> >> > > >
> >> >> > > >         doSomethingWithUrls();
> >> >> > > >
> >> >> > > >         UPSERT INTO url_db(status, priority, added_time, url)
> >> >> > > >         VALUES (1, ?, CURRENT_DATE(), ?);
> >> >> > > >
> >> >> > > >     }
> >> >> > > > }
> >> >> > > >
> >> >> > > > If you need to efficiently query on url, add a secondary index
> >> like
> >> >> > this:
> >> >> > > >
> >> >> > > > CREATE INDEX url_index ON url_db (url);
> >> >> > > >
> >> >> > > > Please let me know if you have questions.
> >> >> > > >
> >> >> > > > Thanks,
> >> >> > > > James
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <[email protected]>
> >> wrote:
> >> >> > > >
> >> >> > > >> thank you. But I can't use nutch. could you tell me how hbase
> is
> >> >> used
> >> >> > > >> in nutch? or hbase is only used to store webpage.
> >> >> > > >>
> >> >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
> >> >> > > >> <[email protected]> wrote:
> >> >> > > >> > Hi,
> >> >> > > >> >
> >> >> > > >> > Have a look at http://nutch.apache.org .  Version 2.x uses
> >> HBase
> >> >> > > under
> >> >> > > >> the
> >> >> > > >> > hood.
> >> >> > > >> >
> >> >> > > >> > Otis
> >> >> > > >> > --
> >> >> > > >> > Performance Monitoring * Log Analytics * Search Analytics
> >> >> > > >> > Solr & Elasticsearch Support * http://sematext.com/
> >> >> > > >> >
> >> >> > > >> >
> >> >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <
> >> >>
> >>
>

Re: use hbase as distributed crawl's scheduler

Reply via email to