Please take a look at our Apache incubator proposal, as I think that may answer your questions: https://wiki.apache.org/incubator/PhoenixProposal
On Fri, Jan 3, 2014 at 11:47 PM, Li Li <[email protected]> wrote: > so what's the relationship of Phoenix and HBase? something like hadoop and > hive? > > > On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <[email protected]> > wrote: > > Hi LiLi, > > Phoenix isn't an experimental project. We're on our 2.2 release, and many > > companies (including the company for which I'm employed, Salesforce.com) > > use it in production today. > > Thanks, > > James > > > > > > On Fri, Jan 3, 2014 at 11:39 PM, Li Li <[email protected]> wrote: > > > >> hi James, > >> phoenix seems great but it's now only a experimental project. I > >> want to use only hbase. could you tell me the difference of Phoenix > >> and hbase? If I use hbase only, how should I design the schema and > >> some extra things for my goal? thank you > >> > >> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <[email protected]> > >> wrote: > >> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <[email protected]> > >> wrote: > >> > > >> >> Couple of notes: > >> >> 1. When updating to status you essentially add a new rowkey into > HBase, > >> I > >> >> would give it up all together. The essential requirement seems to > point > >> at > >> >> retrieving a list of urls in a certain order. > >> >> > >> > Not sure on this, but seemed to me that setting the status field is > >> forcing > >> > the urls that have been processed to be at the end of the sort order. > >> > > >> > 2. Wouldn't salting ruin the sort order required? Priority, date > added? > >> >> > >> > No, as Phoenix maintains returning rows in row key order even when > >> they're > >> > salted. We do parallel scans for each bucket and do a merge sort on > the > >> > client, so the cost is pretty low for this (we also provide a way of > >> > turning this off if your use case doesn't need it). > >> > > >> > Two years, JM? Now you're really going to have to start using Phoenix > :-) > >> > > >> > > >> >> On Friday, January 3, 2014, James Taylor wrote: > >> >> > >> >> > Sure, no problem. One addition: depending on the cardinality of > your > >> >> > priority column, you may want to salt your table to prevent > >> hotspotting, > >> >> > since you'll have a monotonically increasing date in the key. To do > >> that, > >> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the > >> number of > >> >> > machines in your cluster. You can read more about salting here: > >> >> > http://phoenix.incubator.apache.org/salted.html > >> >> > > >> >> > > >> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <[email protected]> > wrote: > >> >> > > >> >> > > thank you. it's great. > >> >> > > > >> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor < > >> [email protected]> > >> >> > > wrote: > >> >> > > > Hi LiLi, > >> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). > >> It's > >> >> a > >> >> > > SQL > >> >> > > > skin on top of HBase. You can model your schema and issue your > >> >> queries > >> >> > > just > >> >> > > > like you would with MySQL. Something like this: > >> >> > > > > >> >> > > > // Create table that optimizes for your most common query > >> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd > >> want > >> >> > your > >> >> > > > rows ordered) > >> >> > > > CREATE TABLE url_db ( > >> >> > > > status TINYINT, > >> >> > > > priority INTEGER NOT NULL, > >> >> > > > added_time DATE, > >> >> > > > url VARCHAR NOT NULL > >> >> > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, > >> url)); > >> >> > > > > >> >> > > > int lastStatus = 0; > >> >> > > > int lastPriority = 0; > >> >> > > > Date lastAddedTime = new Date(0); > >> >> > > > String lastUrl = ""; > >> >> > > > > >> >> > > > while (true) { > >> >> > > > // Use row value constructor to page through results in > >> batches > >> >> of > >> >> > > 1000 > >> >> > > > String query = " > >> >> > > > SELECT * FROM url_db > >> >> > > > WHERE status=0 AND (status, priority, added_time, url) > > > >> (?, > >> >> ?, > >> >> > > ?, > >> >> > > > ?) > >> >> > > > ORDER BY status, priority, added_time, url > >> >> > > > LIMIT 1000" > >> >> > > > PreparedStatement stmt = > connection.prepareStatement(query); > >> >> > > > > >> >> > > > // Bind parameters > >> >> > > > stmt.setInt(1, lastStatus); > >> >> > > > stmt.setInt(2, lastPriority); > >> >> > > > stmt.setDate(3, lastAddedTime); > >> >> > > > stmt.setString(4, lastUrl); > >> >> > > > ResultSet resultSet = stmt.executeQuery(); > >> >> > > > > >> >> > > > while (resultSet.next()) { > >> >> > > > // Remember last row processed so that you can start > after > >> >> that > >> >> > > for > >> >> > > > next batch > >> >> > > > lastStatus = resultSet.getInt(1); > >> >> > > > lastPriority = resultSet.getInt(2); > >> >> > > > lastAddedTime = resultSet.getDate(3); > >> >> > > > lastUrl = resultSet.getString(4); > >> >> > > > > >> >> > > > doSomethingWithUrls(); > >> >> > > > > >> >> > > > UPSERT INTO url_db(status, priority, added_time, url) > >> >> > > > VALUES (1, ?, CURRENT_DATE(), ?); > >> >> > > > > >> >> > > > } > >> >> > > > } > >> >> > > > > >> >> > > > If you need to efficiently query on url, add a secondary index > >> like > >> >> > this: > >> >> > > > > >> >> > > > CREATE INDEX url_index ON url_db (url); > >> >> > > > > >> >> > > > Please let me know if you have questions. > >> >> > > > > >> >> > > > Thanks, > >> >> > > > James > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <[email protected]> > >> wrote: > >> >> > > > > >> >> > > >> thank you. But I can't use nutch. could you tell me how hbase > is > >> >> used > >> >> > > >> in nutch? or hbase is only used to store webpage. > >> >> > > >> > >> >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic > >> >> > > >> <[email protected]> wrote: > >> >> > > >> > Hi, > >> >> > > >> > > >> >> > > >> > Have a look at http://nutch.apache.org . Version 2.x uses > >> HBase > >> >> > > under > >> >> > > >> the > >> >> > > >> > hood. > >> >> > > >> > > >> >> > > >> > Otis > >> >> > > >> > -- > >> >> > > >> > Performance Monitoring * Log Analytics * Search Analytics > >> >> > > >> > Solr & Elasticsearch Support * http://sematext.com/ > >> >> > > >> > > >> >> > > >> > > >> >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li < > >> >> > >> >
