Hi, Yes. I'm sure that would be a welcome addition. Topic for [email protected]...
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:23 AM, James Taylor <[email protected]> wrote: > Otis, > I didn't realize Nutch uses HBase underneath. Might be interesting if you > serialized data in a Phoenix-compliant manner, as you could run SQL queries > directly on top of it. > > Thanks, > James > > > On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic < > [email protected]> wrote: > > > Hi, > > > > Have a look at http://nutch.apache.org . Version 2.x uses HBase under > the > > hood. > > > > Otis > > -- > > Performance Monitoring * Log Analytics * Search Analytics > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[email protected]> wrote: > > > > > hi all, > > > I want to use hbase to store all urls(crawled or not crawled). > > > And each url will has a column named priority which represent the > > > priority of the url. I want to get the top N urls order by priority(if > > > priority is the same then url whose timestamp is ealier is prefered). > > > in using something like mysql, my client application may like: > > > while true: > > > select url from url_db order by priority,addedTime limit > > > 1000 where status='not_crawled'; > > > do something with this urls; > > > extract more urls and insert them into url_db; > > > How should I design hbase schema for this application? Is hbase > > > suitable for me? > > > I found in this article > > > > > > http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ > > > , > > > they use redis to store urls. I think hbase is originated from > > > bigtable and google use bigtable to store webpage, so for huge number > > > of urls, I prefer distributed system like hbase. > > > > > >
