Re: Experience with indexing billions of documents?

2010-04-14 Thread Thomas Koch
Bradford Stephens:
 Hey there,
 
 We've actually been tackling this problem at Drawn to Scale. We'd really
 like to get our hands on LuceHBase to see how it scales. Our faceting still
 needs to be done in-memory, which is kinda tricky, but it's worth
 exploring.
Hi Bradford,

thank you for your interest. Just yesterday I found out, that somebody else 
did apparently exactly the same as I did, porting lucandra to HBase:

http://github.com/akkumar/hbasene

I'll have a look at this project and most likely abandon luceHBase in favor of 
the other, since it's more advanced.

Best regards,

Thomas Koch, http://www.koch.ro


Re: Experience with indexing billions of documents?

2010-04-14 Thread Jason Rutherglen
Tom,

Yes, we've (Biz360) indexed 3 billion and upwards... If indexing
is the issue (or rather re-indexing) we used SOLR-1301 with
Hadoop to re-index efficiently (ie, in a timely manner). For
querying we're currently using the out of the box Solr
distributed shards query mechanism, which is hard (read, near
impossible) to customize. I've been writing SOLR-1724 which
deploy cores out of HDFS. SOLR-1724 works in conjunction with
Solr Cloud which should allow for more efficient failover.  Katta
has a nice model for replicating cores across multiple servers
for redundancy. The issue with this is, it could feasibly
require 2 times as many servers for 2 times replication.

If you have more questions feel free to ping me or whatever.

Cheers,

Jason

On Fri, Apr 2, 2010 at 8:57 AM, Burton-West, Tom tburt...@umich.edu wrote:
 We are currently indexing 5 million books in Solr, scaling up over the next 
 few years to 20 million.  However we are using the entire book as a Solr 
 document.  We are evaluating the possibility of indexing individual pages as 
 there are some use cases where users want the most relevant pages regardless 
 of what book they occur in.  However, we estimate that we are talking about 
 somewhere between 1 and 6 billion pages and have concerns over whether Solr 
 will scale to this level.

 Does anyone have experience using Solr with 1-6 billion Solr documents?

 The lucene file format document 
 (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions 
 a limit of about 2 billion document ids.   I assume this is the lucene 
 internal document id and would therefore be a per index/per shard limit.  Is 
 this correct?


 Tom Burton-West.






Re: Experience with indexing billions of documents?

2010-04-13 Thread Bradford Stephens
Hey there,

We've actually been tackling this problem at Drawn to Scale. We'd really
like to get our hands on LuceHBase to see how it scales. Our faceting still
needs to be done in-memory, which is kinda tricky, but it's worth
exploring.

On Mon, Apr 12, 2010 at 7:27 AM, Thomas Koch tho...@koch.ro wrote:

 Hi,

 could I interest you in this project?
 http://github.com/thkoch2001/lucehbase

 The aim is to store the index directly in HBase, a database system modelled
 after google's Bigtable to store data in the regions of tera or petabytes.

 Best regards, Thomas Koch

 Lance Norskog:
  The 2B limitation is within one shard, due to using a signed 32-bit
  integer. There is no limit in that regard in sharding- Distributed
  Search uses the stored unique document id rather than the internal
  docid.
 
  On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com
 wrote:
   A colleague of mine is using native Lucene + some home-grown
   patches/optimizations to index over 13B small documents in a 32-shard
   environment, which is around 406M docs per shard.
  
   If there's a 2B doc id limitation in Lucene then I assume he's patched
 it
   himself.
  
   On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:
   My guess is that you will need to take advantage of Solr 1.5's
 upcoming
   cloud/cluster renovations and use multiple indexes to comfortably
   achieve those numbers. Hypthetically, in that case, you won't be
 limited
   by single index docid limitations of Lucene.
  
We are currently indexing 5 million books in Solr, scaling up over
 the
next few years to 20 million.  However we are using the entire book
 as
a Solr document.  We are evaluating the possibility of indexing
individual pages as there are some use cases where users want the
 most
relevant
  
   pages
  
regardless of what book they occur in.  However, we estimate that we
are talking about somewhere between 1 and 6 billion pages and have
concerns over whether Solr will scale to this level.
   
Does anyone have experience using Solr with 1-6 billion Solr
documents?
   
The lucene file format document
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
mentions a limit of about 2 billion document ids.   I assume this is
the lucene internal document id and would therefore be a per
 index/per
shard limit.  Is this correct?
   
   
Tom Burton-West.
 

 Thomas Koch, http://www.koch.ro




-- 
Bradford Stephens,
Founder, Drawn to Scale
drawntoscalehq.com
727.697.7528

http://www.drawntoscalehq.com --  The intuitive, cloud-scale data solution.
Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science


Re: Experience with indexing billions of documents?

2010-04-12 Thread Thomas Koch
Hi,

could I interest you in this project?
http://github.com/thkoch2001/lucehbase

The aim is to store the index directly in HBase, a database system modelled 
after google's Bigtable to store data in the regions of tera or petabytes.

Best regards, Thomas Koch

Lance Norskog:
 The 2B limitation is within one shard, due to using a signed 32-bit
 integer. There is no limit in that regard in sharding- Distributed
 Search uses the stored unique document id rather than the internal
 docid.
 
 On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote:
  A colleague of mine is using native Lucene + some home-grown
  patches/optimizations to index over 13B small documents in a 32-shard
  environment, which is around 406M docs per shard.
 
  If there's a 2B doc id limitation in Lucene then I assume he's patched it
  himself.
 
  On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:
  My guess is that you will need to take advantage of Solr 1.5's upcoming
  cloud/cluster renovations and use multiple indexes to comfortably
  achieve those numbers. Hypthetically, in that case, you won't be limited
  by single index docid limitations of Lucene.
 
   We are currently indexing 5 million books in Solr, scaling up over the
   next few years to 20 million.  However we are using the entire book as
   a Solr document.  We are evaluating the possibility of indexing
   individual pages as there are some use cases where users want the most
   relevant
 
  pages
 
   regardless of what book they occur in.  However, we estimate that we
   are talking about somewhere between 1 and 6 billion pages and have
   concerns over whether Solr will scale to this level.
  
   Does anyone have experience using Solr with 1-6 billion Solr
   documents?
  
   The lucene file format document
   (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
   mentions a limit of about 2 billion document ids.   I assume this is
   the lucene internal document id and would therefore be a per index/per
   shard limit.  Is this correct?
  
  
   Tom Burton-West.
 

Thomas Koch, http://www.koch.ro


Re: Experience with indexing billions of documents?

2010-04-05 Thread Lance Norskog
The 2B limitation is within one shard, due to using a signed 32-bit
integer. There is no limit in that regard in sharding- Distributed
Search uses the stored unique document id rather than the internal
docid.

On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote:
 A colleague of mine is using native Lucene + some home-grown
 patches/optimizations to index over 13B small documents in a 32-shard
 environment, which is around 406M docs per shard.

 If there's a 2B doc id limitation in Lucene then I assume he's patched it
 himself.

 On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:

 My guess is that you will need to take advantage of Solr 1.5's upcoming
 cloud/cluster renovations and use multiple indexes to comfortably achieve
 those numbers. Hypthetically, in that case, you won't be limited by single
 index docid limitations of Lucene.

  We are currently indexing 5 million books in Solr, scaling up over the
  next few years to 20 million.  However we are using the entire book as a
  Solr document.  We are evaluating the possibility of indexing individual
  pages as there are some use cases where users want the most relevant
 pages
  regardless of what book they occur in.  However, we estimate that we are
  talking about somewhere between 1 and 6 billion pages and have concerns
  over whether Solr will scale to this level.
 
  Does anyone have experience using Solr with 1-6 billion Solr documents?
 
  The lucene file format document
  (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
  mentions a limit of about 2 billion document ids.   I assume this is the
  lucene internal document id and would therefore be a per index/per shard
  limit.  Is this correct?
 
 
  Tom Burton-West.
 
 
 
 






-- 
Lance Norskog
goks...@gmail.com


Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom
We are currently indexing 5 million books in Solr, scaling up over the next few 
years to 20 million.  However we are using the entire book as a Solr document.  
We are evaluating the possibility of indexing individual pages as there are 
some use cases where users want the most relevant pages regardless of what book 
they occur in.  However, we estimate that we are talking about somewhere 
between 1 and 6 billion pages and have concerns over whether Solr will scale to 
this level.

Does anyone have experience using Solr with 1-6 billion Solr documents?

The lucene file format document 
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions a 
limit of about 2 billion document ids.   I assume this is the lucene internal 
document id and would therefore be a per index/per shard limit.  Is this 
correct?


Tom Burton-West.





Re: Experience with indexing billions of documents?

2010-04-02 Thread darren
My guess is that you will need to take advantage of Solr 1.5's upcoming
cloud/cluster renovations and use multiple indexes to comfortably achieve
those numbers. Hypthetically, in that case, you won't be limited by single
index docid limitations of Lucene.

 We are currently indexing 5 million books in Solr, scaling up over the
 next few years to 20 million.  However we are using the entire book as a
 Solr document.  We are evaluating the possibility of indexing individual
 pages as there are some use cases where users want the most relevant pages
 regardless of what book they occur in.  However, we estimate that we are
 talking about somewhere between 1 and 6 billion pages and have concerns
 over whether Solr will scale to this level.

 Does anyone have experience using Solr with 1-6 billion Solr documents?

 The lucene file format document
 (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
 mentions a limit of about 2 billion document ids.   I assume this is the
 lucene internal document id and would therefore be a per index/per shard
 limit.  Is this correct?


 Tom Burton-West.







Re: Experience with indexing billions of documents?

2010-04-02 Thread Peter Sturge
You can do this today with multiple indexes, replication and distributed
searching.
SolrCloud/clustering will certainly make life easier when it comes to
managing these,
but with distributed searches over multiple indexes, you're limited only by
how much hardware you can throw at it.


On Fri, Apr 2, 2010 at 6:17 PM, dar...@ontrenet.com wrote:

 My guess is that you will need to take advantage of Solr 1.5's upcoming
 cloud/cluster renovations and use multiple indexes to comfortably achieve
 those numbers. Hypthetically, in that case, you won't be limited by single
 index docid limitations of Lucene.

  We are currently indexing 5 million books in Solr, scaling up over the
  next few years to 20 million.  However we are using the entire book as a
  Solr document.  We are evaluating the possibility of indexing individual
  pages as there are some use cases where users want the most relevant
 pages
  regardless of what book they occur in.  However, we estimate that we are
  talking about somewhere between 1 and 6 billion pages and have concerns
  over whether Solr will scale to this level.
 
  Does anyone have experience using Solr with 1-6 billion Solr documents?
 
  The lucene file format document
  (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
  mentions a limit of about 2 billion document ids.   I assume this is the
  lucene internal document id and would therefore be a per index/per shard
  limit.  Is this correct?
 
 
  Tom Burton-West.
 
 
 
 




Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown
patches/optimizations to index over 13B small documents in a 32-shard
environment, which is around 406M docs per shard.

If there's a 2B doc id limitation in Lucene then I assume he's patched it
himself.

On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:

 My guess is that you will need to take advantage of Solr 1.5's upcoming
 cloud/cluster renovations and use multiple indexes to comfortably achieve
 those numbers. Hypthetically, in that case, you won't be limited by single
 index docid limitations of Lucene.

  We are currently indexing 5 million books in Solr, scaling up over the
  next few years to 20 million.  However we are using the entire book as a
  Solr document.  We are evaluating the possibility of indexing individual
  pages as there are some use cases where users want the most relevant
 pages
  regardless of what book they occur in.  However, we estimate that we are
  talking about somewhere between 1 and 6 billion pages and have concerns
  over whether Solr will scale to this level.
 
  Does anyone have experience using Solr with 1-6 billion Solr documents?
 
  The lucene file format document
  (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
  mentions a limit of about 2 billion document ids.   I assume this is the
  lucene internal document id and would therefore be a per index/per shard
  limit.  Is this correct?
 
 
  Tom Burton-West.