Re: Solr feasibility with terabyte-scale data

2008-05-11 Thread Marcus Herou
- Solr - Nutch - Original Message From: marcusherou [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 2:37:19 AM Subject: Re: Solr feasibility with terabyte-scale data Hi. I will as well head into a path like yours within some months

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Thanks Ken. I will take a look be sure of that :) Kindly //Marcus On Fri, May 9, 2008 at 10:26 PM, Ken Krugler [EMAIL PROTECTED] wrote: Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Sent: Friday, May 9, 2008 2:37:19 AM Subject: Re: Solr feasibility with terabyte-scale data Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread marcusherou
Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Marcus Herou
Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
So our problem is made easier by having complete index partitionability by a user_id field. That means at one end of the spectrum, we could have one monolithic index for everyone, while at the other end of the spectrum we could individual cores for each user_id. At the moment, we've gone

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler
Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it

RE: Solr feasibility with terabyte-scale data

2008-05-09 Thread Lance Norskog
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the MD5 cryptographic checksumming algorithm. This takes X bytes of data and creates a 128-bit long random number, or 128 random bits. At this point there are no reports of two different datasets that give the same checksum.

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Marcus, It seems a lot of what you're describing is really similar

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler
) functionality, then it sucks. Not sure what the outcome will be. -- Ken - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Marcus, It seems a lot of what

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
happens there! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 5:37:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Otis, You

Re: Solr feasibility with terabyte-scale data

2008-01-23 Thread Phillip Farber
For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names and technical jargon. Another is to use stop words (which has the unfortunate side effect of making phrase searches like to be or not to be

RE: Solr feasibility with terabyte-scale data

2008-01-23 Thread Lance Norskog
PROTECTED] Sent: Wednesday, January 23, 2008 8:15 AM To: solr-user@lucene.apache.org Subject: Re: Solr feasibility with terabyte-scale data For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
Otis Gospodnetic wrote: Hi, Some quick notes, since it's late here. - You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index. Are you

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Erick Erickson
Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the words are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Mike Klaas
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote: We would need all 7M ids scored so we could push them through a filter query to reduce them to a much smaller number on the order of 100-10,000 representing just those that correspond to items in a collection. You could pass the filter to

Re: Solr feasibility with terabyte-scale data

2008-01-20 Thread Otis Gospodnetic
Hi, Some quick notes, since it's late here. - You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index. - I'd suggest you wait for Solr 1.3 (or

Re: Solr feasibility with terabyte-scale data

2008-01-19 Thread Ryan McKinley
We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not