- Solr - Nutch
- Original Message
From: marcusherou [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 2:37:19 AM
Subject: Re: Solr feasibility with terabyte-scale data
Hi.
I will as well head into a path like yours within some months
Thanks Ken.
I will take a look be sure of that :)
Kindly
//Marcus
On Fri, May 9, 2008 at 10:26 PM, Ken Krugler [EMAIL PROTECTED]
wrote:
Hi Marcus,
It seems a lot of what you're describing is really similar to MapReduce,
so I think Otis' suggestion to look at Hadoop is a good one: it
Sent: Friday, May 9, 2008 2:37:19 AM
Subject: Re: Solr feasibility with terabyte-scale data
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons
is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
at this kind of scale?
Given these parameters is it realistic to think that Solr could handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr
://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Hi Marcus,
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long random number, or 128 random bits. At this point
there are no reports of two different datasets that give the same checksum.
://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
It seems a lot of what you're describing is really similar
)
functionality, then it sucks. Not sure what the outcome will be.
-- Ken
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Marcus,
It seems a lot of what
happens there! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 5:37:19 PM
Subject: Re: Solr feasibility with terabyte-scale data
Hi Otis,
You
For sure this is a problem. We have considered some strategies. One
might be to use a dictionary to clean up the OCR but that gets hard for
proper names and technical jargon. Another is to use stop words (which
has the unfortunate side effect of making phrase searches like to be or
not to be
PROTECTED]
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data
For sure this is a problem. We have considered some strategies. One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names
Ryan McKinley wrote:
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large
[EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
of OCR. Initially our requirements are simple: basic tokenizing
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the words are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote:
We would need all 7M ids scored so we could push them through a
filter query to reduce them to a much smaller number on the order
of 100-10,000 representing just those that correspond to items in a
collection.
You could pass the filter to
- Original Message
From: Phillip Farber [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
of OCR
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not
20 matches
Mail list logo