Re: Solr feasibility with terabyte-scale data

2008-05-11 Thread Marcus Herou
Quick reply to Otis and Ken.

Otis: After a nights sleep I think you are absolutely right about that some
HPC grid like Hadoop or perhaps GlusterHPC should be used regardless of my
last comment.

Ken: Looked at the arch of Katta and it looks really nice. I really believe
that Katta could be something which lives as a subproject of Lucene since it
surely fills a gap that is not filled by Nutch. Nutch surely do similar
stuff but this could actually something that Nutch uses as a component to
distributed the crawled index. I will try to join the Stefan  friends team.

Kindly

//Marcus


On Sat, May 10, 2008 at 6:03 PM, Marcus Herou [EMAIL PROTECTED]
wrote:

 Hi Otis.

 Thanks for the insights. Nice to get feedback from a technorati guy. Nice
 to see that the snippet of yours is almost a copy of mine, gives me the
 right stomach feeling about this :)

 I'm quite familiar with Hadoop as you can see if you check out the code of
 my OS project AbstractCache-
 http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a
 project which aims to create storage solutions based on the Map and
 SortedMap interface. I use it everywhere in Tailsweep.com and used it as
 well at my former employer Eniro.se (largest yellow pages site in Sweden).
 It has been in constant development for five years.

 Since I'm a cluster freak of nature I love a project named GlusterFS where
 thay have managed to create a system without master/slave[s] and NameNode.
 The advantage of this is that it is a lot more scalable, the drawback is
 that you can get into Split-Brain situations which guys in the mailing-list
 are complaining about. Anyway I tend to try to solve this with JGroups
 membership where the coordinator can be any machine in the cluster but in
 the group joining process the first machine to join get's the privilege of
 becoming coordinator. But even with JGroups you can run into trouble with
 race-conditions of all kinds (distributed locks for example).

 I've created an alternative to the Hadoop file system (mostly for fun)
 where you just add an object to the cluster and based on what algorithm you
 choose it is Raided or striped across the cluster.

 Anyway this was off topic but I think my experience in building membership
 aware clusters will help me in this particular case.

 Kindly

 //Mrcaus




 On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic 
 [EMAIL PROTECTED] wrote:

  Marcus,
 
  You are headed in the right direction.
 
  We've built a system like this at Technorati (Lucene, not Solr) and had
  components like the namenode or controller that you mention.  If you
  look at Hadoop project, you will see something similar in concept
  (NameNode), though it deals with raw data blocks, their placement in the
  cluster, etc.  As a matter of fact, I am currently running its re-balancer
  in order to move some of the blocks around in the cluster.  That matches
  what you are describing for moving documents from one shard to the other.
   Of course, you can simplify things and just have this central piece be
  aware of any new servers and simply get it to place any new docs on the new
  servers and create a new shard there.  Or you can get fancy and take into
  consideration the hardware resources - the CPU, the disk space, the memory,
  and use that to figure out how much each machine in your cluster can handle
  and maximize its use based on this knowledge. :)
 
  I think Solr and Nutch are in a desperate need of this central component
  (must not be SPOF!) for shard management.
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: marcusherou [EMAIL PROTECTED]
   To: solr-user@lucene.apache.org
   Sent: Friday, May 9, 2008 2:37:19 AM
   Subject: Re: Solr feasibility with terabyte-scale data
  
  
   Hi.
  
   I will as well head into a path like yours within some months from
  now.
   Currently I have an index of ~10M docs and only store id's in the
  index for
   performance and distribution reasons. When we enter a new market I'm
   assuming we will soon hit 100M and quite soon after that 1G documents.
  Each
   document have in average about 3-5k data.
  
   We will use a GlusterFS installation with RAID1 (or RAID10) SATA
  enclosures
   as shared storage (think of it as a SAN or shared storage at least,
  one
   mount point). Hope this will be the right choice, only future can
  tell.
  
   Since we are developing a search engine I frankly don't think even
  having
   100's of SOLR instances serving the index will cut it performance wise
  if we
   have one big index. I totally agree with the others claiming that you
  most
   definitely will go OOE or hit some other constraints of SOLR if you
  must
   have the whole result in memory sort it and create a xml response. I
  did hit
   such constraints when I couldn't afford the instances to have enough
  memory
   and I had only 1M of docs back then. And think of it... Optimizing a
  TB
   index

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Thanks Ken.

I will take a look be sure of that :)

Kindly

//Marcus

On Fri, May 9, 2008 at 10:26 PM, Ken Krugler [EMAIL PROTECTED]
wrote:

 Hi Marcus,

  It seems a lot of what you're describing is really similar to MapReduce,
 so I think Otis' suggestion to look at Hadoop is a good one: it might
 prevent a lot of headaches and they've already solved a lot of the tricky
 problems. There a number of ridiculously sized projects using it to solve
 their scale problems, not least Yahoo...


 You should also look at a new project called Katta:

 http://katta.wiki.sourceforge.net/

 First code check-in should be happening this weekend, so I'd wait until
 Monday to take a look :)

 -- Ken


  On 9 May 2008, at 01:17, Marcus Herou wrote:

  Cool.

 Since you must certainly already have a good partitioning scheme, could
 you
 elaborate on high level how you set this up ?

 I'm certain that I will shoot myself in the foot both once and twice
 before
 getting it right but this is what I'm good at; to never stop trying :)
 However it is nice to start playing at least on the right side of the
 football field so a little push in the back would be really helpful.

 Kindly

 //Marcus



 On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED]
 
 wrote:

  Hi, we have an index of ~300GB, which is at least approaching the
 ballpark
 you're in.

 Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 index so we can just scale out horizontally across commodity hardware
 with
 no problems at all. We're also using the multicore features available in
 development Solr version to reduce granularity of core size by an order
 of
 magnitude: this makes for lots of small commits, rather than few long
 ones.

 There was mention somewhere in the thread of document collections: if
 you're going to be filtering by collection, I'd strongly recommend
 partitioning too. It makes scaling so much less painful!

 James


 On 8 May 2008, at 23:37, marcusherou wrote:

  Hi.

 I will as well head into a path like yours within some months from now.
 Currently I have an index of ~10M docs and only store id's in the index
 for
 performance and distribution reasons. When we enter a new market I'm
 assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
 document have in average about 3-5k data.

 We will use a GlusterFS installation with RAID1 (or RAID10) SATA
 enclosures
 as shared storage (think of it as a SAN or shared storage at least, one
 mount point). Hope this will be the right choice, only future can tell.

 Since we are developing a search engine I frankly don't think even
 having
 100's of SOLR instances serving the index will cut it performance wise
 if
 we
 have one big index. I totally agree with the others claiming that you
 most
 definitely will go OOE or hit some other constraints of SOLR if you
 must
 have the whole result in memory sort it and create a xml response. I
 did
 hit
 such constraints when I couldn't afford the instances to have enough
 memory
 and I had only 1M of docs back then. And think of it... Optimizing a TB
 index will take a long long time and you really want to have an
 optimized
 index if you want to reduce search time.

 I am thinking of a sharding solution where I fragment the index over
 the
 disk(s) and let each SOLR instance only have little piece of the total
 index. This will require a master database or namenode (or simpler just
 a
 properties file in each index dir) of some sort to know what docs is
 located
 on which machine or at least how many docs each shard have. This is to
 ensure that whenever you introduce a new SOLR instance with a new shard
 the
 master indexer will know what shard to prioritize. This is probably not
 enough either since all new docs will go to the new shard until it is
 filled
 (have the same size as the others) only then will all shards receive
 docs
 in
 a loadbalanced fashion. So whenever you want to add a new indexer you
 probably need to initiate a stealing process where it steals docs
 from
 the
 others until it reaches some sort of threshold (10 servers = each shard
 should have 1/10 of the docs or such).

 I think this will cut it and enabling us to grow with the data. I think
 doing a distributed reindexing will as well be a good thing when it
 comes
 to
 cutting both indexing and optimizing speed. Probably each indexer
 should
 buffer it's shard locally on RAID1 SCSI disks, optimize it and then
 just
 copy it to the main index to minimize the burden of the shared storage.

 Let's say the indexing part will be all fancy and working i TB scale
 now
 we
 come to searching. I personally believe after talking to other guys
 which
 have built big search engines that you need to introduce a controller
 like
 searcher on the client side which itself searches in all of the shards
 and
 merges the response. Perhaps Distributed Solr solves this and will love
 to
 test it whenever my new installation of servers and enclosures 

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Hi Otis.

Thanks for the insights. Nice to get feedback from a technorati guy. Nice to
see that the snippet of yours is almost a copy of mine, gives me the right
stomach feeling about this :)

I'm quite familiar with Hadoop as you can see if you check out the code of
my OS project AbstractCache-
http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project
which aims to create storage solutions based on the Map and SortedMap
interface. I use it everywhere in Tailsweep.com and used it as well at my
former employer Eniro.se (largest yellow pages site in Sweden). It has been
in constant development for five years.

Since I'm a cluster freak of nature I love a project named GlusterFS where
thay have managed to create a system without master/slave[s] and NameNode.
The advantage of this is that it is a lot more scalable, the drawback is
that you can get into Split-Brain situations which guys in the mailing-list
are complaining about. Anyway I tend to try to solve this with JGroups
membership where the coordinator can be any machine in the cluster but in
the group joining process the first machine to join get's the privilege of
becoming coordinator. But even with JGroups you can run into trouble with
race-conditions of all kinds (distributed locks for example).

I've created an alternative to the Hadoop file system (mostly for fun) where
you just add an object to the cluster and based on what algorithm you choose
it is Raided or striped across the cluster.

Anyway this was off topic but I think my experience in building membership
aware clusters will help me in this particular case.

Kindly

//Mrcaus



On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Marcus,

 You are headed in the right direction.

 We've built a system like this at Technorati (Lucene, not Solr) and had
 components like the namenode or controller that you mention.  If you
 look at Hadoop project, you will see something similar in concept
 (NameNode), though it deals with raw data blocks, their placement in the
 cluster, etc.  As a matter of fact, I am currently running its re-balancer
 in order to move some of the blocks around in the cluster.  That matches
 what you are describing for moving documents from one shard to the other.
  Of course, you can simplify things and just have this central piece be
 aware of any new servers and simply get it to place any new docs on the new
 servers and create a new shard there.  Or you can get fancy and take into
 consideration the hardware resources - the CPU, the disk space, the memory,
 and use that to figure out how much each machine in your cluster can handle
 and maximize its use based on this knowledge. :)

 I think Solr and Nutch are in a desperate need of this central component
 (must not be SPOF!) for shard management.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: marcusherou [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Friday, May 9, 2008 2:37:19 AM
  Subject: Re: Solr feasibility with terabyte-scale data
 
 
  Hi.
 
  I will as well head into a path like yours within some months from now.
  Currently I have an index of ~10M docs and only store id's in the index
 for
  performance and distribution reasons. When we enter a new market I'm
  assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
  document have in average about 3-5k data.
 
  We will use a GlusterFS installation with RAID1 (or RAID10) SATA
 enclosures
  as shared storage (think of it as a SAN or shared storage at least, one
  mount point). Hope this will be the right choice, only future can tell.
 
  Since we are developing a search engine I frankly don't think even having
  100's of SOLR instances serving the index will cut it performance wise if
 we
  have one big index. I totally agree with the others claiming that you
 most
  definitely will go OOE or hit some other constraints of SOLR if you must
  have the whole result in memory sort it and create a xml response. I did
 hit
  such constraints when I couldn't afford the instances to have enough
 memory
  and I had only 1M of docs back then. And think of it... Optimizing a TB
  index will take a long long time and you really want to have an optimized
  index if you want to reduce search time.
 
  I am thinking of a sharding solution where I fragment the index over the
  disk(s) and let each SOLR instance only have little piece of the total
  index. This will require a master database or namenode (or simpler just a
  properties file in each index dir) of some sort to know what docs is
 located
  on which machine or at least how many docs each shard have. This is to
  ensure that whenever you introduce a new SOLR instance with a new shard
 the
  master indexer will know what shard to prioritize. This is probably not
  enough either since all new docs will go to the new shard until it is
 filled
  (have the same size as the others) only

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread marcusherou

Hi.

I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.

Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did hit
such constraints when I couldn't afford the instances to have enough memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is filled
(have the same size as the others) only then will all shards receive docs in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a stealing process where it steals docs from the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.

Let's say the indexing part will be all fancy and working i TB scale now we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is finished.

Currently my idea is something like this.
public PageDocument search(SearchDocumentCommand sdc)
{
SetInteger ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize());
for (IteratorInteger iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
ListDocumentIndexer indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
PageDocument res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}

if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}

docs.setTotalItems(totalItems);

return docs;
}

This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and comparing
stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.

I think this approach is quite OK but the paging stuff is broken I think.
However the searching speed will at best be constant proportional to the
number of searchers, probably a lot worse. To get even more speed each
document indexer should be put into a separate thread with something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
pool. The Future result times out after let's say 750 msec and the client
ignores all searchers which are slower. Probably some performance metrics
should be gathered about each searcher so the client knows which indexers to
prefer over the others.
But of course if you have 50 searchers, having each client thread spawn yet
another 50 threads isn't a good thing either. So perhaps 

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
Hi, we have an index of ~300GB, which is at least approaching the  
ballpark you're in.


Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable' index so we can just scale out horizontally across  
commodity hardware with no problems at all. We're also using the  
multicore features available in development Solr version to reduce  
granularity of core size by an order of magnitude: this makes for lots  
of small commits, rather than few long ones.


There was mention somewhere in the thread of document collections: if  
you're going to be filtering by collection, I'd strongly recommend  
partitioning too. It makes scaling so much less painful!


James

On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index for

performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents. Each

document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA  
enclosures
as shared storage (think of it as a SAN or shared storage at least,  
one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if you  
must
have the whole result in memory sort it and create a xml response. I  
did hit
such constraints when I couldn't afford the instances to have enough  
memory
and I had only 1M of docs back then. And think of it... Optimizing a  
TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over  
the

disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler  
just a
properties file in each index dir) of some sort to know what docs is  
located

on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new  
shard the
master indexer will know what shard to prioritize. This is probably  
not
enough either since all new docs will go to the new shard until it  
is filled
(have the same size as the others) only then will all shards receive  
docs in

a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a stealing process where it steals docs  
from the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when it  
comes to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then  
just
copy it to the main index to minimize the burden of the shared  
storage.


Let's say the indexing part will be all fancy and working i TB scale  
now we
come to searching. I personally believe after talking to other guys  
which
have built big search engines that you need to introduce a  
controller like
searcher on the client side which itself searches in all of the  
shards and
merges the response. Perhaps Distributed Solr solves this and will  
love to
test it whenever my new installation of servers and enclosures is  
finished.


Currently my idea is something like this.
public PageDocument search(SearchDocumentCommand sdc)
   {
   SetInteger ids = documentIndexers.keySet();
   int nrOfSearchers = ids.size();
   int totalItems = 0;
   PageDocument docs = new Page(sdc.getPage(),  
sdc.getPageSize());

   for (IteratorInteger iterator = ids.iterator();
iterator.hasNext();)
   {
   Integer id = iterator.next();
   ListDocumentIndexer indexers = documentIndexers.get(id);
   DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
   SearchDocumentCommand sdc2 = copy(sdc);
   sdc2.setPage(sdc.getPage()/nrOfSearchers);
   PageDocument res = indexer.search(sdc);
   totalItems += res.getTotalItems();
   docs.addAll(res);
   }

   if(sdc.getComparator() != null)
   {
   Collections.sort(docs, sdc.getComparator());
   }

   docs.setTotalItems(totalItems);

   return docs;
   }

This is my RaidedDocumentIndexer which wraps a set of  
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and  
comparing
stuff so I have two implementations of DocumentIndexer  
(SolrDocumentIndexer

and 

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Marcus Herou
Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED]
wrote:

 Hi, we have an index of ~300GB, which is at least approaching the ballpark
 you're in.

 Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 index so we can just scale out horizontally across commodity hardware with
 no problems at all. We're also using the multicore features available in
 development Solr version to reduce granularity of core size by an order of
 magnitude: this makes for lots of small commits, rather than few long ones.

 There was mention somewhere in the thread of document collections: if
 you're going to be filtering by collection, I'd strongly recommend
 partitioning too. It makes scaling so much less painful!

 James


 On 8 May 2008, at 23:37, marcusherou wrote:


 Hi.

 I will as well head into a path like yours within some months from now.
 Currently I have an index of ~10M docs and only store id's in the index
 for
 performance and distribution reasons. When we enter a new market I'm
 assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
 document have in average about 3-5k data.

 We will use a GlusterFS installation with RAID1 (or RAID10) SATA
 enclosures
 as shared storage (think of it as a SAN or shared storage at least, one
 mount point). Hope this will be the right choice, only future can tell.

 Since we are developing a search engine I frankly don't think even having
 100's of SOLR instances serving the index will cut it performance wise if
 we
 have one big index. I totally agree with the others claiming that you most
 definitely will go OOE or hit some other constraints of SOLR if you must
 have the whole result in memory sort it and create a xml response. I did
 hit
 such constraints when I couldn't afford the instances to have enough
 memory
 and I had only 1M of docs back then. And think of it... Optimizing a TB
 index will take a long long time and you really want to have an optimized
 index if you want to reduce search time.

 I am thinking of a sharding solution where I fragment the index over the
 disk(s) and let each SOLR instance only have little piece of the total
 index. This will require a master database or namenode (or simpler just a
 properties file in each index dir) of some sort to know what docs is
 located
 on which machine or at least how many docs each shard have. This is to
 ensure that whenever you introduce a new SOLR instance with a new shard
 the
 master indexer will know what shard to prioritize. This is probably not
 enough either since all new docs will go to the new shard until it is
 filled
 (have the same size as the others) only then will all shards receive docs
 in
 a loadbalanced fashion. So whenever you want to add a new indexer you
 probably need to initiate a stealing process where it steals docs from
 the
 others until it reaches some sort of threshold (10 servers = each shard
 should have 1/10 of the docs or such).

 I think this will cut it and enabling us to grow with the data. I think
 doing a distributed reindexing will as well be a good thing when it comes
 to
 cutting both indexing and optimizing speed. Probably each indexer should
 buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
 copy it to the main index to minimize the burden of the shared storage.

 Let's say the indexing part will be all fancy and working i TB scale now
 we
 come to searching. I personally believe after talking to other guys which
 have built big search engines that you need to introduce a controller like
 searcher on the client side which itself searches in all of the shards and
 merges the response. Perhaps Distributed Solr solves this and will love to
 test it whenever my new installation of servers and enclosures is
 finished.

 Currently my idea is something like this.
 public PageDocument search(SearchDocumentCommand sdc)
   {
   SetInteger ids = documentIndexers.keySet();
   int nrOfSearchers = ids.size();
   int totalItems = 0;
   PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize());
   for (IteratorInteger iterator = ids.iterator();
 iterator.hasNext();)
   {
   Integer id = iterator.next();
   ListDocumentIndexer indexers = documentIndexers.get(id);
   DocumentIndexer indexer =
 indexers.get(random.nextInt(indexers.size()));
   SearchDocumentCommand sdc2 = copy(sdc);
   sdc2.setPage(sdc.getPage()/nrOfSearchers);
   PageDocument res = indexer.search(sdc);
   totalItems += 

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
So our problem is made easier by having complete index  
partitionability by a user_id field. That means at one end of the  
spectrum, we could have one monolithic index for everyone, while at  
the other end of the spectrum we could individual cores for each  
user_id.


At the moment, we've gone for a halfway house somewhere in the middle:  
I've got several large EC2 instances (currently 3), each running a  
single master/slave pair of Solr servers. The servers have several  
cores (currently 10 - a guesstimated good number). As new users  
register, I automatically distribute them across cores. I would like  
to do something with clustering users based on geo-location so that  
cores will get 'time off' for maintenance and optimization for that  
user cluster's nighttime. I'd also like to move in the 1 core per user  
direction as dynamic core creation becomes available.


It seems a lot of what you're describing is really similar to  
MapReduce, so I think Otis' suggestion to look at Hadoop is a good  
one: it might prevent a lot of headaches and they've already solved a  
lot of the tricky problems. There a number of ridiculously sized  
projects using it to solve their scale problems, not least Yahoo...


James

On 9 May 2008, at 01:17, Marcus Herou wrote:


Cool.

Since you must certainly already have a good partitioning scheme,  
could you

elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice  
before

getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] 


wrote:

Hi, we have an index of ~300GB, which is at least approaching the  
ballpark

you're in.

Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable'
index so we can just scale out horizontally across commodity  
hardware with
no problems at all. We're also using the multicore features  
available in
development Solr version to reduce granularity of core size by an  
order of
magnitude: this makes for lots of small commits, rather than few  
long ones.


There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index

for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents.

Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at  
least, one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if

we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if  
you must
have the whole result in memory sort it and create a xml response.  
I did

hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing  
a TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index  
over the
disk(s) and let each SOLR instance only have little piece of the  
total
index. This will require a master database or namenode (or simpler  
just a

properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This  
is to
ensure that whenever you introduce a new SOLR instance with a new  
shard

the
master indexer will know what shard to prioritize. This is  
probably not
enough either since all new docs will go to the new shard until it  
is

filled
(have the same size as the others) only then will all shards  
receive docs

in
a loadbalanced fashion. So whenever you want to add a new indexer  
you
probably need to initiate a stealing process where it steals  
docs from

the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when  
it comes

to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize it 

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler

Hi Marcus,

It seems a lot of what you're describing is really similar to 
MapReduce, so I think Otis' suggestion to look at Hadoop is a good 
one: it might prevent a lot of headaches and they've already solved 
a lot of the tricky problems. There a number of ridiculously sized 
projects using it to solve their scale problems, not least Yahoo...


You should also look at a new project called Katta:

http://katta.wiki.sourceforge.net/

First code check-in should be happening this weekend, so I'd wait 
until Monday to take a look :)


-- Ken


On 9 May 2008, at 01:17, Marcus Herou wrote:


Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED]
wrote:


Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.

Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.

There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:


Hi.

I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.

Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if
we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a stealing process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.

Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is
finished.

Currently my idea is something like this.
public PageDocument search(SearchDocumentCommand sdc)
 {
 SetInteger ids = documentIndexers.keySet();
 int nrOfSearchers = ids.size();
 int totalItems = 0;
 PageDocument docs = new 

RE: Solr feasibility with terabyte-scale data

2008-05-09 Thread Lance Norskog
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long random number, or 128 random bits. At this point
there are no reports of two different datasets that give the same checksum.

This gives some handy things: 
a) a fixed-size unique ID field, giving fixed space requirements,
The standard representation of this is 32 hex bytes, i.e.
'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene
data type for this.

b) the ability to change your mind about the uniqueness formula for your
data,

c) a handy primary key for cross-correlating in other databases,
Think external DBs which supply data for some records. The primary
key is the MD5 signature.

d) the ability to randomly pick subsets of your data.
The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the
wildcard string 'deadbeef*'. And 'd*'.
'd*' selects a perfectly random subset of your data, 1/16 of the
total size. 'd**' gives 1/256 of your data.
This is perfectly random because MD5 gives such a perfectly random
hashcode.

This should go on a wiki page 'SchemaDesignTips'.

Cheers,

Lance Norskog





Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
You can't believe how much it pains me to see such nice piece of work live so 
separately.  But I also think I know why it happened :(.  Do you know if Stefan 
 Co. have the intention to bring it under some contrib/ around here?  Would 
that not make sense?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Ken Krugler [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Friday, May 9, 2008 4:26:19 PM
 Subject: Re: Solr feasibility with terabyte-scale data
 
 Hi Marcus,
 
 It seems a lot of what you're describing is really similar to 
 MapReduce, so I think Otis' suggestion to look at Hadoop is a good 
 one: it might prevent a lot of headaches and they've already solved 
 a lot of the tricky problems. There a number of ridiculously sized 
 projects using it to solve their scale problems, not least Yahoo...
 
 You should also look at a new project called Katta:
 
 http://katta.wiki.sourceforge.net/
 
 First code check-in should be happening this weekend, so I'd wait 
 until Monday to take a look :)
 
 -- Ken
 
 On 9 May 2008, at 01:17, Marcus Herou wrote:
 
 Cool.
 
 Since you must certainly already have a good partitioning scheme, could you
 elaborate on high level how you set this up ?
 
 I'm certain that I will shoot myself in the foot both once and twice before
 getting it right but this is what I'm good at; to never stop trying :)
 However it is nice to start playing at least on the right side of the
 football field so a little push in the back would be really helpful.
 
 Kindly
 
 //Marcus
 
 
 
 On Fri, May 9, 2008 at 9:36 AM, James Brady 
 wrote:
 
 Hi, we have an index of ~300GB, which is at least approaching the ballpark
 you're in.
 
 Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 index so we can just scale out horizontally across commodity hardware with
 no problems at all. We're also using the multicore features available in
 development Solr version to reduce granularity of core size by an order of
 magnitude: this makes for lots of small commits, rather than few long ones.
 
 There was mention somewhere in the thread of document collections: if
 you're going to be filtering by collection, I'd strongly recommend
 partitioning too. It makes scaling so much less painful!
 
 James
 
 
 On 8 May 2008, at 23:37, marcusherou wrote:
 
 Hi.
 
 I will as well head into a path like yours within some months from now.
 Currently I have an index of ~10M docs and only store id's in the index
 for
 performance and distribution reasons. When we enter a new market I'm
 assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
 document have in average about 3-5k data.
 
 We will use a GlusterFS installation with RAID1 (or RAID10) SATA
 enclosures
 as shared storage (think of it as a SAN or shared storage at least, one
 mount point). Hope this will be the right choice, only future can tell.
 
 Since we are developing a search engine I frankly don't think even having
 100's of SOLR instances serving the index will cut it performance wise if
 we
 have one big index. I totally agree with the others claiming that you most
 definitely will go OOE or hit some other constraints of SOLR if you must
 have the whole result in memory sort it and create a xml response. I did
 hit
 such constraints when I couldn't afford the instances to have enough
 memory
 and I had only 1M of docs back then. And think of it... Optimizing a TB
 index will take a long long time and you really want to have an optimized
 index if you want to reduce search time.
 
 I am thinking of a sharding solution where I fragment the index over the
 disk(s) and let each SOLR instance only have little piece of the total
 index. This will require a master database or namenode (or simpler just a
 properties file in each index dir) of some sort to know what docs is
 located
 on which machine or at least how many docs each shard have. This is to
 ensure that whenever you introduce a new SOLR instance with a new shard
 the
 master indexer will know what shard to prioritize. This is probably not
 enough either since all new docs will go to the new shard until it is
 filled
 (have the same size as the others) only then will all shards receive docs
 in
 a loadbalanced fashion. So whenever you want to add a new indexer you
 probably need to initiate a stealing process where it steals docs from
 the
 others until it reaches some sort of threshold (10 servers = each shard
 should have 1/10 of the docs or such).
 
 I think this will cut it and enabling us to grow with the data. I think
 doing a distributed reindexing will as well be a good thing when it comes
 to
 cutting both indexing and optimizing speed. Probably each indexer should
 buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
 copy it to the main index to minimize the burden of the shared storage.
 
 Let's say the indexing part will be all fancy and working i TB scale now
 we
 come

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler

Hi Otis,

You can't believe how much it pains me to see such nice piece of 
work live so separately.  But I also think I know why it happened 
:(.  Do you know if Stefan  Co. have the intention to bring it 
under some contrib/ around here?  Would that not make sense?


I'm not working on the project, so I can't speak for Stefan  
friends...but my guess is that it's going to live separately as 
something independent of Solr/Nutch. If you view it as search 
plumbing that's usable in multiple environments, then that makes 
sense. If you view it as replicating core Solr (or Nutch) 
functionality, then it sucks. Not sure what the outcome will be.


-- Ken




- Original Message 

 From: Ken Krugler [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Friday, May 9, 2008 4:26:19 PM
 Subject: Re: Solr feasibility with terabyte-scale data

 Hi Marcus,

 It seems a lot of what you're describing is really similar to
 MapReduce, so I think Otis' suggestion to look at Hadoop is a good
 one: it might prevent a lot of headaches and they've already solved
 a lot of the tricky problems. There a number of ridiculously sized
 projects using it to solve their scale problems, not least Yahoo...

 You should also look at a new project called Katta:

 http://katta.wiki.sourceforge.net/

 First code check-in should be happening this weekend, so I'd wait
 until Monday to take a look :)

 -- Ken

 On 9 May 2008, at 01:17, Marcus Herou wrote:
 
 Cool.
 
 Since you must certainly already have a good partitioning 
scheme, could you

 elaborate on high level how you set this up ?
 
 I'm certain that I will shoot myself in the foot both once and 
twice before

 getting it right but this is what I'm good at; to never stop trying :)
 However it is nice to start playing at least on the right side of the
 football field so a little push in the back would be really helpful.
 
 Kindly
 
 //Marcus
 
 
 
 On Fri, May 9, 2008 at 9:36 AM, James Brady
 wrote:
 
 Hi, we have an index of ~300GB, which is at least approaching 
the ballpark

 you're in.
 
 Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 index so we can just scale out horizontally across commodity 
hardware with

 no problems at all. We're also using the multicore features available in
 development Solr version to reduce granularity of core size by 
an order of
 magnitude: this makes for lots of small commits, rather than 
few long ones.

 
 There was mention somewhere in the thread of document collections: if
 you're going to be filtering by collection, I'd strongly recommend
 partitioning too. It makes scaling so much less painful!
 
 James
 
 
 On 8 May 2008, at 23:37, marcusherou wrote:
 
 Hi.
 
 I will as well head into a path like yours within some months from now.
 Currently I have an index of ~10M docs and only store id's in the index
 for
 performance and distribution reasons. When we enter a new market I'm
 assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
 document have in average about 3-5k data.
 
 We will use a GlusterFS installation with RAID1 (or RAID10) SATA

  enclosures

 as shared storage (think of it as a SAN or shared storage at least, one
 mount point). Hope this will be the right choice, only future can tell.
 
 Since we are developing a search engine I frankly don't think 
even having
 100's of SOLR instances serving the index will cut it 
performance wise if

 we
 have one big index. I totally agree with the others claiming 
that you most

 definitely will go OOE or hit some other constraints of SOLR if you must
 have the whole result in memory sort it and create a xml response. I did
 hit
 such constraints when I couldn't afford the instances to have enough

  memory

 and I had only 1M of docs back then. And think of it... Optimizing a TB
 index will take a long long time and you really want to have 
an optimized

 index if you want to reduce search time.
 
 I am thinking of a sharding solution where I fragment the index over the
 disk(s) and let each SOLR instance only have little piece of the total
 index. This will require a master database or namenode (or 
simpler just a

 properties file in each index dir) of some sort to know what docs is
 located
 on which machine or at least how many docs each shard have. This is to
 ensure that whenever you introduce a new SOLR instance with a new shard
 the
 master indexer will know what shard to prioritize. This is probably not
 enough either since all new docs will go to the new shard until it is
 filled
 (have the same size as the others) only then will all shards 
receive docs

 in
 a loadbalanced fashion. So whenever you want to add a new indexer you
 probably need to initiate a stealing process where it steals docs from
 the
 others until it reaches some sort of threshold (10 servers = each shard
 should have 1/10 of the docs or such).
 
 I think this will cut it and enabling us to grow with the data. I think
 doing a distributed reindexing

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
From what I can tell from the overview on http://katta.wiki.sourceforge.net/, 
it's a partial replication of Solr/Nutch functionality, plus some goodies.  It 
might have been better to work those goodies into some friendly contrib/ be it 
Solr, Nutch, Hadoop, or Lucene.  Anyhow, let's see what happens there! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Ken Krugler [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Friday, May 9, 2008 5:37:19 PM
 Subject: Re: Solr feasibility with terabyte-scale data
 
 Hi Otis,
 
 You can't believe how much it pains me to see such nice piece of 
 work live so separately.  But I also think I know why it happened 
 :(.  Do you know if Stefan  Co. have the intention to bring it 
 under some contrib/ around here?  Would that not make sense?
 
 I'm not working on the project, so I can't speak for Stefan  
 friends...but my guess is that it's going to live separately as 
 something independent of Solr/Nutch. If you view it as search 
 plumbing that's usable in multiple environments, then that makes 
 sense. If you view it as replicating core Solr (or Nutch) 
 functionality, then it sucks. Not sure what the outcome will be.
 
 -- Ken
 
 
 
 - Original Message 
   From: Ken Krugler 
   To: solr-user@lucene.apache.org
   Sent: Friday, May 9, 2008 4:26:19 PM
   Subject: Re: Solr feasibility with terabyte-scale data
 
   Hi Marcus,
 
   It seems a lot of what you're describing is really similar to
   MapReduce, so I think Otis' suggestion to look at Hadoop is a good
   one: it might prevent a lot of headaches and they've already solved
   a lot of the tricky problems. There a number of ridiculously sized
   projects using it to solve their scale problems, not least Yahoo...
 
   You should also look at a new project called Katta:
 
   http://katta.wiki.sourceforge.net/
 
   First code check-in should be happening this weekend, so I'd wait
   until Monday to take a look :)
 
   -- Ken
 
   On 9 May 2008, at 01:17, Marcus Herou wrote:
   
   Cool.
   
   Since you must certainly already have a good partitioning 
 scheme, could you
   elaborate on high level how you set this up ?
   
   I'm certain that I will shoot myself in the foot both once and 
 twice before
   getting it right but this is what I'm good at; to never stop trying :)
   However it is nice to start playing at least on the right side of the
   football field so a little push in the back would be really helpful.
   
   Kindly
   
   //Marcus
   
   
   
   On Fri, May 9, 2008 at 9:36 AM, James Brady
   wrote:
   
   Hi, we have an index of ~300GB, which is at least approaching 
 the ballpark
   you're in.
   
   Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
   index so we can just scale out horizontally across commodity 
 hardware with
   no problems at all. We're also using the multicore features available 
  in
   development Solr version to reduce granularity of core size by 
 an order of
   magnitude: this makes for lots of small commits, rather than 
 few long ones.
   
   There was mention somewhere in the thread of document collections: if
   you're going to be filtering by collection, I'd strongly recommend
   partitioning too. It makes scaling so much less painful!
   
   James
   
   
   On 8 May 2008, at 23:37, marcusherou wrote:
   
   Hi.
   
   I will as well head into a path like yours within some months from 
  now.
   Currently I have an index of ~10M docs and only store id's in the 
  index
   for
   performance and distribution reasons. When we enter a new market I'm
   assuming we will soon hit 100M and quite soon after that 1G documents.
   Each
   document have in average about 3-5k data.
   
   We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
   as shared storage (think of it as a SAN or shared storage at least, 
  one
   mount point). Hope this will be the right choice, only future can 
  tell.
   
   Since we are developing a search engine I frankly don't think 
 even having
   100's of SOLR instances serving the index will cut it 
 performance wise if
   we
   have one big index. I totally agree with the others claiming 
 that you most
   definitely will go OOE or hit some other constraints of SOLR if you 
  must
   have the whole result in memory sort it and create a xml response. I 
  did
   hit
   such constraints when I couldn't afford the instances to have enough
memory
   and I had only 1M of docs back then. And think of it... Optimizing a 
  TB
   index will take a long long time and you really want to have 
 an optimized
   index if you want to reduce search time.
   
   I am thinking of a sharding solution where I fragment the index over 
  the
   disk(s) and let each SOLR instance only have little piece of the total
   index. This will require a master database or namenode (or 
 simpler just a
   properties file in each index dir) of some sort

Re: Solr feasibility with terabyte-scale data

2008-01-23 Thread Phillip Farber
For sure this is a problem.  We have considered some strategies.  One 
might be to use a dictionary to clean up the OCR but that gets hard for 
proper names and technical jargon. Another is to use stop words (which 
has the unfortunate side effect of making phrase searches like to be or 
not to be impossible).  I've heard you can't make a silk purse out of a 
sows ear ...


Phil



Erick Erickson wrote:

Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the words are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch G

Best
Erick

On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote:



Ryan McKinley wrote:

We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR.  Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting.   The schema is simple
too.  A document consists of a numeric id, stored and indexed and a
large text field, indexed not stored, containing the OCR typically
~1.4Mb.  Some limited faceting or additional metadata fields may be
added later.

I have not done anything on this scale...  but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
split a large index into many smaller indices and return the union of
all results.  This may or may not be necessary depending on what the
data actually looks like (if you text just uses 100 words, your index
may not be that big)

How many documents are you talking about?


Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we
are talking perhaps 50K words total to index so as you point out the
index might not be too big.  It's the *data* that is big not the
*index*, right?.  So I don't think SOLR-303 (distributed search) is
required here.

 Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly?  But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?

Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?



Should we expect Solr indexing time to slow significantly as we scale
up?  What kind of query performance could we expect?  Is it totally
naive even to consider Solr at this kind of scale?


You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html



http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html



ryan






RE: Solr feasibility with terabyte-scale data

2008-01-23 Thread Lance Norskog
 We use two indexed copies of the same text, one with stemming and stopwords
and the other with neither.  We do phrase search on the second.

You might use two different OCR implementations and cross-correlate the
output.  

Lance

-Original Message-
From: Phillip Farber [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data

For sure this is a problem.  We have considered some strategies.  One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names and technical jargon. Another is to use stop words (which has the
unfortunate side effect of making phrase searches like to be or not to be
impossible).  I've heard you can't make a silk purse out of a sows ear ...

Phil



Erick Erickson wrote:
 Just to add another wrinkle, how clean is your OCR? I've seen it
 range from very nice (i.e. 99.9% of the words are actually words) to
 horrible (60%+ of the words are nonsense). I saw one attempt
 to OCR a family tree. As in a stylized tree with the data
 hand-written along the various branches in every orientation. Not a
 recognizable word in the bunch G
 
 Best
 Erick
 
 On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote:
 

 Ryan McKinley wrote:
 We are considering Solr 1.2 to index and search a terabyte-scale
 dataset of OCR.  Initially our requirements are simple: basic
 tokenizing, score sorting only, no faceting.   The schema is simple
 too.  A document consists of a numeric id, stored and indexed and a
 large text field, indexed not stored, containing the OCR typically
 ~1.4Mb.  Some limited faceting or additional metadata fields may be
 added later.
 I have not done anything on this scale...  but with:
 https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
 split a large index into many smaller indices and return the union of
 all results.  This may or may not be necessary depending on what the
 data actually looks like (if you text just uses 100 words, your index
 may not be that big)

 How many documents are you talking about?

 Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we
 are talking perhaps 50K words total to index so as you point out the
 index might not be too big.  It's the *data* that is big not the
 *index*, right?.  So I don't think SOLR-303 (distributed search) is
 required here.

  Obviously as the number of documents increase the index size must
 increase to some degree -- I think linearly?  But what index size will
 result for 7M documents over 50K words where we're talking just 2 fields
 per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?

 Regarding single word queries, do you think, say, 0.5 sec/query to
 return 7M score-ranked IDs is possible/reasonable in this scenario?


 Should we expect Solr indexing time to slow significantly as we scale
 up?  What kind of query performance could we expect?  Is it totally
 naive even to consider Solr at this kind of scale?

 You may want to check out the lucene benchmark stuff
 http://lucene.apache.org/java/docs/benchmarks.html



http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/p
ackage-summary.html


 ryan


 



Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber



Ryan McKinley wrote:


We are considering Solr 1.2 to index and search a terabyte-scale 
dataset of OCR.  Initially our requirements are simple: basic 
tokenizing, score sorting only, no faceting.   The schema is simple 
too.  A document consists of a numeric id, stored and indexed and a 
large text field, indexed not stored, containing the OCR typically 
~1.4Mb.  Some limited faceting or additional metadata fields may be 
added later.


I have not done anything on this scale...  but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to 
split a large index into many smaller indices and return the union of 
all results.  This may or may not be necessary depending on what the 
data actually looks like (if you text just uses 100 words, your index 
may not be that big)


How many documents are you talking about?



Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we 
are talking perhaps 50K words total to index so as you point out the 
index might not be too big.  It's the *data* that is big not the 
*index*, right?.  So I don't think SOLR-303 (distributed search) is 
required here.


 Obviously as the number of documents increase the index size must 
increase to some degree -- I think linearly?  But what index size will 
result for 7M documents over 50K words where we're talking just 2 fields 
per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?


Regarding single word queries, do you think, say, 0.5 sec/query to 
return 7M score-ranked IDs is possible/reasonable in this scenario?





Should we expect Solr indexing time to slow significantly as we scale 
up?  What kind of query performance could we expect?  Is it totally 
naive even to consider Solr at this kind of scale?




You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html

http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html 




ryan




Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber



Otis Gospodnetic wrote:

Hi,
Some quick notes, since it's late here.

- You'll need to wait for SOLR-303 - there is no way even a big machine will be 
able to search such a large index in a reasonable amount of time, plus you may 
simply not have enough RAM for such a large index.



Are you basing this on data similar to what Mike Klaas outlines?

Quoting Mike Klaas:

That's 280K tokens per document, assuming ~5 chars/word.  That's 2 
trillion tokens.  Lucene's posting list compression is decent, but 
you're still talking about a minimum of 2-4TB for the index (that's 
assuming 1 or 2 bytes per token). 


and

Well, the average compressed posting list will be at least 80MB that 
needs to be read from the NAS and decoded and ranked.  Since the size is 
exponentially distributed, common terms will be much bigger and rarer 
terms much smaller.


End of quoting Mike Klaas:

We would need all 7M ids scored so we could push them through a filter 
query to reduce them to a much smaller number on the order of 100-10,000 
representing just those that correspond to items in a collection.


So to ask again, do you think it's possible to do this in, say, under 15 
seconds?  (I think I'm giving up on 0.5 sec. ...)




- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the 
about-to-be-released Lucene 2.3)...performance reasons.

- As for avoiding index duplication - how about having a SAN with a single copy 
of the index that all searchers (and the master) point to?




Yes we're thinking a single copy of the index using hardware-based 
snapshot technology for the readers a dedicated indexing solr instance 
updates the index.  Reasonable?





Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Phillip Farber [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data

Hello everyone,

We are considering Solr 1.2 to index and search a terabyte-scale
 dataset 
of OCR.  Initially our requirements are simple: basic tokenizing, score
 
sorting only, no faceting.   The schema is simple too.  A document 
consists of a numeric id, stored and indexed and a large text field, 
indexed not stored, containing the OCR typically ~1.4Mb.  Some limited 
faceting or additional metadata fields may be added later.


The data in question currently amounts to about 1.1Tb of OCR (about 1M 
docs) which we expect to increase to 10Tb over time.  Pilot tests on
 the 
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of 
data via HTTP suggest we can index at a rate sufficient to keep up with
 
the inputs (after getting over the 1.1 Tb hump).  We envision nightly 
commits/optimizes.


We expect to have low QPS (10) rate and probably will not need 
millisecond query response.


Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
 starting
with 2 blades and will add as demands require.

While we have a lot of storage, the idea of master/slave Solr
 Collection 
Distribution to add more Solr instances clearly means duplicating an 
immense index.  Is it possible to use one instance to update the index 
on NAS while other instances only read the index and commit to keep 
their caches warm instead?


Should we expect Solr indexing time to slow significantly as we scale 
up?  What kind of query performance could we expect?  Is it totally 
naive even to consider Solr at this kind of scale?


Given these parameters is it realistic to think that Solr could handle 
the task?


Any advice/wisdom greatly appreciated,

Phil






Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Erick Erickson
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the words are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch G

Best
Erick

On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote:



 Ryan McKinley wrote:
 
  We are considering Solr 1.2 to index and search a terabyte-scale
  dataset of OCR.  Initially our requirements are simple: basic
  tokenizing, score sorting only, no faceting.   The schema is simple
  too.  A document consists of a numeric id, stored and indexed and a
  large text field, indexed not stored, containing the OCR typically
  ~1.4Mb.  Some limited faceting or additional metadata fields may be
  added later.
 
  I have not done anything on this scale...  but with:
  https://issues.apache.org/jira/browse/SOLR-303 it will be possible to
  split a large index into many smaller indices and return the union of
  all results.  This may or may not be necessary depending on what the
  data actually looks like (if you text just uses 100 words, your index
  may not be that big)
 
  How many documents are you talking about?
 

 Currently 1M docs @ ~1.4M/doc.  Scaling to 7M docs.  This is OCR so we
 are talking perhaps 50K words total to index so as you point out the
 index might not be too big.  It's the *data* that is big not the
 *index*, right?.  So I don't think SOLR-303 (distributed search) is
 required here.

  Obviously as the number of documents increase the index size must
 increase to some degree -- I think linearly?  But what index size will
 result for 7M documents over 50K words where we're talking just 2 fields
 per doc: 1 id field and one OCR field of ~1.4M?  Ballpark?

 Regarding single word queries, do you think, say, 0.5 sec/query to
 return 7M score-ranked IDs is possible/reasonable in this scenario?


 
  Should we expect Solr indexing time to slow significantly as we scale
  up?  What kind of query performance could we expect?  Is it totally
  naive even to consider Solr at this kind of scale?
 
 
  You may want to check out the lucene benchmark stuff
  http://lucene.apache.org/java/docs/benchmarks.html
 
 
 http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html
 
 
 
  ryan
 
 



Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Mike Klaas

On 22-Jan-08, at 4:20 PM, Phillip Farber wrote:


We would need all 7M ids scored so we could push them through a  
filter query to reduce them to a much smaller number on the order  
of 100-10,000 representing just those that correspond to items in a  
collection.


You could pass the filter to Solr to improve the speed dramatically.

So to ask again, do you think it's possible to do this in, say,  
under 15 seconds?  (I think I'm giving up on 0.5 sec. ...)


At this point, no-one is going to be able to answer you question  
unless they have done something similar.  The largest individual  
index I've worked with is on the order of 10GB, and one thing I've  
learned is to not extrapolate several orders of magnitude beyond my  
experience.


-Mike


Re: Solr feasibility with terabyte-scale data

2008-01-20 Thread Otis Gospodnetic
Hi,
Some quick notes, since it's late here.

- You'll need to wait for SOLR-303 - there is no way even a big machine will be 
able to search such a large index in a reasonable amount of time, plus you may 
simply not have enough RAM for such a large index.

- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the 
about-to-be-released Lucene 2.3)...performance reasons.

- As for avoiding index duplication - how about having a SAN with a single copy 
of the index that all searchers (and the master) point to?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Phillip Farber [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data

Hello everyone,

We are considering Solr 1.2 to index and search a terabyte-scale
 dataset 
of OCR.  Initially our requirements are simple: basic tokenizing, score
 
sorting only, no faceting.   The schema is simple too.  A document 
consists of a numeric id, stored and indexed and a large text field, 
indexed not stored, containing the OCR typically ~1.4Mb.  Some limited 
faceting or additional metadata fields may be added later.

The data in question currently amounts to about 1.1Tb of OCR (about 1M 
docs) which we expect to increase to 10Tb over time.  Pilot tests on
 the 
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of 
data via HTTP suggest we can index at a rate sufficient to keep up with
 
the inputs (after getting over the 1.1 Tb hump).  We envision nightly 
commits/optimizes.

We expect to have low QPS (10) rate and probably will not need 
millisecond query response.

Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
 starting
with 2 blades and will add as demands require.

While we have a lot of storage, the idea of master/slave Solr
 Collection 
Distribution to add more Solr instances clearly means duplicating an 
immense index.  Is it possible to use one instance to update the index 
on NAS while other instances only read the index and commit to keep 
their caches warm instead?

Should we expect Solr indexing time to slow significantly as we scale 
up?  What kind of query performance could we expect?  Is it totally 
naive even to consider Solr at this kind of scale?

Given these parameters is it realistic to think that Solr could handle 
the task?

Any advice/wisdom greatly appreciated,

Phil






Re: Solr feasibility with terabyte-scale data

2008-01-19 Thread Ryan McKinley


We are considering Solr 1.2 to index and search a terabyte-scale dataset 
of OCR.  Initially our requirements are simple: basic tokenizing, score 
sorting only, no faceting.   The schema is simple too.  A document 
consists of a numeric id, stored and indexed and a large text field, 
indexed not stored, containing the OCR typically ~1.4Mb.  Some limited 
faceting or additional metadata fields may be added later.


I have not done anything on this scale...  but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to 
split a large index into many smaller indices and return the union of 
all results.  This may or may not be necessary depending on what the 
data actually looks like (if you text just uses 100 words, your index 
may not be that big)


How many documents are you talking about?



Should we expect Solr indexing time to slow significantly as we scale 
up?  What kind of query performance could we expect?  Is it totally 
naive even to consider Solr at this kind of scale?




You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html

http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html


ryan