Re: Solr feasibility with terabyte-scale data
Quick reply to Otis and Ken. Otis: After a nights sleep I think you are absolutely right about that some HPC grid like Hadoop or perhaps GlusterHPC should be used regardless of my last comment. Ken: Looked at the arch of Katta and it looks really nice. I really believe that Katta could be something which lives as a subproject of Lucene since it surely fills a gap that is not filled by Nutch. Nutch surely do similar stuff but this could actually something that Nutch uses as a component to distributed the crawled index. I will try to join the Stefan friends team. Kindly //Marcus On Sat, May 10, 2008 at 6:03 PM, Marcus Herou [EMAIL PROTECTED] wrote: Hi Otis. Thanks for the insights. Nice to get feedback from a technorati guy. Nice to see that the snippet of yours is almost a copy of mine, gives me the right stomach feeling about this :) I'm quite familiar with Hadoop as you can see if you check out the code of my OS project AbstractCache- http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project which aims to create storage solutions based on the Map and SortedMap interface. I use it everywhere in Tailsweep.com and used it as well at my former employer Eniro.se (largest yellow pages site in Sweden). It has been in constant development for five years. Since I'm a cluster freak of nature I love a project named GlusterFS where thay have managed to create a system without master/slave[s] and NameNode. The advantage of this is that it is a lot more scalable, the drawback is that you can get into Split-Brain situations which guys in the mailing-list are complaining about. Anyway I tend to try to solve this with JGroups membership where the coordinator can be any machine in the cluster but in the group joining process the first machine to join get's the privilege of becoming coordinator. But even with JGroups you can run into trouble with race-conditions of all kinds (distributed locks for example). I've created an alternative to the Hadoop file system (mostly for fun) where you just add an object to the cluster and based on what algorithm you choose it is Raided or striped across the cluster. Anyway this was off topic but I think my experience in building membership aware clusters will help me in this particular case. Kindly //Mrcaus On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Marcus, You are headed in the right direction. We've built a system like this at Technorati (Lucene, not Solr) and had components like the namenode or controller that you mention. If you look at Hadoop project, you will see something similar in concept (NameNode), though it deals with raw data blocks, their placement in the cluster, etc. As a matter of fact, I am currently running its re-balancer in order to move some of the blocks around in the cluster. That matches what you are describing for moving documents from one shard to the other. Of course, you can simplify things and just have this central piece be aware of any new servers and simply get it to place any new docs on the new servers and create a new shard there. Or you can get fancy and take into consideration the hardware resources - the CPU, the disk space, the memory, and use that to figure out how much each machine in your cluster can handle and maximize its use based on this knowledge. :) I think Solr and Nutch are in a desperate need of this central component (must not be SPOF!) for shard management. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: marcusherou [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 2:37:19 AM Subject: Re: Solr feasibility with terabyte-scale data Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index
Re: Solr feasibility with terabyte-scale data
Thanks Ken. I will take a look be sure of that :) Kindly //Marcus On Fri, May 9, 2008 at 10:26 PM, Ken Krugler [EMAIL PROTECTED] wrote: Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... You should also look at a new project called Katta: http://katta.wiki.sourceforge.net/ First code check-in should be happening this weekend, so I'd wait until Monday to take a look :) -- Ken On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures
Re: Solr feasibility with terabyte-scale data
Hi Otis. Thanks for the insights. Nice to get feedback from a technorati guy. Nice to see that the snippet of yours is almost a copy of mine, gives me the right stomach feeling about this :) I'm quite familiar with Hadoop as you can see if you check out the code of my OS project AbstractCache- http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project which aims to create storage solutions based on the Map and SortedMap interface. I use it everywhere in Tailsweep.com and used it as well at my former employer Eniro.se (largest yellow pages site in Sweden). It has been in constant development for five years. Since I'm a cluster freak of nature I love a project named GlusterFS where thay have managed to create a system without master/slave[s] and NameNode. The advantage of this is that it is a lot more scalable, the drawback is that you can get into Split-Brain situations which guys in the mailing-list are complaining about. Anyway I tend to try to solve this with JGroups membership where the coordinator can be any machine in the cluster but in the group joining process the first machine to join get's the privilege of becoming coordinator. But even with JGroups you can run into trouble with race-conditions of all kinds (distributed locks for example). I've created an alternative to the Hadoop file system (mostly for fun) where you just add an object to the cluster and based on what algorithm you choose it is Raided or striped across the cluster. Anyway this was off topic but I think my experience in building membership aware clusters will help me in this particular case. Kindly //Mrcaus On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Marcus, You are headed in the right direction. We've built a system like this at Technorati (Lucene, not Solr) and had components like the namenode or controller that you mention. If you look at Hadoop project, you will see something similar in concept (NameNode), though it deals with raw data blocks, their placement in the cluster, etc. As a matter of fact, I am currently running its re-balancer in order to move some of the blocks around in the cluster. That matches what you are describing for moving documents from one shard to the other. Of course, you can simplify things and just have this central piece be aware of any new servers and simply get it to place any new docs on the new servers and create a new shard there. Or you can get fancy and take into consideration the hardware resources - the CPU, the disk space, the memory, and use that to figure out how much each machine in your cluster can handle and maximize its use based on this knowledge. :) I think Solr and Nutch are in a desperate need of this central component (must not be SPOF!) for shard management. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: marcusherou [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 2:37:19 AM Subject: Re: Solr feasibility with terabyte-scale data Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only
Re: Solr feasibility with terabyte-scale data
Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures is finished. Currently my idea is something like this. public PageDocument search(SearchDocumentCommand sdc) { SetInteger ids = documentIndexers.keySet(); int nrOfSearchers = ids.size(); int totalItems = 0; PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize()); for (IteratorInteger iterator = ids.iterator(); iterator.hasNext();) { Integer id = iterator.next(); ListDocumentIndexer indexers = documentIndexers.get(id); DocumentIndexer indexer = indexers.get(random.nextInt(indexers.size())); SearchDocumentCommand sdc2 = copy(sdc); sdc2.setPage(sdc.getPage()/nrOfSearchers); PageDocument res = indexer.search(sdc); totalItems += res.getTotalItems(); docs.addAll(res); } if(sdc.getComparator() != null) { Collections.sort(docs, sdc.getComparator()); } docs.setTotalItems(totalItems); return docs; } This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I switch from Solr to raw Lucene back and forth benchmarking and comparing stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer and LuceneDocumentIndexer) to make the switch easy. I think this approach is quite OK but the paging stuff is broken I think. However the searching speed will at best be constant proportional to the number of searchers, probably a lot worse. To get even more speed each document indexer should be put into a separate thread with something like EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread pool. The Future result times out after let's say 750 msec and the client ignores all searchers which are slower. Probably some performance metrics should be gathered about each searcher so the client knows which indexers to prefer over the others. But of course if you have 50 searchers, having each client thread spawn yet another 50 threads isn't a good thing either. So perhaps
Re: Solr feasibility with terabyte-scale data
Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures is finished. Currently my idea is something like this. public PageDocument search(SearchDocumentCommand sdc) { SetInteger ids = documentIndexers.keySet(); int nrOfSearchers = ids.size(); int totalItems = 0; PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize()); for (IteratorInteger iterator = ids.iterator(); iterator.hasNext();) { Integer id = iterator.next(); ListDocumentIndexer indexers = documentIndexers.get(id); DocumentIndexer indexer = indexers.get(random.nextInt(indexers.size())); SearchDocumentCommand sdc2 = copy(sdc); sdc2.setPage(sdc.getPage()/nrOfSearchers); PageDocument res = indexer.search(sdc); totalItems += res.getTotalItems(); docs.addAll(res); } if(sdc.getComparator() != null) { Collections.sort(docs, sdc.getComparator()); } docs.setTotalItems(totalItems); return docs; } This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I switch from Solr to raw Lucene back and forth benchmarking and comparing stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer and
Re: Solr feasibility with terabyte-scale data
Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures is finished. Currently my idea is something like this. public PageDocument search(SearchDocumentCommand sdc) { SetInteger ids = documentIndexers.keySet(); int nrOfSearchers = ids.size(); int totalItems = 0; PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize()); for (IteratorInteger iterator = ids.iterator(); iterator.hasNext();) { Integer id = iterator.next(); ListDocumentIndexer indexers = documentIndexers.get(id); DocumentIndexer indexer = indexers.get(random.nextInt(indexers.size())); SearchDocumentCommand sdc2 = copy(sdc); sdc2.setPage(sdc.getPage()/nrOfSearchers); PageDocument res = indexer.search(sdc); totalItems +=
Re: Solr feasibility with terabyte-scale data
So our problem is made easier by having complete index partitionability by a user_id field. That means at one end of the spectrum, we could have one monolithic index for everyone, while at the other end of the spectrum we could individual cores for each user_id. At the moment, we've gone for a halfway house somewhere in the middle: I've got several large EC2 instances (currently 3), each running a single master/slave pair of Solr servers. The servers have several cores (currently 10 - a guesstimated good number). As new users register, I automatically distribute them across cores. I would like to do something with clustering users based on geo-location so that cores will get 'time off' for maintenance and optimization for that user cluster's nighttime. I'd also like to move in the 1 core per user direction as dynamic core creation becomes available. It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... James On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it
Re: Solr feasibility with terabyte-scale data
Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... You should also look at a new project called Katta: http://katta.wiki.sourceforge.net/ First code check-in should be happening this weekend, so I'd wait until Monday to take a look :) -- Ken On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures is finished. Currently my idea is something like this. public PageDocument search(SearchDocumentCommand sdc) { SetInteger ids = documentIndexers.keySet(); int nrOfSearchers = ids.size(); int totalItems = 0; PageDocument docs = new
RE: Solr feasibility with terabyte-scale data
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the MD5 cryptographic checksumming algorithm. This takes X bytes of data and creates a 128-bit long random number, or 128 random bits. At this point there are no reports of two different datasets that give the same checksum. This gives some handy things: a) a fixed-size unique ID field, giving fixed space requirements, The standard representation of this is 32 hex bytes, i.e. 'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene data type for this. b) the ability to change your mind about the uniqueness formula for your data, c) a handy primary key for cross-correlating in other databases, Think external DBs which supply data for some records. The primary key is the MD5 signature. d) the ability to randomly pick subsets of your data. The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the wildcard string 'deadbeef*'. And 'd*'. 'd*' selects a perfectly random subset of your data, 1/16 of the total size. 'd**' gives 1/256 of your data. This is perfectly random because MD5 gives such a perfectly random hashcode. This should go on a wiki page 'SchemaDesignTips'. Cheers, Lance Norskog
Re: Solr feasibility with terabyte-scale data
You can't believe how much it pains me to see such nice piece of work live so separately. But I also think I know why it happened :(. Do you know if Stefan Co. have the intention to bring it under some contrib/ around here? Would that not make sense? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... You should also look at a new project called Katta: http://katta.wiki.sourceforge.net/ First code check-in should be happening this weekend, so I'd wait until Monday to take a look :) -- Ken On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come
Re: Solr feasibility with terabyte-scale data
Hi Otis, You can't believe how much it pains me to see such nice piece of work live so separately. But I also think I know why it happened :(. Do you know if Stefan Co. have the intention to bring it under some contrib/ around here? Would that not make sense? I'm not working on the project, so I can't speak for Stefan friends...but my guess is that it's going to live separately as something independent of Solr/Nutch. If you view it as search plumbing that's usable in multiple environments, then that makes sense. If you view it as replicating core Solr (or Nutch) functionality, then it sucks. Not sure what the outcome will be. -- Ken - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... You should also look at a new project called Katta: http://katta.wiki.sourceforge.net/ First code check-in should be happening this weekend, so I'd wait until Monday to take a look :) -- Ken On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing
Re: Solr feasibility with terabyte-scale data
From what I can tell from the overview on http://katta.wiki.sourceforge.net/, it's a partial replication of Solr/Nutch functionality, plus some goodies. It might have been better to work those goodies into some friendly contrib/ be it Solr, Nutch, Hadoop, or Lucene. Anyhow, let's see what happens there! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ken Krugler [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 5:37:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Otis, You can't believe how much it pains me to see such nice piece of work live so separately. But I also think I know why it happened :(. Do you know if Stefan Co. have the intention to bring it under some contrib/ around here? Would that not make sense? I'm not working on the project, so I can't speak for Stefan friends...but my guess is that it's going to live separately as something independent of Solr/Nutch. If you view it as search plumbing that's usable in multiple environments, then that makes sense. If you view it as replicating core Solr (or Nutch) functionality, then it sucks. Not sure what the outcome will be. -- Ken - Original Message From: Ken Krugler To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale data Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... You should also look at a new project called Katta: http://katta.wiki.sourceforge.net/ First code check-in should be happening this weekend, so I'd wait until Monday to take a look :) -- Ken On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort
Re: Solr feasibility with terabyte-scale data
For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names and technical jargon. Another is to use stop words (which has the unfortunate side effect of making phrase searches like to be or not to be impossible). I've heard you can't make a silk purse out of a sows ear ... Phil Erick Erickson wrote: Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the words are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch G Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote: Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. I have not done anything on this scale... but with: https://issues.apache.org/jira/browse/SOLR-303 it will be possible to split a large index into many smaller indices and return the union of all results. This may or may not be necessary depending on what the data actually looks like (if you text just uses 100 words, your index may not be that big) How many documents are you talking about? Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think SOLR-303 (distributed search) is required here. Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark? Regarding single word queries, do you think, say, 0.5 sec/query to return 7M score-ranked IDs is possible/reasonable in this scenario? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? You may want to check out the lucene benchmark stuff http://lucene.apache.org/java/docs/benchmarks.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html ryan
RE: Solr feasibility with terabyte-scale data
We use two indexed copies of the same text, one with stemming and stopwords and the other with neither. We do phrase search on the second. You might use two different OCR implementations and cross-correlate the output. Lance -Original Message- From: Phillip Farber [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 8:15 AM To: solr-user@lucene.apache.org Subject: Re: Solr feasibility with terabyte-scale data For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names and technical jargon. Another is to use stop words (which has the unfortunate side effect of making phrase searches like to be or not to be impossible). I've heard you can't make a silk purse out of a sows ear ... Phil Erick Erickson wrote: Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the words are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch G Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote: Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. I have not done anything on this scale... but with: https://issues.apache.org/jira/browse/SOLR-303 it will be possible to split a large index into many smaller indices and return the union of all results. This may or may not be necessary depending on what the data actually looks like (if you text just uses 100 words, your index may not be that big) How many documents are you talking about? Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think SOLR-303 (distributed search) is required here. Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark? Regarding single word queries, do you think, say, 0.5 sec/query to return 7M score-ranked IDs is possible/reasonable in this scenario? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? You may want to check out the lucene benchmark stuff http://lucene.apache.org/java/docs/benchmarks.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/p ackage-summary.html ryan
Re: Solr feasibility with terabyte-scale data
Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. I have not done anything on this scale... but with: https://issues.apache.org/jira/browse/SOLR-303 it will be possible to split a large index into many smaller indices and return the union of all results. This may or may not be necessary depending on what the data actually looks like (if you text just uses 100 words, your index may not be that big) How many documents are you talking about? Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think SOLR-303 (distributed search) is required here. Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark? Regarding single word queries, do you think, say, 0.5 sec/query to return 7M score-ranked IDs is possible/reasonable in this scenario? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? You may want to check out the lucene benchmark stuff http://lucene.apache.org/java/docs/benchmarks.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html ryan
Re: Solr feasibility with terabyte-scale data
Otis Gospodnetic wrote: Hi, Some quick notes, since it's late here. - You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index. Are you basing this on data similar to what Mike Klaas outlines? Quoting Mike Klaas: That's 280K tokens per document, assuming ~5 chars/word. That's 2 trillion tokens. Lucene's posting list compression is decent, but you're still talking about a minimum of 2-4TB for the index (that's assuming 1 or 2 bytes per token). and Well, the average compressed posting list will be at least 80MB that needs to be read from the NAS and decoded and ranked. Since the size is exponentially distributed, common terms will be much bigger and rarer terms much smaller. End of quoting Mike Klaas: We would need all 7M ids scored so we could push them through a filter query to reduce them to a much smaller number on the order of 100-10,000 representing just those that correspond to items in a collection. So to ask again, do you think it's possible to do this in, say, under 15 seconds? (I think I'm giving up on 0.5 sec. ...) - I'd suggest you wait for Solr 1.3 (or some -dev version that uses the about-to-be-released Lucene 2.3)...performance reasons. - As for avoiding index duplication - how about having a SAN with a single copy of the index that all searchers (and the master) point to? Yes we're thinking a single copy of the index using hardware-based snapshot technology for the readers a dedicated indexing solr instance updates the index. Reasonable? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, January 18, 2008 5:26:21 PM Subject: Solr feasibility with terabyte-scale data Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. The data in question currently amounts to about 1.1Tb of OCR (about 1M docs) which we expect to increase to 10Tb over time. Pilot tests on the desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of data via HTTP suggest we can index at a rate sufficient to keep up with the inputs (after getting over the 1.1 Tb hump). We envision nightly commits/optimizes. We expect to have low QPS (10) rate and probably will not need millisecond query response. Our environment makes available Apache on blade servers (Dell 1955 dual dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, high-performance NAS system over a dedicated (out-of-band) GbE switch (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting with 2 blades and will add as demands require. While we have a lot of storage, the idea of master/slave Solr Collection Distribution to add more Solr instances clearly means duplicating an immense index. Is it possible to use one instance to update the index on NAS while other instances only read the index and commit to keep their caches warm instead? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? Given these parameters is it realistic to think that Solr could handle the task? Any advice/wisdom greatly appreciated, Phil
Re: Solr feasibility with terabyte-scale data
Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the words are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch G Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote: Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. I have not done anything on this scale... but with: https://issues.apache.org/jira/browse/SOLR-303 it will be possible to split a large index into many smaller indices and return the union of all results. This may or may not be necessary depending on what the data actually looks like (if you text just uses 100 words, your index may not be that big) How many documents are you talking about? Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think SOLR-303 (distributed search) is required here. Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark? Regarding single word queries, do you think, say, 0.5 sec/query to return 7M score-ranked IDs is possible/reasonable in this scenario? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? You may want to check out the lucene benchmark stuff http://lucene.apache.org/java/docs/benchmarks.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html ryan
Re: Solr feasibility with terabyte-scale data
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote: We would need all 7M ids scored so we could push them through a filter query to reduce them to a much smaller number on the order of 100-10,000 representing just those that correspond to items in a collection. You could pass the filter to Solr to improve the speed dramatically. So to ask again, do you think it's possible to do this in, say, under 15 seconds? (I think I'm giving up on 0.5 sec. ...) At this point, no-one is going to be able to answer you question unless they have done something similar. The largest individual index I've worked with is on the order of 10GB, and one thing I've learned is to not extrapolate several orders of magnitude beyond my experience. -Mike
Re: Solr feasibility with terabyte-scale data
Hi, Some quick notes, since it's late here. - You'll need to wait for SOLR-303 - there is no way even a big machine will be able to search such a large index in a reasonable amount of time, plus you may simply not have enough RAM for such a large index. - I'd suggest you wait for Solr 1.3 (or some -dev version that uses the about-to-be-released Lucene 2.3)...performance reasons. - As for avoiding index duplication - how about having a SAN with a single copy of the index that all searchers (and the master) point to? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, January 18, 2008 5:26:21 PM Subject: Solr feasibility with terabyte-scale data Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. The data in question currently amounts to about 1.1Tb of OCR (about 1M docs) which we expect to increase to 10Tb over time. Pilot tests on the desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of data via HTTP suggest we can index at a rate sufficient to keep up with the inputs (after getting over the 1.1 Tb hump). We envision nightly commits/optimizes. We expect to have low QPS (10) rate and probably will not need millisecond query response. Our environment makes available Apache on blade servers (Dell 1955 dual dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, high-performance NAS system over a dedicated (out-of-band) GbE switch (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting with 2 blades and will add as demands require. While we have a lot of storage, the idea of master/slave Solr Collection Distribution to add more Solr instances clearly means duplicating an immense index. Is it possible to use one instance to update the index on NAS while other instances only read the index and commit to keep their caches warm instead? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? Given these parameters is it realistic to think that Solr could handle the task? Any advice/wisdom greatly appreciated, Phil
Re: Solr feasibility with terabyte-scale data
We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not stored, containing the OCR typically ~1.4Mb. Some limited faceting or additional metadata fields may be added later. I have not done anything on this scale... but with: https://issues.apache.org/jira/browse/SOLR-303 it will be possible to split a large index into many smaller indices and return the union of all results. This may or may not be necessary depending on what the data actually looks like (if you text just uses 100 words, your index may not be that big) How many documents are you talking about? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? You may want to check out the lucene benchmark stuff http://lucene.apache.org/java/docs/benchmarks.html http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html ryan