Re: ClassCastException from custom request handler
OK, problem solved! Well, worked around. I gave up on the new style plugin loading in a multicore Jetty setup, and packaged up my plugin in a rebuilt solr.war. I had tried this before, but only putting the class files in WEB-INF/lib. If I put a jar file in there, it works. 2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de James Brady schrieb: Yeah I was thinking T would be SolrRequestHandler too. Eclipse's debugger can't tell me... You could try disassembling. Or Eclipse opens classes in a very rudimentary format when there is no source code attached. Maybe it shows the actual return value there, instead of T. Lot's of other handlers are created with no problem before my plugin falls over, so I don't think it's a problem with T not being what we expected. Do you know of any working examples of plugins I can download and build in my environment to see what happens? No sorry. I've only overwritten the EntityProcessor from DataImportHandler, and that is not configured in solrconfig.xml. 2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de Code is from AbstractPluginLoader in the solr plugin package, 1.3 (the regular stable release, no svn checkout). 80-84 @SuppressWarnings(unchecked) protected T create( ResourceLoader loader, String name, String className, Node node ) throws Exception { return (T) loader.newInstance( className, getDefaultPackages() ); } -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
Re: ClassCastException from custom request handler
Solr version: 1.3.0 694707 solrconfig.xml: requestHandler name=livecores class=LiveCoresHandler / public class LiveCoresHandler extends RequestHandlerBase { public void init(NamedList args) { } public String getDescription() { return ; } public String getSource() { return ; } public String getSourceId() { return ; } public NamedList getStatistics() { return new NamedList(); } public String getVersion() { return ; } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { CollectionString names = req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames(); rsp.add(cores, names); // if the cores are dynamic, you prob don't want to cache rsp.setHttpCaching(false); } } 2009/8/4 Avlesh Singh avl...@gmail.com I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather thanthe ClassCastException. You are right about that, James. Which Solr version are you using? Can you please paste the relevant pieces in your solrconfig.xml and the request handler class you have created? Cheers Avlesh On Mon, Aug 3, 2009 at 10:51 PM, James Brady james.colin.br...@gmail.com wrote: Hi, Thanks for your suggestions! I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather than the ClassCastException. I did have some problems getting my class on the app server's classpath. I'm running with solr.home set to multicore, but creating a multicore/lib directory and putting my request handler class in there resulted in Error loading class errors. I found that setting jetty.class.path to include multicore/lib (and also explicitly point at Solr's core and common JARs) fixed the Error loading class errors, leaving these ClassCastExceptions... 2009/8/3 Avlesh Singh avl...@gmail.com Can you cross check the class attribute for your handler in solrconfig.xml? My guess is that it is specified as solr.LiveCoresHandler. It should be fully qualified class name - com.foo.path.to.LiveCoresHandler instead. Moreover, I am damn sure that you did not forget to drop your jar into solr.home/lib. Checking once again might not be a bad idea :) Cheers Avlesh On Mon, Aug 3, 2009 at 9:11 PM, James Brady james.colin.br...@gmail.com wrote: Hi, I'm creating a custom request handler to return a list of live cores in Solr. On startup, I get this exception for each core: Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) I've tried a few variations on the class definition, including extending RequestHandlerBase (as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2 ) and implementing SolrRequestHandler directly. I'm sure that the Solr libraries I built against and those I'm running on are the same version too, as I unzipped the Solr war file and copies the relevant jars out of there to build against. Any ideas on what could be causing the ClassCastException? I've attached a debugger to the running Solr process but it didn't shed any light on the issue... Thanks! James -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
Re: ClassCastException from custom request handler
Hi, the LiveCoresHandler is in the default package - the behaviour's the same if I have it in a properly namespaced package too... The requestHandler name can start either be a path (starting with '/') or a qt name: http://wiki.apache.org/solr/SolrRequestHandler 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com what is the package of LiveCoresHandler ? I guess the requestHandler name should be name=/livecores On Tue, Aug 4, 2009 at 5:04 PM, James Bradyjames.colin.br...@gmail.com wrote: Solr version: 1.3.0 694707 solrconfig.xml: requestHandler name=livecores class=LiveCoresHandler / public class LiveCoresHandler extends RequestHandlerBase { public void init(NamedList args) { } public String getDescription() { return ; } public String getSource() { return ; } public String getSourceId() { return ; } public NamedList getStatistics() { return new NamedList(); } public String getVersion() { return ; } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { CollectionString names = req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames(); rsp.add(cores, names); // if the cores are dynamic, you prob don't want to cache rsp.setHttpCaching(false); } } 2009/8/4 Avlesh Singh avl...@gmail.com I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather thanthe ClassCastException. You are right about that, James. Which Solr version are you using? Can you please paste the relevant pieces in your solrconfig.xml and the request handler class you have created? Cheers Avlesh On Mon, Aug 3, 2009 at 10:51 PM, James Brady james.colin.br...@gmail.com wrote: Hi, Thanks for your suggestions! I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather than the ClassCastException. I did have some problems getting my class on the app server's classpath. I'm running with solr.home set to multicore, but creating a multicore/lib directory and putting my request handler class in there resulted in Error loading class errors. I found that setting jetty.class.path to include multicore/lib (and also explicitly point at Solr's core and common JARs) fixed the Error loading class errors, leaving these ClassCastExceptions... 2009/8/3 Avlesh Singh avl...@gmail.com Can you cross check the class attribute for your handler in solrconfig.xml? My guess is that it is specified as solr.LiveCoresHandler. It should be fully qualified class name - com.foo.path.to.LiveCoresHandler instead. Moreover, I am damn sure that you did not forget to drop your jar into solr.home/lib. Checking once again might not be a bad idea :) Cheers Avlesh On Mon, Aug 3, 2009 at 9:11 PM, James Brady james.colin.br...@gmail.com wrote: Hi, I'm creating a custom request handler to return a list of live cores in Solr. On startup, I get this exception for each core: Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) I've tried a few variations on the class definition, including extending RequestHandlerBase (as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2 ) and implementing SolrRequestHandler directly. I'm sure that the Solr libraries I built against and those I'm running on are the same version too, as I unzipped the Solr war file and copies the relevant jars out of there to build against. Any ideas on what could be causing the ClassCastException? I've attached a debugger to the running Solr process but it didn't shed any light on the issue... Thanks! James -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
Re: ClassCastException from custom request handler
There is *something* strange going on with classloaders; when I put my .class files in the right place in WEB-INF/lib in a repackaged solr.war file, it's not found by the plugin loader (Error loading class). So the plugin classloader isn't seeing stuff inside WEB-INF/lib. That explains why the plugin loader sees my class files when I point jetty.class.path at the right directory, but in that situation I also need to point jetty.class.path at the Solr JARs explicitly. Still, how would ClassCastExceptions be caused by class loader paths not being set correctly? I don't follow you... To get a ClassCastException, the class to cast to must have been found. The cast-to class must not be in the object's inheritance hierarchy, or be built against a different version, no? 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com I guess this is a classloader issue. it is worth trying to put it in the WEB-INF/lib of the solr.war On Tue, Aug 4, 2009 at 5:35 PM, James Bradyjames.colin.br...@gmail.com wrote: Hi, the LiveCoresHandler is in the default package - the behaviour's the same if I have it in a properly namespaced package too... The requestHandler name can start either be a path (starting with '/') or a qt name: http://wiki.apache.org/solr/SolrRequestHandler starting w/ '/' helps in accessing it directly 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com what is the package of LiveCoresHandler ? I guess the requestHandler name should be name=/livecores On Tue, Aug 4, 2009 at 5:04 PM, James Bradyjames.colin.br...@gmail.com wrote: Solr version: 1.3.0 694707 solrconfig.xml: requestHandler name=livecores class=LiveCoresHandler / public class LiveCoresHandler extends RequestHandlerBase { public void init(NamedList args) { } public String getDescription() { return ; } public String getSource() { return ; } public String getSourceId() { return ; } public NamedList getStatistics() { return new NamedList(); } public String getVersion() { return ; } public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { CollectionString names = req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames(); rsp.add(cores, names); // if the cores are dynamic, you prob don't want to cache rsp.setHttpCaching(false); } } 2009/8/4 Avlesh Singh avl...@gmail.com I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather thanthe ClassCastException. You are right about that, James. Which Solr version are you using? Can you please paste the relevant pieces in your solrconfig.xml and the request handler class you have created? Cheers Avlesh On Mon, Aug 3, 2009 at 10:51 PM, James Brady james.colin.br...@gmail.com wrote: Hi, Thanks for your suggestions! I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather than the ClassCastException. I did have some problems getting my class on the app server's classpath. I'm running with solr.home set to multicore, but creating a multicore/lib directory and putting my request handler class in there resulted in Error loading class errors. I found that setting jetty.class.path to include multicore/lib (and also explicitly point at Solr's core and common JARs) fixed the Error loading class errors, leaving these ClassCastExceptions... 2009/8/3 Avlesh Singh avl...@gmail.com Can you cross check the class attribute for your handler in solrconfig.xml? My guess is that it is specified as solr.LiveCoresHandler. It should be fully qualified class name - com.foo.path.to.LiveCoresHandler instead. Moreover, I am damn sure that you did not forget to drop your jar into solr.home/lib. Checking once again might not be a bad idea :) Cheers Avlesh On Mon, Aug 3, 2009 at 9:11 PM, James Brady james.colin.br...@gmail.com wrote: Hi, I'm creating a custom request handler to return a list of live cores in Solr. On startup, I get this exception for each core: Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load
Re: ClassCastException from custom request handler
Hi Chantal! I've included a stack trace below. I've attached a debugger to the server starting up, and it is finding my class file as expected... I agree it looks like something wrong with how I've deployed the compiled code, but perhaps different Solr versions at compile time and run time? However, I've checked and rechecked that and can't see a problem! The actually ClassCastException is being thrown in a anonymous AbstractPluginLoader instance's create method: http://svn.apache.org/viewvc/lucene/solr/tags/release-1.3.0/src/java/org/apache/solr/util/plugin/AbstractPluginLoader.java?revision=695557 It's the cast to SolrRequestHandler which fails. Aug 4, 2009 4:24:25 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created /update/csv: org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper Aug 4, 2009 4:24:25 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created /admin/: org.apache.solr.handler.admin.AdminHandlers Aug 4, 2009 4:24:25 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: com.jmsbrdy.LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:323) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:104) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) At the moment, my deployment is: 1. compile my single Java file from an Ant script (pointing at the Solr JARs from an exploded solr.war) 2. copy that class file's directory tree (com/jmsbrdy/LiveCoresHandler.class) to a lib in the root of my jetty install 3. add lib to Jetty's class path 4. add the Solr JARs from the exploded war to Jetty's class path 5. start the server Can you see any problems there? 2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de Hi James! James Brady schrieb: There is *something* strange going on with classloaders; when I put my .class files in the right place in WEB-INF/lib in a repackaged solr.war file, it's not found by the plugin loader (Error loading class). So the plugin classloader isn't seeing stuff inside WEB-INF/lib. That explains why the plugin loader sees my class files when I point jetty.class.path at the right directory, but in that situation I also need to point jetty.class.path at the Solr JARs explicitly. you cannot be sure that it sees *your* files. It only sees a class that qualifies with the name that is requested in your code. It's obviously not the class the code expects, though - as it results in a ClassCastException at some point. It might help to have a look at where and why that casting went wrong. I wrote a custom EntityProcessor and deployed it first under WEB-INF/classes, and now in the plugin directory, and that worked without a problem. My first guess is that something with your packaging is wrong - what do you mean by default package? What is the full name of your class and how does its path in the file system look like? Can you paste the stack trace of the exception? Chantal Still, how would ClassCastExceptions be caused by class loader paths not being set correctly? I don't follow you... To get a ClassCastException, the class to cast to must have been found. The cast-to class must not be in the object's inheritance hierarchy, or be built against a different version, no? 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com I guess this is a classloader issue. it is worth trying to put it in the WEB-INF/lib of the solr.war On Tue, Aug 4, 2009 at 5:35 PM, James Bradyjames.colin.br...@gmail.com wrote: Hi, the LiveCoresHandler is in the default package - the behaviour's the same if I have it in a properly namespaced package too... The requestHandler name can start either be a path (starting with '/') or a qt name: http://wiki.apache.org/solr/SolrRequestHandler starting w/ '/' helps in accessing it directly 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com what is the package of LiveCoresHandler ? I guess the requestHandler name should be name=/livecores On Tue, Aug 4, 2009 at 5:04 PM, James Brady james.colin.br...@gmail.com wrote: Solr version: 1.3.0 694707 solrconfig.xml: requestHandler name=livecores class=LiveCoresHandler / public class LiveCoresHandler extends RequestHandlerBase { public void init(NamedList args) { } public String getDescription
Re: ClassCastException from custom request handler
Yeah I was thinking T would be SolrRequestHandler too. Eclipse's debugger can't tell me... Lot's of other handlers are created with no problem before my plugin falls over, so I don't think it's a problem with T not being what we expected. Do you know of any working examples of plugins I can download and build in my environment to see what happens? 2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de Code is from AbstractPluginLoader in the solr plugin package, 1.3 (the regular stable release, no svn checkout). 80-84 @SuppressWarnings(unchecked) protected T create( ResourceLoader loader, String name, String className, Node node ) throws Exception { return (T) loader.newInstance( className, getDefaultPackages() ); } -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
ClassCastException from custom request handler
Hi, I'm creating a custom request handler to return a list of live cores in Solr. On startup, I get this exception for each core: Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) I've tried a few variations on the class definition, including extending RequestHandlerBase (as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2) and implementing SolrRequestHandler directly. I'm sure that the Solr libraries I built against and those I'm running on are the same version too, as I unzipped the Solr war file and copies the relevant jars out of there to build against. Any ideas on what could be causing the ClassCastException? I've attached a debugger to the running Solr process but it didn't shed any light on the issue... Thanks! James
Re: ClassCastException from custom request handler
Hi, Thanks for your suggestions! I'm sure I have the class name right - changing it to something patently incorrect results in the expected org.apache.solr.common.SolrException: Error loading class ..., rather than the ClassCastException. I did have some problems getting my class on the app server's classpath. I'm running with solr.home set to multicore, but creating a multicore/lib directory and putting my request handler class in there resulted in Error loading class errors. I found that setting jetty.class.path to include multicore/lib (and also explicitly point at Solr's core and common JARs) fixed the Error loading class errors, leaving these ClassCastExceptions... 2009/8/3 Avlesh Singh avl...@gmail.com Can you cross check the class attribute for your handler in solrconfig.xml? My guess is that it is specified as solr.LiveCoresHandler. It should be fully qualified class name - com.foo.path.to.LiveCoresHandler instead. Moreover, I am damn sure that you did not forget to drop your jar into solr.home/lib. Checking once again might not be a bad idea :) Cheers Avlesh On Mon, Aug 3, 2009 at 9:11 PM, James Brady james.colin.br...@gmail.com wrote: Hi, I'm creating a custom request handler to return a list of live cores in Solr. On startup, I get this exception for each core: Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) I've tried a few variations on the class definition, including extending RequestHandlerBase (as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2 ) and implementing SolrRequestHandler directly. I'm sure that the Solr libraries I built against and those I'm running on are the same version too, as I unzipped the Solr war file and copies the relevant jars out of there to build against. Any ideas on what could be causing the ClassCastException? I've attached a debugger to the running Solr process but it didn't shed any light on the issue... Thanks! James -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
Re: Truncated XML responses from CoreAdminHandler
Hi Mark, You're right - a custom request handler sounds like the right option. I've created a handler as you suggested, but I'm having problems on Solr startup (my class is LiveCoresHandler): Jul 31, 2009 5:20:39 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: LiveCoresHandler at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152) at org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169) at org.apache.solr.core.SolrCore.init(SolrCore.java:444) I've tried a few variations on the class definition, including extending RequestHandlerBase (as suggested here: http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2) and implementing SolrRequestHandler directly. I'm sure that the Solr libraries I built against and those I'm running on are the same version too, as I unzipped the Solr war file and copies the relevant jars out of there to build against. Any ideas on what could be causing the ClassCastException? I've attached a debugger to the running Solr process but it didn't shed any light on the issue... Thanks! James 2009/7/20 Mark Miller markrmil...@gmail.com Hi James, That is very odd behavior! I'm not sure what causing it at the moment, but that is not a great way to get all of the core names anyway. It also gathers a *lot* of information for each core that you don't need, including index statistic from Luke. Its very heavy weight for what you want. So while I hope we get to the bottom of this, here is what I would recommend: Create your own plugin RequestHandler. This is very simple - often they just extend RequestHandlerBase, but for this you don't even need to. You can leave most of the RequestHandler methods unimplemented if you'd like - you just want to override/add to: public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) and in that method you can have a very simple impl: CollectionString names = req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames(); rsp.add(cores, names); // if the cores are dynamic, you prob don't want to cache rsp.setHttpCaching(false); Then just plug your simple RequestHandler into {solr.home}/lib and add it to solrconfig.xml. You might also add a JIRA issue requesting the feature for future versions - but that's prob the best solution for 1.3 - I'm not see the functionality there. -- - Mark http://www.lucidimagination.com On Sat, Jul 18, 2009 at 9:02 PM, James Brady james.colin.br...@gmail.com wrote: The Solr application I'm working on has many concurrently active cores - of the order of 1000s at a time. The management application depends on being able to query Solr for the current set of live cores, a requirement I've been satisfying using the STATUS core admin handler method. However, once the number of active cores reaches a particular threshold (which I haven't determined exactly), the response to the STATUS method is truncated, resulting in malformed XML. My debugging so far has revealed: - when doing STATUS queries from the local machine, they succeed, untruncated, 90% of the time - when local STATUS queries do fail, they are always truncated to the same length: 73685 bytes in my case - when doing STATUS queries from a remote machine, they fail due to truncation every time - remote STATUS queries are always truncated to the same length: 24704 bytes in my case - the failing STATUS queries take visibly longer to complete on the client - a few seconds for a truncated result versus 1 second for an untruncated result - all STATUS queries return a successful 200 HTTP code - all STATUS queries are logged as returning in ~700ms in Solr's info log - during failing (truncated) responses, Solr's CPU usage spikes to saturation - behaviour seems the same whatever client I use: wget, curl, Python, ... Using Solr 1.3.0 694707, Jetty 6.1.3. At the moment, the main puzzles for me are that the local and remote behaviour is so different. It leads me to think that it is something to do with the network transmission speed. But the response really isn't that big (untruncated it's ~1MB), and the CPU spike seems to suggest that something in the process of serialising the core information is taking too long and causing a timeout? Any suggestions on settings to tweak, ways to get extra debug information, or ascertain the active core list in some other way would be much appreciated! James -- http://twitter.com/goodgravy 512 300 4210 http://webmynd.com/ Sent from Bury, United Kingdom
Truncated XML responses from CoreAdminHandler
The Solr application I'm working on has many concurrently active cores - of the order of 1000s at a time. The management application depends on being able to query Solr for the current set of live cores, a requirement I've been satisfying using the STATUS core admin handler method. However, once the number of active cores reaches a particular threshold (which I haven't determined exactly), the response to the STATUS method is truncated, resulting in malformed XML. My debugging so far has revealed: - when doing STATUS queries from the local machine, they succeed, untruncated, 90% of the time - when local STATUS queries do fail, they are always truncated to the same length: 73685 bytes in my case - when doing STATUS queries from a remote machine, they fail due to truncation every time - remote STATUS queries are always truncated to the same length: 24704 bytes in my case - the failing STATUS queries take visibly longer to complete on the client - a few seconds for a truncated result versus 1 second for an untruncated result - all STATUS queries return a successful 200 HTTP code - all STATUS queries are logged as returning in ~700ms in Solr's info log - during failing (truncated) responses, Solr's CPU usage spikes to saturation - behaviour seems the same whatever client I use: wget, curl, Python, ... Using Solr 1.3.0 694707, Jetty 6.1.3. At the moment, the main puzzles for me are that the local and remote behaviour is so different. It leads me to think that it is something to do with the network transmission speed. But the response really isn't that big (untruncated it's ~1MB), and the CPU spike seems to suggest that something in the process of serialising the core information is taking too long and causing a timeout? Any suggestions on settings to tweak, ways to get extra debug information, or ascertain the active core list in some other way would be much appreciated! James
Last modified time for cores, taking into account uncommitted changes
Hi, The lastModified field the Solr status seems to only be updated when a commit/optimize operation takes place. Is there any way to determine when a core has been changed, including any uncommitted add operations? Thanks, James
Re: Persistent, seemingly unfixable corrupt indices
Thanks for your answers Michael! I was using a pre-1.3 Solr build, but I've now upgraded to the 1.3 release, run the new CheckIndex shipped as part of the Lucene 2.4 dev build and I'm still getting the CorruptIndexException: docs out of order exceptions I'm afraid. Upon a fresh start, on newly Checked indices, I actually get a lot of Exceptions like: SEVERE: java.lang.RuntimeException: [was class org.mortbay.jetty.EofException] null at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) Before any CorruptIndexExceptions - could that be the root cause? Unfortunately the indices are large and contain confidential information; is there anything else I can do to identify where the problem is and why CheckIndex isn't catching it? Thanks James 2009/2/23 Michael McCandless luc...@mikemccandless.com Actually, even in 2.3.1, CheckIndex checks for docs-out-of-order both within and across segments, so now I'm at a loss as to why it's not catching your case. Any of these indexes small enough to post somewhere i could access? Mike James Brady wrote: Hi,My indices sometime become corrupted - normally when Solr has to be KILLed - these are not normally too much of a problem, as Lucene's CheckIndex tool can normally detect missing / broken segments and fix them. However, I now have a few indices throwing errors like this: INFO: [core4] webapp=/solr path=/update params={} status=0 QTime=2 Exception in thread Thread-75 org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: docs out of order (1124 = 1138 ) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271) Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (1124 = 1138 ) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and INFO: [core7] webapp=/solr path=/update params={} status=500 QTime=5457 Feb 22, 2009 12:14:07 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.index.CorruptIndexException: docs out of order (242 = 248 ) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:193) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1800) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1795) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1791) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2398) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1465) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1424) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:278) CheckIndex reports these cores as being completely healthy, and yet I can't commit new documents in to them. Rebuilding indices isn't an option for me: is there any other way to fix this? If not, any ideas on what I can do to prevent it in the future? Many thanks, James
Persistent, seemingly unfixable corrupt indices
Hi,My indices sometime become corrupted - normally when Solr has to be KILLed - these are not normally too much of a problem, as Lucene's CheckIndex tool can normally detect missing / broken segments and fix them. However, I now have a few indices throwing errors like this: INFO: [core4] webapp=/solr path=/update params={} status=0 QTime=2 Exception in thread Thread-75 org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: docs out of order (1124 = 1138 ) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271) Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (1124 = 1138 ) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and INFO: [core7] webapp=/solr path=/update params={} status=500 QTime=5457 Feb 22, 2009 12:14:07 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.index.CorruptIndexException: docs out of order (242 = 248 ) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:193) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1800) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1795) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1791) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2398) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1465) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1424) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:278) CheckIndex reports these cores as being completely healthy, and yet I can't commit new documents in to them. Rebuilding indices isn't an option for me: is there any other way to fix this? If not, any ideas on what I can do to prevent it in the future? Many thanks, James
Fwd: Separate error logs
OK, so java.util.logging has no way of sending error messages to a separate log without writing your own Handler/Filter code. If we just skip over the absurdity of that, and the rage it makes me feel, what are my options here? What I'm looking for is for all records to go to one file, and records of a ERROR level and above to go to a separate log. Can I write my own Handlers/Filters, drop them on Jetty's classpath and refer to them in my logging.properties? I.e. without rebuilding the whole WAR, with my files added? Is Solr 1.4 (and its nice SLF4J logging) in a state ready for intensive production usage? Thanks! James -- Forwarded message -- From: James Brady james.colin.br...@gmail.com Date: 2009/1/30 Subject: Re: Separate error logs To: solr-user@lucene.apache.org Oh... I should really have found that myself :/ Thank you! 2009/1/30 Ryan McKinley ryan...@gmail.com check: http://wiki.apache.org/solr/SolrLogging You configure whatever flavor logger to write error to a separate log On Jan 30, 2009, at 4:36 PM, James Brady wrote: Hi all,What's the best way for me to split Solr/Lucene error message off to a separate log? Thanks James
Re: Recent document boosting with dismax
Great, thanks for that, Chris! 2009/2/3 Chris Hostetter hossman_luc...@fucit.org : Hi, no the data_added field was one per document. i would suggest adding multiValued=false to your date fieldType so that Solr can enforce that for you -- otherwise we can't be 100% sure. if it really is only a single valued field, then i suspect you're right about the index corruption being the source of your problem, but it's not neccessarily a permenant problem. try optimizing your index, that should merge all the segments and purge any terms that aren't actually part of live documents (i think) ... if that doesn't work, rebuilding will be your best bet (and with that multiValued=false will error if you are inadvertantly sending multiple values per document) : I'm having lots of other problems (un-related) with corrupt indices - : could : it be that in running the org.apache.lucene.index.CheckIndex utility, and : losing some documents in the process, the ordinal part of my boost : function : is permanently broken? -Hoss
Re: Recent document boosting with dismax
Hi, no the data_added field was one per document. 2009/2/1 Erik Hatcher e...@ehatchersolutions.com Is your date_added field multiValued and you've assigned multiple to some documents? Erik On Jan 31, 2009, at 4:12 PM, James Brady wrote: Hi,I'm following the recipe here: http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994709b4d7e540359b1fdfor boosting recent documents: bf=recip(rord(date_added),1,1000,1000) On some of my servers I've started getting errors like this: SEVERE: java.lang.RuntimeException: there are more terms than documents in field date_added, but it's impossible to sort on tokenized fields at org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:379) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352) at org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:55) at org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:56) at org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:103) at org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:81) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:232) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143) at org.apache.lucene.search.Searcher.search(Searcher.java:118) ... The date_added field is stored as a vanilla Solr date type: fieldType name=date class=solr.DateField sortMissingLast=true omitNorms=true/ I'm having lots of other problems (un-related) with corrupt indices - could it be that in running the org.apache.lucene.index.CheckIndex utility, and losing some documents in the process, the ordinal part of my boost function is permanently broken? Thanks! James
Separate error logs
Hi all,What's the best way for me to split Solr/Lucene error message off to a separate log? Thanks James
Re: Separate error logs
Oh... I should really have found that myself :/ Thank you! 2009/1/30 Ryan McKinley ryan...@gmail.com check: http://wiki.apache.org/solr/SolrLogging You configure whatever flavor logger to write error to a separate log On Jan 30, 2009, at 4:36 PM, James Brady wrote: Hi all,What's the best way for me to split Solr/Lucene error message off to a separate log? Thanks James
Disk usage after document deletion
Hi,I have a number of indices that are supposed to maintaining windows of indexed content - the last month's work of data, for example. At the moment, I'm cleaning out old documents with a simple cron job making requests like: deletequerydate_added:[* TO NOW-30DAYS]/query/delete I was expecting disk usage to plateau pretty sharply as the number of documents in the index reaches equilibrium. However, the usage keeps on going up, after 30 days, albeit not as quickly, even if I optimise the index. Can anyone offer an explanation for this? Should document deletions followed by optimises have as much of an effect on disk usage as I was expecting? Thanks! James
Re: Disk usage after document deletion
The number of documents varies - sometimes it increases, sometimes it decreases - month to month. However, the index size increases monotonically. I was expecting some gradual growth as I expect Lucene retains terms that are no longer referenced from any documents, so you'll end up with the superset of all possible terms in the end. However, index size growth probably continues at roughly half the speed of it's growth during the filling up period. 2009/1/26 Ryan McKinley ryan...@gmail.com On Jan 25, 2009, at 6:06 PM, James Brady wrote: Hi,I have a number of indices that are supposed to maintaining windows of indexed content - the last month's work of data, for example. At the moment, I'm cleaning out old documents with a simple cron job making requests like: deletequerydate_added:[* TO NOW-30DAYS]/query/delete I was expecting disk usage to plateau pretty sharply as the number of documents in the index reaches equilibrium. However, the usage keeps on going up, after 30 days, albeit not as quickly, even if I optimise the index. Can anyone offer an explanation for this? Should document deletions followed by optimises have as much of an effect on disk usage as I was expecting? Depends what you are expecting ;) Are you sure that the number or size of docs from month to month is consistent? If you have more docs each month then the previous one, or if more data is stored, then a months data would be bigger too. ryan
Strategy for presenting fresh data
Hi, The product I'm working on requires new documents to be searchable very quickly (inside 60 seconds is my goal). The corpus is also going to grow very large, although it is perfectly partitionable by user. The approach I tried first was to have write-only masters and read- only slaves with data being replicated from one to another postCommit and postOptimise. This allowed new documents to be visible inside 5 minutes or so (until the indexes got so large that re-opening IndexSearchers took for ever, that is...), but still not good enough. Now, I am considering cutting out the commit / replicate / re-open cycle by augmenting Solr with a RAMDirectory per core. Your thoughts on the following approach would be much appreciated: Searches would be forked to both the RAMDirectory and FSDirectory, while writes would go to the RAMDirectory only. The RAMDirectory would be flushed back to the FSDirectory regularly, using IndexWriter.addIndexes (or addIndexesNoOptimise). Effectively, I'd be creating a searchable queue in front of a regularly committed and optimised conventional index. As this seems to be a useful pattern (and is mentioned tangentially in Lucene in Action), is there already support for this in Lucene? Thanks, James
Multicore capability: dynamically creating 1000s of cores?
Hi, there was some talk on JIRA about whether Multicore would be able to manage tens of thousands of cores, and dynamically create hundreds every day: https://issues.apache.org/jira/browse/SOLR-350? focusedCommentId=12571282#action_12571282 The issue of multicore configuration was left open in SOLR-350 (I don't think a new issue was opened?), so it's not clear to me if what Otis described will be possible in the 1.3 timeframe. Can anyone involved in SOLR-350 elaborate on how dynamic creation, closing and opening of cores will work in the future? A real-world deployment of this would require associated admin tasks for each core too: setting up cron jobs, enabling and starting rsync and so on, so core configuration via Solr isn't a requirement for me: I can script the creation and configuration of a new core directory alongside the other admin tasks. The open questions are whether I'll be able to notify Solr that there is a pre-configured core ready to be used - i.e. the configuration set in multicore.xml, and whether this multi-multi-core approach will scale to the levels that Otis mentioned. Thanks! James.
Re: Solr feasibility with terabyte-scale data
Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize it and then just copy it to the main index to minimize the burden of the shared storage. Let's say the indexing part will be all fancy and working i TB scale now we come to searching. I personally believe after talking to other guys which have built big search engines that you need to introduce a controller like searcher on the client side which itself searches in all of the shards and merges the response. Perhaps Distributed Solr solves this and will love to test it whenever my new installation of servers and enclosures is finished. Currently my idea is something like this. public PageDocument search(SearchDocumentCommand sdc) { SetInteger ids = documentIndexers.keySet(); int nrOfSearchers = ids.size(); int totalItems = 0; PageDocument docs = new Page(sdc.getPage(), sdc.getPageSize()); for (IteratorInteger iterator = ids.iterator(); iterator.hasNext();) { Integer id = iterator.next(); ListDocumentIndexer indexers = documentIndexers.get(id); DocumentIndexer indexer = indexers.get(random.nextInt(indexers.size())); SearchDocumentCommand sdc2 = copy(sdc); sdc2.setPage(sdc.getPage()/nrOfSearchers); PageDocument res = indexer.search(sdc); totalItems += res.getTotalItems(); docs.addAll(res); } if(sdc.getComparator() != null) { Collections.sort(docs, sdc.getComparator()); } docs.setTotalItems(totalItems); return docs; } This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I switch from Solr to raw Lucene back and forth benchmarking and comparing stuff so I have two implementations of DocumentIndexer (SolrDocumentIndexer and
Re: Solr feasibility with terabyte-scale data
So our problem is made easier by having complete index partitionability by a user_id field. That means at one end of the spectrum, we could have one monolithic index for everyone, while at the other end of the spectrum we could individual cores for each user_id. At the moment, we've gone for a halfway house somewhere in the middle: I've got several large EC2 instances (currently 3), each running a single master/slave pair of Solr servers. The servers have several cores (currently 10 - a guesstimated good number). As new users register, I automatically distribute them across cores. I would like to do something with clustering users based on geo-location so that cores will get 'time off' for maintenance and optimization for that user cluster's nighttime. I'd also like to move in the 1 core per user direction as dynamic core creation becomes available. It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it to solve their scale problems, not least Yahoo... James On 9 May 2008, at 01:17, Marcus Herou wrote: Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ? I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] wrote: Hi, we have an index of ~300GB, which is at least approaching the ballpark you're in. Lucky for us, to coin a phrase we have an 'embarassingly partitionable' index so we can just scale out horizontally across commodity hardware with no problems at all. We're also using the multicore features available in development Solr version to reduce granularity of core size by an order of magnitude: this makes for lots of small commits, rather than few long ones. There was mention somewhere in the thread of document collections: if you're going to be filtering by collection, I'd strongly recommend partitioning too. It makes scaling so much less painful! James On 8 May 2008, at 23:37, marcusherou wrote: Hi. I will as well head into a path like yours within some months from now. Currently I have an index of ~10M docs and only store id's in the index for performance and distribution reasons. When we enter a new market I'm assuming we will soon hit 100M and quite soon after that 1G documents. Each document have in average about 3-5k data. We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures as shared storage (think of it as a SAN or shared storage at least, one mount point). Hope this will be the right choice, only future can tell. Since we are developing a search engine I frankly don't think even having 100's of SOLR instances serving the index will cut it performance wise if we have one big index. I totally agree with the others claiming that you most definitely will go OOE or hit some other constraints of SOLR if you must have the whole result in memory sort it and create a xml response. I did hit such constraints when I couldn't afford the instances to have enough memory and I had only 1M of docs back then. And think of it... Optimizing a TB index will take a long long time and you really want to have an optimized index if you want to reduce search time. I am thinking of a sharding solution where I fragment the index over the disk(s) and let each SOLR instance only have little piece of the total index. This will require a master database or namenode (or simpler just a properties file in each index dir) of some sort to know what docs is located on which machine or at least how many docs each shard have. This is to ensure that whenever you introduce a new SOLR instance with a new shard the master indexer will know what shard to prioritize. This is probably not enough either since all new docs will go to the new shard until it is filled (have the same size as the others) only then will all shards receive docs in a loadbalanced fashion. So whenever you want to add a new indexer you probably need to initiate a stealing process where it steals docs from the others until it reaches some sort of threshold (10 servers = each shard should have 1/10 of the docs or such). I think this will cut it and enabling us to grow with the data. I think doing a distributed reindexing will as well be a good thing when it comes to cutting both indexing and optimizing speed. Probably each indexer should buffer it's shard locally on RAID1 SCSI disks, optimize
IOException: Mark invalid while analyzing HTML
Hi, I'm seeing a problem mentioned in Solr-42, Highlighting problems with HTMLStripWhitespaceTokenizerFactory: https://issues.apache.org/jira/browse/SOLR-42 I'm indexing HTML documents, and am getting reams of Mark invalid IOExceptions: SEVERE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(Unknown Source) at org .apache .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 728) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 742) at java.io.Reader.read(Unknown Source) at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118) at org .apache .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33) at org .apache .solr .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45) at org .apache .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) at org .apache .solr .analysis .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 33) at org .apache .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.invertField(DocumentsWriter.java:1518) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.processField(DocumentsWriter.java:1407) at org.apache.lucene.index.DocumentsWriter $ThreadState.processDocument(DocumentsWriter.java:1116) at org .apache .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440) at org .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 2422) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 1445) This is using a ~1 week old version of Solr 1.3 from SVN. One workaround mentioned in that Jira issue was to move HTML stripping outside of Solr; can anyone suggest a better approach than that? Thanks James
Re: Master / slave setup with multicore
Ah, wait, my fault - I didn't have the right Solr port configured in the slave, so snapinstaller was commiting the master :/ Thanks, James On 2 May 2008, at 09:17, Bill Au wrote: snapinstall calls commit to trigger Solr to use the new index. Do you see the commit request in your Solr log? Anything in the snapinstaller log? Bill On Thu, May 1, 2008 at 8:35 PM, James Brady [EMAIL PROTECTED] wrote: Hi Ryan, thanks for that! I have one outstanding question: when I take a snapshot on the master, snappull and snapinstall on the slave, the new index is not being used: restarting the slave server does pick up the changes, however. Has anyone else had this problem with recent development builds? In case anyone is trying to do multicore replication, here some of the things I've done to get it working.. These could go on the wiki somewhere, what do people think? Obviously, have as much shared configuration as possible is ideal. On the master, I have core-specific: - scripts.conf, for webapp_name, master_data_dir and master_status_dir - solrconfig.xml, for the post-commit and post-optimise snapshooter locations On the slave, I have core-specific: -scripts.conf, as above I've also customised snappuller to accept a different rsync module name (hard coded to 'solr' at present). This module name is set in the slave scripts.conf James On 29 Apr 2008, at 13:44, Ryan McKinley wrote: On Apr 29, 2008, at 3:09 PM, James Brady wrote: Hi all, I'm aiming to use the new multicore features in development versions of Solr. My ideal setup would be to have master / slave servers on the same machine, snapshotting across from the 'write' to the 'read' server at intervals. This was all fine with Solr 1.2, but the rsync snappuller configuration doesn't seem to be set up to allow for multicore replication in 1.3. The rsyncd.conf file allows for several data directories to be defined, but the snappuller script only handles a single directory, expecting the Lucene index to be directly inside that directory. What's the best practice / best suggestions for replicating a multicore update server out to search servers? Currently, for multicore replication you will need to install the snap* scripts for _each_ core. The scripts all expect a single core so for multiple cores, you will just need to install it multiple times. ryan
Master / slave setup with multicore
Hi all, I'm aiming to use the new multicore features in development versions of Solr. My ideal setup would be to have master / slave servers on the same machine, snapshotting across from the 'write' to the 'read' server at intervals. This was all fine with Solr 1.2, but the rsync snappuller configuration doesn't seem to be set up to allow for multicore replication in 1.3. The rsyncd.conf file allows for several data directories to be defined, but the snappuller script only handles a single directory, expecting the Lucene index to be directly inside that directory. What's the best practice / best suggestions for replicating a multicore update server out to search servers? Thanks, James
Re: Queuing adds and commits
Depending on your application, it might be useful to take control of the queueing yourself: it was for me! I needed quick turnarounds for submitting a document to be indexed, which Solr can't guarantee right now. To address it, I wrote a persistent queueing server, accessed by XML-RPC, which has the benefit of adding a low-cost layer of indirection between client-side and server-side stuff, and properly serialises the order in which events arrive. James On 27 Apr 2008, at 05:33, Phillip Farber wrote: A while back Hoss described Solr queuing behavior: searches can go on happily while commits/adds are happening, and multiple adds can happen in parallel, ... but all adds block while a commit is taking place. i just give all of clients that update the index a really large timeout value (ie: 60 seconds or so) and don't worry about queing up indexing requests. I am worried about those adds queuing up behind an ongoing commit before being processed in parallel. What if a document is added multiple times and each time it is added it has different field values? You'd want the newest, last queued version of that document to win, i.e. to be the version that represents the document in the index. But processing in parallel suggests that the time order of the adds of that document could be lost. Does Solr time stamp the documents in the indexing queue to prevent an earlier queued version of the document from being the last version indexed? Thanks, Phil
Default core in multi-core
Hi all, In the latest trunk version, default='true' doesn't have the effect I would have expected running in multi core mode. The example multicore.xml has: core name=core0 instanceDir=core0 default=true/ core name=core1 instanceDir=core1 / But queries such as /solr/select?q=*:* and /solr/admin/ are executed against core1, not core0 as I would have expected: it seems to be that the last core defined in multicore.xml is the de facto 'default'. Is this a bug or am I missing something? Thanks, James
Favouring recent matches
Hello all, In Lucene in Action, (replicated here: http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene) , theserverside.com team say The date boost has been really important for us. I'm looking for some advice on the best way to actually implement this - the only way I can see to do it right now is to set a boost for documents at index time that increases linearly over time. However, I'm wary of skewing Lucene's scoring in some strange way, or interfering with the document boosts I'm setting for other reasons. Any suggests? Thanks James
Fwd: Favouring recent matches
Sorry, I really should have directly explained what I was looking to do: theserverside.com give higher scores to documents that were added more recently. I'd like to do the same, without the date boost being too overbearing (or unnoticeable...) - some ideas on how to approach this would be great. James Begin forwarded message: From: James Brady [EMAIL PROTECTED] Date: 8 March 2008 19:41:56 PST To: solr-user@lucene.apache.org Subject: Favouring recent matches Hello all, In Lucene in Action, (replicated here: http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene) , theserverside.com team say The date boost has been really important for us. I'm looking for some advice on the best way to actually implement this - the only way I can see to do it right now is to set a boost for documents at index time that increases linearly over time. However, I'm wary of skewing Lucene's scoring in some strange way, or interfering with the document boosts I'm setting for other reasons. Any suggests? Thanks James
Re: Strategy for handling large (and growing) index: horizontal partitioning?
Hi Kevin, Thanks for your suggestions - I've got about 6 million, and am being quite stingy with my schema at the moment I'm afraid. If anything, the size of each document is going to go up, not down, but I might be able to prune some older, unused data. James On 3 Mar 2008, at 14:33, Kevin Lewandowski wrote: How many documents are in the index? If you haven't already done this I'd take a really close look at your schema and make sure you're only storing the things that should really be stored, same with the indexed fields. I drastically reduced my index size just by changing some indexed/stored options on a few fields. On Thu, Feb 28, 2008 at 10:54 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: James, I can't comment more on the SN's arch choices. Here is the story about your questions - 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki) or via the new multi-core support which works, but is still being hacked on - See SOLR-303 in JIRA for distributed search. Yonik committed it just the other day, so now that's in nightly builds if you want to try it. There are 2 Wiki pages about that, too, see Recent changes log on the Wiki to quickly find them. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James Brady [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, February 29, 2008 1:11:07 AM Subject: Re: Strategy for handling large (and growing) index: horizontal partitioning? Hi Otis, Thanks for your comments -- I didn't realise the wiki is open to editing; my apologies. I've put in a few words to try and clear things up a bit. So determining n will probably be a best guess followed by trial and error, that's fine. I'm still not clear about whether single Solr servers can operate across several indices, however.. can anyone give me some pointers here? An alternative would be to have 1 index per instance, and n instances per server, where n is small. This might actually be a practical solution -- I'm spending ~20% of my time committing, so I should probably only have 3 or 4 indices in total per server to avoid two committing at the same time. Your mention of The Large Social Network was interesting! A social network's data is by definition pretty poorly partitioned by user id, so unless they've done something extremely clever like co-locating social cliques in the same indices, I would have though it would be a sub-optimal architecture. If me and my friends are scattered around different indices, each search would have to be federated massively. James On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote: James, Regarding your questions about n users per index - this is a fine approach. The largest Social Network that you know of uses the same approach for various things, including full-text indices (not Solr, but close). You'd have to maintain user-shard/index mapping somewhere, of course. What should the n be, you ask? Look at the overall index size, I'd say, against server capabilities (RAM, disk, CPU), increase n up to a point where you're maximizing your hardware at some target query rate. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James Brady To: solr-user@lucene.apache.org Sent: Wednesday, February 27, 2008 10:08:02 PM Subject: Strategy for handling large (and growing) index: horizontal partitioning? Hi all, Our current setup is a master and slave pair on a single machine, with an index size of ~50GB. Query and update times are still respectable, but commits are taking ~20% of time on the master, while our daily index optimise can up to 4 hours... Here's the most relevant part of solrconfig.xml: true 10 1000 1 1 I've given both master and slave 2.5GB of RAM. Does an index optimise read and re-write the whole thing? If so, taking about 4 hours is pretty good! However, the documentation here: http://wiki.apache.org/solr/CollectionDistribution?highlight=% 28ten +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b states Optimizations can take nearly ten minutes to run... which leads me to think that we've grossly misconfigured something... Firstly, we would obviously love any way to reduce this optimise time - I have yet to experiment extensively with the settings above, and optimise frequency, but some general guidance would be great. Secondly, this index size is increasing monotonously over time and as we acquire new users. We need to take action to ensure we can scale in the future. The approach we're favouring at the moment is horizontal partitioning of indices by user id as our data suits this scheme well. A given index would hold the indexed data for n users, where n would probably be between 1 and 100 users, and we will have multiple indices per search server. Running server per index is impractical, especially for a small n
Re: Strategy for handling large (and growing) index: horizontal partitioning?
Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is definitely the bottleneck, you're right -- iostat was showing 100% utilisation for the 5 hours it took to optimise yesterday... The master and slave are on the same disk, and it's definitely on my list to fix that, but the searcher is so lightly loaded compared to the indexer that I don't think it will win us too much. As there has been another optimise time question on the list today could I request that the 10 minute claim is taken of the CollectionDistribution wiki page? It's extremely misleading for newcomers who don't necessarily realise an optimise entails reading and writing the whole index, and that optimise time is going to be at least O(n) James On 28 Feb 2008, at 09:07, Walter Underwood wrote: Have you timed how long it takes to copy the index files? Optimizing can never be faster than that, since it must read every byte and write a whole new set. Disc speed may be your bottleneck. You could also look at disc access rates in a monitoring tool. Is there read contention between the master and slave for the same disc? wunder On 2/27/08 7:08 PM, James Brady [EMAIL PROTECTED] wrote: Hi all, Our current setup is a master and slave pair on a single machine, with an index size of ~50GB. Query and update times are still respectable, but commits are taking ~20% of time on the master, while our daily index optimise can up to 4 hours... Here's the most relevant part of solrconfig.xml: useCompoundFiletrue/useCompoundFile mergeFactor10/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs1/maxMergeDocs maxFieldLength1/maxFieldLength I've given both master and slave 2.5GB of RAM. Does an index optimise read and re-write the whole thing? If so, taking about 4 hours is pretty good! However, the documentation here: http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b states Optimizations can take nearly ten minutes to run... which leads me to think that we've grossly misconfigured something... Firstly, we would obviously love any way to reduce this optimise time - I have yet to experiment extensively with the settings above, and optimise frequency, but some general guidance would be great. Secondly, this index size is increasing monotonously over time and as we acquire new users. We need to take action to ensure we can scale in the future. The approach we're favouring at the moment is horizontal partitioning of indices by user id as our data suits this scheme well. A given index would hold the indexed data for n users, where n would probably be between 1 and 100 users, and we will have multiple indices per search server. Running server per index is impractical, especially for a small n, so is a sinlge Solr instance capable of managing multiple searchers and writers in this way? Following on from that, does anyone know of limiting factors in Solr or Lucene that would influence our decision on the value of n - the number of users per index? Thanks! James
Re: Strategy for handling large (and growing) index: horizontal partitioning?
Hi Otis, Thanks for your comments -- I didn't realise the wiki is open to editing; my apologies. I've put in a few words to try and clear things up a bit. So determining n will probably be a best guess followed by trial and error, that's fine. I'm still not clear about whether single Solr servers can operate across several indices, however.. can anyone give me some pointers here? An alternative would be to have 1 index per instance, and n instances per server, where n is small. This might actually be a practical solution -- I'm spending ~20% of my time committing, so I should probably only have 3 or 4 indices in total per server to avoid two committing at the same time. Your mention of The Large Social Network was interesting! A social network's data is by definition pretty poorly partitioned by user id, so unless they've done something extremely clever like co-locating social cliques in the same indices, I would have though it would be a sub-optimal architecture. If me and my friends are scattered around different indices, each search would have to be federated massively. James On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote: James, Regarding your questions about n users per index - this is a fine approach. The largest Social Network that you know of uses the same approach for various things, including full-text indices (not Solr, but close). You'd have to maintain user-shard/index mapping somewhere, of course. What should the n be, you ask? Look at the overall index size, I'd say, against server capabilities (RAM, disk, CPU), increase n up to a point where you're maximizing your hardware at some target query rate. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James Brady [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, February 27, 2008 10:08:02 PM Subject: Strategy for handling large (and growing) index: horizontal partitioning? Hi all, Our current setup is a master and slave pair on a single machine, with an index size of ~50GB. Query and update times are still respectable, but commits are taking ~20% of time on the master, while our daily index optimise can up to 4 hours... Here's the most relevant part of solrconfig.xml: true 10 1000 1 1 I've given both master and slave 2.5GB of RAM. Does an index optimise read and re-write the whole thing? If so, taking about 4 hours is pretty good! However, the documentation here: http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b states Optimizations can take nearly ten minutes to run... which leads me to think that we've grossly misconfigured something... Firstly, we would obviously love any way to reduce this optimise time - I have yet to experiment extensively with the settings above, and optimise frequency, but some general guidance would be great. Secondly, this index size is increasing monotonously over time and as we acquire new users. We need to take action to ensure we can scale in the future. The approach we're favouring at the moment is horizontal partitioning of indices by user id as our data suits this scheme well. A given index would hold the indexed data for n users, where n would probably be between 1 and 100 users, and we will have multiple indices per search server. Running server per index is impractical, especially for a small n, so is a sinlge Solr instance capable of managing multiple searchers and writers in this way? Following on from that, does anyone know of limiting factors in Solr or Lucene that would influence our decision on the value of n - the number of users per index? Thanks! James
Strategy for handling large (and growing) index: horizontal partitioning?
Hi all, Our current setup is a master and slave pair on a single machine, with an index size of ~50GB. Query and update times are still respectable, but commits are taking ~20% of time on the master, while our daily index optimise can up to 4 hours... Here's the most relevant part of solrconfig.xml: useCompoundFiletrue/useCompoundFile mergeFactor10/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs1/maxMergeDocs maxFieldLength1/maxFieldLength I've given both master and slave 2.5GB of RAM. Does an index optimise read and re-write the whole thing? If so, taking about 4 hours is pretty good! However, the documentation here: http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b states Optimizations can take nearly ten minutes to run... which leads me to think that we've grossly misconfigured something... Firstly, we would obviously love any way to reduce this optimise time - I have yet to experiment extensively with the settings above, and optimise frequency, but some general guidance would be great. Secondly, this index size is increasing monotonously over time and as we acquire new users. We need to take action to ensure we can scale in the future. The approach we're favouring at the moment is horizontal partitioning of indices by user id as our data suits this scheme well. A given index would hold the indexed data for n users, where n would probably be between 1 and 100 users, and we will have multiple indices per search server. Running server per index is impractical, especially for a small n, so is a sinlge Solr instance capable of managing multiple searchers and writers in this way? Following on from that, does anyone know of limiting factors in Solr or Lucene that would influence our decision on the value of n - the number of users per index? Thanks! James
Re: will hardlinks work across partitions?
Unfortunately, you cannot hard link across mount points. Snapshooter uses cp -lr, which, on my Linux machine at least, fails with: cp: cannot create link `/mnt2/myuser/linktest': Invalid cross-device link James On 23 Feb 2008, at 14:34, Brian Whitman wrote: Will the hardlink snapshot scheme work across physical disk partitions? Can I snapshoot to a different partition than the one holding the live solr index?
Bug fix for Solr Python bindings
Hi, Currently, the solr.py Python binding casts all key and value arguments blindly to strings. The following changes deal with Unicode properly and respect multi-valued parameters passed in as lists: 131a132,142 def __makeField(self, lst, f, v): if not isinstance(f, basestring): f = str(f) if not isinstance(v, basestring): v = str(v) lst.append('field name=') lst.append(self.escapeKey(f)) lst.append('') lst.append(self.escapeVal(v)) lst.append('/field') 143,147c154,158 lst.append('field name=') lst.append(self.escapeKey(str(f))) lst.append('') lst.append(self.escapeVal(str(v))) lst.append('/field') --- if isinstance(v, list): # multi-valued for value in v: self.__makeField(lst, f, value) else: self.__makeField(lst, f, v) James
Fwd: Performance help for heavy indexing workload
Hi again, More analysis showed that the extraordinarily long query times only appeared when I specify a sort. A concrete example: For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther= The QTime is ~500ms. For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 3A39start=0rows=1fl=*% 2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc The QTime is ~75s I.e. I am using the StandardRequestHandler to search for a user entered term (apache above) and filtering by a user_id field. This seems to be the case for every sort option except score asc and score desc. Please tell me Solr doesn't sort all matching documents before applying boolean filters? James Begin forwarded message: From: James Brady [EMAIL PROTECTED] Date: 11 February 2008 23:38:16 GMT-08:00 To: solr-user@lucene.apache.org Subject: Performance help for heavy indexing workload Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory- bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Re: Performance help for heavy indexing workload
Hi - thanks to everyone for their responses. A couple of extra pieces of data which should help me optimise - documents are very rarely updated once in the index, and I can throw away index data older than 7 days. So, based on advice from Mike and Walter, it seems my best option will be to have seven separate indices. 6 indices will never change and hold data from the six previous days. One index will change and will hold data from the current day. Deletions and updates will be handled by effectively storing a revocation list in the mutable index. In this way, I will only need to perform Solr commits (yes, I did mean Solr commits rather than database commits below - my apologies) on the current day's index, and closing and opening new searchers for these commits shouldn't be as painful as it is currently. To do this, I need to work out how to do the following: - parallel multi search through Solr - move to a new index on a scheduled basis (probably commit and optimise the index at this point) - ideally, properly warm new searchers in the background to further improve search performance on the changing index Does that sound like a reasonable strategy in general, and has anyone got advice on the specific points I raise above? Thanks, James On 12 Feb 2008, at 11:45, Mike Klaas wrote: On 11-Feb-08, at 11:38 PM, James Brady wrote: Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. By database commit do you mean solr commit? If so, that is far too frequent if you are sorting on big fields. I use Solr to serve queries for ~10m docs on a medium size EC2 instance. This is an optimized configuration where highlighting is broken off into a separate index, and load balanced into two subindices of 5m docs a piece. I do a good deal of faceting but no sorting. The only reason that this is possible is that the index is only updated every few days. On another box we have a several hundred thousand document index which is updated relatively frequently (autocommit time: 20s). These are merged with the static-er index to create an illusion of real-time index updates. When lucene supports efficient, reopen()able fieldcache upates, this situation might improve, but the above architecture would still probably be better. Note that the second index can be on the same machine. -Mike
Performance help for heavy indexing workload
Hello, I'm looking for some configuration guidance to help improve performance of my application, which tends to do a lot more indexing than searching. At present, it needs to index around two documents / sec - a document being the stripped content of a webpage. However, performance was so poor that I've had to disable indexing of the webpage content as an emergency measure. In addition, some search queries take an inordinate length of time - regularly over 60 seconds. This is running on a medium sized EC2 instance (2 x 2GHz Opterons and 8GB RAM), and there's not too much else going on on the box. In total, there are about 1.5m documents in the index. I'm using a fairly standard configuration - the things I've tried changing so far have been parameters like maxMergeDocs, mergeFactor and the autoCommit options. I'm only using the StandardRequestHandler, no faceting. I have a scheduled task causing a database commit every 15 seconds. Obviously, every workload varies, but could anyone comment on whether this sort of hardware should, with proper configuration, be able to manage this sort of workload? I can't see signs of Solr being IO-bound, CPU-bound or memory-bound, although my scheduled commit operation, or perhaps GC, does spike up the CPU utilisation at intervals. Any help appreciated! James
Unicode bug in python client code
Hi all, I was adding passing python unicode objects to solr.add and got these sort of errors: ... File /Users/jamesbrady/Documents/workspace/YelServer/yel/ solr.py, line 152, in add self.__add(lst,fields) File /Users/jamesbrady/Documents/workspace/YelServer/yel/ solr.py, line 146, in __add lst.append(self.escapeVal(str(v))) UnicodeEncodeError: 'ascii' codec can't encode characters in position 30-31: ordinal not in range(128) Here's a diff which properly checks the object type before calling str (): 142a143,146 if not isinstance(f, basestring): f = str(f) if not isinstance(v, basestring): v = str(v) 144c148 lst.append(self.escapeKey(str(f))) --- lst.append(self.escapeKey(f)) 146c150 lst.append(self.escapeVal(str(v))) --- lst.append(self.escapeVal(v)) Keep up the good work - loving Solr so far! James