Re: ClassCastException from custom request handler

2009-08-05 Thread James Brady
OK, problem solved! Well, worked around.

I gave up on the new style plugin loading in a multicore Jetty setup, and
packaged up my plugin in a rebuilt solr.war.

I had tried this before, but only putting the class files in WEB-INF/lib. If
I put a jar file in there, it works.

2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de



 James Brady schrieb:

 Yeah I was thinking T would be SolrRequestHandler too. Eclipse's debugger
 can't tell me...


 You could try disassembling. Or Eclipse opens classes in a very rudimentary
 format when there is no source code attached. Maybe it shows the actual
 return value there, instead of T.


 Lot's of other handlers are created with no problem before my plugin falls
 over, so I don't think it's a problem with T not being what we expected.

 Do you know of any working examples of plugins I can download and build in
 my environment to see what happens?


 No sorry. I've only overwritten the EntityProcessor from DataImportHandler,
 and that is not configured in solrconfig.xml.




 2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de

  Code is from AbstractPluginLoader in the solr plugin package, 1.3 (the
 regular stable release, no svn checkout).


  80-84

 @SuppressWarnings(unchecked)
 protected T create( ResourceLoader loader, String name, String
 className, Node node ) throws Exception
 {
  return (T) loader.newInstance( className, getDefaultPackages() );
 }



 --
 http://twitter.com/goodgravy
 512 300 4210
 http://webmynd.com/
 Sent from Bury, United Kingdom




-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom


Re: ClassCastException from custom request handler

2009-08-04 Thread James Brady
Solr version: 1.3.0 694707

solrconfig.xml:
requestHandler name=livecores class=LiveCoresHandler /

public class LiveCoresHandler extends RequestHandlerBase {
public void init(NamedList args) { }
public String getDescription() { return ; }
public String getSource() { return ; }
public String getSourceId() { return ; }
public NamedList getStatistics() { return new NamedList(); }
public String getVersion() { return ; }

public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
rsp) {
CollectionString names =
req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames();
rsp.add(cores, names);
// if the cores are dynamic, you prob don't want to cache
rsp.setHttpCaching(false);
}
}

2009/8/4 Avlesh Singh avl...@gmail.com

 
  I'm sure I have the class name right - changing it to something patently
  incorrect results in the expected org.apache.solr.common.SolrException:
  Error loading class ..., rather thanthe ClassCastException.
 
 You are right about that, James.

 Which Solr version are you using?
 Can you please paste the relevant pieces in your solrconfig.xml and the
 request handler class you have created?

 Cheers
 Avlesh

 On Mon, Aug 3, 2009 at 10:51 PM, James Brady james.colin.br...@gmail.com
 wrote:

  Hi,
  Thanks for your suggestions!
 
  I'm sure I have the class name right - changing it to something patently
  incorrect results in the expected
  org.apache.solr.common.SolrException: Error loading class ..., rather
  than
  the ClassCastException.
 
  I did have some problems getting my class on the app server's classpath.
  I'm
  running with solr.home set to multicore, but creating a multicore/lib
  directory and putting my request handler class in there resulted in
 Error
  loading class errors.
 
  I found that setting jetty.class.path to include multicore/lib (and also
  explicitly point at Solr's core and common JARs) fixed the Error loading
  class errors, leaving these ClassCastExceptions...
 
  2009/8/3 Avlesh Singh avl...@gmail.com
 
   Can you cross check the class attribute for your handler in
  solrconfig.xml?
   My guess is that it is specified as solr.LiveCoresHandler. It should
 be
   fully qualified class name - com.foo.path.to.LiveCoresHandler instead.
  
   Moreover, I am damn sure that you did not forget to drop your jar into
   solr.home/lib. Checking once again might not be a bad idea :)
  
   Cheers
   Avlesh
  
   On Mon, Aug 3, 2009 at 9:11 PM, James Brady 
 james.colin.br...@gmail.com
   wrote:
  
Hi,
I'm creating a custom request handler to return a list of live cores
 in
Solr.
   
On startup, I get this exception for each core:
   
Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
SEVERE: java.lang.ClassCastException: LiveCoresHandler
   at
   
 org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
   at
   
 org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
   at
   
   
  
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
   at
   
   
  
 
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:444)
   
I've tried a few variations on the class definition, including
  extending
RequestHandlerBase (as suggested here:
   
   
  
 
 http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
)
and implementing SolrRequestHandler directly.
   
I'm sure that the Solr libraries I built against and those I'm
 running
  on
are the same version too, as I unzipped the Solr war file and copies
  the
relevant jars out of there to build against.
   
Any ideas on what could be causing the ClassCastException? I've
  attached
   a
debugger to the running Solr process but it didn't shed any light on
  the
issue...
   
Thanks!
James
   
  
 
 
 
  --
  http://twitter.com/goodgravy
  512 300 4210
  http://webmynd.com/
  Sent from Bury, United Kingdom
 




-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom


Re: ClassCastException from custom request handler

2009-08-04 Thread James Brady
Hi, the LiveCoresHandler is in the default package - the behaviour's the
same if I have it in a properly namespaced package too...

The requestHandler name can start either be a path (starting with '/') or a
qt name:
http://wiki.apache.org/solr/SolrRequestHandler

2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 what is the package of LiveCoresHandler ?
 I guess the requestHandler name should be name=/livecores

 On Tue, Aug 4, 2009 at 5:04 PM, James Bradyjames.colin.br...@gmail.com
 wrote:
  Solr version: 1.3.0 694707
 
  solrconfig.xml:
 requestHandler name=livecores class=LiveCoresHandler /
 
  public class LiveCoresHandler extends RequestHandlerBase {
 public void init(NamedList args) { }
 public String getDescription() { return ; }
 public String getSource() { return ; }
 public String getSourceId() { return ; }
 public NamedList getStatistics() { return new NamedList(); }
 public String getVersion() { return ; }
 
 public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse
  rsp) {
 CollectionString names =
  req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames();
 rsp.add(cores, names);
 // if the cores are dynamic, you prob don't want to cache
 rsp.setHttpCaching(false);
 }
  }
 
  2009/8/4 Avlesh Singh avl...@gmail.com
 
  
   I'm sure I have the class name right - changing it to something
 patently
   incorrect results in the expected
 org.apache.solr.common.SolrException:
   Error loading class ..., rather thanthe ClassCastException.
  
  You are right about that, James.
 
  Which Solr version are you using?
  Can you please paste the relevant pieces in your solrconfig.xml and the
  request handler class you have created?
 
  Cheers
  Avlesh
 
  On Mon, Aug 3, 2009 at 10:51 PM, James Brady 
 james.colin.br...@gmail.com
  wrote:
 
   Hi,
   Thanks for your suggestions!
  
   I'm sure I have the class name right - changing it to something
 patently
   incorrect results in the expected
   org.apache.solr.common.SolrException: Error loading class ...,
 rather
   than
   the ClassCastException.
  
   I did have some problems getting my class on the app server's
 classpath.
   I'm
   running with solr.home set to multicore, but creating a
 multicore/lib
   directory and putting my request handler class in there resulted in
  Error
   loading class errors.
  
   I found that setting jetty.class.path to include multicore/lib (and
 also
   explicitly point at Solr's core and common JARs) fixed the Error
 loading
   class errors, leaving these ClassCastExceptions...
  
   2009/8/3 Avlesh Singh avl...@gmail.com
  
Can you cross check the class attribute for your handler in
   solrconfig.xml?
My guess is that it is specified as solr.LiveCoresHandler. It
 should
  be
fully qualified class name - com.foo.path.to.LiveCoresHandler
 instead.
   
Moreover, I am damn sure that you did not forget to drop your jar
 into
solr.home/lib. Checking once again might not be a bad idea :)
   
Cheers
Avlesh
   
On Mon, Aug 3, 2009 at 9:11 PM, James Brady 
  james.colin.br...@gmail.com
wrote:
   
 Hi,
 I'm creating a custom request handler to return a list of live
 cores
  in
 Solr.

 On startup, I get this exception for each core:

 Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
 SEVERE: java.lang.ClassCastException: LiveCoresHandler
at

  org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
at

  org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
at


   
  
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at


   
  
 
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
at org.apache.solr.core.SolrCore.init(SolrCore.java:444)

 I've tried a few variations on the class definition, including
   extending
 RequestHandlerBase (as suggested here:


   
  
 
 http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
 )
 and implementing SolrRequestHandler directly.

 I'm sure that the Solr libraries I built against and those I'm
  running
   on
 are the same version too, as I unzipped the Solr war file and
 copies
   the
 relevant jars out of there to build against.

 Any ideas on what could be causing the ClassCastException? I've
   attached
a
 debugger to the running Solr process but it didn't shed any light
 on
   the
 issue...

 Thanks!
 James

   
  
  
  
   --
   http://twitter.com/goodgravy
   512 300 4210
   http://webmynd.com/
   Sent from Bury, United Kingdom
  
 
 
 
 
  --
  http://twitter.com/goodgravy
  512 300 4210
  http://webmynd.com/
  Sent from Bury, United Kingdom

Re: ClassCastException from custom request handler

2009-08-04 Thread James Brady
There is *something* strange going on with classloaders; when I put my
.class files in the right place in WEB-INF/lib in a repackaged solr.war
file, it's not found by the plugin loader (Error loading class).

So the plugin classloader isn't seeing stuff inside WEB-INF/lib.

That explains why the plugin loader sees my class files when I point
jetty.class.path at the right directory, but in that situation I also need
to point jetty.class.path at the Solr JARs explicitly.

Still, how would ClassCastExceptions be caused by class loader paths not
being set correctly? I don't follow you... To get a ClassCastException, the
class to cast to must have been found. The cast-to class must not be in the
object's inheritance hierarchy, or be built against a different version, no?

2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 I guess this is a classloader issue. it is worth trying to put it in
 the WEB-INF/lib of the solr.war


 On Tue, Aug 4, 2009 at 5:35 PM, James Bradyjames.colin.br...@gmail.com
 wrote:
  Hi, the LiveCoresHandler is in the default package - the behaviour's the
  same if I have it in a properly namespaced package too...
 
  The requestHandler name can start either be a path (starting with '/') or
 a
  qt name:
  http://wiki.apache.org/solr/SolrRequestHandler
 starting w/ '/' helps in accessing it directly
 
  2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com
 
  what is the package of LiveCoresHandler ?
  I guess the requestHandler name should be name=/livecores
 
  On Tue, Aug 4, 2009 at 5:04 PM, James Bradyjames.colin.br...@gmail.com
 
  wrote:
   Solr version: 1.3.0 694707
  
   solrconfig.xml:
  requestHandler name=livecores class=LiveCoresHandler /
  
   public class LiveCoresHandler extends RequestHandlerBase {
  public void init(NamedList args) { }
  public String getDescription() { return ; }
  public String getSource() { return ; }
  public String getSourceId() { return ; }
  public NamedList getStatistics() { return new NamedList(); }
  public String getVersion() { return ; }
  
  public void handleRequestBody(SolrQueryRequest req,
 SolrQueryResponse
   rsp) {
  CollectionString names =
   req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames();
  rsp.add(cores, names);
  // if the cores are dynamic, you prob don't want to cache
  rsp.setHttpCaching(false);
  }
   }
  
   2009/8/4 Avlesh Singh avl...@gmail.com
  
   
I'm sure I have the class name right - changing it to something
patently
incorrect results in the expected
org.apache.solr.common.SolrException:
Error loading class ..., rather thanthe ClassCastException.
   
   You are right about that, James.
  
   Which Solr version are you using?
   Can you please paste the relevant pieces in your solrconfig.xml and
 the
   request handler class you have created?
  
   Cheers
   Avlesh
  
   On Mon, Aug 3, 2009 at 10:51 PM, James Brady
   james.colin.br...@gmail.com
   wrote:
  
Hi,
Thanks for your suggestions!
   
I'm sure I have the class name right - changing it to something
patently
incorrect results in the expected
org.apache.solr.common.SolrException: Error loading class ...,
rather
than
the ClassCastException.
   
I did have some problems getting my class on the app server's
classpath.
I'm
running with solr.home set to multicore, but creating a
multicore/lib
directory and putting my request handler class in there resulted in
   Error
loading class errors.
   
I found that setting jetty.class.path to include multicore/lib (and
also
explicitly point at Solr's core and common JARs) fixed the Error
loading
class errors, leaving these ClassCastExceptions...
   
2009/8/3 Avlesh Singh avl...@gmail.com
   
 Can you cross check the class attribute for your handler in
solrconfig.xml?
 My guess is that it is specified as solr.LiveCoresHandler. It
 should
   be
 fully qualified class name - com.foo.path.to.LiveCoresHandler
 instead.

 Moreover, I am damn sure that you did not forget to drop your jar
 into
 solr.home/lib. Checking once again might not be a bad idea :)

 Cheers
 Avlesh

 On Mon, Aug 3, 2009 at 9:11 PM, James Brady 
   james.colin.br...@gmail.com
 wrote:

  Hi,
  I'm creating a custom request handler to return a list of live
  cores
   in
  Solr.
 
  On startup, I get this exception for each core:
 
  Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException
 log
  SEVERE: java.lang.ClassCastException: LiveCoresHandler
 at
 
  
 org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
 at
 
  
 org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
 at
 
 

   
  
  
 org.apache.solr.util.plugin.AbstractPluginLoader.load

Re: ClassCastException from custom request handler

2009-08-04 Thread James Brady
Hi Chantal!
I've included a stack trace below.

I've attached a debugger to the server starting up, and it is finding my
class file as expected... I agree it looks like something wrong with how
I've deployed the compiled code, but perhaps different Solr versions at
compile time and run time? However, I've checked and rechecked that and
can't see a problem!

The actually ClassCastException is being thrown in a anonymous
AbstractPluginLoader instance's create method:
http://svn.apache.org/viewvc/lucene/solr/tags/release-1.3.0/src/java/org/apache/solr/util/plugin/AbstractPluginLoader.java?revision=695557

It's the cast to SolrRequestHandler which fails.

Aug 4, 2009 4:24:25 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created /update/csv:
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper
Aug 4, 2009 4:24:25 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created /admin/: org.apache.solr.handler.admin.AdminHandlers
Aug 4, 2009 4:24:25 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException: com.jmsbrdy.LiveCoresHandler
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
at org.apache.solr.core.SolrCore.init(SolrCore.java:444)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:323)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:104)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)

At the moment, my deployment is:

   1. compile my single Java file from an Ant script (pointing at the Solr
   JARs from an exploded solr.war)
   2. copy that class file's directory tree
   (com/jmsbrdy/LiveCoresHandler.class) to a lib in the root of my jetty
   install
   3. add lib to Jetty's class path
   4. add the Solr JARs from the exploded war to Jetty's class path
   5. start the server

Can you see any problems there?

2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de

 Hi James!

 James Brady schrieb:

 There is *something* strange going on with classloaders; when I put my
 .class files in the right place in WEB-INF/lib in a repackaged solr.war
 file, it's not found by the plugin loader (Error loading class).

 So the plugin classloader isn't seeing stuff inside WEB-INF/lib.

 That explains why the plugin loader sees my class files when I point
 jetty.class.path at the right directory, but in that situation I also need
 to point jetty.class.path at the Solr JARs explicitly.


 you cannot be sure that it sees *your* files. It only sees a class that
 qualifies with the name that is requested in your code. It's obviously not
 the class the code expects, though - as it results in a ClassCastException
 at some point. It might help to have a look at where and why that casting
 went wrong.

 I wrote a custom EntityProcessor and deployed it first under
 WEB-INF/classes, and now in the plugin directory, and that worked without a
 problem. My first guess is that something with your packaging is wrong -
 what do you mean by default package? What is the full name of your class
 and how does its path in the file system look like?

 Can you paste the stack trace of the exception?

 Chantal



 Still, how would ClassCastExceptions be caused by class loader paths not
 being set correctly? I don't follow you... To get a ClassCastException,
 the
 class to cast to must have been found. The cast-to class must not be in
 the
 object's inheritance hierarchy, or be built against a different version,
 no?

 2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

  I guess this is a classloader issue. it is worth trying to put it in
 the WEB-INF/lib of the solr.war


 On Tue, Aug 4, 2009 at 5:35 PM, James Bradyjames.colin.br...@gmail.com
 wrote:

 Hi, the LiveCoresHandler is in the default package - the behaviour's the
 same if I have it in a properly namespaced package too...

 The requestHandler name can start either be a path (starting with '/')
 or

 a

 qt name:
 http://wiki.apache.org/solr/SolrRequestHandler

 starting w/ '/' helps in accessing it directly

 2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 what is the package of LiveCoresHandler ?
 I guess the requestHandler name should be name=/livecores

 On Tue, Aug 4, 2009 at 5:04 PM, James Brady
 james.colin.br...@gmail.com
 wrote:

 Solr version: 1.3.0 694707

 solrconfig.xml:
   requestHandler name=livecores class=LiveCoresHandler /

 public class LiveCoresHandler extends RequestHandlerBase {
   public void init(NamedList args) { }
   public String getDescription

Re: ClassCastException from custom request handler

2009-08-04 Thread James Brady
Yeah I was thinking T would be SolrRequestHandler too. Eclipse's debugger
can't tell me...

Lot's of other handlers are created with no problem before my plugin falls
over, so I don't think it's a problem with T not being what we expected.

Do you know of any working examples of plugins I can download and build in
my environment to see what happens?

2009/8/4 Chantal Ackermann chantal.ackerm...@btelligent.de

 Code is from AbstractPluginLoader in the solr plugin package, 1.3 (the
 regular stable release, no svn checkout).


  80-84
 @SuppressWarnings(unchecked)
 protected T create( ResourceLoader loader, String name, String
 className, Node node ) throws Exception
 {
   return (T) loader.newInstance( className, getDefaultPackages() );
 }




-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom


ClassCastException from custom request handler

2009-08-03 Thread James Brady
Hi,
I'm creating a custom request handler to return a list of live cores in
Solr.

On startup, I get this exception for each core:

Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
SEVERE: java.lang.ClassCastException: LiveCoresHandler
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
at org.apache.solr.core.SolrCore.init(SolrCore.java:444)

I've tried a few variations on the class definition, including extending
RequestHandlerBase (as suggested here:
http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2)
and implementing SolrRequestHandler directly.

I'm sure that the Solr libraries I built against and those I'm running on
are the same version too, as I unzipped the Solr war file and copies the
relevant jars out of there to build against.

Any ideas on what could be causing the ClassCastException? I've attached a
debugger to the running Solr process but it didn't shed any light on the
issue...

Thanks!
James


Re: ClassCastException from custom request handler

2009-08-03 Thread James Brady
Hi,
Thanks for your suggestions!

I'm sure I have the class name right - changing it to something patently
incorrect results in the expected
org.apache.solr.common.SolrException: Error loading class ..., rather than
the ClassCastException.

I did have some problems getting my class on the app server's classpath. I'm
running with solr.home set to multicore, but creating a multicore/lib
directory and putting my request handler class in there resulted in Error
loading class errors.

I found that setting jetty.class.path to include multicore/lib (and also
explicitly point at Solr's core and common JARs) fixed the Error loading
class errors, leaving these ClassCastExceptions...

2009/8/3 Avlesh Singh avl...@gmail.com

 Can you cross check the class attribute for your handler in solrconfig.xml?
 My guess is that it is specified as solr.LiveCoresHandler. It should be
 fully qualified class name - com.foo.path.to.LiveCoresHandler instead.

 Moreover, I am damn sure that you did not forget to drop your jar into
 solr.home/lib. Checking once again might not be a bad idea :)

 Cheers
 Avlesh

 On Mon, Aug 3, 2009 at 9:11 PM, James Brady james.colin.br...@gmail.com
 wrote:

  Hi,
  I'm creating a custom request handler to return a list of live cores in
  Solr.
 
  On startup, I get this exception for each core:
 
  Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
  SEVERE: java.lang.ClassCastException: LiveCoresHandler
 at
  org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
 at
  org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
 at
 
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
 at
 
 
 org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:444)
 
  I've tried a few variations on the class definition, including extending
  RequestHandlerBase (as suggested here:
 
 
 http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
  )
  and implementing SolrRequestHandler directly.
 
  I'm sure that the Solr libraries I built against and those I'm running on
  are the same version too, as I unzipped the Solr war file and copies the
  relevant jars out of there to build against.
 
  Any ideas on what could be causing the ClassCastException? I've attached
 a
  debugger to the running Solr process but it didn't shed any light on the
  issue...
 
  Thanks!
  James
 




-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom


Re: Truncated XML responses from CoreAdminHandler

2009-07-31 Thread James Brady
Hi Mark,
You're right - a custom request handler sounds like the right option.

I've created a handler as you suggested, but I'm having problems on Solr
startup (my class is LiveCoresHandler):
Jul 31, 2009 5:20:39 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException: LiveCoresHandler
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
at org.apache.solr.core.SolrCore.init(SolrCore.java:444)

I've tried a few variations on the class definition, including extending
RequestHandlerBase (as suggested here:
http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2)
and implementing SolrRequestHandler directly.

I'm sure that the Solr libraries I built against and those I'm running on
are the same version too, as I unzipped the Solr war file and copies the
relevant jars out of there to build against.

Any ideas on what could be causing the ClassCastException? I've attached a
debugger to the running Solr process but it didn't shed any light on the
issue...

Thanks!
James

2009/7/20 Mark Miller markrmil...@gmail.com

 Hi James,

 That is very odd behavior! I'm not sure what causing it at the moment, but
 that is not a great way to get all of the core names anyway. It also
 gathers
 a *lot* of information for each core that you don't need, including index
 statistic from Luke. Its very heavy weight for what you want. So while I
 hope we get to the bottom of this, here is what I would recommend:

 Create your own plugin RequestHandler. This is very simple - often they
 just
 extend RequestHandlerBase, but for this you don't even need to. You can
 leave most of the RequestHandler methods unimplemented if you'd like - you
 just want to override/add to:

 public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp)

 and in that method you can have a very simple impl:

CollectionString names =
 req.getCore().getCoreDescriptor().getCoreContainer().getCoreNames();
rsp.add(cores, names);
   // if the cores are dynamic, you prob don't want to cache
rsp.setHttpCaching(false);

 Then just plug your simple RequestHandler into {solr.home}/lib and add it
 to
 solrconfig.xml.

 You might also add a JIRA issue requesting the feature for future versions
 -
 but that's prob the best solution for 1.3 - I'm not see the functionality
 there.

 --
 - Mark

 http://www.lucidimagination.com

 On Sat, Jul 18, 2009 at 9:02 PM, James Brady james.colin.br...@gmail.com
 wrote:

  The Solr application I'm working on has many concurrently active cores -
 of
  the order of 1000s at a time.
 
  The management application depends on being able to query Solr for the
  current set of live cores, a requirement I've been satisfying using the
  STATUS core admin handler method.
 
  However, once the number of active cores reaches a particular threshold
  (which I haven't determined exactly), the response to the STATUS method
 is
  truncated, resulting in malformed XML.
 
  My debugging so far has revealed:
 
- when doing STATUS queries from the local machine, they succeed,
untruncated, 90% of the time
- when local STATUS queries do fail, they are always truncated to the
same length: 73685 bytes in my case
- when doing STATUS queries from a remote machine, they fail due to
truncation every time
- remote STATUS queries are always truncated to the same length: 24704
bytes in my case
- the failing STATUS queries take visibly longer to complete on the
client - a few seconds for a truncated result versus 1 second for an
untruncated result
- all STATUS queries return a successful 200 HTTP code
- all STATUS queries are logged as returning in ~700ms in Solr's info
 log
- during failing (truncated) responses, Solr's CPU usage spikes to
saturation
- behaviour seems the same whatever client I use: wget, curl, Python,
 ...
 
  Using Solr 1.3.0 694707, Jetty 6.1.3.
 
  At the moment, the main puzzles for me are that the local and remote
  behaviour is so different. It leads me to think that it is something to
 do
  with the network transmission speed. But the response really isn't that
 big
  (untruncated it's ~1MB), and the CPU spike seems to suggest that
 something
  in the process of serialising the core information is taking too long and
  causing a timeout?
 
  Any suggestions on settings to tweak, ways to get extra debug
 information,
  or ascertain the active core list in some other way would be much
  appreciated!
 
  James
 




-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom


Truncated XML responses from CoreAdminHandler

2009-07-18 Thread James Brady
The Solr application I'm working on has many concurrently active cores - of
the order of 1000s at a time.

The management application depends on being able to query Solr for the
current set of live cores, a requirement I've been satisfying using the
STATUS core admin handler method.

However, once the number of active cores reaches a particular threshold
(which I haven't determined exactly), the response to the STATUS method is
truncated, resulting in malformed XML.

My debugging so far has revealed:

   - when doing STATUS queries from the local machine, they succeed,
   untruncated, 90% of the time
   - when local STATUS queries do fail, they are always truncated to the
   same length: 73685 bytes in my case
   - when doing STATUS queries from a remote machine, they fail due to
   truncation every time
   - remote STATUS queries are always truncated to the same length: 24704
   bytes in my case
   - the failing STATUS queries take visibly longer to complete on the
   client - a few seconds for a truncated result versus 1 second for an
   untruncated result
   - all STATUS queries return a successful 200 HTTP code
   - all STATUS queries are logged as returning in ~700ms in Solr's info log
   - during failing (truncated) responses, Solr's CPU usage spikes to
   saturation
   - behaviour seems the same whatever client I use: wget, curl, Python, ...

Using Solr 1.3.0 694707, Jetty 6.1.3.

At the moment, the main puzzles for me are that the local and remote
behaviour is so different. It leads me to think that it is something to do
with the network transmission speed. But the response really isn't that big
(untruncated it's ~1MB), and the CPU spike seems to suggest that something
in the process of serialising the core information is taking too long and
causing a timeout?

Any suggestions on settings to tweak, ways to get extra debug information,
or ascertain the active core list in some other way would be much
appreciated!

James


Last modified time for cores, taking into account uncommitted changes

2009-04-30 Thread James Brady
Hi,
The lastModified field the Solr status seems to only be updated when a
commit/optimize operation takes place.

Is there any way to determine when a core has been changed, including any
uncommitted add operations?


Thanks,
James


Re: Persistent, seemingly unfixable corrupt indices

2009-02-24 Thread James Brady
Thanks for your answers Michael! I was using a pre-1.3 Solr build, but I've
now upgraded to the 1.3 release, run the new CheckIndex shipped as part of
the Lucene 2.4 dev build and I'm still getting the CorruptIndexException:
docs out of order exceptions I'm afraid.

Upon a fresh start, on newly Checked indices, I actually get a lot of
Exceptions like:

SEVERE: java.lang.RuntimeException: [was class
org.mortbay.jetty.EofException] null
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)

Before any CorruptIndexExceptions - could that be the root cause?

Unfortunately the indices are large and contain confidential information; is
there anything else I can do to identify where the problem is and why
CheckIndex isn't catching it?

Thanks
James

2009/2/23 Michael McCandless luc...@mikemccandless.com


 Actually, even in 2.3.1, CheckIndex checks for docs-out-of-order both
 within and across segments, so now I'm at a loss as to why it's not catching
 your case.   Any of these indexes small enough to post somewhere i could
 access?

 Mike


 James Brady wrote:

  Hi,My indices sometime become corrupted - normally when Solr has to be
 KILLed - these are not normally too much of a problem, as
 Lucene's CheckIndex tool can normally detect missing / broken segments and
 fix them.

 However, I now have a few indices throwing errors like this:

 INFO: [core4] webapp=/solr path=/update params={} status=0 QTime=2
 Exception in thread Thread-75
 org.apache.lucene.index.MergePolicy$MergeException:
 org.apache.lucene.index.CorruptIndexException: docs out of order (1124 =
 1138 )
 at

 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
 Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
 order
 (1124 = 1138 )
 at

 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502)
 at

 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456)
 at

 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425)
 at
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
 at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
 at

 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)

 and

 INFO: [core7] webapp=/solr path=/update params={} status=500 QTime=5457
 Feb 22, 2009 12:14:07 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.lucene.index.CorruptIndexException: docs out of order
 (242 = 248 )
 at

 org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502)
 at

 org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456)
 at

 org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425)
 at
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
 at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
 at

 org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:193)
 at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1800)
 at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1795)
 at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1791)
 at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2398)
 at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1465)
 at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1424)
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:278)


 CheckIndex reports these cores as being completely healthy, and yet I
 can't
 commit new documents in to them.

 Rebuilding indices isn't an option for me: is there any other way to fix
 this? If not, any ideas on what I can do to prevent it in the future?

 Many thanks,
 James





Persistent, seemingly unfixable corrupt indices

2009-02-22 Thread James Brady
Hi,My indices sometime become corrupted - normally when Solr has to be
KILLed - these are not normally too much of a problem, as
Lucene's CheckIndex tool can normally detect missing / broken segments and
fix them.

However, I now have a few indices throwing errors like this:

INFO: [core4] webapp=/solr path=/update params={} status=0 QTime=2
Exception in thread Thread-75
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (1124 =
1138 )
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order
(1124 = 1138 )
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)

and

INFO: [core7] webapp=/solr path=/update params={} status=500 QTime=5457
Feb 22, 2009 12:14:07 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.index.CorruptIndexException: docs out of order
(242 = 248 )
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:502)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:456)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:425)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:389)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3109)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
at
org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:193)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1800)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1795)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1791)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2398)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1465)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1424)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:278)


CheckIndex reports these cores as being completely healthy, and yet I can't
commit new documents in to them.

Rebuilding indices isn't an option for me: is there any other way to fix
this? If not, any ideas on what I can do to prevent it in the future?

Many thanks,
James


Fwd: Separate error logs

2009-02-06 Thread James Brady
OK, so java.util.logging has no way of sending error messages to a separate
log without writing your own Handler/Filter code.
If we just skip over the absurdity of that, and the rage it makes me feel,
what are my options here? What I'm looking for is for all records to go to
one file, and records of a ERROR level and above to go to a separate log.

Can I write my own Handlers/Filters, drop them on Jetty's classpath and
refer to them in my logging.properties? I.e. without rebuilding the whole
WAR, with my files added?

Is Solr 1.4 (and its nice SLF4J logging) in a state ready for intensive
production usage?

Thanks!
James

-- Forwarded message --
From: James Brady james.colin.br...@gmail.com
Date: 2009/1/30
Subject: Re: Separate error logs
To: solr-user@lucene.apache.org


Oh... I should really have found that myself :/
Thank you!

2009/1/30 Ryan McKinley ryan...@gmail.com

check:
 http://wiki.apache.org/solr/SolrLogging

 You configure whatever flavor logger to write error to a separate log



 On Jan 30, 2009, at 4:36 PM, James Brady wrote:

  Hi all,What's the best way for me to split Solr/Lucene error message off
 to
 a separate log?

 Thanks
 James





Re: Recent document boosting with dismax

2009-02-03 Thread James Brady
Great, thanks for that, Chris!

2009/2/3 Chris Hostetter hossman_luc...@fucit.org


 : Hi, no the data_added field was one per document.

 i would suggest adding multiValued=false to your date fieldType so
 that Solr can enforce that for you -- otherwise we can't be 100% sure.

 if it really is only a single valued field, then i suspect you're right
 about the index corruption being the source of your problem, but it's
 not neccessarily a permenant problem.  try optimizing your index, that
 should merge all the segments and purge any terms that aren't actually
 part of live documents (i think) ... if that doesn't work, rebuilding will
 be your best bet (and with that multiValued=false will error if you
 are inadvertantly sending multiple values per document)

 :  I'm having lots of other problems (un-related) with corrupt indices -
 :  could
 :  it be that in running the org.apache.lucene.index.CheckIndex utility,
 and
 :  losing some documents in the process, the ordinal part of my boost
 :  function
 :  is permanently broken?



 -Hoss




Re: Recent document boosting with dismax

2009-02-02 Thread James Brady
Hi, no the data_added field was one per document.
2009/2/1 Erik Hatcher e...@ehatchersolutions.com

 Is your date_added field multiValued and you've assigned multiple to some
 documents?

Erik


 On Jan 31, 2009, at 4:12 PM, James Brady wrote:

  Hi,I'm following the recipe here:

 http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994709b4d7e540359b1fdfor
 boosting recent documents: bf=recip(rord(date_added),1,1000,1000)

 On some of my servers I've started getting errors like this:

 SEVERE: java.lang.RuntimeException: there are more terms than documents in
 field date_added, but it's impossible to sort on tokenized fields
 at

 org.apache.lucene.search.FieldCacheImpl$10.createValue(FieldCacheImpl.java:379)
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
 at

 org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:352)
 at

 org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:55)
 at

 org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:56)
 at

 org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:103)
 at

 org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:81)
 at

 org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:232)
 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
 at org.apache.lucene.search.Searcher.search(Searcher.java:118)
 ...

 The date_added field is stored as a vanilla Solr date type:
   fieldType name=date class=solr.DateField sortMissingLast=true
 omitNorms=true/

 I'm having lots of other problems (un-related) with corrupt indices -
 could
 it be that in running the org.apache.lucene.index.CheckIndex utility, and
 losing some documents in the process, the ordinal part of my boost
 function
 is permanently broken?

 Thanks!
 James





Separate error logs

2009-01-30 Thread James Brady
Hi all,What's the best way for me to split Solr/Lucene error message off to
a separate log?

Thanks
James


Re: Separate error logs

2009-01-30 Thread James Brady
Oh... I should really have found that myself :/
Thank you!

2009/1/30 Ryan McKinley ryan...@gmail.com

 check:
 http://wiki.apache.org/solr/SolrLogging

 You configure whatever flavor logger to write error to a separate log



 On Jan 30, 2009, at 4:36 PM, James Brady wrote:

  Hi all,What's the best way for me to split Solr/Lucene error message off
 to
 a separate log?

 Thanks
 James





Disk usage after document deletion

2009-01-25 Thread James Brady
Hi,I have a number of indices that are supposed to maintaining windows of
indexed content - the last month's work of data, for example.

At the moment, I'm cleaning out old documents with a simple cron job making
requests like:
deletequerydate_added:[* TO NOW-30DAYS]/query/delete

I was expecting disk usage to plateau pretty sharply as the number of
documents in the index reaches equilibrium. However, the usage keeps on
going up, after 30 days, albeit not as quickly, even if I optimise the
index.

Can anyone offer an explanation for this? Should document deletions followed
by optimises have as much of an effect on disk usage as I was expecting?

Thanks!
James


Re: Disk usage after document deletion

2009-01-25 Thread James Brady
The number of documents varies - sometimes it increases, sometimes it
decreases - month to month.
However, the index size increases monotonically.

I was expecting some gradual growth as I expect Lucene retains terms that
are no longer referenced from any documents, so you'll end up with the
superset of all possible terms in the end.

However, index size growth probably continues at roughly half the speed of
it's growth during the filling up period.

2009/1/26 Ryan McKinley ryan...@gmail.com


 On Jan 25, 2009, at 6:06 PM, James Brady wrote:

  Hi,I have a number of indices that are supposed to maintaining windows
 of
 indexed content - the last month's work of data, for example.

 At the moment, I'm cleaning out old documents with a simple cron job
 making
 requests like:
 deletequerydate_added:[* TO NOW-30DAYS]/query/delete

 I was expecting disk usage to plateau pretty sharply as the number of
 documents in the index reaches equilibrium. However, the usage keeps on
 going up, after 30 days, albeit not as quickly, even if I optimise the
 index.

 Can anyone offer an explanation for this? Should document deletions
 followed
 by optimises have as much of an effect on disk usage as I was expecting?


 Depends what you are expecting ;)

 Are you sure that the number or size of docs from month to month is
 consistent?  If you have more docs each month then the previous one, or if
 more data is stored, then a months data would be bigger too.

 ryan



Strategy for presenting fresh data

2008-06-11 Thread James Brady

Hi,
The product I'm working on requires new documents to be searchable  
very quickly (inside 60 seconds is my goal). The corpus is also going  
to grow very large, although it is perfectly partitionable by user.


The approach I tried first was to have write-only masters and read- 
only slaves with data being replicated from one to another postCommit  
and postOptimise.


This allowed new documents to be visible inside 5 minutes or so (until  
the indexes got so large that re-opening IndexSearchers took for ever,  
that is...), but still not good enough.


Now, I am considering cutting out the commit / replicate / re-open  
cycle by augmenting Solr with a RAMDirectory per core.


Your thoughts on the following approach would be much appreciated:

Searches would be forked to both the RAMDirectory and FSDirectory,  
while writes would go to the RAMDirectory only. The RAMDirectory would  
be flushed back to the FSDirectory regularly, using  
IndexWriter.addIndexes (or addIndexesNoOptimise).


Effectively, I'd be creating a searchable queue in front of a  
regularly committed and optimised conventional index.


As this seems to be a useful pattern (and is mentioned tangentially in  
Lucene in Action), is there already support for this in Lucene?


Thanks,
James


Multicore capability: dynamically creating 1000s of cores?

2008-05-16 Thread James Brady
Hi, there was some talk on JIRA about whether Multicore would be able  
to manage tens of thousands of cores, and dynamically create hundreds  
every day:
https://issues.apache.org/jira/browse/SOLR-350? 
focusedCommentId=12571282#action_12571282


The issue of multicore configuration was left open in SOLR-350 (I  
don't think a new issue was opened?), so it's not clear to me if what  
Otis described will be possible in the 1.3 timeframe.


Can anyone involved in SOLR-350 elaborate on how dynamic creation,  
closing and opening of cores will work in the future?


A real-world deployment of this would require associated admin tasks  
for each core too: setting up cron jobs, enabling and starting rsync  
and so on, so core configuration via Solr isn't a requirement for me:  
I can script the creation and configuration of a new core directory  
alongside the other admin tasks.


The open questions are whether I'll be able to notify Solr that there  
is a pre-configured core ready to be used - i.e. the configuration set  
in multicore.xml, and whether this multi-multi-core approach will  
scale to the levels that Otis mentioned.


Thanks!

James.

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
Hi, we have an index of ~300GB, which is at least approaching the  
ballpark you're in.


Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable' index so we can just scale out horizontally across  
commodity hardware with no problems at all. We're also using the  
multicore features available in development Solr version to reduce  
granularity of core size by an order of magnitude: this makes for lots  
of small commits, rather than few long ones.


There was mention somewhere in the thread of document collections: if  
you're going to be filtering by collection, I'd strongly recommend  
partitioning too. It makes scaling so much less painful!


James

On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index for

performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents. Each

document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA  
enclosures
as shared storage (think of it as a SAN or shared storage at least,  
one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if you  
must
have the whole result in memory sort it and create a xml response. I  
did hit
such constraints when I couldn't afford the instances to have enough  
memory
and I had only 1M of docs back then. And think of it... Optimizing a  
TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over  
the

disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler  
just a
properties file in each index dir) of some sort to know what docs is  
located

on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new  
shard the
master indexer will know what shard to prioritize. This is probably  
not
enough either since all new docs will go to the new shard until it  
is filled
(have the same size as the others) only then will all shards receive  
docs in

a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a stealing process where it steals docs  
from the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when it  
comes to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then  
just
copy it to the main index to minimize the burden of the shared  
storage.


Let's say the indexing part will be all fancy and working i TB scale  
now we
come to searching. I personally believe after talking to other guys  
which
have built big search engines that you need to introduce a  
controller like
searcher on the client side which itself searches in all of the  
shards and
merges the response. Perhaps Distributed Solr solves this and will  
love to
test it whenever my new installation of servers and enclosures is  
finished.


Currently my idea is something like this.
public PageDocument search(SearchDocumentCommand sdc)
   {
   SetInteger ids = documentIndexers.keySet();
   int nrOfSearchers = ids.size();
   int totalItems = 0;
   PageDocument docs = new Page(sdc.getPage(),  
sdc.getPageSize());

   for (IteratorInteger iterator = ids.iterator();
iterator.hasNext();)
   {
   Integer id = iterator.next();
   ListDocumentIndexer indexers = documentIndexers.get(id);
   DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
   SearchDocumentCommand sdc2 = copy(sdc);
   sdc2.setPage(sdc.getPage()/nrOfSearchers);
   PageDocument res = indexer.search(sdc);
   totalItems += res.getTotalItems();
   docs.addAll(res);
   }

   if(sdc.getComparator() != null)
   {
   Collections.sort(docs, sdc.getComparator());
   }

   docs.setTotalItems(totalItems);

   return docs;
   }

This is my RaidedDocumentIndexer which wraps a set of  
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and  
comparing
stuff so I have two implementations of DocumentIndexer  
(SolrDocumentIndexer

and 

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
So our problem is made easier by having complete index  
partitionability by a user_id field. That means at one end of the  
spectrum, we could have one monolithic index for everyone, while at  
the other end of the spectrum we could individual cores for each  
user_id.


At the moment, we've gone for a halfway house somewhere in the middle:  
I've got several large EC2 instances (currently 3), each running a  
single master/slave pair of Solr servers. The servers have several  
cores (currently 10 - a guesstimated good number). As new users  
register, I automatically distribute them across cores. I would like  
to do something with clustering users based on geo-location so that  
cores will get 'time off' for maintenance and optimization for that  
user cluster's nighttime. I'd also like to move in the 1 core per user  
direction as dynamic core creation becomes available.


It seems a lot of what you're describing is really similar to  
MapReduce, so I think Otis' suggestion to look at Hadoop is a good  
one: it might prevent a lot of headaches and they've already solved a  
lot of the tricky problems. There a number of ridiculously sized  
projects using it to solve their scale problems, not least Yahoo...


James

On 9 May 2008, at 01:17, Marcus Herou wrote:


Cool.

Since you must certainly already have a good partitioning scheme,  
could you

elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice  
before

getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady [EMAIL PROTECTED] 


wrote:

Hi, we have an index of ~300GB, which is at least approaching the  
ballpark

you're in.

Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable'
index so we can just scale out horizontally across commodity  
hardware with
no problems at all. We're also using the multicore features  
available in
development Solr version to reduce granularity of core size by an  
order of
magnitude: this makes for lots of small commits, rather than few  
long ones.


There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index

for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents.

Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at  
least, one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if

we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if  
you must
have the whole result in memory sort it and create a xml response.  
I did

hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing  
a TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index  
over the
disk(s) and let each SOLR instance only have little piece of the  
total
index. This will require a master database or namenode (or simpler  
just a

properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This  
is to
ensure that whenever you introduce a new SOLR instance with a new  
shard

the
master indexer will know what shard to prioritize. This is  
probably not
enough either since all new docs will go to the new shard until it  
is

filled
(have the same size as the others) only then will all shards  
receive docs

in
a loadbalanced fashion. So whenever you want to add a new indexer  
you
probably need to initiate a stealing process where it steals  
docs from

the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when  
it comes

to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize

IOException: Mark invalid while analyzing HTML

2008-05-04 Thread James Brady

Hi,
I'm seeing a problem mentioned in Solr-42, Highlighting problems with  
HTMLStripWhitespaceTokenizerFactory:

https://issues.apache.org/jira/browse/SOLR-42

I'm indexing HTML documents, and am getting reams of Mark invalid  
IOExceptions:

SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(Unknown Source)
	at  
org 
.apache 
.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
728)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
742)

at java.io.Reader.read(Unknown Source)
at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
	at  
org 
.apache 
.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
	at  
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
	at  
org 
.apache 
.solr 
.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92)

at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
	at  
org 
.apache 
.solr 
.analysis 
.RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 
33)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)

at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.invertField(DocumentsWriter.java:1518)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.processField(DocumentsWriter.java:1407)
	at org.apache.lucene.index.DocumentsWriter 
$ThreadState.processDocument(DocumentsWriter.java:1116)
	at  
org 
.apache 
.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440)
	at  
org 
.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 
2422)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 
1445)



This is using a ~1 week old version of Solr 1.3 from SVN.

One workaround mentioned in that Jira issue was to move HTML stripping  
outside of Solr; can anyone suggest a better approach than that?


Thanks
James



Re: Master / slave setup with multicore

2008-05-02 Thread James Brady
Ah, wait, my fault - I didn't have the right Solr port configured in  
the slave, so snapinstaller was commiting the master :/


Thanks,
James

On 2 May 2008, at 09:17, Bill Au wrote:

snapinstall calls commit to trigger Solr to use the new index.  Do  
you see
the commit request in your Solr log?  Anything in the snapinstaller  
log?


Bill

On Thu, May 1, 2008 at 8:35 PM, James Brady [EMAIL PROTECTED] 


wrote:


Hi Ryan, thanks for that!

I have one outstanding question: when I take a snapshot on the  
master,
snappull and snapinstall on the slave, the new index is not being  
used:

restarting the slave server does pick up the changes, however.

Has anyone else had this problem with recent development builds?

In case anyone is trying to do multicore replication, here some of  
the
things I've done to get it working.. These could go on the wiki  
somewhere,

what do people think?

Obviously, have as much shared configuration as possible is ideal.  
On the

master, I have core-specific:
- scripts.conf, for webapp_name, master_data_dir and  
master_status_dir

- solrconfig.xml, for the post-commit and post-optimise snapshooter
locations

On the slave, I have core-specific:
-scripts.conf, as above

I've also customised snappuller to accept a different rsync module  
name
(hard coded to 'solr' at present). This module name is set in the  
slave

scripts.conf

James


On 29 Apr 2008, at 13:44, Ryan McKinley wrote:



On Apr 29, 2008, at 3:09 PM, James Brady wrote:


Hi all,
I'm aiming to use the new multicore features in development  
versions
of Solr. My ideal setup would be to have master / slave servers  
on the same
machine, snapshotting across from the 'write' to the 'read'  
server at

intervals.

This was all fine with Solr 1.2, but the rsync  snappuller
configuration doesn't seem to be set up to allow for multicore  
replication

in 1.3.

The rsyncd.conf file allows for several data directories to be
defined, but the snappuller script only handles a single directory,
expecting the Lucene index to be directly inside that directory.

What's the best practice / best suggestions for replicating a
multicore update server out to search servers?


Currently, for multicore replication you will need to install the  
snap*
scripts for _each_ core.  The scripts all expect a single core so  
for

multiple cores, you will just need to install it multiple times.

ryan








Master / slave setup with multicore

2008-04-29 Thread James Brady

Hi all,
I'm aiming to use the new multicore features in development versions  
of Solr. My ideal setup would be to have master / slave servers on the  
same machine, snapshotting across from the 'write' to the 'read'  
server at intervals.


This was all fine with Solr 1.2, but the rsync  snappuller  
configuration doesn't seem to be set up to allow for multicore  
replication in 1.3.


The rsyncd.conf file allows for several data directories to be  
defined, but the snappuller script only handles a single directory,  
expecting the Lucene index to be directly inside that directory.


What's the best practice / best suggestions for replicating a  
multicore update server out to search servers?


Thanks,
James


Re: Queuing adds and commits

2008-04-29 Thread James Brady
Depending on your application, it might be useful to take control of  
the queueing yourself: it was for me!


I needed quick turnarounds for submitting a document to be indexed,  
which Solr can't guarantee right now. To address it, I wrote a  
persistent queueing server, accessed by XML-RPC, which has the benefit  
of adding a low-cost layer of indirection between client-side and  
server-side stuff, and properly serialises the order in which events  
arrive.


James


On 27 Apr 2008, at 05:33, Phillip Farber wrote:



A while back Hoss described Solr queuing behavior:

 searches can go on happily while commits/adds are happening, and
 multiple adds can happen in parallel, ... but all adds block while a
 commit is taking place.  i just give all of clients that update the
 index a really large timeout value (ie: 60 seconds or so) and don't
 worry about queing up indexing requests.

I am worried about those adds queuing up behind an ongoing commit  
before being processed in parallel.  What if a document is added  
multiple times and each time it is added it has different field  
values?  You'd want the newest, last queued version of that document  
to win, i.e. to be the version that represents the document in the  
index. But processing in parallel suggests that the time order of  
the adds of that document could be lost. Does Solr time stamp the  
documents in the indexing queue to prevent an earlier queued version  
of the document from being the last version indexed?


Thanks,

Phil





Default core in multi-core

2008-04-21 Thread James Brady

Hi all,
In the latest trunk version, default='true' doesn't have the effect I  
would have expected running in multi core mode.


The example multicore.xml has:
 core name=core0 instanceDir=core0 default=true/
 core name=core1 instanceDir=core1 /

But queries such as
/solr/select?q=*:*
and
/solr/admin/

are executed against core1, not core0 as I would have expected: it  
seems to be that the last core defined in multicore.xml is the de  
facto 'default'.


Is this a bug or am I missing something?

Thanks,
James


Favouring recent matches

2008-03-08 Thread James Brady

Hello all,
In Lucene in Action, (replicated here: http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene) 
, theserverside.com team say The date boost has been really important  
for us.


I'm looking for some advice on the best way to actually implement this  
- the only way I can see to do it right now is to set a boost for  
documents at index time that increases linearly over time. However,  
I'm wary of skewing Lucene's scoring in some strange way, or  
interfering with the document boosts I'm setting for other reasons.


Any suggests?

Thanks
James


Fwd: Favouring recent matches

2008-03-08 Thread James Brady
Sorry, I really should have directly explained what I was looking to  
do: theserverside.com give higher scores to documents that were added  
more recently.


I'd like to do the same, without the date boost being too overbearing  
(or unnoticeable...) - some ideas on how to approach this would be  
great.


James

Begin forwarded message:


From: James Brady [EMAIL PROTECTED]
Date: 8 March 2008 19:41:56 PST
To: solr-user@lucene.apache.org
Subject: Favouring recent matches

Hello all,
In Lucene in Action, (replicated here: http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene) 
, theserverside.com team say The date boost has been really  
important for us.


I'm looking for some advice on the best way to actually implement  
this - the only way I can see to do it right now is to set a boost  
for documents at index time that increases linearly over time.  
However, I'm wary of skewing Lucene's scoring in some strange way,  
or interfering with the document boosts I'm setting for other reasons.


Any suggests?

Thanks
James




Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-03-03 Thread James Brady

Hi Kevin,
Thanks for your suggestions - I've got about 6 million, and am being  
quite stingy with my schema at the moment I'm afraid.


If anything, the size of each document is going to go up, not down,  
but I might be able to prune some older, unused data.


James

On 3 Mar 2008, at 14:33, Kevin Lewandowski wrote:


How many documents are in the index?

If you haven't already done this I'd take a really close look at your
schema and make sure you're only storing the things that should really
be stored, same with the indexed fields. I drastically reduced my
index size just by changing some indexed/stored options on a few
fields.

On Thu, Feb 28, 2008 at 10:54 PM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:

James,

 I can't comment more on the SN's arch choices.

 Here is the story about your questions
 - 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki)  
or via the new multi-core support which works, but is still being  
hacked on
 - See SOLR-303 in JIRA for distributed search.  Yonik committed  
it just the other day, so now that's in nightly builds if you want  
to try it.  There are 2 Wiki pages about that, too, see Recent  
changes log on the Wiki to quickly find them.



 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 

From: James Brady [EMAIL PROTECTED]
To: solr-user@lucene.apache.org




Sent: Friday, February 29, 2008 1:11:07 AM
Subject: Re: Strategy for handling large (and growing) index:  
horizontal partitioning?


Hi Otis,
Thanks for your comments -- I didn't realise the wiki is open to
editing; my apologies. I've put in a few words to try and clear
things up a bit.

So determining n will probably be a best guess followed by trial and
error, that's fine. I'm still not clear about whether single Solr
servers can operate across several indices, however.. can anyone  
give

me some pointers here?
An alternative would be to have 1 index per instance, and n  
instances

per server, where n is small. This might actually be a practical
solution -- I'm spending ~20% of my time committing, so I should
probably only have 3 or 4 indices in total per server to avoid two
committing at the same time.

Your mention of The Large Social Network was interesting! A social
network's data is by definition pretty poorly partitioned by user  
id,

so unless they've done something extremely clever like co-locating
social cliques in the same indices, I would have though it would  
be a

sub-optimal architecture. If me and my friends are scattered around
different indices, each search would have to be federated massively.

James


On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:


James,

Regarding your questions about n users per index - this is a fine
approach.  The largest Social Network that you know of uses the
same approach for various things, including full-text indices (not
Solr, but close).  You'd have to maintain user-shard/index mapping
somewhere, of course.  What should the n be, you ask?  Look at the
overall index size, I'd say, against server capabilities (RAM,
disk, CPU), increase n up to a point where you're maximizing your
hardware at some target query rate.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: James Brady




To: solr-user@lucene.apache.org
Sent: Wednesday, February 27, 2008 10:08:02 PM
Subject: Strategy for handling large (and growing) index:
horizontal partitioning?

Hi all,
Our current setup is a master and slave pair on a single machine,
with an index size of ~50GB.

Query and update times are still respectable, but commits are  
taking
~20% of time on the master, while our daily index optimise can  
up to

4 hours...
Here's the most relevant part of solrconfig.xml:
 true
 10
 1000
 1
 1

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,
taking about 4 hours is pretty good! However, the documentation  
here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=% 
28ten

+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states Optimizations can take nearly ten minutes to run... which
leads me to think that we've grossly misconfigured something...

Firstly, we would obviously love any way to reduce this  
optimise time
- I have yet to experiment extensively with the settings above,  
and

optimise frequency, but some general guidance would be great.

Secondly, this index size is increasing monotonously over time  
and as
we acquire new users. We need to take action to ensure we can  
scale

in the future. The approach we're favouring at the moment is
horizontal partitioning of indices by user id as our data suits  
this
scheme well. A given index would hold the indexed data for n  
users,
where n would probably be between 1 and 100 users, and we will  
have

multiple indices per search server.

Running server per index is impractical, especially for a small  
n

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread James Brady
Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is  
definitely the bottleneck, you're right -- iostat was showing 100%  
utilisation for the 5 hours it took to optimise yesterday...


The master and slave are on the same disk, and it's definitely on my  
list to fix that, but the searcher is so lightly loaded compared to  
the indexer that I don't think it will win us too much.


As there has been another optimise time question on the list today  
could I request that the 10 minute claim is taken of the  
CollectionDistribution wiki page? It's extremely misleading for  
newcomers who don't necessarily realise an optimise entails reading  
and writing the whole index, and that optimise time is going to be at  
least O(n)


James


On 28 Feb 2008, at 09:07, Walter Underwood wrote:


Have you timed how long it takes to copy the index files? Optimizing
can never be faster than that, since it must read every byte and write
a whole new set. Disc speed may be your bottleneck.

You could also look at disc access rates in a monitoring tool.

Is there read contention between the master and slave for the same  
disc?


wunder

On 2/27/08 7:08 PM, James Brady [EMAIL PROTECTED] wrote:


Hi all,
Our current setup is a master and slave pair on a single machine,
with an index size of ~50GB.

Query and update times are still respectable, but commits are taking
~20% of time on the master, while our daily index optimise can up to
4 hours...
Here's the most relevant part of solrconfig.xml:
 useCompoundFiletrue/useCompoundFile
 mergeFactor10/mergeFactor
 maxBufferedDocs1000/maxBufferedDocs
 maxMergeDocs1/maxMergeDocs
 maxFieldLength1/maxFieldLength

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,
taking about 4 hours is pretty good! However, the documentation here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states Optimizations can take nearly ten minutes to run... which
leads me to think that we've grossly misconfigured something...

Firstly, we would obviously love any way to reduce this optimise time
- I have yet to experiment extensively with the settings above, and
optimise frequency, but some general guidance would be great.

Secondly, this index size is increasing monotonously over time and as
we acquire new users. We need to take action to ensure we can scale
in the future. The approach we're favouring at the moment is
horizontal partitioning of indices by user id as our data suits this
scheme well. A given index would hold the indexed data for n users,
where n would probably be between 1 and 100 users, and we will have
multiple indices per search server.

Running server per index is impractical, especially for a small n, so
is a sinlge Solr instance capable of managing multiple searchers and
writers in this way? Following on from that, does anyone know of
limiting factors in Solr or Lucene that would influence our decision
on the value of n - the number of users per index?

Thanks!
James









Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread James Brady

Hi Otis,
Thanks for your comments -- I didn't realise the wiki is open to  
editing; my apologies. I've put in a few words to try and clear  
things up a bit.


So determining n will probably be a best guess followed by trial and  
error, that's fine. I'm still not clear about whether single Solr  
servers can operate across several indices, however.. can anyone give  
me some pointers here?
An alternative would be to have 1 index per instance, and n instances  
per server, where n is small. This might actually be a practical  
solution -- I'm spending ~20% of my time committing, so I should  
probably only have 3 or 4 indices in total per server to avoid two  
committing at the same time.


Your mention of The Large Social Network was interesting! A social  
network's data is by definition pretty poorly partitioned by user id,  
so unless they've done something extremely clever like co-locating  
social cliques in the same indices, I would have though it would be a  
sub-optimal architecture. If me and my friends are scattered around  
different indices, each search would have to be federated massively.


James


On 28 Feb 2008, at 20:49, Otis Gospodnetic wrote:


James,

Regarding your questions about n users per index - this is a fine  
approach.  The largest Social Network that you know of uses the  
same approach for various things, including full-text indices (not  
Solr, but close).  You'd have to maintain user-shard/index mapping  
somewhere, of course.  What should the n be, you ask?  Look at the  
overall index size, I'd say, against server capabilities (RAM,  
disk, CPU), increase n up to a point where you're maximizing your  
hardware at some target query rate.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: James Brady [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, February 27, 2008 10:08:02 PM
Subject: Strategy for handling large (and growing) index:  
horizontal partitioning?


Hi all,
Our current setup is a master and slave pair on a single machine,
with an index size of ~50GB.

Query and update times are still respectable, but commits are taking
~20% of time on the master, while our daily index optimise can up to
4 hours...
Here's the most relevant part of solrconfig.xml:
 true
 10
 1000
 1
 1

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,
taking about 4 hours is pretty good! However, the documentation here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states Optimizations can take nearly ten minutes to run... which
leads me to think that we've grossly misconfigured something...

Firstly, we would obviously love any way to reduce this optimise time
- I have yet to experiment extensively with the settings above, and
optimise frequency, but some general guidance would be great.

Secondly, this index size is increasing monotonously over time and as
we acquire new users. We need to take action to ensure we can scale
in the future. The approach we're favouring at the moment is
horizontal partitioning of indices by user id as our data suits this
scheme well. A given index would hold the indexed data for n users,
where n would probably be between 1 and 100 users, and we will have
multiple indices per search server.

Running server per index is impractical, especially for a small n, so
is a sinlge Solr instance capable of managing multiple searchers and
writers in this way? Following on from that, does anyone know of
limiting factors in Solr or Lucene that would influence our decision
on the value of n - the number of users per index?

Thanks!
James











Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-27 Thread James Brady

Hi all,
Our current setup is a master and slave pair on a single machine,  
with an index size of ~50GB.


Query and update times are still respectable, but commits are taking  
~20% of time on the master, while our daily index optimise can up to  
4 hours...

Here's the most relevant part of solrconfig.xml:
useCompoundFiletrue/useCompoundFile
mergeFactor10/mergeFactor
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs1/maxMergeDocs
maxFieldLength1/maxFieldLength

I've given both master and slave 2.5GB of RAM.

Does an index optimise read and re-write the whole thing? If so,  
taking about 4 hours is pretty good! However, the documentation here:
http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten 
+minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
states Optimizations can take nearly ten minutes to run... which  
leads me to think that we've grossly misconfigured something...


Firstly, we would obviously love any way to reduce this optimise time  
- I have yet to experiment extensively with the settings above, and  
optimise frequency, but some general guidance would be great.


Secondly, this index size is increasing monotonously over time and as  
we acquire new users. We need to take action to ensure we can scale  
in the future. The approach we're favouring at the moment is  
horizontal partitioning of indices by user id as our data suits this  
scheme well. A given index would hold the indexed data for n users,  
where n would probably be between 1 and 100 users, and we will have  
multiple indices per search server.


Running server per index is impractical, especially for a small n, so  
is a sinlge Solr instance capable of managing multiple searchers and  
writers in this way? Following on from that, does anyone know of  
limiting factors in Solr or Lucene that would influence our decision  
on the value of n - the number of users per index?


Thanks!
James





Re: will hardlinks work across partitions?

2008-02-24 Thread James Brady

Unfortunately, you cannot hard link across mount points.

Snapshooter uses cp -lr, which, on my Linux machine at least, fails  
with:
cp: cannot create link `/mnt2/myuser/linktest': Invalid cross-device  
link


James

On 23 Feb 2008, at 14:34, Brian Whitman wrote:

Will the hardlink snapshot scheme work across physical disk  
partitions? Can I snapshoot to a different partition than the one  
holding the live solr index?




Bug fix for Solr Python bindings

2008-02-19 Thread James Brady

Hi,
Currently, the solr.py Python binding casts all key and value  
arguments blindly to strings. The following changes deal with Unicode  
properly and respect multi-valued parameters passed in as lists:


131a132,142
   def __makeField(self, lst, f, v):
   if not isinstance(f, basestring):
 f = str(f)
   if not isinstance(v, basestring):
 v = str(v)
   lst.append('field name=')
   lst.append(self.escapeKey(f))
   lst.append('')
   lst.append(self.escapeVal(v))
   lst.append('/field')

143,147c154,158
   lst.append('field name=')
   lst.append(self.escapeKey(str(f)))
   lst.append('')
   lst.append(self.escapeVal(str(v)))
   lst.append('/field')
---
   if isinstance(v, list): # multi-valued
 for value in v:
   self.__makeField(lst, f, value)
   else:
 self.__makeField(lst, f, v)

James


Fwd: Performance help for heavy indexing workload

2008-02-12 Thread James Brady

Hi again,
More analysis showed that the extraordinarily long query times only  
appeared when I specify a sort. A concrete example:


For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 
3A39start=0rows=1fl=*%2Cscoreqt=standardwt=standardexplainOther=

The QTime is ~500ms.
For a querystring such as: ?indent=onversion=2.2q=apache+user_id% 
3A39start=0rows=1fl=*% 
2Cscoreqt=standardwt=standardexplainOther=sort=date_added%20asc

The QTime is ~75s

I.e. I am using the StandardRequestHandler to search for a user  
entered term (apache above) and filtering by a user_id field.


This seems to be the case for every sort option except score asc and  
score desc. Please tell me Solr doesn't sort all matching documents  
before applying boolean filters?


James

Begin forwarded message:


From: James Brady [EMAIL PROTECTED]
Date: 11 February 2008 23:38:16 GMT-08:00
To: solr-user@lucene.apache.org
Subject: Performance help for heavy indexing workload

Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more  
indexing than searching.


At present, it needs to index around two documents / sec - a  
document being the stripped content of a webpage. However,  
performance was so poor that I've had to disable indexing of the  
webpage content as an emergency measure. In addition, some search  
queries take an inordinate length of time - regularly over 60 seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
and 8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task  
causing a database commit every 15 seconds.


Obviously, every workload varies, but could anyone comment on  
whether this sort of hardware should, with proper configuration, be  
able to manage this sort of workload?


I can't see signs of Solr being IO-bound, CPU-bound or memory- 
bound, although my scheduled commit operation, or perhaps GC, does  
spike up the CPU utilisation at intervals.


Any help appreciated!
James




Re: Performance help for heavy indexing workload

2008-02-12 Thread James Brady

Hi - thanks to everyone for their responses.

A couple of extra pieces of data which should help me optimise -  
documents are very rarely updated once in the index, and I can throw  
away index data older than 7 days.


So, based on advice from Mike and Walter, it seems my best option  
will be to have seven separate indices. 6 indices will never change  
and hold data from the six previous days. One index will change and  
will hold data from the current day. Deletions and updates will be  
handled by effectively storing a revocation list in the mutable index.


In this way, I will only need to perform Solr commits (yes, I did  
mean Solr commits rather than database commits below - my apologies)  
on the current day's index, and closing and opening new searchers for  
these commits shouldn't be as painful as it is currently.


To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and  
optimise the index at this point)
- ideally, properly warm new searchers in the background to further  
improve search performance on the changing index


Does that sound like a reasonable strategy in general, and has anyone  
got advice on the specific points I raise above?


Thanks,
James

On 12 Feb 2008, at 11:45, Mike Klaas wrote:


On 11-Feb-08, at 11:38 PM, James Brady wrote:


Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more  
indexing than searching.


At present, it needs to index around two documents / sec - a  
document being the stripped content of a webpage. However,  
performance was so poor that I've had to disable indexing of the  
webpage content as an emergency measure. In addition, some search  
queries take an inordinate length of time - regularly over 60  
seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
and 8GB RAM), and there's not too much else going on on the box.  
In total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs,  
mergeFactor and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task  
causing a database commit every 15 seconds.


By database commit do you mean solr commit?  If so, that is far  
too frequent if you are sorting on big fields.


I use Solr to serve queries for ~10m docs on a medium size EC2  
instance.  This is an optimized configuration where highlighting is  
broken off into a separate index, and load balanced into two  
subindices of 5m docs a piece.  I do a good deal of faceting but no  
sorting.  The only reason that this is possible is that the index  
is only updated every few days.


On another box we have a several hundred thousand document index  
which is updated relatively frequently (autocommit time: 20s).   
These are merged with the static-er index to create an illusion of  
real-time index updates.


When lucene supports efficient, reopen()able fieldcache upates,  
this situation might improve, but the above architecture would  
still probably be better.  Note that the second index can be on the  
same machine.


-Mike




Performance help for heavy indexing workload

2008-02-11 Thread James Brady

Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more indexing  
than searching.


At present, it needs to index around two documents / sec - a document  
being the stripped content of a webpage. However, performance was so  
poor that I've had to disable indexing of the webpage content as an  
emergency measure. In addition, some search queries take an  
inordinate length of time - regularly over 60 seconds.


This is running on a medium sized EC2 instance (2 x 2GHz Opterons and  
8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.


I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task causing  
a database commit every 15 seconds.


Obviously, every workload varies, but could anyone comment on whether  
this sort of hardware should, with proper configuration, be able to  
manage this sort of workload?


I can't see signs of Solr being IO-bound, CPU-bound or memory-bound,  
although my scheduled commit operation, or perhaps GC, does spike up  
the CPU utilisation at intervals.


Any help appreciated!
James

Unicode bug in python client code

2008-02-01 Thread James Brady

Hi all,
I was adding passing python unicode objects to solr.add and got these  
sort of errors:

...
  File /Users/jamesbrady/Documents/workspace/YelServer/yel/ 
solr.py, line 152, in add

self.__add(lst,fields)
  File /Users/jamesbrady/Documents/workspace/YelServer/yel/ 
solr.py, line 146, in __add

lst.append(self.escapeVal(str(v)))
UnicodeEncodeError: 'ascii' codec can't encode characters in position  
30-31: ordinal not in range(128)


Here's a diff which properly checks the object type before calling str 
():

142a143,146
   if not isinstance(f, basestring):
 f = str(f)
   if not isinstance(v, basestring):
 v = str(v)
144c148
   lst.append(self.escapeKey(str(f)))
---
   lst.append(self.escapeKey(f))
146c150
   lst.append(self.escapeVal(str(v)))
---
   lst.append(self.escapeVal(v))

Keep up the good work - loving Solr so far!
James