Re: Some new SOLR features

2008-09-19 Thread Noble Paul നോബിള്‍ नोब्ळ्
why to restart solr ? reloading a core may be sufficient.
SOLR-561 already supports this
-


On Thu, Sep 18, 2008 at 5:17 PM, Jason Rutherglen
[EMAIL PROTECTED] wrote:
 Servlets is one thing.  For SOLR the situation is different.  There
 are always small changes people want to make, a new stop word, a small
 tweak to an analyzer.  Rebooting the server for these should not be
 necessary.  Ideally this is handled via a centralized console and
 deployed over the network (using RMI or XML) so that files do not need
 to be deployed.

 On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote:
 Isnt this done in servlet containers for debugging type work? Maybe an
 option, but I disagree that this should drive anything in solr. It should
 really be turned off in production in servelet containers imo as well.

 This can really be such a pain in the ass on a live site...someone touches
 web.xml and the app server reboots*shudder*. Seen it, don't dig it.

 Jason Rutherglen wrote:

 This should be done.  Great idea.

 On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote:


 My vote is for dynamically scanning a directory of configuration files.
 When
 a new one appears, or an existing file is touched, load it. When a
 configuration disappears, unload it.  This model works very well for
 servlet
 containers.

 Lance

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Wednesday, September 17, 2008 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Some new SOLR features

 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:


 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.


 Exactly.  Actually, multi-core allows you to instantiate a completely new
 core and swap it for the old one, but it's a bit of a heavyweight
 approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but that
 one could easily swap in a new schema object for an index at any time.
  That
 would allow a single request to see a stable view of the schema, while
 preventing having to make every aspect of the schema thread-safe.



 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.


 That's the plan... completely separate the serialized and in memory
 representations.



 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.


 Nothing will stop one from using java serialization for config
 persistence,
 however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can cut-n-paste
 relevant parts of their config in email for support, or to a wiki to
 explain
 things, etc.

 Of course, if you are talking about being able to have custom filters or
 analyzers (new classes that don't even exist on the server yet), then it
 does start to get interesting.  This intersects with deployment in
 general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that could
 also automatically be handled in a a large cluster... what are the
 options
 for handling that?  Is there a role here for OSGi to play?
  It sounds like at least some of that is outside of the Solr domain.

 An alternative to serializing everything would be to ship a new schema
 along
 with a new jar file containing the custom components.

 -Yonik









-- 
--Noble Paul


Re: Some new SOLR features

2008-09-19 Thread Jason Rutherglen
Yes reloading a core can be used.  I guess the proposal is a way to
update the config and schema files over the network through SOLR
rather than by the filesystem.  This will make grid computing and
schema updates much faster.

On Fri, Sep 19, 2008 at 2:11 AM, Noble Paul നോബിള്‍ नोब्ळ्
[EMAIL PROTECTED] wrote:
 why to restart solr ? reloading a core may be sufficient.
 SOLR-561 already supports this
 -


 On Thu, Sep 18, 2008 at 5:17 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
 Servlets is one thing.  For SOLR the situation is different.  There
 are always small changes people want to make, a new stop word, a small
 tweak to an analyzer.  Rebooting the server for these should not be
 necessary.  Ideally this is handled via a centralized console and
 deployed over the network (using RMI or XML) so that files do not need
 to be deployed.

 On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote:
 Isnt this done in servlet containers for debugging type work? Maybe an
 option, but I disagree that this should drive anything in solr. It should
 really be turned off in production in servelet containers imo as well.

 This can really be such a pain in the ass on a live site...someone touches
 web.xml and the app server reboots*shudder*. Seen it, don't dig it.

 Jason Rutherglen wrote:

 This should be done.  Great idea.

 On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote:


 My vote is for dynamically scanning a directory of configuration files.
 When
 a new one appears, or an existing file is touched, load it. When a
 configuration disappears, unload it.  This model works very well for
 servlet
 containers.

 Lance

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Wednesday, September 17, 2008 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Some new SOLR features

 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:


 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.


 Exactly.  Actually, multi-core allows you to instantiate a completely new
 core and swap it for the old one, but it's a bit of a heavyweight
 approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but that
 one could easily swap in a new schema object for an index at any time.
  That
 would allow a single request to see a stable view of the schema, while
 preventing having to make every aspect of the schema thread-safe.



 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.


 That's the plan... completely separate the serialized and in memory
 representations.



 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.


 Nothing will stop one from using java serialization for config
 persistence,
 however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can cut-n-paste
 relevant parts of their config in email for support, or to a wiki to
 explain
 things, etc.

 Of course, if you are talking about being able to have custom filters or
 analyzers (new classes that don't even exist on the server yet), then it
 does start to get interesting.  This intersects with deployment in
 general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that could
 also automatically be handled in a a large cluster... what are the
 options
 for handling that?  Is there a role here for OSGi to play?
  It sounds like at least some of that is outside of the Solr domain.

 An alternative to serializing everything would be to ship a new schema
 along
 with a new jar file containing the custom components.

 -Yonik









 --
 --Noble Paul



Re: Some new SOLR features

2008-09-18 Thread Jason Rutherglen
Hi Yonik,

One approach I have been working on that I will integrate into SOLR is
the ability to use serialized objects for the analyzers so that the
schema can be defined on the client side if need be.  The analyzer
classes will be dynamically loaded.  Or there is no need for a schema
and plain Java objects can be defined and used.

I'd like to see the synonyms serialized as well.  When I mentioned the
serialization it is in regards to setting the configuration over the
Hadoop RMI LUCENE-1336 protocol.  Instead of creating methods for each
new call one wants, the easiest approach in distributed computing is
to have a dynamic class loaded that operates directly on SolrCore and
so can do whatever is necessary to get the work completed.  Creating
new methods in distributed computing is always a bad idea IMO.

In realtime indexing one will not be able to simply reindex all the
time, and so either a dynamic schema, or no schema at all is best.
Otherwise the documents would need to have a schemaVersion field, this
gets messy I looked at this.

Jason

On Wed, Sep 17, 2008 at 5:10 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 On Wed, Sep 17, 2008 at 4:50 PM, Henrib [EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:

 ...multi-core allows you to instantiate a completely
 new core and swap it for the old one, but it's a bit of a heavyweight
 approach
 ...a schema object would not be mutable, but
 that one could easily swap in a new schema object for an index at any
 time...


 Not sure I understand what we gain; if you change the schema, you'll most
 likely will
 have to reindex as well.

 That's management at a higher level in a way.
 There are enough ways that one could change the schema in a compatible
 way (say like just adding query-time synonyms, etc) that it does seem
 like we should permit it.

 Or are you saying we should have a shortcut for the
 whole operation of
 creating a new core, reindex content, replacing an existing core ?

 Eventually, it seems like we should be able to handle re-indexing when
 necessary.
 And we should consider the ability to change some config without
 necessarily reloading *everything*.

 -Yonik



Re: Some new SOLR features

2008-09-18 Thread Jason Rutherglen
This should be done.  Great idea.

On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote:
 My vote is for dynamically scanning a directory of configuration files. When
 a new one appears, or an existing file is touched, load it. When a
 configuration disappears, unload it.  This model works very well for servlet
 containers.

 Lance

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
 Sent: Wednesday, September 17, 2008 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Some new SOLR features

 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.

 Exactly.  Actually, multi-core allows you to instantiate a completely new
 core and swap it for the old one, but it's a bit of a heavyweight approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but that
 one could easily swap in a new schema object for an index at any time.  That
 would allow a single request to see a stable view of the schema, while
 preventing having to make every aspect of the schema thread-safe.

 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.

 That's the plan... completely separate the serialized and in memory
 representations.

 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.

 Nothing will stop one from using java serialization for config persistence,
 however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can cut-n-paste
 relevant parts of their config in email for support, or to a wiki to explain
 things, etc.

 Of course, if you are talking about being able to have custom filters or
 analyzers (new classes that don't even exist on the server yet), then it
 does start to get interesting.  This intersects with deployment in
 general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that could
 also automatically be handled in a a large cluster... what are the options
 for handling that?  Is there a role here for OSGi to play?
  It sounds like at least some of that is outside of the Solr domain.

 An alternative to serializing everything would be to ship a new schema along
 with a new jar file containing the custom components.

 -Yonik




Re: Some new SOLR features

2008-09-18 Thread Jason Rutherglen
 That would allow a single request to see a stable view of the
 schema, while preventing having to make every aspect of the schema
 thread-safe.

Yes that is the best approach.

 Nothing will stop one from using java serialization for config
 persistence,

Persistence should not be serialized.  Serialization is for transport
over the wire for automated upgrades of the configuration.  This could
be done in XML as well, but it would be good to support both models.

 Is there a role here for OSGi to play?

Yes.  Eclipse successfully uses OSGI, and for grid computing in Java,
and to take advantage of what Java can do with dynamic classloading,
OSGI is the way to go.  Every search project I have worked on needs
this stuff to be way easier than it is now.  The current distributed
computing model in SOLR may work, but it will not work reliably and
will break a lot.  When it does break there is no way to know what
happened.  This will create excessive downtime for users.  I have had
excessive downtime in production even in the current simple
master-slave architecture because there is no failover.  Failover in
the current system should be in there because it's too easy to
implement with the rsync based batch replication.

On Wed, Sep 17, 2008 at 2:21 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.

 Exactly.  Actually, multi-core allows you to instantiate a completely
 new core and swap it for the old one, but it's a bit of a heavyweight
 approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but
 that one could easily swap in a new schema object for an index at any
 time.  That would allow a single request to see a stable view of the
 schema, while preventing having to make every aspect of the schema
 thread-safe.

 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.

 That's the plan... completely separate the serialized and in memory
 representations.

 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.

 Nothing will stop one from using java serialization for config
 persistence, however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can
 cut-n-paste relevant parts of their config in email for support, or to
 a wiki to explain things, etc.

 Of course, if you are talking about being able to have custom filters
 or analyzers (new classes that don't even exist on the server yet),
 then it does start to get interesting.  This intersects with
 deployment in general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that
 could also automatically be handled in a a large cluster... what are
 the options for handling that?  Is there a role here for OSGi to play?
  It sounds like at least some of that is outside of the Solr domain.

 An alternative to serializing everything would be to ship a new schema
 along with a new jar file containing the custom components.

 -Yonik



Re: Some new SOLR features

2008-09-18 Thread Jason Rutherglen
Servlets is one thing.  For SOLR the situation is different.  There
are always small changes people want to make, a new stop word, a small
tweak to an analyzer.  Rebooting the server for these should not be
necessary.  Ideally this is handled via a centralized console and
deployed over the network (using RMI or XML) so that files do not need
to be deployed.

On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote:
 Isnt this done in servlet containers for debugging type work? Maybe an
 option, but I disagree that this should drive anything in solr. It should
 really be turned off in production in servelet containers imo as well.

 This can really be such a pain in the ass on a live site...someone touches
 web.xml and the app server reboots*shudder*. Seen it, don't dig it.

 Jason Rutherglen wrote:

 This should be done.  Great idea.

 On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote:


 My vote is for dynamically scanning a directory of configuration files.
 When
 a new one appears, or an existing file is touched, load it. When a
 configuration disappears, unload it.  This model works very well for
 servlet
 containers.

 Lance

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Wednesday, September 17, 2008 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Some new SOLR features

 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:


 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.


 Exactly.  Actually, multi-core allows you to instantiate a completely new
 core and swap it for the old one, but it's a bit of a heavyweight
 approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but that
 one could easily swap in a new schema object for an index at any time.
  That
 would allow a single request to see a stable view of the schema, while
 preventing having to make every aspect of the schema thread-safe.



 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.


 That's the plan... completely separate the serialized and in memory
 representations.



 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.


 Nothing will stop one from using java serialization for config
 persistence,
 however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can cut-n-paste
 relevant parts of their config in email for support, or to a wiki to
 explain
 things, etc.

 Of course, if you are talking about being able to have custom filters or
 analyzers (new classes that don't even exist on the server yet), then it
 does start to get interesting.  This intersects with deployment in
 general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that could
 also automatically be handled in a a large cluster... what are the
 options
 for handling that?  Is there a role here for OSGi to play?
  It sounds like at least some of that is outside of the Solr domain.

 An alternative to serializing everything would be to ship a new schema
 along
 with a new jar file containing the custom components.

 -Yonik







Re: Some new SOLR features

2008-09-18 Thread Mark Miller
Dynamic changes are not what I'm against...I'm against dynamic changes 
that are triggered by the app noticing that the config have changed.


Jason Rutherglen wrote:

Servlets is one thing.  For SOLR the situation is different.  There
are always small changes people want to make, a new stop word, a small
tweak to an analyzer.  Rebooting the server for these should not be
necessary.  Ideally this is handled via a centralized console and
deployed over the network (using RMI or XML) so that files do not need
to be deployed.

On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote:
  

Isnt this done in servlet containers for debugging type work? Maybe an
option, but I disagree that this should drive anything in solr. It should
really be turned off in production in servelet containers imo as well.

This can really be such a pain in the ass on a live site...someone touches
web.xml and the app server reboots*shudder*. Seen it, don't dig it.

Jason Rutherglen wrote:


This should be done.  Great idea.

On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote:

  

My vote is for dynamically scanning a directory of configuration files.
When
a new one appears, or an existing file is touched, load it. When a
configuration disappears, unload it.  This model works very well for
servlet
containers.

Lance

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, September 17, 2008 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Some new SOLR features

On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
[EMAIL PROTECTED] wrote:



If the configuration code is going to be rewritten then I would like
to see the ability to dynamically update the configuration and schema
without needing to reboot the server.

  

Exactly.  Actually, multi-core allows you to instantiate a completely new
core and swap it for the old one, but it's a bit of a heavyweight
approach.

The key is finding the right granularity of change.
My current thought is that a schema object would not be mutable, but that
one could easily swap in a new schema object for an index at any time.
 That
would allow a single request to see a stable view of the schema, while
preventing having to make every aspect of the schema thread-safe.




Also I would like the
configuration classes to just contain data and not have so many
methods that operate on the filesystem.

  

That's the plan... completely separate the serialized and in memory
representations.




This way the configuration
object can be serialized, and loaded by the server dynamically.  It
would be great for the schema to work the same way.

  

Nothing will stop one from using java serialization for config
persistence,
however I am a fan of human readable for config files...
so much easier to debug and support.  Right now, people can cut-n-paste
relevant parts of their config in email for support, or to a wiki to
explain
things, etc.

Of course, if you are talking about being able to have custom filters or
analyzers (new classes that don't even exist on the server yet), then it
does start to get interesting.  This intersects with deployment in
general... and I'm not sure what the right answer is.
What if Lucene or Solr needs an upgrade?  It would be nice if that could
also automatically be handled in a a large cluster... what are the
options
for handling that?  Is there a role here for OSGi to play?
 It sounds like at least some of that is outside of the Solr domain.

An alternative to serializing everything would be to ship a new schema
along
with a new jar file containing the custom components.

-Yonik









Re: Some new SOLR features

2008-09-18 Thread Jason Rutherglen
Yes, so it's probably best to make the changes through a remote
interface so that the app will be able to make the appropriate
internal changes.  File based system changes are less than ideal,
agreed, however I suppose with an open source project such as SOLR the
kitchen sink affect happens and it will find it's way in there
anyways.  The hard part is organizing the project such that it does
not get too bloated with everyone's features and allows features to be
pluggable outside of the core releases.  There are many things that
may best best as contrib modules that could be OSGI based add ons
rather than placed into the standard releases (of which I don't have
any off hand).  The standard for contribs for SOLR can be OSGI.  This
will greatly assist in SOLR becoming grid computing friendly.  Ideally
SOLR 2.0 would be cleaner, standardized, and most of the features
pluggable.  This will allow for consistent release cycles, make grid
computing simpler to implement.  SOLR seems like it could be going in
the direction of bloat which could increasingly confuse new users.
Instead they could either implement their own modules and upload them
in the contrib section, implement their own that are proprietary.

I am curious about what is the recommended place to put the query
expansion code (such as adding boosting, adding phrase queries and
such)?  Is is now best to use a SearchComponent?  Is it possible in
the future to make SearchComponents OSGI enabled?

On Thu, Sep 18, 2008 at 7:56 AM, Mark Miller [EMAIL PROTECTED] wrote:
 Dynamic changes are not what I'm against...I'm against dynamic changes that
 are triggered by the app noticing that the config have changed.

 Jason Rutherglen wrote:

 Servlets is one thing.  For SOLR the situation is different.  There
 are always small changes people want to make, a new stop word, a small
 tweak to an analyzer.  Rebooting the server for these should not be
 necessary.  Ideally this is handled via a centralized console and
 deployed over the network (using RMI or XML) so that files do not need
 to be deployed.

 On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED]
 wrote:


 Isnt this done in servlet containers for debugging type work? Maybe an
 option, but I disagree that this should drive anything in solr. It should
 really be turned off in production in servelet containers imo as well.

 This can really be such a pain in the ass on a live site...someone
 touches
 web.xml and the app server reboots*shudder*. Seen it, don't dig it.

 Jason Rutherglen wrote:


 This should be done.  Great idea.

 On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED]
 wrote:



 My vote is for dynamically scanning a directory of configuration files.
 When
 a new one appears, or an existing file is touched, load it. When a
 configuration disappears, unload it.  This model works very well for
 servlet
 containers.

 Lance

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
 Seeley
 Sent: Wednesday, September 17, 2008 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Some new SOLR features

 On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:



 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.



 Exactly.  Actually, multi-core allows you to instantiate a completely
 new
 core and swap it for the old one, but it's a bit of a heavyweight
 approach.

 The key is finding the right granularity of change.
 My current thought is that a schema object would not be mutable, but
 that
 one could easily swap in a new schema object for an index at any time.
  That
 would allow a single request to see a stable view of the schema, while
 preventing having to make every aspect of the schema thread-safe.




 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.



 That's the plan... completely separate the serialized and in memory
 representations.




 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.



 Nothing will stop one from using java serialization for config
 persistence,
 however I am a fan of human readable for config files...
 so much easier to debug and support.  Right now, people can cut-n-paste
 relevant parts of their config in email for support, or to a wiki to
 explain
 things, etc.

 Of course, if you are talking about being able to have custom filters
 or
 analyzers (new classes that don't even exist on the server yet), then
 it
 does start to get interesting.  This intersects with deployment in
 general... and I'm not sure what the right answer is.
 What if Lucene or Solr needs an upgrade?  It would be nice if that
 could
 also automatically be handled in a a large cluster... what

Re: Some new SOLR features

2008-09-17 Thread Yonik Seeley
On Tue, Sep 16, 2008 at 10:12 AM, Jason Rutherglen
[EMAIL PROTECTED] wrote:
  SQL database such as H2
 Mainly to offer joins and be able to perform hierarchical queries.

Can you define or give an example of what you mean by hierarchical queries?
A downside of any type of cross-document queries (like joins) is that
it tends to limit scalability.  Of course, I think it's acceptable to
have some query types that only work on a single shard, since that may
continue to cover the majority of users.

Along the same lines, I think it would be useful to have a highly
integrated extension point for stored fields (so they could be
retrieved from external systems if needed).

-Yonik


Re: Some new SOLR features

2008-09-17 Thread Jason Rutherglen
If the configuration code is going to be rewritten then I would like
to see the ability to dynamically update the configuration and schema
without needing to reboot the server.  Also I would like the
configuration classes to just contain data and not have so many
methods that operate on the filesystem.  This way the configuration
object can be serialized, and loaded by the server dynamically.  It
would be great for the schema to work the same way.

Yonik, what is the best way to get this type of things going?  Where
in the code do you want to implement the distributed RMI Hadoop stuff?

On Tue, Sep 16, 2008 at 1:07 PM, Henrib [EMAIL PROTECTED] wrote:



 ryantxu wrote:


 Yes, include would get us some of the way there, but not far enough
 (IMHO).  The problem is that (as written) you still need to have all
 the configs spattered about various directories.



 I does not allow us to go *all* the way but it does allow to put
 configurations files in one directory (plus schema  conf can have specific
 names set for each CoreDescriptor).
 There actually is a test where the config  schema are shared  can set the
 dataDir as a property.
 Still a step forward...

 --
 View this message in context: 
 http://www.nabble.com/Some-new-SOLR-features-tp19494251p19516242.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Some new SOLR features

2008-09-17 Thread Jason Rutherglen
  Can you define or give an example of what you mean by hierarchical queries?

Good question, I think Erik Hatcher had more ideas on that.  I was
imagining joins or sub queries like SQL does.  Clearly they won't be
efficient, but it's easier than implementing joins (or is it) in SOLR?

Joins limit scalability that is true, I guess it's just the nature of
it though.  Unless there is some other way to do it.  Doesn't Oracle
implement some sort of distributed join in their clustering solution?
Is it worth it?

On Wed, Sep 17, 2008 at 12:25 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 On Tue, Sep 16, 2008 at 10:12 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
  SQL database such as H2
 Mainly to offer joins and be able to perform hierarchical queries.

 Can you define or give an example of what you mean by hierarchical queries?
 A downside of any type of cross-document queries (like joins) is that
 it tends to limit scalability.  Of course, I think it's acceptable to
 have some query types that only work on a single shard, since that may
 continue to cover the majority of users.

 Along the same lines, I think it would be useful to have a highly
 integrated extension point for stored fields (so they could be
 retrieved from external systems if needed).

 -Yonik



Re: Some new SOLR features

2008-09-17 Thread Yonik Seeley
On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
[EMAIL PROTECTED] wrote:
 If the configuration code is going to be rewritten then I would like
 to see the ability to dynamically update the configuration and schema
 without needing to reboot the server.

Exactly.  Actually, multi-core allows you to instantiate a completely
new core and swap it for the old one, but it's a bit of a heavyweight
approach.

The key is finding the right granularity of change.
My current thought is that a schema object would not be mutable, but
that one could easily swap in a new schema object for an index at any
time.  That would allow a single request to see a stable view of the
schema, while preventing having to make every aspect of the schema
thread-safe.

 Also I would like the
 configuration classes to just contain data and not have so many
 methods that operate on the filesystem.

That's the plan... completely separate the serialized and in memory
representations.

 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It
 would be great for the schema to work the same way.

Nothing will stop one from using java serialization for config
persistence, however I am a fan of human readable for config files...
so much easier to debug and support.  Right now, people can
cut-n-paste relevant parts of their config in email for support, or to
a wiki to explain things, etc.

Of course, if you are talking about being able to have custom filters
or analyzers (new classes that don't even exist on the server yet),
then it does start to get interesting.  This intersects with
deployment in general... and I'm not sure what the right answer is.
What if Lucene or Solr needs an upgrade?  It would be nice if that
could also automatically be handled in a a large cluster... what are
the options for handling that?  Is there a role here for OSGi to play?
 It sounds like at least some of that is outside of the Solr domain.

An alternative to serializing everything would be to ship a new schema
along with a new jar file containing the custom components.

-Yonik


RE: Some new SOLR features

2008-09-17 Thread Lance Norskog
My vote is for dynamically scanning a directory of configuration files. When
a new one appears, or an existing file is touched, load it. When a
configuration disappears, unload it.  This model works very well for servlet
containers.

Lance

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Wednesday, September 17, 2008 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Some new SOLR features

On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen
[EMAIL PROTECTED] wrote:
 If the configuration code is going to be rewritten then I would like 
 to see the ability to dynamically update the configuration and schema 
 without needing to reboot the server.

Exactly.  Actually, multi-core allows you to instantiate a completely new
core and swap it for the old one, but it's a bit of a heavyweight approach.

The key is finding the right granularity of change.
My current thought is that a schema object would not be mutable, but that
one could easily swap in a new schema object for an index at any time.  That
would allow a single request to see a stable view of the schema, while
preventing having to make every aspect of the schema thread-safe.

 Also I would like the
 configuration classes to just contain data and not have so many 
 methods that operate on the filesystem.

That's the plan... completely separate the serialized and in memory
representations.

 This way the configuration
 object can be serialized, and loaded by the server dynamically.  It 
 would be great for the schema to work the same way.

Nothing will stop one from using java serialization for config persistence,
however I am a fan of human readable for config files...
so much easier to debug and support.  Right now, people can cut-n-paste
relevant parts of their config in email for support, or to a wiki to explain
things, etc.

Of course, if you are talking about being able to have custom filters or
analyzers (new classes that don't even exist on the server yet), then it
does start to get interesting.  This intersects with deployment in
general... and I'm not sure what the right answer is.
What if Lucene or Solr needs an upgrade?  It would be nice if that could
also automatically be handled in a a large cluster... what are the options
for handling that?  Is there a role here for OSGi to play?
 It sounds like at least some of that is outside of the Solr domain.

An alternative to serializing everything would be to ship a new schema along
with a new jar file containing the custom components.

-Yonik



Re: Some new SOLR features

2008-09-17 Thread Yonik Seeley
On Wed, Sep 17, 2008 at 4:50 PM, Henrib [EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:

 ...multi-core allows you to instantiate a completely
 new core and swap it for the old one, but it's a bit of a heavyweight
 approach
 ...a schema object would not be mutable, but
 that one could easily swap in a new schema object for an index at any
 time...


 Not sure I understand what we gain; if you change the schema, you'll most
 likely will
 have to reindex as well.

That's management at a higher level in a way.
There are enough ways that one could change the schema in a compatible
way (say like just adding query-time synonyms, etc) that it does seem
like we should permit it.

 Or are you saying we should have a shortcut for the
 whole operation of
 creating a new core, reindex content, replacing an existing core ?

Eventually, it seems like we should be able to handle re-indexing when
necessary.
And we should consider the ability to change some config without
necessarily reloading *everything*.

-Yonik


Re: Some new SOLR features

2008-09-16 Thread Jason Rutherglen
Hello Ryan,

  SQL database such as H2

Mainly to offer joins and be able to perform hierarchical queries.
Also any other types of queries a hybrid SQL search system would
offer.  This is something that is best built into SOLR rather than
Lucene.  It seems like a lot of the users of SOLR work with SQL
databases as well.  It would seem natural to integrate the two.  Also
the Summize realtime search system that Twitter purchased worked by
integrating with Mysql.  The way to do something similar in Lucene
would be to integrate with a Java SQL database.  Also hierarchical
queries could be performed faster using this method (though I could be
wrong, if there is a better way).

 to have multiple lucene indexes within a single SolrCore?

I don't like the whole multi core thing from an administrative
perspective.  That means each index needs a separate schema and
configuration etc.  That becomes hard to manage if there are 10+
indexes required and is definitely not as simple as an SQL database
does not require so many separate directories and manual
configuration.  It would be simple to add this into SOLR.  In general
though I have trouble figuring out many of the design decisions of
SOLR though and so hesitate to implement things that seem to go
against the SOLR design model (is there one?).

 9. Distributed search and updates using a object serialization which

Where would I start with integrating this into SOLR?  Need some help
on that part of it.  Tell me what's best and I'll integrate it, it
should be the easiest on the list.

Jason

On Mon, Sep 15, 2008 at 11:44 AM, Ryan McKinley [EMAIL PROTECTED] wrote:


 Here are my gut reactions to this list... in general, most of this comes
 down to sounds great, if someone did the work I'm all for it!

 Also, no need to post to solr-user AND solr-dev, probably better to think of
 solr-user as a superset of solr-dev.


 1. Machine learning based suggest feature
 https://issues.apache.org/jira/browse/LUCENE-626 which is implemented
 as is similar to what Google in their suggest implementation.  The
 Fuzzy based spellchecker is ok, but it would be better to incorporate
 use behavior.
 2. Realtime updates https://issues.apache.org/jira/browse/LUCENE-1313
 and work being planned for IndexWriter
 3. Realtime untokenized field updates
 https://issues.apache.org/jira/browse/LUCENE-1292

 Without knowing the details of these patches, everything sounds great.

 In my view, SOLR should offer a nice interface to anything in lucene
 core/contrib


 4. BM25 Scoring

 Again, no idea, but if implement in lucene yes


 5. Integration with an open source SQL database such as H2.  This
 would mean under the hood, SOLR would enable storing data in a
 relational database to allow for joins and things.  It would need to
 be combined with realtime updates.  H2 has Lucene integration but it
 is the usual index everything at once, non-incrementally.  The new
 system would simply index as a new row in a table is added.  The SOLR
 schema could allow for certain fields being stored in an SQL database.

 Sounds interesting -- what is the basic problem you are addressing?

 (It seems you are pointing to something specific, and describing your
 solution)



 6. SOLR schema allowing for multiple indexes without using the
 multicore.  The indexes could be defined like SQL tables in the
 schema.xml file.

 Is this just a configuration issue?  I defiantly hope we can make
 configuration easier in the future.

 As is, a custom handler can look at multiple indexes... why is their a need
 to have multiple lucene indexes within a single SolrCore?



 6. Crowd by feature ala GBase
 http://code.google.com/apis/base/attrs-queries.html#crowding which is
 similar to Field Collapsing.  I am thinking it is advantageous from a
 performance perspective to obtain an excessive amount of results, then
 filter down the result set, rather than first sort a result set.

 Again, sounds great!  I would love to see it.


 7. Improved relevance based on user clicks of individual query results
 for individual queries.  This can be thought of as similar to what
 Digg does.  I'm sure Google does something similar.  It is a feature
 that would be of value to almost any SOLR implementation.

 Agreed -- if there is a good way to quickly update a field used for
 sorting/scoring, this would happen


 8. Integration of LocalSolr into the standard SOLR distribution.
 Location is something many sites use these days and is standard in
 GBase and most likely other products like FAST.

 I'm working on it  will be a lucene contrib package and cooked into the
 core solr distribution.



 9. Distributed search and updates using a object serialization which
 could use.  https://issues.apache.org/jira/browse/LUCENE-1336  This
 allows span queries, custom payload queries, custom similarities,
 custom analyzers, without compiling and deploying and a new SOLR war
 file to individual servers.


 sounds good (but I have no technical basis to 

Re: Some new SOLR features

2008-09-16 Thread Ryan McKinley


On Sep 16, 2008, at 10:12 AM, Jason Rutherglen wrote:


Hello Ryan,


SQL database such as H2


Mainly to offer joins and be able to perform hierarchical queries.
Also any other types of queries a hybrid SQL search system would
offer.  This is something that is best built into SOLR rather than
Lucene.  It seems like a lot of the users of SOLR work with SQL
databases as well.  It would seem natural to integrate the two.  Also
the Summize realtime search system that Twitter purchased worked by
integrating with Mysql.  The way to do something similar in Lucene
would be to integrate with a Java SQL database.  Also hierarchical
queries could be performed faster using this method (though I could be
wrong, if there is a better way).



Defiantly sounds interesting -- not on my personal TODO list, but I  
can see the value and would support this direction (perhaps as a  
contrib?)
For starters, it seems like everything could happen in a custom  
RequestHandler  (perhaps QueryComponent?)




to have multiple lucene indexes within a single SolrCore?


I don't like the whole multi core thing from an administrative
perspective.  That means each index needs a separate schema and
configuration etc.  That becomes hard to manage if there are 10+
indexes required and is definitely not as simple as an SQL database
does not require so many separate directories and manual
configuration.


I 100% agree that that multicore configuration gets unwieldy quickly.   
That said what I'm hearing from you is the config is problematic, not  
that you really need multiple lucene indexes in the same SolrCore.


FYI -- the name SolrCore is perhaps legacy from when it was static  
and had access to the only index available.  With MultiCore we  
removed all the static access and each lucene index gets a SolrCore.   
Maybe better to think of SolrCore as SolrIndex -- everythign you can  
do with one index.


Yes, I would like to see a way to specify all the fieldtypes /  
handlers in one location and then only specify what fields are  
available for each core.


So yes -- I agree.  In 2.0, I hope to flush out configs so they are  
not monstrous.




 It would be simple to add this into SOLR.  In general
though I have trouble figuring out many of the design decisions of
SOLR though and so hesitate to implement things that seem to go
against the SOLR design model (is there one?).



The 1.X line is organic growth from an internal CNET architecture.
I hope the 2.X line will have more consistent design model...

As far as getting around the exiting multicore configs  I do this  
in my code by overriding:


 protected CoreContainer.Initializer createInitializer() {
return new CoreContainer.Initializer();
  }
in SolrDispatchFilter.

I actually initialize the CoreContainer manually (pulling some info  
from a SQL database)




9. Distributed search and updates using a object serialization which


Where would I start with integrating this into SOLR?  Need some help
on that part of it.  Tell me what's best and I'll integrate it, it
should be the easiest on the list.



not sure ;)  Distributed search is one of the areas I have not looked  
at


ryan



Re: Some new SOLR features

2008-09-16 Thread Henrib



ryantxu wrote:
 
 ...
 Yes, I would like to see a way to specify all the fieldtypes /   
 handlers in one location and then only specify what fields are   
 available for each core. 
 
 So yes -- I agree.  In 2.0, I hope to flush out configs so they are   
 not monstrous. 
 ...
 

What about using include so each core can have a minimal specific
configuration and schema  everything else shared between them?
Something akin to what's allowed by solr-646.
Just couldn't resist :-)
Henri

-- 
View this message in context: 
http://www.nabble.com/Some-new-SOLR-features-tp19494251p19515526.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Some new SOLR features

2008-09-16 Thread Ryan McKinley


ryantxu wrote:


...
Yes, I would like to see a way to specify all the fieldtypes /
handlers in one location and then only specify what fields are
available for each core.

So yes -- I agree.  In 2.0, I hope to flush out configs so they are
not monstrous.
...



What about using include so each core can have a minimal specific
configuration and schema  everything else shared between them?
Something akin to what's allowed by solr-646.
Just couldn't resist :-)
Henri



somehow I knew that was coming :)

Yes, include would get us some of the way there, but not far enough  
(IMHO).  The problem is that (as written) you still need to have all  
the configs spattered about various directories.



ryan


Re: Some new SOLR features

2008-09-16 Thread Henrib



ryantxu wrote:
 
 
 Yes, include would get us some of the way there, but not far enough  
 (IMHO).  The problem is that (as written) you still need to have all  
 the configs spattered about various directories.
 
 

I does not allow us to go *all* the way but it does allow to put
configurations files in one directory (plus schema  conf can have specific
names set for each CoreDescriptor).
There actually is a test where the config  schema are shared  can set the
dataDir as a property.
Still a step forward...

-- 
View this message in context: 
http://www.nabble.com/Some-new-SOLR-features-tp19494251p19516242.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Some new SOLR features

2008-09-15 Thread Ryan McKinley




Here are my gut reactions to this list... in general, most of this  
comes down to sounds great, if someone did the work I'm all for it!


Also, no need to post to solr-user AND solr-dev, probably better to  
think of solr-user as a superset of solr-dev.




1. Machine learning based suggest feature
https://issues.apache.org/jira/browse/LUCENE-626 which is implemented
as is similar to what Google in their suggest implementation.  The
Fuzzy based spellchecker is ok, but it would be better to incorporate
use behavior.
2. Realtime updates https://issues.apache.org/jira/browse/LUCENE-1313
and work being planned for IndexWriter
3. Realtime untokenized field updates
https://issues.apache.org/jira/browse/LUCENE-1292


Without knowing the details of these patches, everything sounds great.

In my view, SOLR should offer a nice interface to anything in lucene  
core/contrib




4. BM25 Scoring


Again, no idea, but if implement in lucene yes



5. Integration with an open source SQL database such as H2.  This
would mean under the hood, SOLR would enable storing data in a
relational database to allow for joins and things.  It would need to
be combined with realtime updates.  H2 has Lucene integration but it
is the usual index everything at once, non-incrementally.  The new
system would simply index as a new row in a table is added.  The SOLR
schema could allow for certain fields being stored in an SQL database.


Sounds interesting -- what is the basic problem you are addressing?

(It seems you are pointing to something specific, and describing your  
solution)





6. SOLR schema allowing for multiple indexes without using the
multicore.  The indexes could be defined like SQL tables in the
schema.xml file.


Is this just a configuration issue?  I defiantly hope we can make  
configuration easier in the future.


As is, a custom handler can look at multiple indexes... why is their a  
need to have multiple lucene indexes within a single SolrCore?





6. Crowd by feature ala GBase
http://code.google.com/apis/base/attrs-queries.html#crowding which is
similar to Field Collapsing.  I am thinking it is advantageous from a
performance perspective to obtain an excessive amount of results, then
filter down the result set, rather than first sort a result set.


Again, sounds great!  I would love to see it.



7. Improved relevance based on user clicks of individual query results
for individual queries.  This can be thought of as similar to what
Digg does.  I'm sure Google does something similar.  It is a feature
that would be of value to almost any SOLR implementation.


Agreed -- if there is a good way to quickly update a field used for  
sorting/scoring, this would happen




8. Integration of LocalSolr into the standard SOLR distribution.
Location is something many sites use these days and is standard in
GBase and most likely other products like FAST.


I'm working on it  will be a lucene contrib package and cooked  
into the core solr distribution.





9. Distributed search and updates using a object serialization which
could use.  https://issues.apache.org/jira/browse/LUCENE-1336  This
allows span queries, custom payload queries, custom similarities,
custom analyzers, without compiling and deploying and a new SOLR war
file to individual servers.



sounds good (but I have no technical basis to say so)


ryan