Re: Some new SOLR features
why to restart solr ? reloading a core may be sufficient. SOLR-561 already supports this - On Thu, Sep 18, 2008 at 5:17 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: Servlets is one thing. For SOLR the situation is different. There are always small changes people want to make, a new stop word, a small tweak to an analyzer. Rebooting the server for these should not be necessary. Ideally this is handled via a centralized console and deployed over the network (using RMI or XML) so that files do not need to be deployed. On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote: Isnt this done in servlet containers for debugging type work? Maybe an option, but I disagree that this should drive anything in solr. It should really be turned off in production in servelet containers imo as well. This can really be such a pain in the ass on a live site...someone touches web.xml and the app server reboots*shudder*. Seen it, don't dig it. Jason Rutherglen wrote: This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik -- --Noble Paul
Re: Some new SOLR features
Yes reloading a core can be used. I guess the proposal is a way to update the config and schema files over the network through SOLR rather than by the filesystem. This will make grid computing and schema updates much faster. On Fri, Sep 19, 2008 at 2:11 AM, Noble Paul നോബിള് नोब्ळ् [EMAIL PROTECTED] wrote: why to restart solr ? reloading a core may be sufficient. SOLR-561 already supports this - On Thu, Sep 18, 2008 at 5:17 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: Servlets is one thing. For SOLR the situation is different. There are always small changes people want to make, a new stop word, a small tweak to an analyzer. Rebooting the server for these should not be necessary. Ideally this is handled via a centralized console and deployed over the network (using RMI or XML) so that files do not need to be deployed. On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote: Isnt this done in servlet containers for debugging type work? Maybe an option, but I disagree that this should drive anything in solr. It should really be turned off in production in servelet containers imo as well. This can really be such a pain in the ass on a live site...someone touches web.xml and the app server reboots*shudder*. Seen it, don't dig it. Jason Rutherglen wrote: This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik -- --Noble Paul
Re: Some new SOLR features
Hi Yonik, One approach I have been working on that I will integrate into SOLR is the ability to use serialized objects for the analyzers so that the schema can be defined on the client side if need be. The analyzer classes will be dynamically loaded. Or there is no need for a schema and plain Java objects can be defined and used. I'd like to see the synonyms serialized as well. When I mentioned the serialization it is in regards to setting the configuration over the Hadoop RMI LUCENE-1336 protocol. Instead of creating methods for each new call one wants, the easiest approach in distributed computing is to have a dynamic class loaded that operates directly on SolrCore and so can do whatever is necessary to get the work completed. Creating new methods in distributed computing is always a bad idea IMO. In realtime indexing one will not be able to simply reindex all the time, and so either a dynamic schema, or no schema at all is best. Otherwise the documents would need to have a schemaVersion field, this gets messy I looked at this. Jason On Wed, Sep 17, 2008 at 5:10 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Wed, Sep 17, 2008 at 4:50 PM, Henrib [EMAIL PROTECTED] wrote: Yonik Seeley wrote: ...multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach ...a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time... Not sure I understand what we gain; if you change the schema, you'll most likely will have to reindex as well. That's management at a higher level in a way. There are enough ways that one could change the schema in a compatible way (say like just adding query-time synonyms, etc) that it does seem like we should permit it. Or are you saying we should have a shortcut for the whole operation of creating a new core, reindex content, replacing an existing core ? Eventually, it seems like we should be able to handle re-indexing when necessary. And we should consider the ability to change some config without necessarily reloading *everything*. -Yonik
Re: Some new SOLR features
This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
Re: Some new SOLR features
That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Yes that is the best approach. Nothing will stop one from using java serialization for config persistence, Persistence should not be serialized. Serialization is for transport over the wire for automated upgrades of the configuration. This could be done in XML as well, but it would be good to support both models. Is there a role here for OSGi to play? Yes. Eclipse successfully uses OSGI, and for grid computing in Java, and to take advantage of what Java can do with dynamic classloading, OSGI is the way to go. Every search project I have worked on needs this stuff to be way easier than it is now. The current distributed computing model in SOLR may work, but it will not work reliably and will break a lot. When it does break there is no way to know what happened. This will create excessive downtime for users. I have had excessive downtime in production even in the current simple master-slave architecture because there is no failover. Failover in the current system should be in there because it's too easy to implement with the rsync based batch replication. On Wed, Sep 17, 2008 at 2:21 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
Re: Some new SOLR features
Servlets is one thing. For SOLR the situation is different. There are always small changes people want to make, a new stop word, a small tweak to an analyzer. Rebooting the server for these should not be necessary. Ideally this is handled via a centralized console and deployed over the network (using RMI or XML) so that files do not need to be deployed. On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote: Isnt this done in servlet containers for debugging type work? Maybe an option, but I disagree that this should drive anything in solr. It should really be turned off in production in servelet containers imo as well. This can really be such a pain in the ass on a live site...someone touches web.xml and the app server reboots*shudder*. Seen it, don't dig it. Jason Rutherglen wrote: This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
Re: Some new SOLR features
Dynamic changes are not what I'm against...I'm against dynamic changes that are triggered by the app noticing that the config have changed. Jason Rutherglen wrote: Servlets is one thing. For SOLR the situation is different. There are always small changes people want to make, a new stop word, a small tweak to an analyzer. Rebooting the server for these should not be necessary. Ideally this is handled via a centralized console and deployed over the network (using RMI or XML) so that files do not need to be deployed. On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote: Isnt this done in servlet containers for debugging type work? Maybe an option, but I disagree that this should drive anything in solr. It should really be turned off in production in servelet containers imo as well. This can really be such a pain in the ass on a live site...someone touches web.xml and the app server reboots*shudder*. Seen it, don't dig it. Jason Rutherglen wrote: This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
Re: Some new SOLR features
Yes, so it's probably best to make the changes through a remote interface so that the app will be able to make the appropriate internal changes. File based system changes are less than ideal, agreed, however I suppose with an open source project such as SOLR the kitchen sink affect happens and it will find it's way in there anyways. The hard part is organizing the project such that it does not get too bloated with everyone's features and allows features to be pluggable outside of the core releases. There are many things that may best best as contrib modules that could be OSGI based add ons rather than placed into the standard releases (of which I don't have any off hand). The standard for contribs for SOLR can be OSGI. This will greatly assist in SOLR becoming grid computing friendly. Ideally SOLR 2.0 would be cleaner, standardized, and most of the features pluggable. This will allow for consistent release cycles, make grid computing simpler to implement. SOLR seems like it could be going in the direction of bloat which could increasingly confuse new users. Instead they could either implement their own modules and upload them in the contrib section, implement their own that are proprietary. I am curious about what is the recommended place to put the query expansion code (such as adding boosting, adding phrase queries and such)? Is is now best to use a SearchComponent? Is it possible in the future to make SearchComponents OSGI enabled? On Thu, Sep 18, 2008 at 7:56 AM, Mark Miller [EMAIL PROTECTED] wrote: Dynamic changes are not what I'm against...I'm against dynamic changes that are triggered by the app noticing that the config have changed. Jason Rutherglen wrote: Servlets is one thing. For SOLR the situation is different. There are always small changes people want to make, a new stop word, a small tweak to an analyzer. Rebooting the server for these should not be necessary. Ideally this is handled via a centralized console and deployed over the network (using RMI or XML) so that files do not need to be deployed. On Thu, Sep 18, 2008 at 7:41 AM, Mark Miller [EMAIL PROTECTED] wrote: Isnt this done in servlet containers for debugging type work? Maybe an option, but I disagree that this should drive anything in solr. It should really be turned off in production in servelet containers imo as well. This can really be such a pain in the ass on a live site...someone touches web.xml and the app server reboots*shudder*. Seen it, don't dig it. Jason Rutherglen wrote: This should be done. Great idea. On Wed, Sep 17, 2008 at 3:41 PM, Lance Norskog [EMAIL PROTECTED] wrote: My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what
Re: Some new SOLR features
On Tue, Sep 16, 2008 at 10:12 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: SQL database such as H2 Mainly to offer joins and be able to perform hierarchical queries. Can you define or give an example of what you mean by hierarchical queries? A downside of any type of cross-document queries (like joins) is that it tends to limit scalability. Of course, I think it's acceptable to have some query types that only work on a single shard, since that may continue to cover the majority of users. Along the same lines, I think it would be useful to have a highly integrated extension point for stored fields (so they could be retrieved from external systems if needed). -Yonik
Re: Some new SOLR features
If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Yonik, what is the best way to get this type of things going? Where in the code do you want to implement the distributed RMI Hadoop stuff? On Tue, Sep 16, 2008 at 1:07 PM, Henrib [EMAIL PROTECTED] wrote: ryantxu wrote: Yes, include would get us some of the way there, but not far enough (IMHO). The problem is that (as written) you still need to have all the configs spattered about various directories. I does not allow us to go *all* the way but it does allow to put configurations files in one directory (plus schema conf can have specific names set for each CoreDescriptor). There actually is a test where the config schema are shared can set the dataDir as a property. Still a step forward... -- View this message in context: http://www.nabble.com/Some-new-SOLR-features-tp19494251p19516242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some new SOLR features
Can you define or give an example of what you mean by hierarchical queries? Good question, I think Erik Hatcher had more ideas on that. I was imagining joins or sub queries like SQL does. Clearly they won't be efficient, but it's easier than implementing joins (or is it) in SOLR? Joins limit scalability that is true, I guess it's just the nature of it though. Unless there is some other way to do it. Doesn't Oracle implement some sort of distributed join in their clustering solution? Is it worth it? On Wed, Sep 17, 2008 at 12:25 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Tue, Sep 16, 2008 at 10:12 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: SQL database such as H2 Mainly to offer joins and be able to perform hierarchical queries. Can you define or give an example of what you mean by hierarchical queries? A downside of any type of cross-document queries (like joins) is that it tends to limit scalability. Of course, I think it's acceptable to have some query types that only work on a single shard, since that may continue to cover the majority of users. Along the same lines, I think it would be useful to have a highly integrated extension point for stored fields (so they could be retrieved from external systems if needed). -Yonik
Re: Some new SOLR features
On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
RE: Some new SOLR features
My vote is for dynamically scanning a directory of configuration files. When a new one appears, or an existing file is touched, load it. When a configuration disappears, unload it. This model works very well for servlet containers. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, September 17, 2008 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Some new SOLR features On Wed, Sep 17, 2008 at 1:27 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: If the configuration code is going to be rewritten then I would like to see the ability to dynamically update the configuration and schema without needing to reboot the server. Exactly. Actually, multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach. The key is finding the right granularity of change. My current thought is that a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time. That would allow a single request to see a stable view of the schema, while preventing having to make every aspect of the schema thread-safe. Also I would like the configuration classes to just contain data and not have so many methods that operate on the filesystem. That's the plan... completely separate the serialized and in memory representations. This way the configuration object can be serialized, and loaded by the server dynamically. It would be great for the schema to work the same way. Nothing will stop one from using java serialization for config persistence, however I am a fan of human readable for config files... so much easier to debug and support. Right now, people can cut-n-paste relevant parts of their config in email for support, or to a wiki to explain things, etc. Of course, if you are talking about being able to have custom filters or analyzers (new classes that don't even exist on the server yet), then it does start to get interesting. This intersects with deployment in general... and I'm not sure what the right answer is. What if Lucene or Solr needs an upgrade? It would be nice if that could also automatically be handled in a a large cluster... what are the options for handling that? Is there a role here for OSGi to play? It sounds like at least some of that is outside of the Solr domain. An alternative to serializing everything would be to ship a new schema along with a new jar file containing the custom components. -Yonik
Re: Some new SOLR features
On Wed, Sep 17, 2008 at 4:50 PM, Henrib [EMAIL PROTECTED] wrote: Yonik Seeley wrote: ...multi-core allows you to instantiate a completely new core and swap it for the old one, but it's a bit of a heavyweight approach ...a schema object would not be mutable, but that one could easily swap in a new schema object for an index at any time... Not sure I understand what we gain; if you change the schema, you'll most likely will have to reindex as well. That's management at a higher level in a way. There are enough ways that one could change the schema in a compatible way (say like just adding query-time synonyms, etc) that it does seem like we should permit it. Or are you saying we should have a shortcut for the whole operation of creating a new core, reindex content, replacing an existing core ? Eventually, it seems like we should be able to handle re-indexing when necessary. And we should consider the ability to change some config without necessarily reloading *everything*. -Yonik
Re: Some new SOLR features
Hello Ryan, SQL database such as H2 Mainly to offer joins and be able to perform hierarchical queries. Also any other types of queries a hybrid SQL search system would offer. This is something that is best built into SOLR rather than Lucene. It seems like a lot of the users of SOLR work with SQL databases as well. It would seem natural to integrate the two. Also the Summize realtime search system that Twitter purchased worked by integrating with Mysql. The way to do something similar in Lucene would be to integrate with a Java SQL database. Also hierarchical queries could be performed faster using this method (though I could be wrong, if there is a better way). to have multiple lucene indexes within a single SolrCore? I don't like the whole multi core thing from an administrative perspective. That means each index needs a separate schema and configuration etc. That becomes hard to manage if there are 10+ indexes required and is definitely not as simple as an SQL database does not require so many separate directories and manual configuration. It would be simple to add this into SOLR. In general though I have trouble figuring out many of the design decisions of SOLR though and so hesitate to implement things that seem to go against the SOLR design model (is there one?). 9. Distributed search and updates using a object serialization which Where would I start with integrating this into SOLR? Need some help on that part of it. Tell me what's best and I'll integrate it, it should be the easiest on the list. Jason On Mon, Sep 15, 2008 at 11:44 AM, Ryan McKinley [EMAIL PROTECTED] wrote: Here are my gut reactions to this list... in general, most of this comes down to sounds great, if someone did the work I'm all for it! Also, no need to post to solr-user AND solr-dev, probably better to think of solr-user as a superset of solr-dev. 1. Machine learning based suggest feature https://issues.apache.org/jira/browse/LUCENE-626 which is implemented as is similar to what Google in their suggest implementation. The Fuzzy based spellchecker is ok, but it would be better to incorporate use behavior. 2. Realtime updates https://issues.apache.org/jira/browse/LUCENE-1313 and work being planned for IndexWriter 3. Realtime untokenized field updates https://issues.apache.org/jira/browse/LUCENE-1292 Without knowing the details of these patches, everything sounds great. In my view, SOLR should offer a nice interface to anything in lucene core/contrib 4. BM25 Scoring Again, no idea, but if implement in lucene yes 5. Integration with an open source SQL database such as H2. This would mean under the hood, SOLR would enable storing data in a relational database to allow for joins and things. It would need to be combined with realtime updates. H2 has Lucene integration but it is the usual index everything at once, non-incrementally. The new system would simply index as a new row in a table is added. The SOLR schema could allow for certain fields being stored in an SQL database. Sounds interesting -- what is the basic problem you are addressing? (It seems you are pointing to something specific, and describing your solution) 6. SOLR schema allowing for multiple indexes without using the multicore. The indexes could be defined like SQL tables in the schema.xml file. Is this just a configuration issue? I defiantly hope we can make configuration easier in the future. As is, a custom handler can look at multiple indexes... why is their a need to have multiple lucene indexes within a single SolrCore? 6. Crowd by feature ala GBase http://code.google.com/apis/base/attrs-queries.html#crowding which is similar to Field Collapsing. I am thinking it is advantageous from a performance perspective to obtain an excessive amount of results, then filter down the result set, rather than first sort a result set. Again, sounds great! I would love to see it. 7. Improved relevance based on user clicks of individual query results for individual queries. This can be thought of as similar to what Digg does. I'm sure Google does something similar. It is a feature that would be of value to almost any SOLR implementation. Agreed -- if there is a good way to quickly update a field used for sorting/scoring, this would happen 8. Integration of LocalSolr into the standard SOLR distribution. Location is something many sites use these days and is standard in GBase and most likely other products like FAST. I'm working on it will be a lucene contrib package and cooked into the core solr distribution. 9. Distributed search and updates using a object serialization which could use. https://issues.apache.org/jira/browse/LUCENE-1336 This allows span queries, custom payload queries, custom similarities, custom analyzers, without compiling and deploying and a new SOLR war file to individual servers. sounds good (but I have no technical basis to
Re: Some new SOLR features
On Sep 16, 2008, at 10:12 AM, Jason Rutherglen wrote: Hello Ryan, SQL database such as H2 Mainly to offer joins and be able to perform hierarchical queries. Also any other types of queries a hybrid SQL search system would offer. This is something that is best built into SOLR rather than Lucene. It seems like a lot of the users of SOLR work with SQL databases as well. It would seem natural to integrate the two. Also the Summize realtime search system that Twitter purchased worked by integrating with Mysql. The way to do something similar in Lucene would be to integrate with a Java SQL database. Also hierarchical queries could be performed faster using this method (though I could be wrong, if there is a better way). Defiantly sounds interesting -- not on my personal TODO list, but I can see the value and would support this direction (perhaps as a contrib?) For starters, it seems like everything could happen in a custom RequestHandler (perhaps QueryComponent?) to have multiple lucene indexes within a single SolrCore? I don't like the whole multi core thing from an administrative perspective. That means each index needs a separate schema and configuration etc. That becomes hard to manage if there are 10+ indexes required and is definitely not as simple as an SQL database does not require so many separate directories and manual configuration. I 100% agree that that multicore configuration gets unwieldy quickly. That said what I'm hearing from you is the config is problematic, not that you really need multiple lucene indexes in the same SolrCore. FYI -- the name SolrCore is perhaps legacy from when it was static and had access to the only index available. With MultiCore we removed all the static access and each lucene index gets a SolrCore. Maybe better to think of SolrCore as SolrIndex -- everythign you can do with one index. Yes, I would like to see a way to specify all the fieldtypes / handlers in one location and then only specify what fields are available for each core. So yes -- I agree. In 2.0, I hope to flush out configs so they are not monstrous. It would be simple to add this into SOLR. In general though I have trouble figuring out many of the design decisions of SOLR though and so hesitate to implement things that seem to go against the SOLR design model (is there one?). The 1.X line is organic growth from an internal CNET architecture. I hope the 2.X line will have more consistent design model... As far as getting around the exiting multicore configs I do this in my code by overriding: protected CoreContainer.Initializer createInitializer() { return new CoreContainer.Initializer(); } in SolrDispatchFilter. I actually initialize the CoreContainer manually (pulling some info from a SQL database) 9. Distributed search and updates using a object serialization which Where would I start with integrating this into SOLR? Need some help on that part of it. Tell me what's best and I'll integrate it, it should be the easiest on the list. not sure ;) Distributed search is one of the areas I have not looked at ryan
Re: Some new SOLR features
ryantxu wrote: ... Yes, I would like to see a way to specify all the fieldtypes / handlers in one location and then only specify what fields are available for each core. So yes -- I agree. In 2.0, I hope to flush out configs so they are not monstrous. ... What about using include so each core can have a minimal specific configuration and schema everything else shared between them? Something akin to what's allowed by solr-646. Just couldn't resist :-) Henri -- View this message in context: http://www.nabble.com/Some-new-SOLR-features-tp19494251p19515526.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some new SOLR features
ryantxu wrote: ... Yes, I would like to see a way to specify all the fieldtypes / handlers in one location and then only specify what fields are available for each core. So yes -- I agree. In 2.0, I hope to flush out configs so they are not monstrous. ... What about using include so each core can have a minimal specific configuration and schema everything else shared between them? Something akin to what's allowed by solr-646. Just couldn't resist :-) Henri somehow I knew that was coming :) Yes, include would get us some of the way there, but not far enough (IMHO). The problem is that (as written) you still need to have all the configs spattered about various directories. ryan
Re: Some new SOLR features
ryantxu wrote: Yes, include would get us some of the way there, but not far enough (IMHO). The problem is that (as written) you still need to have all the configs spattered about various directories. I does not allow us to go *all* the way but it does allow to put configurations files in one directory (plus schema conf can have specific names set for each CoreDescriptor). There actually is a test where the config schema are shared can set the dataDir as a property. Still a step forward... -- View this message in context: http://www.nabble.com/Some-new-SOLR-features-tp19494251p19516242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some new SOLR features
Here are my gut reactions to this list... in general, most of this comes down to sounds great, if someone did the work I'm all for it! Also, no need to post to solr-user AND solr-dev, probably better to think of solr-user as a superset of solr-dev. 1. Machine learning based suggest feature https://issues.apache.org/jira/browse/LUCENE-626 which is implemented as is similar to what Google in their suggest implementation. The Fuzzy based spellchecker is ok, but it would be better to incorporate use behavior. 2. Realtime updates https://issues.apache.org/jira/browse/LUCENE-1313 and work being planned for IndexWriter 3. Realtime untokenized field updates https://issues.apache.org/jira/browse/LUCENE-1292 Without knowing the details of these patches, everything sounds great. In my view, SOLR should offer a nice interface to anything in lucene core/contrib 4. BM25 Scoring Again, no idea, but if implement in lucene yes 5. Integration with an open source SQL database such as H2. This would mean under the hood, SOLR would enable storing data in a relational database to allow for joins and things. It would need to be combined with realtime updates. H2 has Lucene integration but it is the usual index everything at once, non-incrementally. The new system would simply index as a new row in a table is added. The SOLR schema could allow for certain fields being stored in an SQL database. Sounds interesting -- what is the basic problem you are addressing? (It seems you are pointing to something specific, and describing your solution) 6. SOLR schema allowing for multiple indexes without using the multicore. The indexes could be defined like SQL tables in the schema.xml file. Is this just a configuration issue? I defiantly hope we can make configuration easier in the future. As is, a custom handler can look at multiple indexes... why is their a need to have multiple lucene indexes within a single SolrCore? 6. Crowd by feature ala GBase http://code.google.com/apis/base/attrs-queries.html#crowding which is similar to Field Collapsing. I am thinking it is advantageous from a performance perspective to obtain an excessive amount of results, then filter down the result set, rather than first sort a result set. Again, sounds great! I would love to see it. 7. Improved relevance based on user clicks of individual query results for individual queries. This can be thought of as similar to what Digg does. I'm sure Google does something similar. It is a feature that would be of value to almost any SOLR implementation. Agreed -- if there is a good way to quickly update a field used for sorting/scoring, this would happen 8. Integration of LocalSolr into the standard SOLR distribution. Location is something many sites use these days and is standard in GBase and most likely other products like FAST. I'm working on it will be a lucene contrib package and cooked into the core solr distribution. 9. Distributed search and updates using a object serialization which could use. https://issues.apache.org/jira/browse/LUCENE-1336 This allows span queries, custom payload queries, custom similarities, custom analyzers, without compiling and deploying and a new SOLR war file to individual servers. sounds good (but I have no technical basis to say so) ryan