RE: CoreAdmin STATUS performance
Hi Stefan, I have opened issue SOLR-4302 and attached the suggested patch. Regards, Shahar. -Original Message- From: Stefan Matheis [mailto:matheis.ste...@gmail.com] Sent: Sunday, January 13, 2013 3:11 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance Shahar would you mind, if i ask you to open an jira-issue for that? attaching your changes as typical patch? perhaps we could use that for the UI, in those cases where we don't need to full set of information .. Stefan On Sunday, January 13, 2013 at 12:28 PM, Shahar Davidson wrote: Shawn, Per and anyone else who has participated in this thread - thank you! I have finally resorted to apply a minor patch the Solr code. I have noticed that most of the time of the STATUS request is spent when collecting Index related info (such as segmentCount, sizeInBytes, numDocs, etc.). In the STATUS request I added support for a new parameter which, if present, will skip collection of the Index info (hence will only return general static info, among it the core name) - this will, in fact, cut down the request time by an order of two magnitudes! In my case, it decreased the request time from around 800ms to around 1ms-4ms. Regards, Shahar. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, January 10, 2013 5:14 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Re: CoreAdmin STATUS performance On 1/10/2013 2:09 AM, Shahar Davidson wrote: As for your first question, the core info needs to be gathered upon every search request because cores are created dynamically. When a user initiates a search request, the system must be aware of all available cores in order to execute distributed search on _all_ relevant cores. (the user must get reliable and most up to date data) The reason that 800ms seems a lot to me is because the overall execution time takes about 2500ms and a large part of it is due to the STATUS request. The minimal interval concept is a good idea and indeed we've considered it, yet it poses a slight problem when building a RT system which needs to return to most up to date data. I am just trying to understand if there's some other way to hasten the STATUS reply (for example, by asking the STATUS request to return just certain core attributes, such as name, instead of collecting everything) Are there a *huge* number of SolrJ clients in the wild, or is it something like a server farm where you are in control of everything? If it's the latter, what I think I would do is have an asynchronous thread that periodically (every few seconds) updates the client's view of what cores exist. When a query is made, it will use that information, speeding up your queries by 800 milliseconds and ensuring that new cores will not have long delays before they become searchable. If you have a huge number of clients in the wild, it would still be possible, but ensuring that those clients get updated might be hard. If you also delete cores as well as add them, that complicates things. You'd have to have the clients be smart enough to exclude the last core on the list (by whatever sorting mechanism you require), and you'd have to wait long enough (30 seconds, maybe?) before *actually* deleting the last core to be sure that no clients are accessing it. Or you could use SolrCloud, as Per suggested, but with 4.1, not the released 4.0. SolrCloud manages your cores for you automatically. You'd probably be using a slightly customized SolrCloud, including the custom hashing capability added by SOLR-2592. I don't know what other customizations you might need. Thanks, Shawn Email secured by Check Point Email secured by Check Point
RE: CoreAdmin STATUS performance
Thanks for sharing this info, Per - this info may prove to be valuable for me in the future. Shahar. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Thursday, January 10, 2013 6:10 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance The collections are created dynamically. Not on update though. We use one collection per month and we have a timer-job running (every hour or so), which checks if all collections that need to exist actually does exist - if not it creates the collection(s). The rule is that the collection for next month has to exist as soon as we enter current month, so the first time the timer job runs e.g. 1. July it will create the August-collection. We never get data with timestamp in the future. Therefore if the timer-job just gets to run once within every month we will always have needed collections ready. We create collections using the new Collection API in Solr. Be used to manage creation of every single Shard/Replica/Core of the collections during the Core Admin API in Solr, but since an Collection API was introduced we decided that we better use that. In 4.0 it did not have the features we needed, which triggered SOLR-4114, SOLR-4120 and SOLR-4140 which will be available in 4.1. With those features we are now using the Collection API. BTW, our timer-job also handles deletion of old collections. In our system you can configure how many historic month-collection you will keep before it is ok to delete them. Lets say that this is configured to 3, as soon at it becomes 1. July the timer-job will delete the March-collection (the historic collections to keep will just have become April-, May- and June-collections). This way we will always have a least 3 months of historic data, and last in a month close to 4 months of history. It does not matter that we have a little to much history, when we just do not go below the lower limit on lenght of historic data. We also use the new Collection API for deletion. Regards, Per Steffensen On 1/10/13 3:04 PM, Shahar Davidson wrote: Hi Per, Thanks for your reply! That's a very interesting approach. In your system, how are the collections created? In other words, are the collections created dynamically upon an update (for example, per new day)? If they are created dynamically, who handles their creation (client/server) and how is it done? I'd love to hear more about it! Appreciate your help, Shahar. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Thursday, January 10, 2013 1:23 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance On 1/10/13 10:09 AM, Shahar Davidson wrote: search request, the system must be aware of all available cores in order to execute distributed search on_all_ relevant cores For this purpose I would definitely recommend that you go SolrCloud. Further more we do something ekstra: We have several collections each containing data from a specific period in time - timestamp of ingoing data decides which collection it is indexed into. One important search-criteria for our clients are search on timestamp-interval. Therefore most searches can be restricted to only consider a subset of all our collections. Instead of having the logic calculating the subset of collections to search (given the timestamp search-interval) in clients, we just let clients do dumb searches by giving the timestamp-interval. The subset of collections to search are calculated on server-side from the timestamp-interval in the search-query. We handle this in a Solr SearchComponent which we place early in the chain of SearchComponents. Maybe you can get some inspiration by this approach, if it is also relevant for you. Regards, Per Steffensen Email secured by Check Point Email secured by Check Point
RE: CoreAdmin STATUS performance
Shawn, Per and anyone else who has participated in this thread - thank you! I have finally resorted to apply a minor patch the Solr code. I have noticed that most of the time of the STATUS request is spent when collecting Index related info (such as segmentCount, sizeInBytes, numDocs, etc.). In the STATUS request I added support for a new parameter which, if present, will skip collection of the Index info (hence will only return general static info, among it the core name) - this will, in fact, cut down the request time by an order of two magnitudes! In my case, it decreased the request time from around 800ms to around 1ms-4ms. Regards, Shahar. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, January 10, 2013 5:14 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance On 1/10/2013 2:09 AM, Shahar Davidson wrote: As for your first question, the core info needs to be gathered upon every search request because cores are created dynamically. When a user initiates a search request, the system must be aware of all available cores in order to execute distributed search on _all_ relevant cores. (the user must get reliable and most up to date data) The reason that 800ms seems a lot to me is because the overall execution time takes about 2500ms and a large part of it is due to the STATUS request. The minimal interval concept is a good idea and indeed we've considered it, yet it poses a slight problem when building a RT system which needs to return to most up to date data. I am just trying to understand if there's some other way to hasten the STATUS reply (for example, by asking the STATUS request to return just certain core attributes, such as name, instead of collecting everything) Are there a *huge* number of SolrJ clients in the wild, or is it something like a server farm where you are in control of everything? If it's the latter, what I think I would do is have an asynchronous thread that periodically (every few seconds) updates the client's view of what cores exist. When a query is made, it will use that information, speeding up your queries by 800 milliseconds and ensuring that new cores will not have long delays before they become searchable. If you have a huge number of clients in the wild, it would still be possible, but ensuring that those clients get updated might be hard. If you also delete cores as well as add them, that complicates things. You'd have to have the clients be smart enough to exclude the last core on the list (by whatever sorting mechanism you require), and you'd have to wait long enough (30 seconds, maybe?) before *actually* deleting the last core to be sure that no clients are accessing it. Or you could use SolrCloud, as Per suggested, but with 4.1, not the released 4.0. SolrCloud manages your cores for you automatically. You'd probably be using a slightly customized SolrCloud, including the custom hashing capability added by SOLR-2592. I don't know what other customizations you might need. Thanks, Shawn Email secured by Check Point
Re: CoreAdmin STATUS performance
Shahar would you mind, if i ask you to open an jira-issue for that? attaching your changes as typical patch? perhaps we could use that for the UI, in those cases where we don't need to full set of information .. Stefan On Sunday, January 13, 2013 at 12:28 PM, Shahar Davidson wrote: Shawn, Per and anyone else who has participated in this thread - thank you! I have finally resorted to apply a minor patch the Solr code. I have noticed that most of the time of the STATUS request is spent when collecting Index related info (such as segmentCount, sizeInBytes, numDocs, etc.). In the STATUS request I added support for a new parameter which, if present, will skip collection of the Index info (hence will only return general static info, among it the core name) - this will, in fact, cut down the request time by an order of two magnitudes! In my case, it decreased the request time from around 800ms to around 1ms-4ms. Regards, Shahar. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, January 10, 2013 5:14 PM To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Subject: Re: CoreAdmin STATUS performance On 1/10/2013 2:09 AM, Shahar Davidson wrote: As for your first question, the core info needs to be gathered upon every search request because cores are created dynamically. When a user initiates a search request, the system must be aware of all available cores in order to execute distributed search on _all_ relevant cores. (the user must get reliable and most up to date data) The reason that 800ms seems a lot to me is because the overall execution time takes about 2500ms and a large part of it is due to the STATUS request. The minimal interval concept is a good idea and indeed we've considered it, yet it poses a slight problem when building a RT system which needs to return to most up to date data. I am just trying to understand if there's some other way to hasten the STATUS reply (for example, by asking the STATUS request to return just certain core attributes, such as name, instead of collecting everything) Are there a *huge* number of SolrJ clients in the wild, or is it something like a server farm where you are in control of everything? If it's the latter, what I think I would do is have an asynchronous thread that periodically (every few seconds) updates the client's view of what cores exist. When a query is made, it will use that information, speeding up your queries by 800 milliseconds and ensuring that new cores will not have long delays before they become searchable. If you have a huge number of clients in the wild, it would still be possible, but ensuring that those clients get updated might be hard. If you also delete cores as well as add them, that complicates things. You'd have to have the clients be smart enough to exclude the last core on the list (by whatever sorting mechanism you require), and you'd have to wait long enough (30 seconds, maybe?) before *actually* deleting the last core to be sure that no clients are accessing it. Or you could use SolrCloud, as Per suggested, but with 4.1, not the released 4.0. SolrCloud manages your cores for you automatically. You'd probably be using a slightly customized SolrCloud, including the custom hashing capability added by SOLR-2592. I don't know what other customizations you might need. Thanks, Shawn Email secured by Check Point
Re: CoreAdmin STATUS performance
If you are using ZK-coordinating Solr (SolrCloud - you need 4.0+) you can maintain a in-memory always-up-to-date data-structure containing the information - ClusterState. You can get it through CloudSolrServer og ZkStateReader that you connect to ZK once and it will automatically update the in-memory ClusterState with changes. Regards, Per Steffensen On 1/9/13 4:38 PM, Shahar Davidson wrote: Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? Thanks, Shahar.
RE: CoreAdmin STATUS performance
Thanks Per. I'm currently not using SolrCloud but that's a good tip to keep in mind. Thanks, Shahar. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Thursday, January 10, 2013 10:02 AM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance If you are using ZK-coordinating Solr (SolrCloud - you need 4.0+) you can maintain a in-memory always-up-to-date data-structure containing the information - ClusterState. You can get it through CloudSolrServer og ZkStateReader that you connect to ZK once and it will automatically update the in-memory ClusterState with changes. Regards, Per Steffensen On 1/9/13 4:38 PM, Shahar Davidson wrote: Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? Thanks, Shahar. Email secured by Check Point
Re: CoreAdmin STATUS performance
On 1/10/13 10:09 AM, Shahar Davidson wrote: search request, the system must be aware of all available cores in order to execute distributed search on_all_ relevant cores For this purpose I would definitely recommend that you go SolrCloud. Further more we do something ekstra: We have several collections each containing data from a specific period in time - timestamp of ingoing data decides which collection it is indexed into. One important search-criteria for our clients are search on timestamp-interval. Therefore most searches can be restricted to only consider a subset of all our collections. Instead of having the logic calculating the subset of collections to search (given the timestamp search-interval) in clients, we just let clients do dumb searches by giving the timestamp-interval. The subset of collections to search are calculated on server-side from the timestamp-interval in the search-query. We handle this in a Solr SearchComponent which we place early in the chain of SearchComponents. Maybe you can get some inspiration by this approach, if it is also relevant for you. Regards, Per Steffensen
RE: CoreAdmin STATUS performance
Hi Per, Thanks for your reply! That's a very interesting approach. In your system, how are the collections created? In other words, are the collections created dynamically upon an update (for example, per new day)? If they are created dynamically, who handles their creation (client/server) and how is it done? I'd love to hear more about it! Appreciate your help, Shahar. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Thursday, January 10, 2013 1:23 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance On 1/10/13 10:09 AM, Shahar Davidson wrote: search request, the system must be aware of all available cores in order to execute distributed search on_all_ relevant cores For this purpose I would definitely recommend that you go SolrCloud. Further more we do something ekstra: We have several collections each containing data from a specific period in time - timestamp of ingoing data decides which collection it is indexed into. One important search-criteria for our clients are search on timestamp-interval. Therefore most searches can be restricted to only consider a subset of all our collections. Instead of having the logic calculating the subset of collections to search (given the timestamp search-interval) in clients, we just let clients do dumb searches by giving the timestamp-interval. The subset of collections to search are calculated on server-side from the timestamp-interval in the search-query. We handle this in a Solr SearchComponent which we place early in the chain of SearchComponents. Maybe you can get some inspiration by this approach, if it is also relevant for you. Regards, Per Steffensen Email secured by Check Point
Re: CoreAdmin STATUS performance
On 1/10/2013 2:09 AM, Shahar Davidson wrote: As for your first question, the core info needs to be gathered upon every search request because cores are created dynamically. When a user initiates a search request, the system must be aware of all available cores in order to execute distributed search on _all_ relevant cores. (the user must get reliable and most up to date data) The reason that 800ms seems a lot to me is because the overall execution time takes about 2500ms and a large part of it is due to the STATUS request. The minimal interval concept is a good idea and indeed we've considered it, yet it poses a slight problem when building a RT system which needs to return to most up to date data. I am just trying to understand if there's some other way to hasten the STATUS reply (for example, by asking the STATUS request to return just certain core attributes, such as name, instead of collecting everything) Are there a *huge* number of SolrJ clients in the wild, or is it something like a server farm where you are in control of everything? If it's the latter, what I think I would do is have an asynchronous thread that periodically (every few seconds) updates the client's view of what cores exist. When a query is made, it will use that information, speeding up your queries by 800 milliseconds and ensuring that new cores will not have long delays before they become searchable. If you have a huge number of clients in the wild, it would still be possible, but ensuring that those clients get updated might be hard. If you also delete cores as well as add them, that complicates things. You'd have to have the clients be smart enough to exclude the last core on the list (by whatever sorting mechanism you require), and you'd have to wait long enough (30 seconds, maybe?) before *actually* deleting the last core to be sure that no clients are accessing it. Or you could use SolrCloud, as Per suggested, but with 4.1, not the released 4.0. SolrCloud manages your cores for you automatically. You'd probably be using a slightly customized SolrCloud, including the custom hashing capability added by SOLR-2592. I don't know what other customizations you might need. Thanks, Shawn
Re: CoreAdmin STATUS performance
The collections are created dynamically. Not on update though. We use one collection per month and we have a timer-job running (every hour or so), which checks if all collections that need to exist actually does exist - if not it creates the collection(s). The rule is that the collection for next month has to exist as soon as we enter current month, so the first time the timer job runs e.g. 1. July it will create the August-collection. We never get data with timestamp in the future. Therefore if the timer-job just gets to run once within every month we will always have needed collections ready. We create collections using the new Collection API in Solr. Be used to manage creation of every single Shard/Replica/Core of the collections during the Core Admin API in Solr, but since an Collection API was introduced we decided that we better use that. In 4.0 it did not have the features we needed, which triggered SOLR-4114, SOLR-4120 and SOLR-4140 which will be available in 4.1. With those features we are now using the Collection API. BTW, our timer-job also handles deletion of old collections. In our system you can configure how many historic month-collection you will keep before it is ok to delete them. Lets say that this is configured to 3, as soon at it becomes 1. July the timer-job will delete the March-collection (the historic collections to keep will just have become April-, May- and June-collections). This way we will always have a least 3 months of historic data, and last in a month close to 4 months of history. It does not matter that we have a little to much history, when we just do not go below the lower limit on lenght of historic data. We also use the new Collection API for deletion. Regards, Per Steffensen On 1/10/13 3:04 PM, Shahar Davidson wrote: Hi Per, Thanks for your reply! That's a very interesting approach. In your system, how are the collections created? In other words, are the collections created dynamically upon an update (for example, per new day)? If they are created dynamically, who handles their creation (client/server) and how is it done? I'd love to hear more about it! Appreciate your help, Shahar. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Thursday, January 10, 2013 1:23 PM To: solr-user@lucene.apache.org Subject: Re: CoreAdmin STATUS performance On 1/10/13 10:09 AM, Shahar Davidson wrote: search request, the system must be aware of all available cores in order to execute distributed search on_all_ relevant cores For this purpose I would definitely recommend that you go SolrCloud. Further more we do something ekstra: We have several collections each containing data from a specific period in time - timestamp of ingoing data decides which collection it is indexed into. One important search-criteria for our clients are search on timestamp-interval. Therefore most searches can be restricted to only consider a subset of all our collections. Instead of having the logic calculating the subset of collections to search (given the timestamp search-interval) in clients, we just let clients do dumb searches by giving the timestamp-interval. The subset of collections to search are calculated on server-side from the timestamp-interval in the search-query. We handle this in a Solr SearchComponent which we place early in the chain of SearchComponents. Maybe you can get some inspiration by this approach, if it is also relevant for you. Regards, Per Steffensen Email secured by Check Point
CoreAdmin STATUS performance
Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? Thanks, Shahar.
Re: CoreAdmin STATUS performance
On 1/9/2013 10:38 AM, Shahar Davidson wrote: Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? If you have access to the filesystem, you could just read solr.xml where all cores are listed.
Re: CoreAdmin STATUS performance
On 1/9/2013 8:38 AM, Shahar Davidson wrote: I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? I'm curious why 800 milliseconds isn't fast enough. How often do you actually need to gather this information? If you are incorporating it into something that will get accessed a lot (such as a status servlet page), put a minimum interval capability into the part of the program that contacts solr. If it's been less than that minimum interval (5-10 seconds could be a recommended starting point) since the last time the information was gathered, just use the previously stored response rather than make a new request. I have used this approach in a homegrown status servlet written with SolrJ. I have been trying to come up with a way to generalize the paradigm so it can be incorporated directly into a future SolrJ version. Thanks, Shawn