[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732042#comment-16732042 ] Jason Gerlowski commented on SOLR-6595: --- I'm not going to have much time in the immediate future to finish this up, so I wanted to summarize the progress so far: - the latest patch sets the "status" property to 500 when the "failure" list is present and non-empty - because of this, SolrJ will now throw exceptions in failure cases where it previously allowed the request to fail silently. This causes some tests to fail that were passing (incorrectly) before. I investigated a few examples of this, and most were in test setup/cleanup when the expectations were a bit off. There weren't a ton of these failures though and they should be simpler to debug thanks to other recent test flakiness improvements. - I investigated making changes to SolrJ that would attach a NamedList to SolrExceptions thrown because of a 500, but didn't pursue that too far. It's probably a separate JIRA anyways. > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Assignee: Jason Gerlowski >Priority: Minor > Attachments: SOLR-6595.patch > > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702292#comment-16702292 ] Jason Gerlowski commented on SOLR-6595: --- Thinking aloud here, and I guess also soliciting feedback. The current patch sets 500 as the value for the "status' property, as well as the HTTP status code on the response. The expectation in most other places seems to be that the "status" property matches the HTTP status code. So this seems like the technically correct thing to do from an API perspective. There's is a downside to this though- SolrJ converts non-200 responses into exceptions. So while the failure information is still in the response, SolrJ users can't get at it. (This isn't strictly true...SolrJ tries its best to come up with a good exception message by looking for properties like "error" and "failure". But that's a pale substitute to giving users access to the response itself if they want it). It'd be cool if SolrJ users could access the original response in exceptional cases. Maybe we should attach the parsed NamedList to RemoteSolrExceptions that get thrown by SolrJ. That seems like a separate JIRA, but wanted to raise it here since it bears on these response changes indirectly. > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Assignee: Jason Gerlowski >Priority: Minor > Attachments: SOLR-6595.patch > > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699566#comment-16699566 ] Jason Gerlowski commented on SOLR-6595: --- I've attached a patch here which ensures that any collection-api response with a non-empty "failure" property also has its status set to 500. This has the advantage of covering things more generically and save us from constantly finding new cases where the status property (and HTTP status code) is incorrect. (There's a few different JIRAs open at the moment for similar issues with various collection APIs.). Reviewers might notice that I change the status to 500 not by throwing a SolrException as is common, but my introducing a field in SolrQueryResponse as a "status-override". I didn't like deviating from the normal way of doing things, and I don't love introducing yet-another way to set the API status, but I had trouble finding a good way to flatten the often-nested structure of the "failure" map into a message for a SolrException without losing tons of information that could help the user out. If anyone sees a better way here, I'd love some review/feedback. This change triggers a few additional test failures- the API calls in these tests have apparently been failing for some time before this change but we never noticed since the response status obscured the problem. So this patch includes fixes for a number of these tests. I'm still building confidence that I've caught all of these cases, hoping to flush out more status-related test failures through the week. If my runs stop finding issues by the end of the week, I'll be looking to commit. > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Assignee: Jason Gerlowski >Priority: Minor > Attachments: SOLR-6595.patch > > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692329#comment-16692329 ] Jason Gerlowski commented on SOLR-6595: --- Wanted to check in on this and see which of the original concerns are still issues: bq. Status=0 when the cmd did not succeed Still a problem, though it will soon be fixed for CREATE, the reporter's original example here. bq. The error printed does not tell which action failed Still a problem, but a hard one: it's tough to guess which bits in the exception chain are the helpful bits. The top and root of the chain are the most likely entries to be interesting, but not always. Any truncation of the exception chain is going to reduce the chance we're conveying the important part. bq. State of collection is not clean since it exists as far as ZK is concerned but cores not created This _should_ have already been fixed in SOLR-8983. So I'd argue that fixing the {{status}} property should be our main goal. To that end, I've attached a patch fixing this problem for CREATE on SOLR-5970. I don't like the narrowness of that fix though will spend some time seeing if there's a way it can be generalized at a different level of our collection API processing. Going to assign this to myself. > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Priority: Minor > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575214#comment-15575214 ] Jan Høydahl commented on SOLR-6595: --- I wonder if the error reporting might be solved during a lot of refactoring of the overseer, async operations etc? Anyone? > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Priority: Minor > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036518#comment-15036518 ] mugeesh commented on SOLR-6595: --- above conversation nobody tell clearly how to solve it, I am getting same error in solr-5.3. Provide the exact command for creating create colllection/core. > Improve error response in case distributed collection cmd fails > --- > > Key: SOLR-6595 > URL: https://issues.apache.org/jira/browse/SOLR-6595 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.10 > Environment: SolrCloud with Client SSL >Reporter: Sindre Fiskaa >Priority: Minor > > Followed the description > https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a > self signed key pair. Configured a few solr-nodes and used the collection api > to crate a new collection. -I get error message when specify the nodes with > the createNodeSet param. When I don't use createNodeSet param the collection > gets created without error on random nodes. Could this be a bug related to > the createNodeSet param?- *Update: It failed due to what turned out to be > invalid client certificate on the overseer, and returned the following > response:* > {code:xml} > > 0 name="QTime">185 > > org.apache.solr.client.solrj.SolrServerException:IOException occured > when talking to server at: https://vt-searchln04:443/solr > > > {code} > *Update: Three problems:* > # Status=0 when the cmd did not succeed (only ZK was updated, but cores not > created due to failing to connect to shard nodes to talk to core admin API). > # The error printed does not tell which action failed. Would be helpful to > either get the msg from the original exception or at least some message > saying "Failed to create core, see log on Overseer > # State of collection is not clean since it exists as far as ZK is concerned > but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. > Should Overseer detect error in distributed cmds and rollback changes already > made in ZK? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183481#comment-14183481 ] Jan Høydahl commented on SOLR-6595: --- Appreciate feedback and discussion on how to solve this... Improve error response in case distributed collection cmd fails --- Key: SOLR-6595 URL: https://issues.apache.org/jira/browse/SOLR-6595 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Environment: SolrCloud with Client SSL Reporter: Sindre Fiskaa Priority: Minor Followed the description https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a self signed key pair. Configured a few solr-nodes and used the collection api to crate a new collection. -I get error message when specify the nodes with the createNodeSet param. When I don't use createNodeSet param the collection gets created without error on random nodes. Could this be a bug related to the createNodeSet param?- *Update: It failed due to what turned out to be invalid client certificate on the overseer, and returned the following response:* {code:xml} response lst name=responseHeaderint name=status0/intint name=QTime185/int/lst lst name=failure strorg.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: https://vt-searchln04:443/solr/str /lst /response {code} *Update: Three problems:* # Status=0 when the cmd did not succeed (only ZK was updated, but cores not created due to failing to connect to shard nodes to talk to core admin API). # The error printed does not tell which action failed. Would be helpful to either get the msg from the original exception or at least some message saying Failed to create core, see log on Overseer node.name # State of collection is not clean since it exists as far as ZK is concerned but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. Should Overseer detect error in distributed cmds and rollback changes already made in ZK? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6595) Improve error response in case distributed collection cmd fails
[ https://issues.apache.org/jira/browse/SOLR-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168385#comment-14168385 ] Jan Høydahl commented on SOLR-6595: --- Comment to the three listed problems in the updated problem description: # What error code to return? Anything is better than 0. In this case it's a server configuration error, so 5xx? But where to modify the status code? Perhaps {{OverseerCollectionProcessor#processResponse()}}? # How about printing the Exception-class names of all intermediate exceptions in the chain and then the message from the original one? # Rollback of partially successful collection create would be interesting, but deserves its own JIRA perhaps :-) Improve error response in case distributed collection cmd fails --- Key: SOLR-6595 URL: https://issues.apache.org/jira/browse/SOLR-6595 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10 Environment: SolrCloud with Client SSL Reporter: Sindre Fiskaa Priority: Minor Followed the description https://cwiki.apache.org/confluence/display/solr/Enabling+SSL and generated a self signed key pair. Configured a few solr-nodes and used the collection api to crate a new collection. -I get error message when specify the nodes with the createNodeSet param. When I don't use createNodeSet param the collection gets created without error on random nodes. Could this be a bug related to the createNodeSet param?- *Update: It failed due to what turned out to be invalid client certificate on the overseer, and returned the following response:* {code:xml} response lst name=responseHeaderint name=status0/intint name=QTime185/int/lst lst name=failure strorg.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: https://vt-searchln04:443/solr/str /lst /response {code} *Update: Three problems:* # Status=0 when the cmd did not succeed (only ZK was updated, but cores not created due to failing to connect to shard nodes to talk to core admin API). # The error printed does not tell which action failed. Would be helpful to either get the msg from the original exception or at least some message saying Failed to create core, see log on Overseer node.name # State of collection is not clean since it exists as far as ZK is concerned but cores not created. Thus retrying the CREATECOLLECTION cmd would fail. Should Overseer detect error in distributed cmds and rollback changes already made in ZK? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org