[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583211#comment-16583211 ] Erick Erickson commented on LUCENE-8264: OK, unless there are objections I'm going to close this as "Wont fix". At minimum "fixing what we can" (things like adding docValues and the like) belongs at a "consumer of Lucene" level. The original intent point here was to be able to go from Lucene X-2 -> Lucene X-1 -> Lucene X. That's simply not gonna happen for the reasons enumerated here. There might be some marginal utility in getting all segments from X-1 -> X, but I don't see it as enough of a benefit to be worth the effort. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495893#comment-16495893 ] Erick Erickson commented on LUCENE-8264: Sorry, been away for a while [~janhoy] re: UninvertDocValuesMergePolicyFactory. True, but wouldn't it be nice from an ops perspective to just be able to do this as a single operation? [~simonw] Thanks for the pointer, I'll look at this when I have a bit more breather. I think you're right, this is probably a Solr/ES issue in terms of making it convenient from an admin basis. I suspect it'll be something like an API command that does the wrapping magic. How to make it maximally convenient is something we'll wrestle to the ground. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474157#comment-16474157 ] Jan Høydahl commented on LUCENE-8264: - [~erickerickson] check out SOLR-10046 which seems to do what you want in <2> through {{UninvertDocValuesMergePolicyFactory }}already, or do I misunderstand? > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474131#comment-16474131 ] Simon Willnauer commented on LUCENE-8264: - [~erickerickson] # For N-1 -> N we have _org.apache.lucene.index.UpgradeIndexMergePolicy_ ? # In order to add DV I think this should be done by wrapping a codec reader. I personally think quite an edge case and should be done in the higher level application ie. Solr itself. You can do this quite easily with _org.apache.lucene.index.OneMergeWrappingMergePolicy_ similar to what we do in the soft delete case in _SoftDeletesRetentionMergePolicy_ do I miss something? > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469894#comment-16469894 ] Erick Erickson commented on LUCENE-8264: See comment 9-May. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452072#comment-16452072 ] Michael McCandless commented on LUCENE-8264: I don't think it's realistic to expect Lucene to carry forward an index forever. This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them. Yes, Solr/ES can add database-like behavior where they hold the true original source of the document and use that to rebuild Lucene indices over time. But Lucene really is just a "search index" and we need to be free to make important improvements with time. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451952#comment-16451952 ] Simon Willnauer commented on LUCENE-8264: - I totally agree with robert here, good collection of valid technical points. We can't let lurking corruptions happen. The improvements made to norms here are awesome and we need to move forward with stuff like this. Also after looking at the details, I am convinced the guarantees that this restriction gives us are crucial to the future of lucene. We can't support lurking corruptions for users that come from ancient versions by converting (merging / rewriting segments) from N-X to N in steps that nobody ever tested. Also the points about the database aspect are very much valid. We need raw data to re-create these indices reliably and if you are running on top of a search engine you need to account for reindexing. Btw. we have this restriction in ES since 1.0 implicitly. We always only supported N-1 major versions for ES indices, yet they happen to be corresponding to N-1 Lucene major versions. There is also a lot of work gone into supporting searching across major versions of ES to allow users to stay on older versions for retention policy purposes. Some of these conversations are not easy but necessary for us to prevent support insanity. That said, I think there might be room for N-X at some point as long as the guarantee is only N-1. At some point we might allow the min index created version to be 7 even if you are on 9. But for us to make progress we need to be free to break and only guarantee N-1. Also, what this means is that your indices are supported ~2.5 years that is the major release cadence historically. I think it's important to keep this in mind. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451382#comment-16451382 ] Robert Muir commented on LUCENE-8264: - Its not possible to warn-only: the encoding of things changed completely. I think the key issue here is Lucene is an *index* not a *database*. Because it is a lossy *index* and does not retain all of the user's data, its not possible to safely migrate some things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index") and compute stats to get back the value, so it can be re-encoded. The function is {{y = f(x)}} and if {{x}} is not available its not possible, so lucene can't do it. Also related to this change, in some cases, its necessary for the user to migrate away from index-time boosts. The removal of these is what opened the door to adrien's more efficient encoding here. So the user has to decide to put such feature values into a NumericDocValuesField and use expressions/function queries to combine with the documents score, or via the new FeatureField (which can be much more efficient), or whatever. This case is interesting because it emphasizes there are other things besides just the original document's text that need to be dealt with on upgrades. I don't agree with the idea that lucene should be forced to drag along all kinds of nonsense data and slowly corrupt itself over time, or that some improvements aren't possible because the format can't be changed. Instead I think projects like solr that advertise themselves as a *database* need to add the ability to regenerate a new lucene index efficiently (e.g. minimizing network traffic across distributed nodes, etc). They need to use the additional stuff they have (e.g. original user's data, abstractions of some sort over lucene stuff like scoring features) to make this easier. Lucene is just the indexing/search library. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451326#comment-16451326 ] Jan Høydahl commented on LUCENE-8264: - I'm also puzzled of this strictness introduced by LUCENE-7837 from 8.0. I'm fine with keeping that behaviour as default, but add a config option to fall-back to warn-only, so that Lucene users such as offline upgrade tools can choose to handle created=N-2 situations in a custom way. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450134#comment-16450134 ] Erick Erickson commented on LUCENE-8264: bq. I am a true -1 to making a tool that will screw up scoring, sorry. It was surprising to me how many applications I became involved in where scoring was irrelevant. I still have to check my assumptions at the door when working with a new client on that score (little pun there). Conversely, scoring is everything to other clients I work with and screwing up scoring would be a major problem for them. One-size-fits-all doesn't reflect my experience at all though. Having something that silently "did the best it could" automagically would lead to it's own problems, so having something like this silently kick in isn't a good option. I'm not going to enjoy the conversations that start with "Well, you have to re-index from scratch for your app or stay on version 7x forever, there is no other option". Yet explaining weird results to a customer isn't very much fun either, especially when it's a surprise to them. At least when they upgrade and things don't load at all they won't be surprised by subtle problems. Surprised by total inability to do anything, maybe. But that's not subtle. I also dread taking customer X and trying to explain to them all the gotcha's with a tool that upgrades manually. "Well, you'll be able to search but if you originally indexed with X, then the consequence will be Y" through about 30 iterations. So I'm a little lost here on what to do. _Strongly_ recommend that people reindex is obvious, but then maybe the fallback is to send Uwe a lot of business... So is this going to be the official stance going forward? Lucene supports version N-1 if (and only if) it was originally created with N-1? Or will this upgrade problem go away absent the original problem and people will be able to go from an index produced with 8->9->10? Or is that TBD? > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449944#comment-16449944 ] Mark Miller commented on LUCENE-8264: - bq. Sorry, i don't the discussion makes much sense. The discussion makes sense, it sounds like you think making some kind of tool doesn't make sense. bq. The stuff like norms changes requires reindex, like the inverted index, the data is stored in a lossy way. Lucene can't do anything about it: its an index. That's been covered in the discussion - sometimes you can't do anything and that's why Lucene currently has this limitation. bq. I am a true -1 to making a tool that will screw up scoring, sorry. Same old helpful Robert. Type fast and carry a big veto. Happy belated birthday! > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449787#comment-16449787 ] David Smiley commented on LUCENE-8264: -- Fascinating discussion. Shawn said: {quote}Now I'm hearing differently ... that any user who has successfully done this has just gotten lucky, and that there's no guarantee for the future. {quote} I don't think it's quite that bleak. I believe each segment records metadata of the Lucene version, so we could explicitly know wether or not the index contains segments older than the current version. One could even write a tool to spit out the IDs of those documents to facilitate a re-index of just those documents. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449759#comment-16449759 ] Uwe Schindler commented on LUCENE-8264: --- I have no problem on making that tool not public. This is how I earn my money since a few months. Bringing stone-aged indexes up-to date (adding docvalues). Those people know that's wrong and scoring is not an issue for them in most cases. If it is we are working on reindexing, but sometimes that's really impossible. All those people were Lucene-only customers. It's cool, because people back in 2.x/3.x days were already using Lucene as their only storage, unfortunately not everything also stored, so some stuff is "not easy" to be reindexed in a fast way (like extracting all text again from PDF files). > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449727#comment-16449727 ] Robert Muir commented on LUCENE-8264: - Sorry, i don't the discussion makes much sense. The stuff like norms changes requires reindex, like the inverted index, the data is stored in a lossy way. Lucene can't do anything about it: its an index. I am a true -1 to making a tool that will screw up scoring, sorry. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449710#comment-16449710 ] Robert Muir commented on LUCENE-8264: - {quote} to be absolutely honest I was surprised by this as well. I think the reasons behind this change make sense to me but the implications are big. I am not sure if the strictness here comes only from the broken TermVectors offsets or not but if so can we discuss relaxing this a bit. {quote} Besides the term vectors stuff: bugs in norms got fixed (i worked with adrien to fix one of them), and the representation improved (adrien improved that) and other things. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449702#comment-16449702 ] Uwe Schindler commented on LUCENE-8264: --- Simon: This is exactly the case: People have simple analysis chains or they are using ClassicAnalyzer that is still supported. In my case, one of the customers just had his own version of the old analyzer from 2.x days and ported it, jut to make the indexes still working. The only issue was missing doc values (no fieldcache anymore), so there was the requirement to add doc values. Based on that, the upgrade path was easy: Take the tokens as is (offsets were not an issue, because no highlighting - only "old-style" highlighting with reanalyzing text). The Analyzers were ported. The IndexUpgarde we wrote was just like the above code (using UninvertingReader to add doc values) and additionally adding all (filter-wrapped) segments of the original index one by one so keeping the old segment structure. So yes, we should think of adding some infrastructure to do manual upgrades (configurable), so you don't have to hack crazy filterreaders. Maybe add some options like "add docvalues for field x, convert Numeric/Trie field to Points - even that is possible with some limitations by using UninvertingReader!!!), keep tokens alive, drop offsets completely (e.g., if broken). But as said before, by default: Don't support that without manual intervention. But we should not let the people fail completely when upgrading. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449666#comment-16449666 ] Simon Willnauer commented on LUCENE-8264: - to be absolutely honest I was surprised by this as well. I think the reasons behind this change make sense to me but the implications are big. I am not sure if the strictness here comes only from the broken TermVectors offsets or not but if so can we discuss relaxing this a bit. This change hit a couple of committers by surprise (including myself) and I wonder if we can take a step back and reiterate on this decision? While there are a bunch or other issues when you for instance go from 3.x to 7.x like your tokenization / analysis chain isn't supported anymore etc. there are valid usecases for ugrading your index via background merges rewriting the index format. The issues like unsupported analysis chains should be handled by highler level apps like solr or es. Like there are tons of people that use lucene as a retrieval engine doing very simple whitespace tokenization, a merge from 3.x to 7.x might be just fine? I think it would be good to have the conversation again even though the changes were communicated very openly. [~jpountz] [~thetaphi] [~rcmuir] [~mikemccand] [~dweiss] WDYT? > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449219#comment-16449219 ] Shawn Heisey commented on LUCENE-8264: -- I'm with Erick. I had always understood that you could use IndexUpgrader to migrate one major version at a time and use the index with a much newer version, as long as all the field type classes writing data into the index are still around in the newer version. Now I'm hearing differently ... that any user who has successfully done this has just gotten lucky, and that there's no guarantee for the future. I am not greatly impacted by this, because I always prefer to build indexes from scratch whenever I upgrade, and I recommend to anyone who will listen that they do the same. We do have users with extremely large indexes, who have a very difficult time reindexing from scratch. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448895#comment-16448895 ] Erick Erickson commented on LUCENE-8264: So let me see if I have this straight. The guarantee is that Lucene N will read indexes produced with N-1 if (and only if) the index was originally created entirely by version N-1? IOW if Lucene N-2 was used at any point in history no guarantees are made. And any upgrade process that would overcome that issue would be on a case-by-case, one-off basis that would stand no chance of making it into the Lucene code base. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447804#comment-16447804 ] Uwe Schindler commented on LUCENE-8264: --- Ah, I was always using a FilterLeafReader with a SlowCodecReaderWrapper (because those people needed to add DocValues from UninvertingReader - the one from solr). So yes, in my case, the FilterLeafReader war returning {{new LeafMetadata(Version.LUCENE_7_0_0, Version.LUCENE_7_0_0, null)}} in {getMetadata()}. Very easy and worked. Broken offsets were not an issue there, if they would have been we may have removed them by changing the field metadata from the filter reader. I just wanted to say: It's good that we still allow to do the migration, but I agree with you: This should not be provided by default in Solr or Lucene. People need to know how about the side-effects. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447767#comment-16447767 ] Simon Willnauer commented on LUCENE-8264: - > It worked at least until 7.x. As I said, you can remove offsets if needed. > And of course a FilterLeafReader together with SlowCodecReaderWrapper is > definitly needed. I am not so sure about this, at least [this|https://github.com/apache/lucene-solr/blob/branch_7x/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2756] will fail and it's in there since 7.0 > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447759#comment-16447759 ] Uwe Schindler commented on LUCENE-8264: --- It worked at least until 7.x. As I said, you can remove offsets if needed. And of course a FilterLeafReader together with SlowCodecReaderWrapper is definitly needed. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447758#comment-16447758 ] Simon Willnauer commented on LUCENE-8264: - [~thetaphi] I don't think this is going to work here. IndexWriter#validateMergeReader will prevent you from doing this unless you add some evil hacks. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447756#comment-16447756 ] Uwe Schindler commented on LUCENE-8264: --- By this way it is also possible to add DocValues using InvertedReader. That was another thing the old cutsomers needed to do (they sorted on a date field). With FilterLaefReaders you can do a lot of stuff. If you can't fix offset you may also simply remove them. But for some people it's the only way. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447746#comment-16447746 ] Uwe Schindler commented on LUCENE-8264: --- Ähm Dawifd, that won't work with hard copies. I explicitly said: Merge the CodecReaders!!!: {code:java} for (LeafReader leaf : directoryReader.leaves()) { target.addIndexes((CodecReader) leaf); target.flush(); // this ensures that the segment is not merged NOW } > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447713#comment-16447713 ] Dawid Weiss commented on LUCENE-8264: - Ah, right. I wasn't aware of that. I like Uwe's idea though; if you use HardlinkCopyDirectoryWrapper that seems like a fairly cheap way to bring segments_* to date, together with segments? > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447707#comment-16447707 ] Simon Willnauer commented on LUCENE-8264: - [~dweiss] I think you are not aware of the fact that an index that was created with N-2 won't be supported by N even if you rewrite all segments. The created version is baked into the segments file and Lucene will not open it even if all segments are on N or N-1. There are several reasons for this for instance to reject broken offsets in term vectors in Lucene 7. We can never enforce limits like this if we keep on upgrading stuff behind the scenes that didn't have these protections. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447703#comment-16447703 ] Uwe Schindler commented on LUCENE-8264: --- IndexUpgrader and UpgradeIndexMergePolicy should do the job, but it merges all segments. One thing that works: Open old index and merge segment by segment using addIndexes(Leaf/CodecReader) to a completely new one. Important: Do it step by step, so the segments don't get merged. Not sure how this behaves with corrupt offsets. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447700#comment-16447700 ] Adrien Grand commented on LUCENE-8264: -- The problem I described isn't related to merges but to the data that is stored in segments. A 6x index won't be upgradeable to 8x, even if you merge all segments so that they use the 7x file formats. This will require a reindex. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447686#comment-16447686 ] Dawid Weiss commented on LUCENE-8264: - Adrien, I think the point here is to trigger a rewrite of all segments to forcefully update them from N-1 and N, without waiting for the merges (that would do the same) to occur; if merges take a long time or never happen, this can cause exactly the sort of problem you've described. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447672#comment-16447672 ] Adrien Grand commented on LUCENE-8264: -- Lucene only supports reading indices generated by versions N or N-1, so I don't think we should help users go into unsupported territory by attempting 5x->6x->7x upgrades? For the record, it might "work" in that very particular case because we didn't change anything on top of the codec API between 5x and 6x (I think?), but a 6x->7x->8x would be problematic since 7x started rejecting corrupt offsets and changed the encoding of norms, so attempting a 6x->7x->8x upgrade this way could propagate corrupt offsets to 8x and would corrupt norms. To avoid this trap we started recording the creation version in LUCENE-7703 and LUCENE-7756, and then failed opening 6x (or less) indices with 8x in LUCENE-7837. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments
[ https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447257#comment-16447257 ] Shawn Heisey commented on LUCENE-8264: -- On the dev list, [~yriveiro] replied to this issue. His indexes are up to 15 terabytes. (yowza!) Reindexing from scratch on an index that big is something you can't just decide to do one day. I really like the idea of rewriting all segments without merging them. The way that IndexUpgrader currently works can cause the LUCENE-7976 problems. > Allow an option to rewrite all segments > --- > > Key: LUCENE-8264 > URL: https://issues.apache.org/jira/browse/LUCENE-8264 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > > For the background, see SOLR-12259. > There are several use-cases that would be much easier, especially during > upgrades, if we could specify that all segments get rewritten. > One example: Upgrading 5x->6x->7x. When segments are merged, they're > rewritten into the current format. However, there's no guarantee that a > particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily > be successful. > How many merge policies support this is an open question. I propose to start > with TMP and raise other JIRAs as necessary for other merge policies. > So far the usual response has been "re-index from scratch", but that's > increasingly difficult as systems get larger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org