[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-08-16 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583211#comment-16583211
 ] 

Erick Erickson commented on LUCENE-8264:


OK, unless there are objections I'm going to close this as "Wont fix". At 
minimum "fixing what we can" (things like adding docValues and the like) 
belongs at a "consumer of Lucene" level.

The original intent point here was to be able to go from Lucene X-2 -> Lucene 
X-1 -> Lucene X. That's simply not gonna happen for the reasons enumerated here.

There might be some marginal utility in getting all segments from X-1 -> X, but 
I don't see it as enough of a benefit to be worth the effort.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-05-30 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495893#comment-16495893
 ] 

Erick Erickson commented on LUCENE-8264:


Sorry, been away for a while

[~janhoy] re: UninvertDocValuesMergePolicyFactory. True, but wouldn't it be 
nice from an ops perspective to just be able to do this as a single operation?

[~simonw] Thanks for the pointer, I'll look at this when I have a bit more 
breather. I think you're right, this is probably a Solr/ES issue in terms of 
making it convenient from an admin basis. I suspect it'll be something like an 
API command that does the wrapping magic. How to make it maximally convenient 
is something we'll wrestle to the ground.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-05-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474157#comment-16474157
 ] 

Jan Høydahl commented on LUCENE-8264:
-

[~erickerickson] check out SOLR-10046 which seems to do what you want in <2> 
through {{UninvertDocValuesMergePolicyFactory }}already, or do I misunderstand?

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-05-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474131#comment-16474131
 ] 

Simon Willnauer commented on LUCENE-8264:
-

[~erickerickson]

 
 # For N-1 -> N we have _org.apache.lucene.index.UpgradeIndexMergePolicy_ ?
 # In order to add DV I think this should be done by wrapping a codec reader. I 
personally think quite an edge case and should be done in the higher level 
application ie. Solr itself. You can do this quite easily with 
_org.apache.lucene.index.OneMergeWrappingMergePolicy_ similar to what we do in 
the soft delete case in _SoftDeletesRetentionMergePolicy_

do I miss something?

 

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-05-09 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469894#comment-16469894
 ] 

Erick Erickson commented on LUCENE-8264:


See comment 9-May.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452072#comment-16452072
 ] 

Michael McCandless commented on LUCENE-8264:


I don't think it's realistic to expect Lucene to carry forward an index 
forever.  This really is the difference between an index and a database: we do 
not store, precisely, the original documents.  We store an efficient 
derived/computed index from them.  Yes, Solr/ES can add database-like behavior 
where they hold the true original source of the document and use that to 
rebuild Lucene indices over time.  But Lucene really is just a "search index" 
and we need to be free to make important improvements with time.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-25 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451952#comment-16451952
 ] 

Simon Willnauer commented on LUCENE-8264:
-

I totally agree with robert here, good collection of valid technical points. We 
can't let lurking corruptions happen. The improvements made to norms here are 
awesome and we need to move forward with stuff like this. Also after looking at 
the details, I am convinced the guarantees that this restriction gives us are 
crucial to the future of lucene. We can't support lurking corruptions for users 
that come from ancient versions by converting (merging / rewriting segments) 
from N-X to N in steps that nobody ever tested.

Also the points about the database aspect are very much valid. We need raw data 
to re-create these indices reliably and if you are running on top of a search 
engine you need to account for reindexing.

Btw. we have this restriction in ES since 1.0 implicitly. We always only 
supported N-1 major versions for ES indices, yet they happen to be 
corresponding to N-1 Lucene major versions. There is also a lot of work gone 
into supporting searching across major versions of ES to allow users to stay on 
older versions for retention policy purposes. Some of these conversations are 
not easy but necessary for us to prevent support insanity. 

That said, I think there might be room for N-X at some point as long as the 
guarantee is only N-1. At some point we might allow the min index created 
version to be 7 even if you are on 9. But for us to make progress we need to be 
free to break and only guarantee N-1. 

Also, what this means is that your indices are supported ~2.5 years that is the 
major release cadence historically. I think it's important to keep this in 
mind.   

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451382#comment-16451382
 ] 

Robert Muir commented on LUCENE-8264:
-

Its not possible to warn-only: the encoding of things changed completely. I 
think the key issue here is Lucene is an *index* not a *database*. Because it 
is a lossy *index* and does not retain all of the user's data, its not possible 
to safely migrate some things automagically. In the norms case IndexWriter 
needs to re-analyze the text ("re-index") and compute stats to get back the 
value, so it can be re-encoded. The function is {{y = f(x)}} and if {{x}} is 
not available its not possible, so lucene can't do it.

Also related to this change, in some cases, its necessary for the user to 
migrate away from index-time boosts. The removal of these is what opened the 
door to adrien's more efficient encoding here. So the user has to decide to put 
such feature values into a NumericDocValuesField and use expressions/function 
queries to combine with the documents score, or via the new FeatureField (which 
can be much more efficient), or whatever. This case is interesting because it 
emphasizes there are other things besides just the original document's text 
that need to be dealt with on upgrades.

I don't agree with the idea that lucene should be forced to drag along all 
kinds of nonsense data and slowly corrupt itself over time, or that some 
improvements aren't possible because the format can't be changed. Instead I 
think projects like solr that advertise themselves as a *database* need to add 
the ability to regenerate a new lucene index efficiently (e.g. minimizing 
network traffic across distributed nodes, etc). They need to use the additional 
stuff they have (e.g. original user's data, abstractions of some sort over 
lucene stuff like scoring features) to make this easier. Lucene is just the 
indexing/search library.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451326#comment-16451326
 ] 

Jan Høydahl commented on LUCENE-8264:
-

I'm also puzzled of this strictness introduced by LUCENE-7837 from 8.0. I'm 
fine with keeping that behaviour as default, but add a config option to 
fall-back to warn-only, so that Lucene users such as offline upgrade tools can 
choose to handle created=N-2 situations in a custom way.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450134#comment-16450134
 ] 

Erick Erickson commented on LUCENE-8264:


bq. I am a true -1 to making a tool that will screw up scoring, sorry.

It was surprising to me how many applications I became involved in where 
scoring was irrelevant. I still have to check my assumptions at the door when 
working with a new client on that score (little pun there).

Conversely, scoring is everything to other clients I work with and screwing up 
scoring would be a major problem for them.

One-size-fits-all doesn't reflect my experience at all though.

Having something that silently "did the best it could" automagically would lead 
to it's own problems, so having something like this silently kick in isn't a 
good option.

I'm not going to enjoy the conversations that start with "Well, you have to 
re-index from scratch for your app or stay on version 7x forever, there is no 
other option".

Yet explaining weird results to a customer isn't very much fun either, 
especially when it's a surprise to them. At least when they upgrade and things 
don't load at all they won't be surprised by subtle problems. Surprised by 
total inability to do anything, maybe. But that's not subtle.

I also dread taking customer X and trying to explain to them all the gotcha's 
with a tool that upgrades manually. "Well, you'll be able to search but if you 
originally indexed with X, then the consequence will be Y" through about 30 
iterations.

So I'm a little lost here on what to do. _Strongly_ recommend that people 
reindex is obvious, but then maybe the fallback is to send Uwe a lot of 
business...

So is this going to be the official stance going forward? Lucene supports 
version N-1 if (and only if) it was originally created with N-1? Or will this 
upgrade problem go away absent the original problem and people will be able to 
go from an index produced with 8->9->10? Or is that TBD?





> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449944#comment-16449944
 ] 

Mark Miller commented on LUCENE-8264:
-

bq. Sorry, i don't the discussion makes much sense. 

The discussion makes sense, it sounds like you think making some kind of tool 
doesn't make sense.

bq. The stuff like norms changes requires reindex, like the inverted index, the 
data is stored in a lossy way. Lucene can't do anything about it: its an index.

That's been covered in the discussion - sometimes you can't do anything and 
that's why Lucene currently has this limitation.

bq. I am a true -1 to making a tool that will screw up scoring, sorry.

Same old helpful Robert. Type fast and carry a big veto. Happy belated birthday!

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449787#comment-16449787
 ] 

David Smiley commented on LUCENE-8264:
--

Fascinating discussion.

Shawn said:
{quote}Now I'm hearing differently ... that any user who has successfully done 
this has just gotten lucky, and that there's no guarantee for the future.
{quote}
I don't think it's quite that bleak.  I believe each segment records metadata 
of the Lucene version, so we could explicitly know wether or not the index 
contains segments older than the current version.  One could even write a tool 
to spit out the IDs of those documents to facilitate a re-index of just those 
documents.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449759#comment-16449759
 ] 

Uwe Schindler commented on LUCENE-8264:
---

I have no problem on making that tool not public. This is how I earn my money 
since a few months. Bringing stone-aged indexes up-to date (adding docvalues). 
Those people know that's wrong and scoring is not an issue for them in most 
cases. If it is we are working on reindexing, but sometimes that's really 
impossible. All those people were Lucene-only customers. It's cool, because 
people back in 2.x/3.x days were already using Lucene as their only storage, 
unfortunately not everything also stored, so some stuff is "not easy" to be 
reindexed in a fast way (like extracting all text again from PDF files).

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449727#comment-16449727
 ] 

Robert Muir commented on LUCENE-8264:
-

Sorry, i don't the discussion makes much sense. The stuff like norms changes 
requires reindex, like the inverted index, the data is stored in a lossy way. 
Lucene can't do anything about it: its an index. I am a true -1 to making a 
tool that will screw up scoring, sorry.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449710#comment-16449710
 ] 

Robert Muir commented on LUCENE-8264:
-

{quote}
to be absolutely honest I was surprised by this as well. I think the reasons 
behind this change make sense to me but the implications are big. I am not sure 
if the strictness here comes only from the broken TermVectors offsets or not 
but if so can we discuss relaxing this a bit. 
{quote}

Besides the term vectors stuff: bugs in norms got fixed (i worked with adrien 
to fix one of them), and the representation improved (adrien improved that) and 
other things.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449702#comment-16449702
 ] 

Uwe Schindler commented on LUCENE-8264:
---

Simon: This is exactly the case: People have simple analysis chains or they are 
using ClassicAnalyzer that is still supported. In my case, one of the customers 
just had his own version of the old analyzer from 2.x days and ported it, jut 
to make the indexes still working. The only issue was missing doc values (no 
fieldcache anymore), so there was the requirement to add doc values. Based on 
that, the upgrade path was easy: Take the tokens as is (offsets were not an 
issue, because no highlighting - only "old-style" highlighting with reanalyzing 
text). The Analyzers were ported. The IndexUpgarde we wrote was just like the 
above code (using UninvertingReader to add doc values) and additionally adding 
all (filter-wrapped) segments of the original index one by one so keeping the 
old segment structure.
So yes, we should think of adding some infrastructure to do manual upgrades 
(configurable), so you don't have to hack crazy filterreaders. Maybe add some 
options like "add docvalues for field x, convert Numeric/Trie field to Points - 
even that is possible with some limitations by using UninvertingReader!!!), 
keep tokens alive, drop offsets completely (e.g., if broken).
But as said before, by default: Don't support that without manual intervention. 
But we should not let the people fail completely when upgrading.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-24 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449666#comment-16449666
 ] 

Simon Willnauer commented on LUCENE-8264:
-

to be absolutely honest I was surprised by this as well. I think the reasons 
behind this change make sense to me but the implications are big. I am not sure 
if the strictness here comes only from the broken TermVectors offsets or not 
but if so can we discuss relaxing this a bit. This change hit a couple of 
committers by surprise (including myself) and I wonder if we can take a step 
back and reiterate on this decision? While there are a bunch or other issues 
when you for instance go from 3.x to 7.x like your tokenization / analysis 
chain isn't supported anymore etc. there are valid usecases for ugrading your 
index via background merges rewriting the index format. The issues like 
unsupported analysis chains should be handled by highler level apps like solr 
or es. Like there are tons of people that use lucene as a retrieval engine 
doing very simple whitespace tokenization, a merge from 3.x to 7.x might be 
just fine? I think it would be good to have the conversation again even though 
the changes were communicated very openly. [~jpountz] [~thetaphi] [~rcmuir] 
[~mikemccand] [~dweiss] WDYT?

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449219#comment-16449219
 ] 

Shawn Heisey commented on LUCENE-8264:
--

I'm with Erick.  I had always understood that you could use IndexUpgrader to 
migrate one major version at a time and use the index with a much newer 
version, as long as all the field type classes writing data into the index are 
still around in the newer version.

Now I'm hearing differently ... that any user who has successfully done this 
has just gotten lucky, and that there's no guarantee for the future.

I am not greatly impacted by this, because I always prefer to build indexes 
from scratch whenever I upgrade, and I recommend to anyone who will listen that 
they do the same.  We do have users with extremely large indexes, who have a 
very difficult time reindexing from scratch.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448895#comment-16448895
 ] 

Erick Erickson commented on LUCENE-8264:


So let me see if I have this straight. The guarantee is that Lucene N will read 
indexes produced with N-1 if (and only if) the index was originally created 
entirely by version N-1? IOW if Lucene N-2 was used at any point in history no 
guarantees are made.

And any upgrade process that would overcome that issue would be on a 
case-by-case, one-off basis that would stand no chance of making it into the 
Lucene code base.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447804#comment-16447804
 ] 

Uwe Schindler commented on LUCENE-8264:
---

Ah, I was always using a FilterLeafReader with a SlowCodecReaderWrapper 
(because those people needed to add DocValues from UninvertingReader - the one 
from solr). So yes, in my case, the FilterLeafReader war returning {{new 
LeafMetadata(Version.LUCENE_7_0_0, Version.LUCENE_7_0_0, null)}} in 
{getMetadata()}. Very easy and worked. Broken offsets were not an issue there, 
if they would have been we may have removed them by changing the field metadata 
from the filter reader.

I just wanted to say: It's good that we still allow to do the migration, but I 
agree with you: This should not be provided by default in Solr or Lucene. 
People need to know how about the side-effects.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447767#comment-16447767
 ] 

Simon Willnauer commented on LUCENE-8264:
-

> It worked at least until 7.x. As I said, you can remove offsets if needed. 
> And of course a FilterLeafReader together with SlowCodecReaderWrapper is 
> definitly needed.

I am not so sure about this, at least 
[this|https://github.com/apache/lucene-solr/blob/branch_7x/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2756]
 will fail and it's in there since 7.0



> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447759#comment-16447759
 ] 

Uwe Schindler commented on LUCENE-8264:
---

It worked at least until 7.x. As I said, you can remove offsets if needed. And 
of course a FilterLeafReader together with SlowCodecReaderWrapper is definitly 
needed.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447758#comment-16447758
 ] 

Simon Willnauer commented on LUCENE-8264:
-

[~thetaphi] I don't think this is going to work here. 
IndexWriter#validateMergeReader will prevent you from doing this unless you add 
some evil hacks.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447756#comment-16447756
 ] 

Uwe Schindler commented on LUCENE-8264:
---

By this way it is also possible to add DocValues using InvertedReader. That was 
another thing the old cutsomers needed to do (they sorted on a date field). 
With FilterLaefReaders you can do a lot of stuff. If you can't fix offset you 
may also simply remove them. But for some people it's the only way.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447746#comment-16447746
 ] 

Uwe Schindler commented on LUCENE-8264:
---

Ähm Dawifd, that won't work with hard copies. I explicitly said: Merge the 
CodecReaders!!!:
{code:java}
for (LeafReader leaf : directoryReader.leaves()) {
  target.addIndexes((CodecReader) leaf);
  target.flush(); // this ensures that the segment is not merged NOW
}

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447713#comment-16447713
 ] 

Dawid Weiss commented on LUCENE-8264:
-

Ah, right. I wasn't aware of that. I like Uwe's idea though; if you use 
HardlinkCopyDirectoryWrapper that seems like a fairly cheap way to bring 
segments_* to date, together with segments?

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447707#comment-16447707
 ] 

Simon Willnauer commented on LUCENE-8264:
-

[~dweiss] I think you are not aware of the fact that an index that was created 
with N-2 won't be supported by N even if you rewrite all segments. The created 
version is baked into the segments file and Lucene will not open it even if all 
segments are on N or N-1. There are several reasons for this for instance to 
reject broken offsets in term vectors in Lucene 7. We can never enforce limits 
like this if we keep on upgrading stuff behind the scenes that didn't have 
these protections. 

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447703#comment-16447703
 ] 

Uwe Schindler commented on LUCENE-8264:
---

IndexUpgrader and UpgradeIndexMergePolicy should do the job, but it merges all 
segments. One thing that works: Open old index and merge segment by segment 
using addIndexes(Leaf/CodecReader) to a completely new one. Important: Do it 
step by step, so the segments don't get merged. Not sure how this behaves with 
corrupt offsets.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447700#comment-16447700
 ] 

Adrien Grand commented on LUCENE-8264:
--

The problem I described isn't related to merges but to the data that is stored 
in segments. A 6x index won't be upgradeable to 8x, even if you merge all 
segments so that they use the 7x file formats. This will require a reindex.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447686#comment-16447686
 ] 

Dawid Weiss commented on LUCENE-8264:
-

Adrien, I think the point here is to trigger a rewrite of all segments to 
forcefully update them from N-1 and N, without waiting for the merges (that 
would do the same) to occur; if merges take a long time or never happen, this 
can cause exactly the sort of problem you've described.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447672#comment-16447672
 ] 

Adrien Grand commented on LUCENE-8264:
--

Lucene only supports reading indices generated by versions N or N-1, so I don't 
think we should help users go into unsupported territory by attempting 
5x->6x->7x upgrades? For the record, it might "work" in that very particular 
case because we didn't change anything on top of the codec API between 5x and 
6x (I think?), but a 6x->7x->8x would be problematic since 7x started rejecting 
corrupt offsets and changed the encoding of norms, so attempting a 6x->7x->8x 
upgrade this way could propagate corrupt offsets to 8x and would corrupt norms. 
To avoid this trap we started recording the creation version in LUCENE-7703 and 
LUCENE-7756, and then failed opening 6x (or less) indices with 8x in 
LUCENE-7837.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8264) Allow an option to rewrite all segments

2018-04-22 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447257#comment-16447257
 ] 

Shawn Heisey commented on LUCENE-8264:
--

On the dev list, [~yriveiro] replied to this issue.  His indexes are up to 15 
terabytes.  (yowza!)

Reindexing from scratch on an index that big is something you can't just decide 
to do one day.

I really like the idea of rewriting all segments without merging them.  The way 
that IndexUpgrader currently works can cause the LUCENE-7976 problems.

> Allow an option to rewrite all segments
> ---
>
> Key: LUCENE-8264
> URL: https://issues.apache.org/jira/browse/LUCENE-8264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> For the background, see SOLR-12259.
> There are several use-cases that would be much easier, especially during 
> upgrades, if we could specify that all segments get rewritten. 
> One example: Upgrading 5x->6x->7x. When segments are merged, they're 
> rewritten into the current format. However, there's no guarantee that a 
> particular segment _ever_ gets merged so the 6x-7x upgrade won't necessarily 
> be successful.
> How many merge policies support this is an open question. I propose to start 
> with TMP and raise other JIRAs as necessary for other merge policies.
> So far the usual response has been "re-index from scratch", but that's 
> increasingly difficult as systems get larger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org