Re: TestCodecs running time
See you already did that Mike :). Thanks ! now the tests run for 2s. Shai On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless luc...@mikemccandless.com wrote: It's also slow because it repeats all the tests for each of the core codecs (standard, sep, pulsing, intblock). I think it's fine to reduce the number of iterations -- just make sure there's no seed to newRandom() so the distributing testing is effective. Mike On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote: Hi I've noticed that TestCodecs takes an insanely long time to run on my machine - between 35-40 seconds. Is that expected? The reason why it runs so long, seems to be that its threads make (each) 4000 iterations ... is that really required to ensure correctness? Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
SnapshotDeletionPolicy throws NPE if no commit happened
SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai
Re: Proposal about Version API relaxation
We can remove Version, because all incompatible changes go straight to a new major release, which we release more often, yes. 3.x is going to be released after 4.0 if bugs are found and fixed, or if people ask to backport some (minor?) features, and some dev has time for this. The question of what to call major release in X.Y.Z scheme - X or Y, is there, but immaterial :) I think it's okay to settle with X.Y, we have major releases and bugfixes, what that third number can be used for? On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote: So then I don't understand this: {quote} * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. {quote} What's different than what's done today? How can we remove Version in that world, if we need to maintain full back-compat between 3.1 and 3.2, index and API-wise? We'll still need to deprecate and come up w/ new classes every time, and we'll still need to maintain runtime changes back-compat. Unless you're telling me we'll start releasing major releases more often? Well ... then we're saying the same thing, only I think that instead of releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ... because if you look back, every minor release included API deprecations as well as back-compat breaks. That means that every minor release should have been a major release right? Point is, if I understand correctly and you agree w/ my statement above - I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features. If it's just a numbering thing, then I don't think it matters what is defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. Just pointing out that X will grow more rapidly than today. That's all. So did I get it right? Shai On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote: I don't read what you wrote and what Mike wrote as even close to the same. - Mark http://www.lucidimagination.com (mobile) On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what a great way to start a day :). +1 !!! I have some questions/comments though: * Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their segments when they move from 2.x to 3.x before 4.0 lands and they'll need to call optimize() to ensure 4.0 still works on their index. I hope that will still be the case? Otherwise I don't see how we can prevent reindexing by apps. ** Index behavioral/runtime changes, like those of Analyzers, are ok to require a reindex, as proposed. So after 3.1 is out, trunk can break the API and 3.2 will have a new set of API? Cool and convenient. For how long do we keep the 3.1 branch around? Also, it used to only fix bugs, but from now on it'll be allowed to introduce new features, if they maintain back-compat? So 3.1.1 can have 'flex' (going for the extreme on purpose) if someone maintains back-compat? I think the back-compat on branches should be only for index runtime changes. There's no point, in my opinion, to maintain API back-compat anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1 just to get a new feature but get it API back-supported? As soon as they upgrade to 3.2, that means a new set of API right? Major releases will just change the index structure format then? Or move to Java 1.6? Well ... not even that because as I understand it, 3.2 can move to Java 1.6 ... no API back-compat right :). That's definitely a great step forward ! Shai On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org wrote: On Thu, 15 Apr 2010, Earwin Burrfoot wrote: Can't believe my eyes. +1 Likewise. +1 ! Andi.. On Thu, Apr 15, 2010 at 01:22, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs),
Re: SnapshotDeletionPolicy throws NPE if no commit happened
We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SnapshotDeletionPolicy throws NPE if no commit happened
Well ... one can still call commit() or close() right after IW creation. And this is a very rare case to be hit by. Was just asking about whether we want to add an explicit and clear protective code about it or not. Shai On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com wrote: We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I think that version numbers mean more than we'd like them to mean, as people perceive them. Let's discuss the format X.Y.Z: When X is changed, it should mean something 'big' happened - index structure has changed (e.g. the flexible scoring work), new Java version supported (Java 1.6) and even stuff like 'flex' which includes statements like if you don't want your app to slow down, consider reindexing. Such things signal a major change in Lucene, sometimes even just policy changes (Java version supported) and therefore I think we should reserve the ability to bump X when such things happen. Another thing is the index structure back-compat policy - today Lucene supports X-1 index structure, but during upgrades of X.Y versions, your segments are gradually migrated. Eventually, when you upgrade to 4.0 you should know whether you have a 2.x index, and call optimize just in case if you're not sure it's not migrated yet (if you've upgraded to 3.x). If we start bumping up 'X' too often, we'll either need to change the X-1 policy to X-N, which will just complicate matters for users. Or we'll keep the X-1 policy, but people will need to call optimize more frequently. Y should change on a regular basis, and no back-compat API-wise or index runtime-wise is guaranteed. So the Collector and per-segment searches in 2.9 could go w/o deprecating tons of API, so is the TokenStream work. Changes to Analyzer's runtime capabilities will also be allowed between Y revisions. Z should change when bugfixes are fixed, or when features are backported. Really ... we rarely fix bugs on a released Y branch, and I don't expect tons of features will be backported to a Y branch (to create a Z+1 release). Therefore this should not confuse anyone. So all I'm saying is that instead of increasing X whenever the API, index structure or runtime behavior has changed, I'm simply proposing to differentiate between really major changes to those that just say 'we're not back-compat compliant'. But above all, I'd like to see this change happening, so if I need to surrender to the X vs. X+Y approach, I will. Just think it will create some confusion. BTW, w/ all that - does it mean 'backwards' can be dropped, or at least test-backwards activated only on a branch which we decide needs it? That'll be really great. Shai On Thu, Apr 15, 2010 at 10:24 AM, Earwin Burrfoot ear...@gmail.com wrote: We can remove Version, because all incompatible changes go straight to a new major release, which we release more often, yes. 3.x is going to be released after 4.0 if bugs are found and fixed, or if people ask to backport some (minor?) features, and some dev has time for this. The question of what to call major release in X.Y.Z scheme - X or Y, is there, but immaterial :) I think it's okay to settle with X.Y, we have major releases and bugfixes, what that third number can be used for? On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote: So then I don't understand this: {quote} * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. {quote} What's different than what's done today? How can we remove Version in that world, if we need to maintain full back-compat between 3.1 and 3.2, index and API-wise? We'll still need to deprecate and come up w/ new classes every time, and we'll still need to maintain runtime changes back-compat. Unless you're telling me we'll start releasing major releases more often? Well ... then we're saying the same thing, only I think that instead of releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ... because if you look back, every minor release included API deprecations as well as back-compat breaks. That means that every minor release should have been a major release right? Point is, if I understand correctly and you agree w/ my statement above - I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features. If it's just a numbering thing, then I don't think it matters what is defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. Just pointing out that X will grow more rapidly than today. That's all. So did I get it right? Shai On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote: I don't read what you wrote and what Mike wrote as even close to the same. - Mark http://www.lucidimagination.com (mobile) On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what
Re: SnapshotDeletionPolicy throws NPE if no commit happened
BTW, even if it's a stupid thing to do, someone can today create SDP and call snapshot without ever creating IW. And it's not an impossible scenario. Consider a backup-aware application which creates SDP first, then passes it to the indexing process and the backup process, separately. The backup process doesn't need to know of IW at all, and might call snapshot() before IW was even created, and SDP.onInit was called. It's a possibility, not saying it's a great and safe architecture. So this is really about do we want to write clear protective code, or allow the NPE? Shai 2010/4/15 Shai Erera ser...@gmail.com Well ... one can still call commit() or close() right after IW creation. And this is a very rare case to be hit by. Was just asking about whether we want to add an explicit and clear protective code about it or not. Shai On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.comwrote: We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SnapshotDeletionPolicy throws NPE if no commit happened
Presumably you'd also hit this exception if the DP deletes all commit points, right? I like IllegalStateException. Mike 2010/4/15 Shai Erera ser...@gmail.com: BTW, even if it's a stupid thing to do, someone can today create SDP and call snapshot without ever creating IW. And it's not an impossible scenario. Consider a backup-aware application which creates SDP first, then passes it to the indexing process and the backup process, separately. The backup process doesn't need to know of IW at all, and might call snapshot() before IW was even created, and SDP.onInit was called. It's a possibility, not saying it's a great and safe architecture. So this is really about do we want to write clear protective code, or allow the NPE? Shai 2010/4/15 Shai Erera ser...@gmail.com Well ... one can still call commit() or close() right after IW creation. And this is a very rare case to be hit by. Was just asking about whether we want to add an explicit and clear protective code about it or not. Shai On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com wrote: We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TestCodecs running time
Yah :) TestStressIndexing2 is another slow one... I'll go fix it... Mike On Thu, Apr 15, 2010 at 2:15 AM, Shai Erera ser...@gmail.com wrote: See you already did that Mike :). Thanks ! now the tests run for 2s. Shai On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless luc...@mikemccandless.com wrote: It's also slow because it repeats all the tests for each of the core codecs (standard, sep, pulsing, intblock). I think it's fine to reduce the number of iterations -- just make sure there's no seed to newRandom() so the distributing testing is effective. Mike On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote: Hi I've noticed that TestCodecs takes an insanely long time to run on my machine - between 35-40 seconds. Is that expected? The reason why it runs so long, seems to be that its threads make (each) 4000 iterations ... is that really required to ensure correctness? Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
2010/4/15 Shai Erera ser...@gmail.com: One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. I prefer X.Y, ie, changes to Y only is a minor release (mostly bug fixes but maybe small features); changes to X is a major release. I think that's more standard, ie, people will generally grok that 3.3 - 4.0 is a major change but 3.3 - 3.4 isn't. So this proposal would change how Lucene releases are numbered. Ie, the next release would be 4.0. Bug fixes / small features would then be 4.1. Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. No... in the proposal, you must re-index on upgrading to the next major release (3.x - 4.0). I think supporting old indexes, badly (what we do today) is not a great solution. EG on upgrading to 3.1 you'll immediately see a search perf hit since the flex emulation layer is running. It's a trap. It's this freedom, I think, that'd let us drop Version entirely. It's the back-compat of the index that is the major driver for having Version today (eg so that the analyzers can produce tokens matching your old index). EG Terrier seems to have the same requirement -- note the bold All indexes must be rebuilt: http://terrier.org/docs/current/whats_new.html Also, Lucene isn't a primary store (like a filesytem or a database). We expect that your true content still lives somewhere else. So why do we go to such great lengths to keep the index format for so long...? BTW, w/ all that - does it mean 'backwards' can be dropped, or at least test-backwards activated only on a branch which we decide needs it? That'll be really great. I think the stable branches (2.x, 3.x) would have backwards tests created the moment they are branched, to make sure as we fix bugs / backport minor features we don't break back compat, along that branch. I don't think we need the .Z part of a release numbering -- our numbers would look like most other software projects. 3.0 is a major release, 3.1, 3.2, 3.3 fix bugs / add minor features, etc. If flex were done in this world I would've finished it alot faster! A huge amount of time went into the cross back compat emulation layers (pre-flex APIs and pre-flex index). Also, we will still need to maintain the Backwards section in CHANGES (or move it to API Changes), to help people upgrade from release to release. I think we'd create a migration guide to explain how apps migrate to the next major release (this is what other projects do), eg like this: http://community.jboss.org/wiki/Hibernate3MigrationGuides#A42 Unless you're telling me we'll start releasing major releases more often? I think this is mostly orthogonal? We could still do major releases frequently or rarely with this model... however, it would give us more freedom to do major releases frequently (vs today where every major release sets a scary back-compat-burden stake in the ground). I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features I think the minor releases on the stable branch (3.1, 3.2, 3.3) would be mostly bug fixes, but maybe also minor features if contributor's/developer's had the itch to make them available on the stable (3.x) branch. How much dev happens on the stable branch can be largely determined by itch... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Merging the Mailing Lists
Looks like we are ready to go to merge the Lucene and Solr dev mailing lists. The new list will be d...@lucene.apache.org. All existing subscribers will automatically be subscribed to the new list. For more info, see https://issues.apache.org/jira/browse/INFRA-2567. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1278. Resolution: Won't Fix I think the pulsing codec (wraps any other codec, but inlines low-freq terms directly into the terms dict) solves this? Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters
Add a scoring DistanceQuery that does not need caches and separate filters -- Key: LUCENE-2395 URL: https://issues.apache.org/jira/browse/LUCENE-2395 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Uwe Schindler Fix For: 3.1 In a chat with Chris Male and my own ideas when implemnting for PANGAEA, I thought about the broken distance query in contrib. It lacks the folloing features: - It needs a query for the encldoing bbox (which is constant score) - It needs a separate filter for filtering out distances - It has no scoring, so if somebody wants to sort by distance, he needs to use the custom sort. For that to work, spatial caches distance calculation (which is borken for multi-segment search) The idea is now to combine all three things into one query, but customizeable: We first thought about extending CustomScoreQuery and calculate the distance from FieldCache in the customScore method and return a score of 1 for distance=0, score=0 on the max distance and score0 for farer hits, that are in the bounding box but not in the distance circle. To filter out such negative scores, we would need to override the scorer in CustomScoreQuery which is priate. My proposal is now to use a very stripped down CustomScoreQuery (but not extend it) that does call a method getDistance(docId) in its scorer's advance and nextDoc that calculates the distance for the current doc. It stores this distance also in the scorer. If the distance maxDistance it throws away the hit and calls nextDoc() again. The score() method will reurn per default weight.value*(maxDistance - distance)/maxDistance and uses the precalculated distance. So the distance is only calculated one time in nextDoc()/advance(). To be able to plug in custom scoring, the following methods in the query can be overridden: - float getDistanceScore(double distance) - returns per default: (maxDistance - distance)/maxDistance; allows score customization - DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g. wrap a Query with QueryWrapperFilter - support a setter for the GeoDistanceCalculator that is used by the scorer to get the distance. This query is almost finished in my head, it just needs coding :-) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters
[ https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857278#action_12857278 ] Chris Male commented on LUCENE-2395: +1 This will replace the work I was doing on improving the DistanceFilter and the DistanceSortSource. Instead we will have a proper DistanceQuery where the sorting is done through the existing sorting by score functionality in Lucene. The CartesianShapeFilter will then be able to be used as a Filter with the new Query. This also addresses the current problems with caching calculated distances and means that Spatial will work with per segment. Add a scoring DistanceQuery that does not need caches and separate filters -- Key: LUCENE-2395 URL: https://issues.apache.org/jira/browse/LUCENE-2395 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Uwe Schindler Fix For: 3.1 In a chat with Chris Male and my own ideas when implemnting for PANGAEA, I thought about the broken distance query in contrib. It lacks the folloing features: - It needs a query for the encldoing bbox (which is constant score) - It needs a separate filter for filtering out distances - It has no scoring, so if somebody wants to sort by distance, he needs to use the custom sort. For that to work, spatial caches distance calculation (which is borken for multi-segment search) The idea is now to combine all three things into one query, but customizeable: We first thought about extending CustomScoreQuery and calculate the distance from FieldCache in the customScore method and return a score of 1 for distance=0, score=0 on the max distance and score0 for farer hits, that are in the bounding box but not in the distance circle. To filter out such negative scores, we would need to override the scorer in CustomScoreQuery which is priate. My proposal is now to use a very stripped down CustomScoreQuery (but not extend it) that does call a method getDistance(docId) in its scorer's advance and nextDoc that calculates the distance for the current doc. It stores this distance also in the scorer. If the distance maxDistance it throws away the hit and calls nextDoc() again. The score() method will reurn per default weight.value*(maxDistance - distance)/maxDistance and uses the precalculated distance. So the distance is only calculated one time in nextDoc()/advance(). To be able to plug in custom scoring, the following methods in the query can be overridden: - float getDistanceScore(double distance) - returns per default: (maxDistance - distance)/maxDistance; allows score customization - DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g. wrap a Query with QueryWrapperFilter - support a setter for the GeoDistanceCalculator that is used by the scorer to get the distance. This query is almost finished in my head, it just needs coding :-) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. Up until now, Lucene migrated my segments gradually, and before I upgraded from X+1 to X+2 I could run optimize() to ensure my index will be readable by X+2. I don't think I can myself agree to it, let alone convince all the stakeholders in my company who adopt Lucene today in numerous projects, to let go of such capability. We've been there before (requiring reindexing on version upgrades) w/ some offerings and customers simply didn't like it and were forced to use an enterprise-class search engine which offered less (and didn't use Lucene, up until recently !). Until we moved to Lucene ... What's Solr's take on it? I differentiate between structural changes and runtime changes. I, myself, don't mind if we let go of back-compat support for runtime changes, such as those generated by analyzers. For a couple of reasons, the most important ones are (1) these are not so frequent (but so is index structural change) and (2) that's a decision I, as the application developer, makes - using or not a newer version of an Analyzer. I don't mind working hard to make a 2.x Analyzer version work in the 3.x world, but I cannot make a 2.x index readable by a 3.x Lucene jar, if the latter doesn't support it. That's the key difference, in my mind, between the two. I can choose not to upgrade at all to a newer analyzer version ... but I don't want to be forced to stay w/ older Lucene versions and features because of that ... well people might say that it's not Lucene's problem, but I beg to differ. Lucene benefits from wider and faster adoption and we rely on new features to be adopted quickly. That might be jeopardized if we let go of that strong capability, IMO. What we can do is provide an index migration tool ... but personally I don't know what's the difference between that and gradually migrating segments as they are merged, code-wise. I mean - it has to be the same code. Only an index migration tool may take days to complete on a very large index, while the ongoing migration takes ~0 time when you come to upgrade to a newer Lucene release. And the note about Terrier requiring reindexing ... well I can't say it's a strength of it but a damn big weakness IMO. About the release pace, I don't think we can suddenly release every 2 years ... makes people think the project is stuck. And some out there are not so fond of using a 'trunk' version and release it w/ their products because trunk is perceived as ongoing development (which it is) and thus less stable, or is likely to change and most importantly harder to maintain (as the consumer). So I still think we should release more often than not. That's why I wanted to differentiate X and Y, but I don't mind if we release just X ... if that's so important to people. BTW Mike, Eclipse's releases are like Lucene, and in fact I don't know of so many projects that just release X ... many of them seem to release X.Y. I don't understand why we're treating this as a all or nothing thing. We can let go of API back-compat, that clearly has no affect on index structure and content. We can even let go of index runtime changes for all I care. But I simply don't think we can let go of index structure back-support. Shai On Thu, Apr 15, 2010 at 1:12 PM, Michael McCandless luc...@mikemccandless.com wrote: 2010/4/15 Shai Erera ser...@gmail.com: One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. I prefer X.Y, ie, changes to Y only is a minor release (mostly bug fixes but maybe small features); changes to X is a major release. I think that's more standard, ie, people will generally grok that 3.3 - 4.0 is a major change but 3.3 - 3.4 isn't. So this proposal would change how Lucene releases are numbered. Ie, the next release would be 4.0. Bug fixes / small features would then be 4.1. Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. No... in the proposal, you must re-index on upgrading to the next major release (3.x - 4.0). I think supporting old indexes, badly (what we do today) is not a great solution. EG on upgrading to 3.1 you'll immediately see a search perf hit since the flex emulation layer is running. It's a trap. It's this freedom, I think, that'd let us drop Version entirely. It's the back-compat of the index that is the major driver for having Version today (eg so that the analyzers can produce tokens matching your old index). EG Terrier seems
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
I think an index upgrade tool is okay? While you still definetly have to code it, things like if idxVer==m doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away from lucene innards and we all profit? On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Thanks Danil - you reminded me of another reason why reindexing is impossible - fetching the data, even if it's available is too damn costly. Robert, I think you're driven by Analyzers changes ... been too much around them I'm afraid :). A major version upgrade is a move to Java 1.5 for example. I can do that, and I don't see why I need to reindex my data because of that. And I simply don't buy that do this work on your own ... people can take a snapshot of the code, maintain it separately and you'll never hear back from them. Who benefits - neither ! It's open source - true, but it's way past the Hey look, I'm a new open source project w/ a dozen users, I can do whatever I want. Lucene is a respected open source project, w/ serious adoption and deployments. People trust on the select few committers here to do it right for them, so they don't need to invest the time and resources in developing core IR stuff. And now you're pushing to do it yourself approach? I simply don't get or buy it. When were you struck w/ maintaining backwards change because the index structure changed? I bet no so many of us, or shall I say just the few Mikes out there? So how hard is it to require such back-compat support? I wholeheartedly agree that we shouldn't keep back-compat on Analyzer changes, nor on bugs such that one which changed the position of the field from -1 to 0 (a while ago - don't remember the exact details). Shai On Thu, Apr 15, 2010 at 3:17 PM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
I can live w/ that Earwin ... I prefer the ongoing upgrades still, but I won't hold off the back-compat policy change vote because of that. Shai On Thu, Apr 15, 2010 at 3:30 PM, Earwin Burrfoot ear...@gmail.com wrote: I think an index upgrade tool is okay? While you still definetly have to code it, things like if idxVer==m doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away from lucene innards and we all profit? On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I think you guys miss the entire point. The idea that you can keep getting all the new features without reindexing is merely an illusion Instead, features simply aren't being added at all, because the policy makes it too cumbersome. Why is it problematic to have a different SVN branch/release, with lots of new features, but requires you to reindex and change your app? If its too difficult to reindex, it doesnt break your app that features exist elsewhere that you cannot access. Its the same as it is today, there are features you cannot access, except they do not even exist in apache SVN at all, even trunk, because of these problems. On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote: I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote: I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I could argue that it's you who miss the point :). I completely don't buy the all the new features comment -- how many new features are in a major release which force you to consider reindexing? Yet there are many of them that change the API. How will I know whether a release supports my index or not? Why do I need to work hard to back-port all the new developed issues onto a branch I use? How many of those branches will exist? Will they all run nightly unit tests? Can I cut a release of such branch myself? Or will I need the PMC or a VOTE? This will get complicated pretty fast ... Lucene is not a do it yourself kit - we try so hard to have the best defaults, best out of the box experience ... best everything for our users. Even w/ Analyzers we try so damn hard. While we could have simply componentize everything and tell the users you can use those filters, tokenizers, segment mergers, policies etc. to make up your indexing application ... And I don't think there are features out there that exist and are not contributed because people are afraid of the index format changes ... obviously if they have done it, they're passed the fear of handling index format ... I'd like to hear of one such feature. I'd bet there are such out there that are not contributed for IP, Business and Laziness reasons. Shai On Thu, Apr 15, 2010 at 3:56 PM, Robert Muir rcm...@gmail.com wrote: I think you guys miss the entire point. The idea that you can keep getting all the new features without reindexing is merely an illusion Instead, features simply aren't being added at all, because the policy makes it too cumbersome. Why is it problematic to have a different SVN branch/release, with lots of new features, but requires you to reindex and change your app? If its too difficult to reindex, it doesnt break your app that features exist elsewhere that you cannot access. Its the same as it is today, there are features you cannot access, except they do not even exist in apache SVN at all, even trunk, because of these problems. On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote: I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from
Re: Proposal about Version API relaxation
I'm realize that just transforming old index won't give me anything new. The applications usually evolve. Let's take as example 2.9 (relatively few changes in index structure, but Trie was a nice addition, per segment search and reload was a bless): - There are 4 billion documents which don't have numeric ranges (but those still got faster reopen) - But for next 1 billion documents in another index i do have numeric ranges. The whole application works in ONE environment from same codebase. Splitting it into several environments based on whatever version of lucene happend to be current at index creation date, and maintaining branches of code would be quite a PITA for a developer (and very error prone) So yeah, I won't get new features for old indexes if i transform them to new format, but new indexes will be able to use them. And my application as a whole will be much cleaner and easier to maintain (I'm a lazy developer that thinks that he is already overworked) I just want my system as a whole to evolve together with lucene without dropping the indexes I already have and keeping tens of branches of code and remembering how things worked back in 2005 to slightly modify the analyzer because data in 2010 changed a bit. Danil. On Thu, Apr 15, 2010 at 15:56, Robert Muir rcm...@gmail.com wrote: I think you guys miss the entire point. The idea that you can keep getting all the new features without reindexing is merely an illusion Instead, features simply aren't being added at all, because the policy makes it too cumbersome. Why is it problematic to have a different SVN branch/release, with lots of new features, but requires you to reindex and change your app? If its too difficult to reindex, it doesnt break your app that features exist elsewhere that you cannot access. Its the same as it is today, there are features you cannot access, except they do not even exist in apache SVN at all, even trunk, because of these problems. On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote: I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. It's still harder. Consider a common scenario where you have one master and the index being replicated to multiple slaves. One would need to stop replication to an upgraded slave until the master is also upgraded. Some people can't even stop replication because they use something like a SAN to share the index. I'm just pointing out that there is a lot of value for many people to back compatible indexes... I'm not trying to make any points about when that back combat should be broken. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
wrong, it doesnt fix the analyzers problem. you need to reindex. On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
+1 On Apr 14, 2010, at 5:22 PM, Michael McCandless wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and, if any devs have the itch, they could freely back-port improvements from trunk as long as they kept back-compat within the branch. I think in such a future world, we could: * Remove Version entirely! * Not worry at all about back-compat when developing on trunk * Give proper names to new improved classes instead of StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing today; rename existing classes. * Let analyzers freely, incrementally improve * Use interfaces without fear * Stop spending the truly substantial time (look @ Uwe's awesome back-compat layer for analyzers!) that we now must spend when adding new features, for back-compat * Be more free to introduce very new not-fully-baked features/APIs, marked as experimental, on the expectation that once they are used (in trunk) they will iterate/change/improve vs trying so hard to get things right on the first go for fear of future back compat horrors. Thoughts...? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2396) remove version from contrib/analyzers.
remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I do think major versions should be able to read the previous version index. Still, even being able to do that is no guarantee that it will produce correct results. Likewise, even having an upgrade tool is no guarantee that correct results will be produced. So, my take is that we strive for it, but we all have to realize, and document, that it might not always be possible. Let's just be practical and pragmatic. Past history indicates we are capable of, for the most part, reading the prev. version index and upgrading it. If it can't be done automatically, then we can consider a tool. If the tool won't work, then we will have to reindex. It doesn't have to be an all or nothing decision made in the void. We've always been very practical here about making decisions on problems that are directly facing us, so I would suggest we move forward with the new approach (which I agree makes more sense and is pretty prevalent across a lot of projects) and we take this issue on a case-by-case basis. -Grant On Apr 15, 2010, at 9:49 AM, Yonik Seeley wrote: On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. It's still harder. Consider a common scenario where you have one master and the index being replicated to multiple slaves. One would need to stop replication to an upgraded slave until the master is also upgraded. Some people can't even stop replication because they use something like a SAN to share the index. I'm just pointing out that there is a lot of value for many people to back compatible indexes... I'm not trying to make any points about when that back combat should be broken. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857321#action_12857321 ] Robert Muir commented on LUCENE-2396: - Additionally, i would like to remove all CHANGES from backwards compatibility policy from contrib/CHANGES. contrib has no backwards compatibility policy, so it makes no sense. these are just ordinary changes for Contrib. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 17:49, Robert Muir rcm...@gmail.com wrote: wrong, it doesnt fix the analyzers problem. you need to reindex. On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. Couldn't care less about analyzers. There's two kinds of breaks in index compatibility - soft and hard ones. Hard break is - your index structure changed, you're using a new encoding for numeric fields, such kind of things. Soft break is - you fixed a stemmer, so now 'some' words are stemmed differently, such kind of things. With hard break you have to do an offline reindex, and then switch over. With soft breaks you can sometimes just enqueue all your documents and do reindexation online - that breaks a small percentage of your queries for a small period of time. Something you can bear, if that saves you from doing manual labor. I never claimed an index upgrade tool should fix your tokens, offsets and whatnot. It is power-user stuff that allows you to turn some hard breaks into soft breaks, and then decide on your own how to handle the latter. We also can hit some index format changes that deny any kind of automatic conversion. Well, too sad. We'll just skip issuing index upgrade tool on that release. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857325#action_12857325 ] Robert Muir commented on LUCENE-2396: - Also, i would like to remove all deprecated methods from contrib/analyzers as well. this again shouldnt be a problem, as it has no back compat policy. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Agree. However I don't see how lucene could suddenly change that even a conversion tool is impossible to create. After all it's all about terms, positions and frequencies. Yeah..some additions as payloads may appear, disappear, or evolve into something new, but those are on user's side anyway. Analyzers indeed are delicate problem so when StandardAnalyzer(which probably 90% of users use) for same string generates different set of terms. But again it's user side problem. Nothing stops him to rip StandrardAnalyzer from whatever version of lucene, adapt it to newer indexing API, plug it in and continue. I already use 50% customized analyzers, my own query parser and so on. I have junits for (hopefully) all cases I need to cover, so if new Analyzer misbehaves, it's my responsability. Danil. On Thu, Apr 15, 2010 at 16:56, Grant Ingersoll gsing...@apache.org wrote: I do think major versions should be able to read the previous version index. Still, even being able to do that is no guarantee that it will produce correct results. Likewise, even having an upgrade tool is no guarantee that correct results will be produced. So, my take is that we strive for it, but we all have to realize, and document, that it might not always be possible. Let's just be practical and pragmatic. Past history indicates we are capable of, for the most part, reading the prev. version index and upgrading it. If it can't be done automatically, then we can consider a tool. If the tool won't work, then we will have to reindex. It doesn't have to be an all or nothing decision made in the void. We've always been very practical here about making decisions on problems that are directly facing us, so I would suggest we move forward with the new approach (which I agree makes more sense and is pretty prevalent across a lot of projects) and we take this issue on a case-by-case basis. -Grant On Apr 15, 2010, at 9:49 AM, Yonik Seeley wrote: On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. It's still harder. Consider a common scenario where you have one master and the index being replicated to multiple slaves. One would need to stop replication to an upgraded slave until the master is also upgraded. Some people can't even stop replication because they use something like a SAN to share the index. I'm just pointing out that there is a lot of value for many people to back compatible indexes... I'm not trying to make any points about when that back combat should be broken. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Wed, Apr 14, 2010 at 5:22 PM, Michael McCandless luc...@mikemccandless.com wrote: * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). Sort of... except many of these projects listed above care a lot about back compat, even between major releases. So while we could always break back compat, we shouldn't do so unless it's necessary. It's not an all-or-nothing scenario though... requiring re-indexing seems reasonable, but changing APIs around when there's not a good reason behind it (other than someone liked the name a little better) should still be approached with caution. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Coming in late to the discussion, and without really understanding the underlying Lucene issues, but... The size of the problem of reindexing is under-appreciated I think. Somewhere in my company is the original data I indexed. But the effort it would take to resurrect it is O(unknown). An unfortunate reality of commercial products is that the often receive very little love for extended periods of time until all of the sudden more work is required. There ensues an extended period of re-orientation, even if the people who originally worked on the project are still around. *Assuming* the data is available to reindex (and there are many reasons besides poor practice on the part of the company that it may not be), remembering/finding out exactly which of the various backups you made of the original data is the one that's actually in your product can be highly non-trivial. Compounded by the fact that the product manager will be adamant about Do NOT surprise our customers. So I can be in a spot of saying I *think* I have the original data set, and I *think* I have the original code used to index it, and if I get a new version of Lucene I *think* I can recreate the index and I *think* that the user will see the expected change. After all that effort is completed, I *think* we'll see the expected changes, but we won't know until we try it puts me in a very precarious position. This assumes that I have a reasonable chance of getting the original data. But say I've been indexing data from a live feed. Sure as hell hope I stored the data somewhere, because going back to the source and saying please resend me 10 years worth of data that I have in my index is...er...hard. Or say that the original provider has gone out of business, or the licensing arrangement specifies a one-time transmission of data that may not be retained in its original form or. The point of this long diatribe is that there are many reasons why reindexing is impossible and/or impractical. Making any decision that requires reindexing for a new version is locking a user into a version potentially forever. We should not underestimate how painful that can be and should never think that just reindex is acceptable in all situations. It's not. Period. Be very clear that some number of Lucene users will absolutely not be able to reindex. We may still make a decision that requires this, but let's make it without deluding ourselves that it's a possible solution for everyone. So an upgrade tool seems like a reasonable compromise. I agree that being hampered in what we can develop in Lucene by having to accomodate reading old indexes slows new features etc. It's always nice to be able to work without dealing with pesky legacy issues G. Perhaps splitting out the indexing upgrades into a separate program lets us accommodate both concerns. FWIW Erick On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote: True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
reasonable, but changing APIs around when there's not a good reason behind it (other than someone liked the name a little better) should still be approached with caution. Changing names is a good enough reason :) They make a darn difference between having to read a book to be able to use some library, or just playing around with it for a bit. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
If you absolutely cannot re-index, and you have *no* access to the data again - you are one ballsy mofo to upgrade to a new version of Lucene for features. It means you likely BASE jump in your free time? On 04/15/2010 10:14 AM, Erick Erickson wrote: Coming in late to the discussion, and without really understanding the underlying Lucene issues, but... The size of the problem of reindexing is under-appreciated I think. Somewhere in my company is the original data I indexed. But the effort it would take to resurrect it is O(unknown). An unfortunate reality of commercial products is that the often receive very little love for extended periods of time until all of the sudden more work is required. There ensues an extended period of re-orientation, even if the people who originally worked on the project are still around. *Assuming* the data is available to reindex (and there are many reasons besides poor practice on the part of the company that it may not be), remembering/finding out exactly which of the various backups you made of the original data is the one that's actually in your product can be highly non-trivial. Compounded by the fact that the product manager will be adamant about Do NOT surprise our customers. So I can be in a spot of saying I *think* I have the original data set, and I *think* I have the original code used to index it, and if I get a new version of Lucene I *think* I can recreate the index and I *think* that the user will see the expected change. After all that effort is completed, I *think* we'll see the expected changes, but we won't know until we try it puts me in a very precarious position. This assumes that I have a reasonable chance of getting the original data. But say I've been indexing data from a live feed. Sure as hell hope I stored the data somewhere, because going back to the source and saying please resend me 10 years worth of data that I have in my index is...er...hard. Or say that the original provider has gone out of business, or the licensing arrangement specifies a one-time transmission of data that may not be retained in its original form or. The point of this long diatribe is that there are many reasons why reindexing is impossible and/or impractical. Making any decision that requires reindexing for a new version is locking a user into a version potentially forever. We should not underestimate how painful that can be and should never think that just reindex is acceptable in all situations. It's not. Period. Be very clear that some number of Lucene users will absolutely not be able to reindex. We may still make a decision that requires this, but let's make it without deluding ourselves that it's a possible solution for everyone. So an upgrade tool seems like a reasonable compromise. I agree that being hampered in what we can develop in Lucene by having to accomodate reading old indexes slows new features etc. It's always nice to be able to work without dealing with pesky legacy issues G. Perhaps splitting out the indexing upgrades into a separate program lets us accommodate both concerns. FWIW Erick On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com mailto:torin...@gmail.com wrote: True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com mailto:ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com mailto:yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com mailto:ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
'Cause some exec finally noticed the product was losing market share. Or got a wild hair strategically placed. My point is only that we should be clear that some number of Lucene users *will* be in such a position. I'm actually fine with a decision that we're not going to support such a scenario, but let's be clear that that's the decision we're making. And corporate competence aside, there's still licensing that may prevent me archiving the raw data Erick On Thu, Apr 15, 2010 at 10:20 AM, Earwin Burrfoot ear...@gmail.com wrote: I think the need to upgrade to latest and greatest lucene for poor corporate users that lost all their data is somewhat overblown. Why the heck do you need to upgrade if your app rotted in neglect for years?? On Thu, Apr 15, 2010 at 18:14, Erick Erickson erickerick...@gmail.com wrote: Coming in late to the discussion, and without really understanding the underlying Lucene issues, but... The size of the problem of reindexing is under-appreciated I think. Somewhere in my company is the original data I indexed. But the effort it would take to resurrect it is O(unknown). An unfortunate reality of commercial products is that the often receive very little love for extended periods of time until all of the sudden more work is required. There ensues an extended period of re-orientation, even if the people who originally worked on the project are still around. *Assuming* the data is available to reindex (and there are many reasons besides poor practice on the part of the company that it may not be), remembering/finding out exactly which of the various backups you made of the original data is the one that's actually in your product can be highly non-trivial. Compounded by the fact that the product manager will be adamant about Do NOT surprise our customers. So I can be in a spot of saying I *think* I have the original data set, and I *think* I have the original code used to index it, and if I get a new version of Lucene I *think* I can recreate the index and I *think* that the user will see the expected change. After all that effort is completed, I *think* we'll see the expected changes, but we won't know until we try it puts me in a very precarious position. This assumes that I have a reasonable chance of getting the original data. But say I've been indexing data from a live feed. Sure as hell hope I stored the data somewhere, because going back to the source and saying please resend me 10 years worth of data that I have in my index is...er...hard. Or say that the original provider has gone out of business, or the licensing arrangement specifies a one-time transmission of data that may not be retained in its original form or. The point of this long diatribe is that there are many reasons why reindexing is impossible and/or impractical. Making any decision that requires reindexing for a new version is locking a user into a version potentially forever. We should not underestimate how painful that can be and should never think that just reindex is acceptable in all situations. It's not. Period. Be very clear that some number of Lucene users will absolutely not be able to reindex. We may still make a decision that requires this, but let's make it without deluding ourselves that it's a possible solution for everyone. So an upgrade tool seems like a reasonable compromise. I agree that being hampered in what we can develop in Lucene by having to accomodate reading old indexes slows new features etc. It's always nice to be able to work without dealing with pesky legacy issues G. Perhaps splitting out the indexing upgrades into a separate program lets us accommodate both concerns. FWIW Erick On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote: True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903)
Re: Proposal about Version API relaxation
The app is not rotted, it's alive and kicking, and gets a lot of TLC. There are some older indexes that use some features and there are newer indexes that will benefit greatly from newer features. All running in one freaking big distributed application. Leveraging lucene version by updating to newer lucene for new indexes and changing analyzer chain of old indexes in a way that doesn't affect (too much) search results they used to get, is a logical way from my point of view. I only ask for a tool to convert from old lucene format to new one. I don't expect magic to happen, but give me the possibility to go forward and let me worry about backward compatibility of search results. On Thu, Apr 15, 2010 at 17:20, Earwin Burrfoot ear...@gmail.com wrote: I think the need to upgrade to latest and greatest lucene for poor corporate users that lost all their data is somewhat overblown. Why the heck do you need to upgrade if your app rotted in neglect for years?? On Thu, Apr 15, 2010 at 18:14, Erick Erickson erickerick...@gmail.com wrote: Coming in late to the discussion, and without really understanding the underlying Lucene issues, but... The size of the problem of reindexing is under-appreciated I think. Somewhere in my company is the original data I indexed. But the effort it would take to resurrect it is O(unknown). An unfortunate reality of commercial products is that the often receive very little love for extended periods of time until all of the sudden more work is required. There ensues an extended period of re-orientation, even if the people who originally worked on the project are still around. *Assuming* the data is available to reindex (and there are many reasons besides poor practice on the part of the company that it may not be), remembering/finding out exactly which of the various backups you made of the original data is the one that's actually in your product can be highly non-trivial. Compounded by the fact that the product manager will be adamant about Do NOT surprise our customers. So I can be in a spot of saying I *think* I have the original data set, and I *think* I have the original code used to index it, and if I get a new version of Lucene I *think* I can recreate the index and I *think* that the user will see the expected change. After all that effort is completed, I *think* we'll see the expected changes, but we won't know until we try it puts me in a very precarious position. This assumes that I have a reasonable chance of getting the original data. But say I've been indexing data from a live feed. Sure as hell hope I stored the data somewhere, because going back to the source and saying please resend me 10 years worth of data that I have in my index is...er...hard. Or say that the original provider has gone out of business, or the licensing arrangement specifies a one-time transmission of data that may not be retained in its original form or. The point of this long diatribe is that there are many reasons why reindexing is impossible and/or impractical. Making any decision that requires reindexing for a new version is locking a user into a version potentially forever. We should not underestimate how painful that can be and should never think that just reindex is acceptable in all situations. It's not. Period. Be very clear that some number of Lucene users will absolutely not be able to reindex. We may still make a decision that requires this, but let's make it without deluding ourselves that it's a possible solution for everyone. So an upgrade tool seems like a reasonable compromise. I agree that being hampered in what we can develop in Lucene by having to accomodate reading old indexes slows new features etc. It's always nice to be able to work without dealing with pesky legacy issues G. Perhaps splitting out the indexing upgrades into a separate program lets us accommodate both concerns. FWIW Erick On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote: True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters
[ https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2395: -- Attachment: DistanceQuery.java A first idea of the Query, it does not even compile as some classes are missing (coming with Chris' later patches), but it shows how it should work and how its customizeable. Add a scoring DistanceQuery that does not need caches and separate filters -- Key: LUCENE-2395 URL: https://issues.apache.org/jira/browse/LUCENE-2395 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Uwe Schindler Fix For: 3.1 Attachments: DistanceQuery.java In a chat with Chris Male and my own ideas when implementing for PANGAEA, I thought about the broken distance query in contrib. It lacks the following features: - It needs a query/filter for the enclosing bbox (which is constant score) - It needs a separate filter for filtering out hits to far away (inside bbox but outside distance limit) - It has no scoring, so if somebody wants to sort by distance, he needs to use the custom sort. For that to work, spatial caches distance calculation (which is broken for multi-segment search) The idea is now to combine all three things into one query, but customizeable: We first thought about extending CustomScoreQuery and calculate the distance from FieldCache in the customScore method and return a score of 1 for distance=0, score=0 on the max distance and score0 for farer hits, that are in the bounding box but not in the distance circle. To filter out such negative scores, we would need to override the scorer in CustomScoreQuery which is priate. My proposal is now to use a very stripped down CustomScoreQuery (but not extend it) that does call a method getDistance(docId) in its scorer's advance and nextDoc that calculates the distance for the current doc. It stores this distance also in the scorer. If the distance maxDistance it throws away the hit and calls nextDoc() again. The score() method will reurn per default weight.value*(maxDistance - distance)/maxDistance and uses the precalculated distance. So the distance is only calculated one time in nextDoc()/advance(). To be able to plug in custom scoring, the following methods in the query can be overridden: - float getDistanceScore(double distance) - returns per default: (maxDistance - distance)/maxDistance; allows score customization - DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g. wrap a Query with QueryWrapperFilter - support a setter for the GeoDistanceCalculator that is used by the scorer to get the distance. - a LatLng provider (similar to CustomScoreProvider/ValueSource) that returns for a given doc id the lat/lng. This method is called per IndexReader one time in scorer creation and will retrieve the coordinates. By that we support FieldCache or whatever. This query is almost finished in my head, it just needs coding :-) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2396: --- Assignee: Robert Muir remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857373#action_12857373 ] Michael McCandless commented on LUCENE-2324: bq. The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think? bq. Mainly, because I think apps with such an affinity that you describe are very rare? Hmm I suspect it's not that rare yes one design is a single indexing queue w/ dedicated thread pool only for indexing, but a push model is equal valid, where your app already has separate threads (or thread pools) servicing different content sources, so when a doc arrives to one of those source-specific threads, it's that thread that indexes it, rather than handing off to a separately pool. Lucene is used in a very wide variety of apps -- we shouldn't optimize the indexer on such hard app specific assumptions. bq. And if a user really has so different docs, maybe the right answer would be to have more than one single index? Hmm but the app shouldn't have to resort to this... (it doesn't have to today). But... could we allow an add/updateDocument call to express this affinity, explicitly? If you index homogenous docs you wouldn't use it, but, if you index drastically different docs that fall into clear categories, expressing the affinity can get you a good gain in indexing throughput. This may be the best solution, since then one could pass the affinity even through a thread pool, and then we would fallback to thread binding if the document class wasn't declared? I mean this is virtually identical to having more than one index, since the DW is like its own index. It just saves some of the copy-back/merge cost of addIndexes... bq. Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging. Yes this optimization is only about the initial flush, but, it's potentially sizable. Merging matters less since typically it's not the bottleneck (happens in the BG, quickly enough). On the right apps, thread affinity can make a huge difference. EG if you allow up to 8 thread states, and the threads are indexing content w/ highly divergent terms (eg, one language per thread, or, docs w/ very different field names), in the worst case you'll be up to 1/8 as efficient since each term must now be copied in up to 8 places instead of one. We have a high per-term RAM cost (reduced thanks to the parallel arrays, but, still high). bq. If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory? It really depends on the app -- if the term space is highly thread dependent (above examples) you an end up flush much more frequently for a given RAM buffer. bq. Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory? Hmm... this would be kinda costly binder? You'd need a pqueue? Thread affinity (or the explicit affinity) is a single map/array/member lookup. But it's an interesting idea... bq. If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize? The other DWs would keep indexing :) That's the beauty of this approach... a flush of one DW doesn't stop all other DWs from indexing, unliked today. And you want to serialize the flushing right? Ie, only one DW flushes at a time (the others keep indexing). Hmm I suppose flushing more than one should be allowed (OS/IO have alot of concurrency, esp since IO goes into write cache)... perhaps that's the best way to balance index vs flush time? EG we pick one to flush @ 90%, if we cross 95% we pick another to flush, another at 100%, etc. bq. Of course we could say always flush when 90% of the overall memory is consumed, but how would we know that the remaining 10% won't fill up during the time the flush takes? Regardless of the approach for document - DW binding, this is an issue (ie it's non-differentiating here)? Ie the other DWs continue to consume RAM while one DW is flushing. I think the low/high water mark is an OK solution here? Or the tiered flushing (I think I like that better :) ). bq. Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently. I'm all for simplifying things, which you've already nicely done here, but not of it's at the cost of a
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375 ] Tim Smith commented on LUCENE-2324: --- bq. But... could we allow an add/updateDocument call to express this affinity, explicitly? i would love to be able to explicitly define a segment affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity) Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2396: Attachment: LUCENE-2396.patch attached is a patch, including CHANGES rewording. All Lucene/Solr tests pass. If no one objects, I plan to commit in a day or two. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857380#action_12857380 ] Jason Rutherglen commented on LUCENE-2324: -- bq. only one DW flushes at a time (the others keep indexing). I think it's best to simply flush at 90% for now. We already exceed the ram buffer size because of over allocation? Perhaps we can view the ram buffer size as a rough guideline not a hard and fast limit because, lets face it, we're using Java which is about as inexact when it comes to RAM consumption as it gets? Also, hopefully it would move the patch along faster and more complex algorithms could easily be added later. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857381#action_12857381 ] Michael McCandless commented on LUCENE-2324: {quote} i would love to be able to explicitly define a segment affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 {quote} Right, this is exactly what affinity would be good for -- so IW would try to send table a docs their own DW(s) and table b docs to their own DW(s), which should give faster indexing than randomly binding to DWs. But: bq. this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity) This part I was not proposing :) The affinity would just be an optimization hint in creating the initial flushed segments, so IW can speed up indexing. Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857384#action_12857384 ] Uwe Schindler commented on LUCENE-2396: --- Are you sure you want to use LUCENE_CURRENT in some ctors? remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857386#action_12857386 ] Robert Muir commented on LUCENE-2396: - bq. Are you sure you want to use LUCENE_CURRENT in some ctors? The lucene core subclasses used by some analyzers require this, so another alternative is to create a static CONTRIB_ANALYZERS_VERSION = 3.1 for this purpose, and bump it every release. that's fine too. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385 ] Tim Smith commented on LUCENE-2324: --- bq. Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices? Thats what i'm currently thinking i'll have to do however it would be ideal if i could either subclass IndexWriter or use IndexWriter directly with this affinity concept (potentially writing my own segment merger that is affinity aware) that makes it so i can easily use near real time indexing, as only one IndexWriter will be in the mix, as well as make managing deletes and a whole other host of issues with multiple indexes disappear Also makes it so i can configure memory settings across all affinity groups instead of having to dynamically create them, each with their own memory bounds Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857388#action_12857388 ] Shai Erera commented on LUCENE-2396: Robert I think this is great! Can we move more analyzers from core here? I think however that a backwards section in changes is important because it alerts users about those analyzers whose runtime behavior changed. Otherwise how would the poor uses know that? It doesn't mean you need to maintain back compat support but at least alert them when things change. Even if we eventually decide to remove API bw completely, a section in CHANGES will still be required to help users upgrade easily. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857395#action_12857395 ] Robert Muir commented on LUCENE-2396: - {quote} Robert I think this is great! Can we move more analyzers from core here? I think however that a backwards section in changes is important because it alerts users about those analyzers whose runtime behavior changed. Otherwise how would the poor uses know that? It doesn't mean you need to maintain back compat support but at least alert them when things change. {quote} I think this belongs in Changes in Runtime Behavior. Its a question of wording..., which is why i renamed it as such in the patch. If folks want to move the analyzers in core into here, that would be great too, even better the Solr analyzers. we can call it a module if we want, or whatever. But for now, I'm working with what I got. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857396#action_12857396 ] Shai Erera commented on LUCENE-2396: Static? Weren't you against that!? But if we remove back compat from analyzers why do we need Version? Or is this API bw that we remove? remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857398#action_12857398 ] Robert Muir commented on LUCENE-2396: - {quote} Static? Weren't you against that!? But if we remove back compat from analyzers why do we need Version? Or is this API bw that we remove? {quote} Whoah... don't get too excited :). *Internally* some of these contrib analyzers subclass stuff thats in lucene core, which requires Version. If this stuff was moved into say, contrib analyzers, then we wouldnt need this *Internal-only-use* Version arg. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857402#action_12857402 ] Uwe Schindler commented on LUCENE-2396: --- bq. Static? Weren't you against that!? He meant a static final! It is just to make the analyzers that depend on core stuff fix to a specific version. Until we have no more analyzers in core exspect Whitespace. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857412#action_12857412 ] Robert Muir commented on LUCENE-2396: - bq. Until we have no more analyzers in core exspect Whitespace. Actually i think whitespace belongs in analyzers module too. I would suggest a TestAnalyzer in src/test, which might just be quick-and-dirty or whatever. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/15/2010 09:49 AM, Robert Muir wrote: wrong, it doesnt fix the analyzers problem. you need to reindex. On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com mailto:ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com mailto:yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. Having read the thread, I have a few comments. Much of it is summary. The current proposal requires re-index on every upgrade to Lucene. Plain and simple. Robert is right about the analyzers. There are three levels of backward compatibility, though we talk about 2. First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Second, the API. The current mechanism to use deprecations to migrate users to a new API is both a blessing and a curse. It is a blessing to end users so that they have a clear migration path. It is a curse to development because the API is bloated with the old and the new. Further it causes unfortunate class naming, with the tendency to migrate away from the good name. It is a curse to end users because it can cause confusion. While I like the mechanism of deprecations to migrate me from one release to another, I'd be open to another mechanism. So much effort is put into API bw compat that might be better spent on another mechanism. E.g. thorough documentation. Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers, stop words, ...) if the token stream changes, the index is no longer valid. It may appear to work, but it is broken. The token stream applies not only to the indexed documents, but also to the user supplied query. A simple example, if from one release to another the stop word 'a' is dropped, then phrase searches including 'a' won't work as 'a' is not in the index. Even a simple, obvious bug fix that changes the stream is bad. Another behavior change is an upgrade in Java version. By forcing users to go to Java 5 with Lucene 3, the version of Unicode changed. This in itself causes a change in some token streams. With a change to a token stream, the index must be re-created to ensure expected behavior. If the original input is no longer available or the index cannot be rebuilt for whatever reason, then lucene should not be upgraded. It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. The other problem I have is the assumption that re-index is feasible and that indexes are always server based. Re-index feasibility has already been well-discussed on this thread from a server side perspective. There are many client side applications, like mine, where the index is built and used on the clients computer. In my scenario the user builds indexes individually for books. From the index perspective, the sentence is the Lucene document and the book is the index. Building an index is voluntary and takes time proportional to the size of the document and time inversely proportional to the power of the computer. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. So what are my choices? (rhetorical) With each new release of my app, I'd like to exploit the latest and greatest features of Lucene. And I'm going to change my app with features which may or may not be related to the use of Lucene. Those latter features are what matter the most to my user base. They don't care what technologies are used to do searches. If the latest Lucene jar does not let me use Version (or some other mechanism) to maintain compatibility with an older index, the user will have to re-index. Or I can forgo any future upgrades with Lucene. Neither are very palatable. -- DM Smith
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857427#action_12857427 ] DM Smith commented on LUCENE-2396: -- Robert, I think this is a red-herring. There has been an implicit bw compat policy, with all the effort to maintain bw compat in the analyzers. With the re-shuffling of contrib this has been made a bit murky and does need to be re-addressed. How is this any different than the discussion to eliminate Version altogether? I think that should be resolved first and this follow the lead of that. How can one have a useful index across releases without a stable token stream? From the thread it is clear that few understand the impact of an analyzer on the usefulness of an index. If this succeeds there is little reason to maintain Version at all. -- DM remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 1:30 PM, DM Smith dmsmith...@gmail.com wrote: Another behavior change is an upgrade in Java version. By forcing users to go to Java 5 with Lucene 3, the version of Unicode changed. This in itself causes a change in some token streams. ... It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. DM brings up some interesting points here. For example, the Porter Stemmer in core from 1970 or whenever, is essentially frozen to all changes for some time now, it says so on Porter's site. This is not the case for non-english, things are very much in flux, including how the characters themselves are encoded on a computer. If we want to support languages other than english in lucene, we have to make it possible to iterate and improve things without making 20 copies of something or scattering Version everywhere. -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857440#action_12857440 ] Robert Muir commented on LUCENE-2396: - bq. There has been an implicit bw compat policy, Part of the point of this patch for me was two things: # what would the code look like if we delete the back compat cruft? # why do i constantly hear different ideas about what contrib/analyzer's back compat and what it should be? I want it defined! At first I said, this is a stupid idea, but I am gonna delete all the backwards cruft from a few Analyzers and just give it a try... its amazing how much easier it is to see what is going on when you delete the 1.8MB of backwards crap... a lot of it i put a lot of effort myself into. So I think we should instead use real-versions for contrib/analyzers. You can be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't going to change behavior... no matter how much backwards stuff we try to add, this is easiest and safest on everyone. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Burton-West updated LUCENE-2393: Attachment: LUCENE-2393.patch New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N terms with the highest docFreq and looks up the total term frequency and outputs the list of terms sorted by highest term frequency (which approximates the largest entries in the *prx files).I'm not sure how to combine the GetTermInfo program, with either version of HighFreqTerms in a way that leads to sane command line arguments and argument processing. I suppose that HighFreqTerms could have a flag that turns on or off the inclusion of total term frequency. Utility to output total term frequency and df from a lucene index - Key: LUCENE-2393 URL: https://issues.apache.org/jira/browse/LUCENE-2393 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Tom Burton-West Priority: Trivial Attachments: LUCENE-2393.patch, LUCENE-2393.patch This is a command line utility that takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). It is useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857456#action_12857456 ] DM Smith commented on LUCENE-2396: -- {quote} So I think we should instead use real-versions for contrib/analyzers. You can be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't going to change behavior... no matter how much backwards stuff we try to add, this is easiest and safest on everyone. {quote} I could live with thatmaybe. What guarantee is there that lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work? How can I use lucene-analyzers-3.0.jar on old indexes and lucene-analyzers-3.5.jar on newer ones within the same package? What I'd like to see is that all analyzers and their parts are kept together in an analyzer jar (maybe more than one for the honking big analyzers as we have today) and that it be elevated to core. (I think contrib give the wrong impression.) And have a well-define compatibility policy. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I'd like to remind that Mike's proposal has stable branches. We can branch off preflex trunk right now and wrap it up as 3.1. Current trunk is declared as future 4.0 and all backcompat cruft is removed from it. If some new features/bugfixes appear in trunk, and they don't break stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3, etc Thus, devs are free to work without back-compat burden, bleeding edge users get their blood, conservative users get their stability + a subset of new features from stable branches. On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857456#action_12857456 ] DM Smith edited comment on LUCENE-2396 at 4/15/10 2:16 PM: --- {quote} So I think we should instead use real-versions for contrib/analyzers. You can be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't going to change behavior... no matter how much backwards stuff we try to add, this is easiest and safest on everyone. {quote} I could live with thatmaybe. What guarantee is there that lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work? How can I use lucene-analyzers-3.0.jar on old indexes and lucene-analyzers-3.5.jar on newer ones within the same application? What I'd like to see is that all analyzers and their parts are kept together in an analyzer jar (maybe more than one for the honking big analyzers as we have today) and that it be elevated to core. (I think contrib give the wrong impression.) And have a well-define compatibility policy. was (Author: dmsmith): {quote} So I think we should instead use real-versions for contrib/analyzers. You can be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't going to change behavior... no matter how much backwards stuff we try to add, this is easiest and safest on everyone. {quote} I could live with thatmaybe. What guarantee is there that lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work? How can I use lucene-analyzers-3.0.jar on old indexes and lucene-analyzers-3.5.jar on newer ones within the same package? What I'd like to see is that all analyzers and their parts are kept together in an analyzer jar (maybe more than one for the honking big analyzers as we have today) and that it be elevated to core. (I think contrib give the wrong impression.) And have a well-define compatibility policy. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857466#action_12857466 ] Robert Muir commented on LUCENE-2396: - bq. I could live with thatmaybe. What guarantee is there that lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work? Well, they should work, unless lucene-core breaks backwards compatibility with analyzers! {quote} How can I use lucene-analyzers-3.0.jar on old indexes and lucene-analyzers-3.5.jar on newer ones within the same application? What I'd like to see is that all analyzers and their parts are kept together in an analyzer jar (maybe more than one for the honking big analyzers as we have today) and that it be elevated to core. (I think contrib give the wrong impression.) And have a well-define compatibility policy. {quote} Well, I think asking for a well-defined backwards compatibility policy for 'all analyzers' is asking too much. Things are not so simple and sorted out like they are with English/porter stemming, etc. I'll go with the flow, we can stay with what we have now, and the language support will also likely remain weak like it is now. Currently I feel its an immense up-front effort to contribute any analysis support, it has to be near-perfect less it will cause future problems, because its not easy to iterate with the current situation without creating a mess. Forget about applying little patches or improvements (assuming adequately relevance-tested / sane etc)... we've really only been able to fix bugs, add tests, and reorganize analyzers because touching them at all means you have to add backwards compat cruft. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857471#action_12857471 ] Robert Muir commented on LUCENE-2396: - bq. How can I use lucene-analyzers-3.0.jar on old indexes and lucene-analyzers-3.5.jar on newer ones within the same application? sorry DM, i meant to respond to this too! I think this is an advanced use case, that doesn't justify complex backwards compatibility layers. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I seriously don't understand the fuss around index format back compat. How many times is this changed such that it is too much to ask to keep X support X-1? I prefer to have ongoing segment merging but can live w/ a manual converter tool. Thing is - I'll probably not be able to develop one myself outside the scope of Lucene because I'll miss tons of API. So having Lucene declare and adhere to it seems reasonable to me. BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. And I also think that a manual migration tool will need access to some lower level API which is not exposed today, and will generally not perform as well as online migration. But that's a side note... Shai On Thursday, April 15, 2010, Earwin Burrfoot ear...@gmail.com wrote: I'd like to remind that Mike's proposal has stable branches. We can branch off preflex trunk right now and wrap it up as 3.1. Current trunk is declared as future 4.0 and all backcompat cruft is removed from it. If some new features/bugfixes appear in trunk, and they don't break stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3, etc Thus, devs are free to work without back-compat burden, bleeding edge users get their blood, conservative users get their stability + a subset of new features from stable branches. On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Hello, I think some compatibility breaks should really be accepted, otherwise these requirements are going to kill the technological advancement: the effort in backwards compatibility will grow and be more timeconsuming and harder every day. A mayor release won't happen every day, likely not even every year, so it seems acceptable to have milestones defining compatibility boundaries: you need to be able to reset the complexity curve occasionally. Backporting a feature would benefit from being merged in the correct testsuite, and avoid the explosion of this matrix-like backwards compatibility test suite. BTW the current testsuite is likely covering all kinds of combinations which nobody is actually using or caring about. Also if I where to discover a nice improvement in an Analyzer, and you where telling me that to contribute it I would have to face this amount of complexity.. I would think twice before trying; honestly the current requirements are scary. +1 Sanne 2010/4/15 Earwin Burrfoot ear...@gmail.com: I'd like to remind that Mike's proposal has stable branches. We can branch off preflex trunk right now and wrap it up as 3.1. Current trunk is declared as future 4.0 and all backcompat cruft is removed from it. If some new features/bugfixes appear in trunk, and they don't break stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3, etc Thus, devs are free to work without back-compat burden, bleeding edge users get their blood, conservative users get their stability + a subset of new features from stable branches. On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857487#action_12857487 ] DM Smith commented on LUCENE-2396: -- bq. Well, I think asking for a well-defined backwards compatibility policy for 'all analyzers' is asking too much. Things are not so simple and sorted out like they are with English/porter stemming, etc. Some ramblings: I think things need to change/improve wrt analyzers, tokenizers and filters. The current Version mechanism is a road block. So is bw compat. I get that. When I asked for a well-define compatibility policy, I was not suggesting that we go back to the old mechanism or keep the new Version mechanism. Just a clear statement on what the policy is. It might be on a per class basis. One mechanism that would work is versioned Java package names or class names. The current release would get the good names. If a user wanted the old jar they'd have to get it from the current release (e.g. lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff which now has either a new package name or a new class name. Example, trStemmer.java is going to be changed as the first breaking change since 3.0, so trStemmer3_0.java is created as a copy and then trStemmer.java is changed. The compatibility policy would be that the jar is not a drop in replacement, but that the old classes still exist, albeit with a different name. I have worked on some contributions w/ bw compat and it is a pain. I didn't like it. And that was both pre-version and post-version. I'd like to see version go away, but I'm not sure I'd like bw compat to go away. As it is I'm resigning myself that as I use each release of Lucene, I'm going to want more from it and that is likely to require index rebuilds. Right now I'm stuck with the 2.9 series and what happens until I upgrade to 3.x or 4.x doesn't really impact me. It will impact me then. I'll figure out how to deal with it and suck it up. Some other things I'd like to see: * I'd like to see fully controllable Unicode support. The only way I see this is if we use ICU. It will take the java version problem out of the picture. A user would have control of the version of Unicode by their control of the version of ICU. * An analyzer construction factory, that would take a spec (of fields, tokenizers, stop words, stemmers, ) and spit out an per field analyzer. This would allow for the deprecation of the analyzers. These and others would be more readily tackled if the bw compat policy did not get in the way. bq. I'll go with the flow, we can stay with what we have now, and the language support will also likely remain weak like it is now. You know I don't want that ;) I was suggesting that this issue should wait to see what the outcome of the general Version discussion is. Even if it is negative, perhaps this can go forward. bq.Currently I feel its an immense up-front effort to contribute any analysis support, it has to be near-perfect less it will cause future problems, because its not easy to iterate with the current situation without creating a mess. With new stuff, even in core, if it is marked as experimental, it is outside the bw compat policy. That gives the opportunity to iterate. Dev branches are another good way. But please, keep up the good work! remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/15/2010 03:04 PM, Earwin Burrfoot wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. Will it be able to be used within a client application that creates and uses local indexes? I;m assuming it will be faster than re-indexing. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857490#action_12857490 ] Robert Muir commented on LUCENE-2396: - {quote} One mechanism that would work is versioned Java package names or class names. The current release would get the good names. If a user wanted the old jar they'd have to get it from the current release (e.g. lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff which now has either a new package name or a new class name. Example, trStemmer.java is going to be changed as the first breaking change since 3.0, so trStemmer3_0.java is created as a copy and then trStemmer.java is changed. {quote} Right, but I dont think Lucene should manage this. I think if we assume normally versioned releases, a user with a really complex case that needs multiple versions of lucene working in the same JVM , like you, could use some other tool (eclipse refactor or maybe google's jarjar) to do rename things? remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 23:07, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 03:04 PM, Earwin Burrfoot wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. Will it be able to be used within a client application that creates and uses local indexes? I;m assuming it will be faster than re-indexing. As I said earlier in the topic, it is obvious the tool has to have both programmatic and command-line interfaces. I will also reiterate - it only upgrades the index structurally. If you changed your analyzers - that's your problem and you have to deal with it. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857498#action_12857498 ] DM Smith commented on LUCENE-2396: -- {quote} bq. One mechanism that would work is versioned Java package names or class names. The current release would get the good names. If a user wanted the old jar they'd have to get it from the current release (e.g. lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff which now has either a new package name or a new class name. Example, trStemmer.java is going to be changed as the first breaking change since 3.0, so trStemmer3_0.java is created as a copy and then trStemmer.java is changed. Right, but I dont think Lucene should manage this. I think if we assume normally versioned releases, a user with a really complex case that needs multiple versions of lucene working in the same JVM , like you, could use some other tool (eclipse refactor or maybe google's jarjar) to do rename things? {quote} I can go along with this. I still think it might be good to let the dust settle on the general Version question before committing. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
but seriously... are you moving across major lucene releases every single day? if you are using 3.x, how does it hurt you if there is a version 4.x that you can't use without re-indexing? why wouldn't you just stay on your stable branch (say 3.x)? 2010/4/15 jm jmugur...@gmail.com Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier There's absolutely no, zero, nada, way to use modified/fixed analyzer stack without reindexing. If you want it - reindex, if you don't - stick with the stable branch. If your stack is unchanged, but the index format changes - upgrade it with the proposed tool and be happy. Speaking as a happy plain user, whose indexes take two days to be fully rebuilt and who does it (though not always full) at least once a month. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/15/2010 03:12 PM, Earwin Burrfoot wrote: On Thu, Apr 15, 2010 at 23:07, DM Smithdmsmith...@gmail.com wrote: On 04/15/2010 03:04 PM, Earwin Burrfoot wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. Will it be able to be used within a client application that creates and uses local indexes? I;m assuming it will be faster than re-indexing. As I said earlier in the topic, it is obvious the tool has to have both programmatic and command-line interfaces. I will also reiterate - it only upgrades the index structurally. If you changed your analyzers - that's your problem and you have to deal with it Good. (Sorry I missed that. There's just too much in the thread to keep track of ;) As long as my old analyzers will still work with the new lucene-core jar, I'm fat, dumb and happy with the upgraded index. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. Shai On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote: Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/15/2010 03:25 PM, Shai Erera wrote: We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. What about an index that has already called optimize()? I presume it will be upgraded with what ever is decided? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Proposal about Version API relaxation
Hi Earwin, I am strongly +1 on this. I would also make the Release Manager for 3.1, if nobody else wants to do this. I would like to take the preflex tag or some revisions before (maybe without the IndexWriterConfig, which is a really new API) to be 3.1 branch. And after that port some of my post-flex-changes like the StandardTokenizer refactoring back (so we can produce the old analyzer still without Java 1.4). So +1 on branching pre-flex and release as 3.1 soon. The Unicode improvements rectify a new release. I think also s1monw wants to have this. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Earwin Burrfoot [mailto:ear...@gmail.com] Sent: Thursday, April 15, 2010 8:15 PM To: java-dev@lucene.apache.org Subject: Re: Proposal about Version API relaxation I'd like to remind that Mike's proposal has stable branches. We can branch off preflex trunk right now and wrap it up as 3.1. Current trunk is declared as future 4.0 and all backcompat cruft is removed from it. If some new features/bugfixes appear in trunk, and they don't break stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3, etc Thus, devs are free to work without back-compat burden, bleeding edge users get their blood, conservative users get their stability + a subset of new features from stable branches. On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
2010/4/15 Shai Erera ser...@gmail.com: The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. We should not create such an API on IW, and we should build offline migration tool as a separate thing :) Because otherwise we have to keep all back-compat stuff within IW, SR and friends as it is. Look at current SegmentReader.Norm code - there's three freaking places they can be loaded from. I will also reiterate the issue of the API. Fat index changes are almost certainly accompanied by API changes. With online migration we have to emulate new APIs over old segments, which is really cumbersome. With offline migration we only need to be able to read said segments in one or another manner. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857507#action_12857507 ] Robert Muir commented on LUCENE-2396: - bq. I can go along with this. Cool! bq. I still think it might be good to let the dust settle on the general Version question before committing. Sure... but we should still remember there's really no back compat for the stuff changed in this patch :) I'm also glad you mentioned the unicode issue, i mean if you are doing non-English, some of the ideas in lucene's back compat with analyzers are basically downright silly at the end of the day. Besides the fact that upgrading your JVM can cause java itself to treat text differently (which we currently cannot control), changes to the users operating system [potentially completely outside of the scope of your application!] can cause 'searches that worked before to not work anymore'. For example, if your users upgrade and their new input method generates U+09CE instead of U+09A4 U+09CD U+200D for Khanda-ta, the search won't match, even though perhaps they typed the exact same key on their keyboard. Unicode normalization does nothing in this case, and its your app's responsibility to be aware of stuff like this (Not Lucene's analyzers!) and deal with them. At the end of the day, I think a lot of what lucene considers our own backwards compatibility responsibility necessarily belongs in the app instead. {noformat} Versions of the Unicode Standard prior to Version 4.1 recommended that khanda ta be represented as the sequence U+09A4 bengali letter ta, U+09CD bengali sign virama, U+200D zero width joiner in all circumstances. U+09CE bengali letter khanda ta should instead be used explicitly in newly generated text, but users are cautioned that instances of the older representation may exist. {noformat} remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Unfortunately, live searching against an old index can get very hairy. EG look at what I had to do for the flex API on pre-flex index flex emulation layer. It's also not great because it gives the illusion that all is good, yet, you've taken a silent hit (up to ~10% or so) in your search perf. Whereas building maintaining a one-time index migration tool, in contrast, is much less work. I realize the migration tool has issues -- it fixes the hard changes but silently allows the soft changes to break (ie, your analyzers my not produce the same tokens, until we move all core analyzers outside of core, so they are separately versioned), but it seems like a good compromise here? Mike 2010/4/15 Shai Erera ser...@gmail.com: The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. Shai On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote: Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
2010/4/15 Michael McCandless luc...@mikemccandless.com I realize the migration tool has issues -- it fixes the hard changes but silently allows the soft changes to break (ie, your analyzers my not produce the same tokens, until we move all core analyzers outside of core, so they are separately versioned), but it seems like a good compromise here? Well, lets consider doing that too. Since analyzers have this tough problem of being soft changes, I propose the following: 1. get rid of version 2. minimize the interface between the indexer and analysis 3. put analyzers in their own versioned jar files. this way, we could provide a realistic capability for users to use lucene-3.5.jar with lucene-3.2-analyzers.jar, and possibly have STRONGER analyzer back compat (e.g. if we minimize the damn thing enough, perhaps very old analyzers.jar's could even work across major releases). its also much safer when you are using the same bytecodes you used before, instead of hairy back compat layers. I don't refer to Uwe's code here: its perfect, but we cant force Uwe into writing the back compat for every big feature. -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
From IRC: why do I get the feeling that everyone is in heated agreement on the Version thread? there are some cases that mean people will have to reindex in those cases, we should tell people they will have to reindex then they can decide to upgrade or not all other cases, just do the sensible thing and test first I have yet to meet anyone who simply drops a new version into production and says go So, as I said earlier, why don't we just move forward with it, strive to support reading X-1 index format in X and let the user know the cases in which they will have to re-index. If a migration tool is necessary, then someone can write it at the appropriate time. Just as was said w/ the Solr merge, it's software. If it doesn't work, we can change it. Thank goodness we don't have a back compatibility policy for our policies! -Grant On Apr 15, 2010, at 3:35 PM, Michael McCandless wrote: Unfortunately, live searching against an old index can get very hairy. EG look at what I had to do for the flex API on pre-flex index flex emulation layer. It's also not great because it gives the illusion that all is good, yet, you've taken a silent hit (up to ~10% or so) in your search perf. Whereas building maintaining a one-time index migration tool, in contrast, is much less work. I realize the migration tool has issues -- it fixes the hard changes but silently allows the soft changes to break (ie, your analyzers my not produce the same tokens, until we move all core analyzers outside of core, so they are separately versioned), but it seems like a good compromise here? Mike 2010/4/15 Shai Erera ser...@gmail.com: The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. Shai On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote: Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I think this should split off the mega-thread :) On Thu, Apr 15, 2010 at 23:28, Uwe Schindler u...@thetaphi.de wrote: Hi Earwin, I am strongly +1 on this. I would also make the Release Manager for 3.1, if nobody else wants to do this. I would like to take the preflex tag or some revisions before (maybe without the IndexWriterConfig, which is a really new API) to be 3.1 branch. And after that port some of my post-flex-changes like the StandardTokenizer refactoring back (so we can produce the old analyzer still without Java 1.4). So +1 on branching pre-flex and release as 3.1 soon. The Unicode improvements rectify a new release. I think also s1monw wants to have this. Uwe -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, 15 Apr 2010, Robert Muir wrote: 2010/4/15 Michael McCandless luc...@mikemccandless.com I realize the migration tool has issues -- it fixes the hard changes but silently allows the soft changes to break (ie, your analyzers my not produce the same tokens, until we move all core analyzers outside of core, so they are separately versioned), but it seems like a good compromise here? Well, lets consider doing that too. Since analyzers have this tough problem of being soft changes, I propose the following: 1. get rid of version 2. minimize the interface between the indexer and analysis 3. put analyzers in their own versioned jar files. Yes, every analyzer needs to have its own version and thus, jar file. Putting all analyzers into one versioned jar file joins them at the hip and suffers from the same versioning and compat problems we're currently facing in core. Andi.. this way, we could provide a realistic capability for users to use lucene-3.5.jar with lucene-3.2-analyzers.jar, and possibly have STRONGER analyzer back compat (e.g. if we minimize the damn thing enough, perhaps very old analyzers.jar's could even work across major releases). its also much safer when you are using the same bytecodes you used before, instead of hairy back compat layers. I don't refer to Uwe's code here: its perfect, but we cant force Uwe into writing the back compat for every big feature. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Proposal about Version API relaxation
I wish we could have a face to face talk like in the evenings at ApacheCon :( Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Thursday, April 15, 2010 9:46 PM To: java-dev@lucene.apache.org Subject: Re: Proposal about Version API relaxation From IRC: why do I get the feeling that everyone is in heated agreement on the Version thread? there are some cases that mean people will have to reindex in those cases, we should tell people they will have to reindex then they can decide to upgrade or not all other cases, just do the sensible thing and test first I have yet to meet anyone who simply drops a new version into production and says go So, as I said earlier, why don't we just move forward with it, strive to support reading X-1 index format in X and let the user know the cases in which they will have to re-index. If a migration tool is necessary, then someone can write it at the appropriate time. Just as was said w/ the Solr merge, it's software. If it doesn't work, we can change it. Thank goodness we don't have a back compatibility policy for our policies! -Grant On Apr 15, 2010, at 3:35 PM, Michael McCandless wrote: Unfortunately, live searching against an old index can get very hairy. EG look at what I had to do for the flex API on pre-flex index flex emulation layer. It's also not great because it gives the illusion that all is good, yet, you've taken a silent hit (up to ~10% or so) in your search perf. Whereas building maintaining a one-time index migration tool, in contrast, is much less work. I realize the migration tool has issues -- it fixes the hard changes but silently allows the soft changes to break (ie, your analyzers my not produce the same tokens, until we move all core analyzers outside of core, so they are separately versioned), but it seems like a good compromise here? Mike 2010/4/15 Shai Erera ser...@gmail.com: The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. Shai On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote: Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 -- --- To unsubscribe, e-mail:
Re: Proposal about Version API relaxation
3. put analyzers in their own versioned jar files. Yes, every analyzer needs to have its own version and thus, jar file. Putting all analyzers into one versioned jar file joins them at the hip and suffers from the same versioning and compat problems we're currently facing in core. Andi.. that was actually a typo, sorry :) But maybe not a bad idea for the future. for now simply moving analyzers to its own jar filE would be a great step! -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote: for now simply moving analyzers to its own jar filE would be a great step! +1 -- why not consolidate all analyzers now? (And fix indexer to require a minimal API = TokenStream minus reset close). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
+1 on the Analyzers as well. Earwin, I think I don't mind if we introduce migrate() elsewhere rather than on IW. What I meant to say is that if we stick w/ index format back-compat and ongoing migration, then such a method would be useful on IW for customers to call to ensure they're on the latest version. But if the majority here agree w/ a standalone tool, then I'm ok if it sits elsewhere. Grant, I'm all for 'just doing it and see what happens'. But I think we need to at least decide what we're going to do so it's clear to everyone. Because I'd like to know if I'm about to propose an index format change, whether I need to build migration tool or not. Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. But +1 for changing something ! Analyzers at first, API second. Shai On Thu, Apr 15, 2010 at 10:52 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote: for now simply moving analyzers to its own jar filE would be a great step! +1 -- why not consolidate all analyzers now? (And fix indexer to require a minimal API = TokenStream minus reset close). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Apr 15, 2010, at 4:21 PM, Shai Erera wrote: +1 on the Analyzers as well. Earwin, I think I don't mind if we introduce migrate() elsewhere rather than on IW. What I meant to say is that if we stick w/ index format back-compat and ongoing migration, then such a method would be useful on IW for customers to call to ensure they're on the latest version. But if the majority here agree w/ a standalone tool, then I'm ok if it sits elsewhere. Grant, I'm all for 'just doing it and see what happens'. But I think we need to at least decide what we're going to do so it's clear to everyone. Because I'd like to know if I'm about to propose an index format change, whether I need to build migration tool or not. Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. As I said, we should strive for index compatibility, but even in the past, we said we did, but the implications weren't always clear. I think index compatibility is very important. I've seen plenty of times where reindexing is not possible. But even then, you still have the option of testing to find out whether you can update or not. If you can't update, then don't until you can figure out how to do it. FWIW, I think our approach is much more proactive than see what happens. I'd argue, that in the past, our approach was see what happens, only the seeing didn't happen until after the release! -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On Thu, Apr 15, 2010 at 4:21 PM, Shai Erera ser...@gmail.com wrote: Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. just look at the 1.8MB of backwards compat code in contrib/analyzers i want to remove in LUCENE-2396? are you serious? I wrote most of that cruft to prevent reindexing and you are trying to say I don't understand the fuss about it? We shouldnt make people reindex, but we should have the chance, even if we only do it ONE TIME, to reset Lucene to a new Major Version that has a bunch of stuff fixed we couldnt fix before, and more flexibility. because with the current policy, its like we are in 1.x forever our version numbers are a joke! -- Robert Muir rcm...@gmail.com