date:20100415

See you already did that Mike :). Thanks ! now the tests run for 2s.

Shai

On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 It's also slow because it repeats all the tests for each of the core
 codecs (standard, sep, pulsing, intblock).

 I think it's fine to reduce the number of iterations -- just make sure
 there's no seed to newRandom() so the distributing testing is
 effective.

 Mike

 On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I've noticed that TestCodecs takes an insanely long time to run on my
  machine - between 35-40 seconds. Is that expected?
  The reason why it runs so long, seems to be that its threads make (each)
  4000 iterations ... is that really required to ensure correctness?
 
  Shai
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

SnapshotDeletionPolicy throws NPE if no commit happened

SDP throws NPE if the index includes no commits, but snapshot() is called.
This is an extreme case, but can happen if one takes snapshots (for backup
purposes for example) in a separate code segment than indexing, and does not
know if commit was called or not.

I think we should throw an IllegalStateException instead of falling on NPE,
w/ a descriptive message. Alternatively, we can just return null and
document it ... But I prefer the ISE instead. What do you think?

Shai

Re: Proposal about Version API relaxation

We can remove Version, because all incompatible changes go straight to
a new major release, which we release more often, yes.
3.x is going to be released after 4.0 if bugs are found and fixed, or
if people ask to backport some (minor?) features, and some dev has
time for this.

The question of what to call major release in X.Y.Z scheme - X or Y,
is there, but immaterial :) I think it's okay to settle with X.Y, we
have major releases and bugfixes, what that third number can be used
for?

On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote:
 So then I don't understand this:

 {quote}
 * A major release always bumps the major release number (2.x -
    3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
    releases along that branch

 * There is no back compat across major releases (index nor APIs),
    but full back compat within branches.

 {quote}

 What's different than what's done today? How can we remove Version in that
 world, if we need to maintain full back-compat between 3.1 and 3.2, index
 and API-wise? We'll still need to deprecate and come up w/ new classes every
 time, and we'll still need to maintain runtime changes back-compat.

 Unless you're telling me we'll start releasing major releases more often?
 Well ... then we're saying the same thing, only I think that instead of
 releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
 because if you look back, every minor release included API deprecations as
 well as back-compat breaks. That means that every minor release should have
 been a major release right?

 Point is, if I understand correctly and you agree w/ my statement above - I
 don't see why would anyone releases a 3.x after 4.0 is out unless someone
 really wants to work hard on maintaining back-compat of some features.

 If it's just a numbering thing, then I don't think it matters what is
 defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor
 X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer
 the latter but don't have any strong feelings against the former. Just
 pointing out that X will grow more rapidly than today. That's all.

 So did I get it right?

 Shai

 On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote:

 I don't read what you wrote and what Mike wrote as even close to the
 same.

 - Mark
 http://www.lucidimagination.com (mobile)
 On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote:

 Ahh ... a dream finally comes true ... what a great way to start a day :).
 +1 !!!

 I have some questions/comments though:

 * Index back compat should be maintained between major releases, like it
 is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade
 their segments when they move from 2.x to 3.x before 4.0 lands and they'll
 need to call optimize() to ensure 4.0 still works on their index. I hope
 that will still be the case? Otherwise I don't see how we can prevent
 reindexing by apps.
 ** Index behavioral/runtime changes, like those of Analyzers, are ok to
 require a reindex, as proposed.

 So after 3.1 is out, trunk can break the API and 3.2 will have a new set
 of API? Cool and convenient. For how long do we keep the 3.1 branch around?
 Also, it used to only fix bugs, but from now on it'll be allowed to
 introduce new features, if they maintain back-compat? So 3.1.1 can have
 'flex' (going for the extreme on purpose) if someone maintains back-compat?

 I think the back-compat on branches should be only for index runtime
 changes. There's no point, in my opinion, to maintain API back-compat
 anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
 just to get a new feature but get it API back-supported? As soon as they
 upgrade to 3.2, that means a new set of API right?

 Major releases will just change the index structure format then? Or move
 to Java 1.6? Well ... not even that because as I understand it, 3.2 can move
 to Java 1.6 ... no API back-compat right :).

 That's definitely a great step forward !

 Shai

 On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org
 wrote:

 On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

 Can't believe my eyes.

 +1

 Likewise. +1 !

 Andi..


 On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 luc...@mikemccandless.com wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:

 Essentially, we're free to break back compat within Lucy at any
 time, but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development
 with
 Analyzers is just change them and note the break in the Changes file.

 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x -
    3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
    releases along that branch

  * There is no back compat across major releases (index nor APIs),

Re: SnapshotDeletionPolicy throws NPE if no commit happened

We should just let IW create a null commit on an empty directory, like
it always did ;)
Then a whole class of such problems disappears.

On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
 SDP throws NPE if the index includes no commits, but snapshot() is called.
 This is an extreme case, but can happen if one takes snapshots (for backup
 purposes for example) in a separate code segment than indexing, and does not
 know if commit was called or not.

 I think we should throw an IllegalStateException instead of falling on NPE,
 w/ a descriptive message. Alternatively, we can just return null and
 document it ... But I prefer the ISE instead. What do you think?

 Shai




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: SnapshotDeletionPolicy throws NPE if no commit happened

Well ... one can still call commit() or close() right after IW creation. And
this is a very rare case to be hit by. Was just asking about whether we want
to add an explicit and clear protective code about it or not.

Shai

On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com wrote:

 We should just let IW create a null commit on an empty directory, like
 it always did ;)
 Then a whole class of such problems disappears.

 On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
  SDP throws NPE if the index includes no commits, but snapshot() is
 called.
  This is an extreme case, but can happen if one takes snapshots (for
 backup
  purposes for example) in a separate code segment than indexing, and does
 not
  know if commit was called or not.
 
  I think we should throw an IllegalStateException instead of falling on
 NPE,
  w/ a descriptive message. Alternatively, we can just return null and
  document it ... But I prefer the ISE instead. What do you think?
 
  Shai
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Well ... I think that version numbers mean more than we'd like them to mean,
as people perceive them. Let's discuss the format X.Y.Z:

When X is changed, it should mean something 'big' happened - index structure
has changed (e.g. the flexible scoring work), new Java version supported
(Java 1.6) and even stuff like 'flex' which includes statements like if you
don't want your app to slow down, consider reindexing. Such things signal a
major change in Lucene, sometimes even just policy changes (Java version
supported) and therefore I think we should reserve the ability to bump X
when such things happen.

Another thing is the index structure back-compat policy - today Lucene
supports X-1 index structure, but during upgrades of X.Y versions, your
segments are gradually migrated. Eventually, when you upgrade to 4.0 you
should know whether you have a 2.x index, and call optimize just in case if
you're not sure it's not migrated yet (if you've upgraded to 3.x).
If we start bumping up 'X' too often, we'll either need to change the X-1
policy to X-N, which will just complicate matters for users. Or we'll keep
the X-1 policy, but people will need to call optimize more frequently.

Y should change on a regular basis, and no back-compat API-wise or index
runtime-wise is guaranteed. So the Collector and per-segment searches in 2.9
could go w/o deprecating tons of API, so is the TokenStream work. Changes to
Analyzer's runtime capabilities will also be allowed between Y revisions.

Z should change when bugfixes are fixed, or when features are backported.
Really ... we rarely fix bugs on a released Y branch, and I don't expect
tons of features will be backported to a Y branch (to create a Z+1 release).
Therefore this should not confuse anyone.

So all I'm saying is that instead of increasing X whenever the API, index
structure or runtime behavior has changed, I'm simply proposing to
differentiate between really major changes to those that just say 'we're
not back-compat compliant'.

But above all, I'd like to see this change happening, so if I need to
surrender to the X vs. X+Y approach, I will. Just think it will create some
confusion.

BTW, w/ all that - does it mean 'backwards' can be dropped, or at least
test-backwards activated only on a branch which we decide needs it? That'll
be really great.

Shai

On Thu, Apr 15, 2010 at 10:24 AM, Earwin Burrfoot ear...@gmail.com wrote:

 We can remove Version, because all incompatible changes go straight to
 a new major release, which we release more often, yes.
 3.x is going to be released after 4.0 if bugs are found and fixed, or
 if people ask to backport some (minor?) features, and some dev has
 time for this.

 The question of what to call major release in X.Y.Z scheme - X or Y,
 is there, but immaterial :) I think it's okay to settle with X.Y, we
 have major releases and bugfixes, what that third number can be used
 for?

 On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote:
  So then I don't understand this:
 
  {quote}
  * A major release always bumps the major release number (2.x -
 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
 releases along that branch
 
  * There is no back compat across major releases (index nor APIs),
 but full back compat within branches.
 
  {quote}
 
  What's different than what's done today? How can we remove Version in
 that
  world, if we need to maintain full back-compat between 3.1 and 3.2, index
  and API-wise? We'll still need to deprecate and come up w/ new classes
 every
  time, and we'll still need to maintain runtime changes back-compat.
 
  Unless you're telling me we'll start releasing major releases more often?
  Well ... then we're saying the same thing, only I think that instead of
  releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
  because if you look back, every minor release included API deprecations
 as
  well as back-compat breaks. That means that every minor release should
 have
  been a major release right?
 
  Point is, if I understand correctly and you agree w/ my statement above -
 I
  don't see why would anyone releases a 3.x after 4.0 is out unless someone
  really wants to work hard on maintaining back-compat of some features.
 
  If it's just a numbering thing, then I don't think it matters what is
  defined as 'major' vs. 'minor'. One way is to define 'major' as X and
 minor
  X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I
 prefer
  the latter but don't have any strong feelings against the former. Just
  pointing out that X will grow more rapidly than today. That's all.
 
  So did I get it right?
 
  Shai
 
  On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  I don't read what you wrote and what Mike wrote as even close to the
  same.
 
  - Mark
  http://www.lucidimagination.com (mobile)
  On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote:
 
  Ahh ... a dream finally comes true ... what

Re: SnapshotDeletionPolicy throws NPE if no commit happened

BTW, even if it's a stupid thing to do, someone can today create SDP and
call snapshot without ever creating IW. And it's not an impossible scenario.
Consider a backup-aware application which creates SDP first, then passes it
to the indexing process and the backup process, separately. The backup
process doesn't need to know of IW at all, and might call snapshot() before
IW was even created, and SDP.onInit was called. It's a possibility, not
saying it's a great and safe architecture.

So this is really about do we want to write clear protective code, or allow
the NPE?

Shai

2010/4/15 Shai Erera ser...@gmail.com

 Well ... one can still call commit() or close() right after IW creation.
 And this is a very rare case to be hit by. Was just asking about whether we
 want to add an explicit and clear protective code about it or not.

 Shai


 On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.comwrote:

 We should just let IW create a null commit on an empty directory, like
 it always did ;)
 Then a whole class of such problems disappears.

 On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
  SDP throws NPE if the index includes no commits, but snapshot() is
 called.
  This is an extreme case, but can happen if one takes snapshots (for
 backup
  purposes for example) in a separate code segment than indexing, and does
 not
  know if commit was called or not.
 
  I think we should throw an IllegalStateException instead of falling on
 NPE,
  w/ a descriptive message. Alternatively, we can just return null and
  document it ... But I prefer the ISE instead. What do you think?
 
  Shai
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: SnapshotDeletionPolicy throws NPE if no commit happened

Presumably you'd also hit this exception if the DP deletes all commit
points, right?

I like IllegalStateException.

Mike

2010/4/15 Shai Erera ser...@gmail.com:
 BTW, even if it's a stupid thing to do, someone can today create SDP and
 call snapshot without ever creating IW. And it's not an impossible scenario.
 Consider a backup-aware application which creates SDP first, then passes it
 to the indexing process and the backup process, separately. The backup
 process doesn't need to know of IW at all, and might call snapshot() before
 IW was even created, and SDP.onInit was called. It's a possibility, not
 saying it's a great and safe architecture.

 So this is really about do we want to write clear protective code, or allow
 the NPE?

 Shai

 2010/4/15 Shai Erera ser...@gmail.com

 Well ... one can still call commit() or close() right after IW creation.
 And this is a very rare case to be hit by. Was just asking about whether we
 want to add an explicit and clear protective code about it or not.

 Shai

 On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com
 wrote:

 We should just let IW create a null commit on an empty directory, like
 it always did ;)
 Then a whole class of such problems disappears.

 On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote:
  SDP throws NPE if the index includes no commits, but snapshot() is
  called.
  This is an extreme case, but can happen if one takes snapshots (for
  backup
  purposes for example) in a separate code segment than indexing, and
  does not
  know if commit was called or not.
 
  I think we should throw an IllegalStateException instead of falling on
  NPE,
  w/ a descriptive message. Alternatively, we can just return null and
  document it ... But I prefer the ISE instead. What do you think?
 
  Shai
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: TestCodecs running time

Yah :)

TestStressIndexing2 is another slow one... I'll go fix it...

Mike

On Thu, Apr 15, 2010 at 2:15 AM, Shai Erera ser...@gmail.com wrote:
 See you already did that Mike :). Thanks ! now the tests run for 2s.

 Shai

 On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 It's also slow because it repeats all the tests for each of the core
 codecs (standard, sep, pulsing, intblock).

 I think it's fine to reduce the number of iterations -- just make sure
 there's no seed to newRandom() so the distributing testing is
 effective.

 Mike

 On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I've noticed that TestCodecs takes an insanely long time to run on my
  machine - between 35-40 seconds. Is that expected?
  The reason why it runs so long, seems to be that its threads make (each)
  4000 iterations ... is that really required to ensure correctness?
 
  Shai
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010/4/15 Shai Erera ser...@gmail.com:

 One way is to define 'major' as X and minor X.Y, and another is to define 
 major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any 
 strong feelings against the former.

I prefer X.Y, ie, changes to Y only is a minor release (mostly bug
fixes but maybe small features); changes to X is a major release.  I
think that's more standard, ie, people will generally grok that 3.3
- 4.0 is a major change but 3.3 - 3.4 isn't.

So this proposal would change how Lucene releases are numbered.  Ie,
the next release would be 4.0.  Bug fixes / small features would then
be 4.1.

 Index back compat should be maintained between major releases, like it is 
 today, STRUCTURE-wise.

No... in the proposal, you must re-index on upgrading to the next
major release (3.x - 4.0).

I think supporting old indexes, badly (what we do today) is not a
great solution.  EG on upgrading to 3.1 you'll immediately see a
search perf hit since the flex emulation layer is running.  It's a
trap.

It's this freedom, I think, that'd let us drop Version entirely.  It's
the back-compat of the index that is the major driver for having
Version today (eg so that the analyzers can produce tokens matching
your old index).

EG Terrier seems to have the same requirement -- note the bold All
indexes must be rebuilt:

  http://terrier.org/docs/current/whats_new.html

Also, Lucene isn't a primary store (like a filesytem or a database).
We expect that your true content still lives somewhere else.  So why
do we go to such great lengths to keep the index format for so
long...?

 BTW, w/ all that - does it mean 'backwards' can be dropped, or at least 
 test-backwards activated only on a branch which we decide needs it? That'll 
 be really great.

I think the stable branches (2.x, 3.x) would have backwards tests
created the moment they are branched, to make sure as we fix bugs /
backport minor features we don't break back compat, along that branch.

I don't think we need the .Z part of a release numbering -- our
numbers would look like most other software projects.  3.0 is a major
release, 3.1, 3.2, 3.3 fix bugs / add minor features, etc.

If flex were done in this world I would've finished it alot faster!  A
huge amount of time went into the cross back compat emulation layers
(pre-flex APIs and pre-flex index).

 Also, we will still need to maintain the Backwards section in CHANGES (or 
 move it to API Changes), to help people upgrade from release to release.

I think we'd create a migration guide to explain how apps migrate to
the next major release (this is what other projects do), eg like this:

  http://community.jboss.org/wiki/Hibernate3MigrationGuides#A42

 Unless you're telling me we'll start releasing major releases more often?

I think this is mostly orthogonal?  We could still do major releases
frequently or rarely with this model... however, it would give us more
freedom to do major releases frequently (vs today where every major
release sets a scary back-compat-burden stake in the ground).

 I don't see why would anyone releases a 3.x after 4.0 is out unless someone 
 really wants to work hard on maintaining back-compat of some features

I think the minor releases on the stable branch (3.1, 3.2, 3.3) would
be mostly bug fixes, but maybe also minor features if
contributor's/developer's had the itch to make them available on the
stable (3.x) branch.  How much dev happens on the stable branch can be
largely determined by itch...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Merging the Mailing Lists

Looks like we are ready to go to merge the Lucene and Solr dev mailing lists.  
The new list will be d...@lucene.apache.org.  All existing subscribers will 
automatically be subscribed to the new list.  For more info, see 
https://issues.apache.org/jira/browse/INFRA-2567.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2010-04-15 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1278.


Resolution: Won't Fix

I think the pulsing codec (wraps any other codec, but inlines low-freq terms 
directly into the terms dict) solves this?

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters

Add a scoring DistanceQuery that does not need caches and separate filters
--

 Key: LUCENE-2395
 URL: https://issues.apache.org/jira/browse/LUCENE-2395
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Uwe Schindler
 Fix For: 3.1


In a chat with Chris Male and my own ideas when implemnting for PANGAEA, I 
thought about the broken distance query in contrib. It lacks the folloing 
features:
- It needs a query for the encldoing bbox (which is constant score)
- It needs a separate filter for filtering out distances
- It has no scoring, so if somebody wants to sort by distance, he needs to use 
the custom sort. For that to work, spatial caches distance calculation (which 
is borken for multi-segment search)

The idea is now to combine all three things into one query, but customizeable:

We first thought about extending CustomScoreQuery and calculate the distance 
from FieldCache in the customScore method and return a score of 1 for 
distance=0, score=0 on the max distance and score0 for farer hits, that are in 
the bounding box but not in the distance circle. To filter out such negative 
scores, we would need to override the scorer in CustomScoreQuery which is 
priate.

My proposal is now to use a very stripped down CustomScoreQuery (but not extend 
it) that does call a method getDistance(docId) in its scorer's advance and 
nextDoc that calculates the distance for the current doc. It stores this 
distance also in the scorer. If the distance  maxDistance it throws away the 
hit and calls nextDoc() again. The score() method will reurn per default 
weight.value*(maxDistance - distance)/maxDistance and uses the precalculated 
distance. So the distance is only calculated one time in nextDoc()/advance().

To be able to plug in custom scoring, the following methods in the query can be 
overridden:
- float getDistanceScore(double distance) - returns per default: (maxDistance - 
distance)/maxDistance; allows score customization
- DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an 
DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a 
NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g. 
wrap a Query with QueryWrapperFilter
- support a setter for the GeoDistanceCalculator that is used by the scorer to 
get the distance.

This query is almost finished in my head, it just needs coding :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters

2010-04-15 Thread Chris Male (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857278#action_12857278
]

Chris Male commented on LUCENE-2395:

This will replace the work I was doing on improving the DistanceFilter and the
DistanceSortSource. Instead we will have a proper DistanceQuery where the
sorting is done through the existing sorting by score functionality in Lucene.
The CartesianShapeFilter will then be able to be used as a Filter with the new
Query.

This also addresses the current problems with caching calculated distances and
means that Spatial will work with per segment.

Add a scoring DistanceQuery that does not need caches and separate filters
--

Key: LUCENE-2395
URL: https://issues.apache.org/jira/browse/LUCENE-2395
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/spatial
Reporter: Uwe Schindler
Fix For: 3.1

In a chat with Chris Male and my own ideas when implemnting for PANGAEA, I
thought about the broken distance query in contrib. It lacks the folloing
features:
- It needs a query for the encldoing bbox (which is constant score)
- It needs a separate filter for filtering out distances
- It has no scoring, so if somebody wants to sort by distance, he needs to
use the custom sort. For that to work, spatial caches distance calculation
(which is borken for multi-segment search)
The idea is now to combine all three things into one query, but customizeable:
We first thought about extending CustomScoreQuery and calculate the distance
from FieldCache in the customScore method and return a score of 1 for
distance=0, score=0 on the max distance and score0 for farer hits, that are
in the bounding box but not in the distance circle. To filter out such
negative scores, we would need to override the scorer in CustomScoreQuery
which is priate.
My proposal is now to use a very stripped down CustomScoreQuery (but not
extend it) that does call a method getDistance(docId) in its scorer's advance
and nextDoc that calculates the distance for the current doc. It stores this
distance also in the scorer. If the distance maxDistance it throws away the
hit and calls nextDoc() again. The score() method will reurn per default
weight.value*(maxDistance - distance)/maxDistance and uses the precalculated
distance. So the distance is only calculated one time in nextDoc()/advance().
To be able to plug in custom scoring, the following methods in the query can
be overridden:
- float getDistanceScore(double distance) - returns per default: (maxDistance
- distance)/maxDistance; allows score customization
- DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an
DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a
NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g.
wrap a Query with QueryWrapperFilter
- support a setter for the GeoDistanceCalculator that is used by the scorer
to get the distance.
This query is almost finished in my head, it just needs coding :-)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Well ... I must say that I completely disagree w/ dropping index structure
back-support. Our customers will simply not hear of reindexing 10s of TBs of
content because of version upgrades. Such a decision is key to Lucene
adoption in large-scale projects. It's entirely not about whether Lucene is
a content store or not - content is stored on other systems, I agree. But
that doesn't mean reindexing it is tolerable.

Up until now, Lucene migrated my segments gradually, and before I upgraded
from X+1 to X+2 I could run optimize() to ensure my index will be readable
by X+2. I don't think I can myself agree to it, let alone convince all the
stakeholders in my company who adopt Lucene today in numerous projects, to
let go of such capability. We've been there before (requiring reindexing on
version upgrades) w/ some offerings and customers simply didn't like it and
were forced to use an enterprise-class search engine which offered less (and
didn't use Lucene, up until recently !). Until we moved to Lucene ...

What's Solr's take on it?

I differentiate between structural changes and runtime changes. I, myself,
don't mind if we let go of back-compat support for runtime changes, such as
those generated by analyzers. For a couple of reasons, the most important
ones are (1) these are not so frequent (but so is index structural change)
and (2) that's a decision I, as the application developer, makes - using or
not a newer version of an Analyzer. I don't mind working hard to make a 2.x
Analyzer version work in the 3.x world, but I cannot make a 2.x index
readable by a 3.x Lucene jar, if the latter doesn't support it. That's the
key difference, in my mind, between the two. I can choose not to upgrade at
all to a newer analyzer version ... but I don't want to be forced to stay w/
older Lucene versions and features because of that ... well people might say
that it's not Lucene's problem, but I beg to differ. Lucene benefits from
wider and faster adoption and we rely on new features to be adopted quickly.
That might be jeopardized if we let go of that strong capability, IMO.

What we can do is provide an index migration tool ... but personally I don't
know what's the difference between that and gradually migrating segments as
they are merged, code-wise. I mean - it has to be the same code. Only an
index migration tool may take days to complete on a very large index, while
the ongoing migration takes ~0 time when you come to upgrade to a newer
Lucene release.

And the note about Terrier requiring reindexing ... well I can't say it's a
strength of it but a damn big weakness IMO.

About the release pace, I don't think we can suddenly release every 2 years
... makes people think the project is stuck. And some out there are not so
fond of using a 'trunk' version and release it w/ their products because
trunk is perceived as ongoing development (which it is) and thus less
stable, or is likely to change and most importantly harder to maintain (as
the consumer). So I still think we should release more often than not.

That's why I wanted to differentiate X and Y, but I don't mind if we release
just X ... if that's so important to people. BTW Mike, Eclipse's releases
are like Lucene, and in fact I don't know of so many projects that just
release X ... many of them seem to release X.Y.

I don't understand why we're treating this as a all or nothing thing. We
can let go of API back-compat, that clearly has no affect on index structure
and content. We can even let go of index runtime changes for all I care. But
I simply don't think we can let go of index structure back-support.

Shai

On Thu, Apr 15, 2010 at 1:12 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 2010/4/15 Shai Erera ser...@gmail.com:

  One way is to define 'major' as X and minor X.Y, and another is to define
 major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any
 strong feelings against the former.

 I prefer X.Y, ie, changes to Y only is a minor release (mostly bug
 fixes but maybe small features); changes to X is a major release.  I
 think that's more standard, ie, people will generally grok that 3.3
 - 4.0 is a major change but 3.3 - 3.4 isn't.

 So this proposal would change how Lucene releases are numbered.  Ie,
 the next release would be 4.0.  Bug fixes / small features would then
 be 4.1.

  Index back compat should be maintained between major releases, like it is
 today, STRUCTURE-wise.

 No... in the proposal, you must re-index on upgrading to the next
 major release (3.x - 4.0).

 I think supporting old indexes, badly (what we do today) is not a
 great solution.  EG on upgrading to 3.1 you'll immediately see a
 search perf hit since the flex emulation layer is running.  It's a
 trap.

 It's this freedom, I think, that'd let us drop Version entirely.  It's
 the back-compat of the index that is the major driver for having
 Version today (eg so that the analyzers can produce tokens matching
 your old index).

 EG Terrier seems

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index structure
 back-support. Our customers will simply not hear of reindexing 10s of TBs of
 content because of version upgrades. Such a decision is key to Lucene
 adoption in large-scale projects. It's entirely not about whether Lucene is
 a content store or not - content is stored on other systems, I agree. But
 that doesn't mean reindexing it is tolerable.


I don't understand how its helpful to do a MAJOR version upgrade without
reindexing... what in the world do you stand to gain from that?

The idea here, is that development can be free of such hassles. Development
should be this way.

If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
reindex, and are willing to do the work to port it back to Version 3 in a
completely backwards compatible way, then under this new scheme it can
happen.


-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
cost to do in a running production system (i can't shut it down for
maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
have no idea what are the costs to retrieve that data all over again, but i
estimate it to be quite a lot)

And providing a way to migrate existing indexes to new lucene is crucial
from my point of view.

I don't care what this way is: calling optimize() with newer lucene or
running some tool that takes 5 days, it's ok with me.

Just don't put me through full reindexing as I really don't have all that
data anymore.
It's not my data, i just receive it from clients, and provide a search
interface.

It took years to build those indexes, rebuilding is not an option, and
staying with old lucene forever just sucks.

Danil.

On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index structure
 back-support. Our customers will simply not hear of reindexing 10s of TBs of
 content because of version upgrades. Such a decision is key to Lucene
 adoption in large-scale projects. It's entirely not about whether Lucene is
 a content store or not - content is stored on other systems, I agree. But
 that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?

 The idea here, is that development can be free of such hassles. Development
 should be this way.

 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.


 --
 Robert Muir
 rcm...@gmail.com

Re: Proposal about Version API relaxation

its open source, if you feel this way, you can put the work to add features
to some version branch from trunk in a backwards compatible way.

Then this branch can have a backwards-compatible minor release with new
features, but nothing ground-breaking.

but this kinda stuff shouldnt hinder development on trunk.


On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
 cost to do in a running production system (i can't shut it down for
 maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
 have no idea what are the costs to retrieve that data all over again, but i
 estimate it to be quite a lot)

 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.

 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.

 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.

 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.

 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?

 The idea here, is that development can be free of such hassles.
 Development should be this way.

 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.


 --
 Robert Muir
 rcm...@gmail.com





-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

I think an index upgrade tool is okay?
While you still definetly have to code it, things like if idxVer==m
doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away
from lucene innards and we all profit?

On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote:
 its open source, if you feel this way, you can put the work to add features
 to some version branch from trunk in a backwards compatible way.
 Then this branch can have a backwards-compatible minor release with new
 features, but nothing ground-breaking.
 but this kinda stuff shouldnt hinder development on trunk.

 On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
 cost to do in a running production system (i can't shut it down for
 maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
 have no idea what are the costs to retrieve that data all over again, but i
 estimate it to be quite a lot)
 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.
 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.
 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.
 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.
 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:


 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 
 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?
 The idea here, is that development can be free of such hassles.
 Development should be this way.
 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.

 --
 Robert Muir
 rcm...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Thanks Danil - you reminded me of another reason why reindexing is
impossible - fetching the data, even if it's available is too damn costly.

Robert, I think you're driven by Analyzers changes ... been too much around
them I'm afraid :).

A major version upgrade is a move to Java 1.5 for example. I can do that,
and I don't see why I need to reindex my data because of that. And I simply
don't buy that do this work on your own ... people can take a snapshot of
the code, maintain it separately and you'll never hear back from them. Who
benefits - neither !
It's open source - true, but it's way past the Hey look, I'm a new open
source project w/ a dozen users, I can do whatever I want. Lucene is a
respected open source project, w/ serious adoption and deployments. People
trust on the select few committers here to do it right for them, so they
don't need to invest the time and resources in developing core IR stuff. And
now you're pushing to do it yourself approach? I simply don't get or buy
it.

When were you struck w/ maintaining backwards change because the index
structure changed? I bet no so many of us, or shall I say just the few Mikes
out there? So how hard is it to require such back-compat support? I
wholeheartedly agree that we shouldn't keep back-compat on Analyzer changes,
nor on bugs such that one which changed the position of the field from -1 to
0 (a while ago - don't remember the exact details).

Shai

On Thu, Apr 15, 2010 at 3:17 PM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
 cost to do in a running production system (i can't shut it down for
 maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
 have no idea what are the costs to retrieve that data all over again, but i
 estimate it to be quite a lot)

 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.

 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.

 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.

 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.

 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?

 The idea here, is that development can be free of such hassles.
 Development should be this way.

 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.


 --
 Robert Muir
 rcm...@gmail.com

Re: Proposal about Version API relaxation

I can live w/ that Earwin ... I prefer the ongoing upgrades still, but I
won't hold off the back-compat policy change vote because of that.

Shai

On Thu, Apr 15, 2010 at 3:30 PM, Earwin Burrfoot ear...@gmail.com wrote:

 I think an index upgrade tool is okay?
 While you still definetly have to code it, things like if idxVer==m
 doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away
 from lucene innards and we all profit?

 On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote:
  its open source, if you feel this way, you can put the work to add
 features
  to some version branch from trunk in a backwards compatible way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
 prohibitive
  cost to do in a running production system (i can't shut it down for
  maintainance, so i need a lot of hardware to reindex ~5 billion
 documents, i
  have no idea what are the costs to retrieve that data all over again,
 but i
  estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is key
 to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other systems,
 I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
  reindex, and are willing to do the work to port it back to Version 3 in
 a
  completely backwards compatible way, then under this new scheme it can
  happen.
 
  --
  Robert Muir
  rcm...@gmail.com
 
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

All I ask is a way to migrate existing indexes to newer format.


On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:

 its open source, if you feel this way, you can put the work to add features
 to some version branch from trunk in a backwards compatible way.

 Then this branch can have a backwards-compatible minor release with new
 features, but nothing ground-breaking.

 but this kinda stuff shouldnt hinder development on trunk.


 On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive
 cost to do in a running production system (i can't shut it down for
 maintainance, so i need a lot of hardware to reindex ~5 billion documents, i
 have no idea what are the costs to retrieve that data all over again, but i
 estimate it to be quite a lot)

 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.

 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.

 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.

 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.

 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:



 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 
 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?

 The idea here, is that development can be free of such hassles.
 Development should be this way.

 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.


 --
 Robert Muir
 rcm...@gmail.com





 --
 Robert Muir
 rcm...@gmail.com

Re: Proposal about Version API relaxation

I like the idea of index conversion tool over silent online upgrade
because it is
1. controllable - with online upgrade you never know for sure when
your index is completely upgraded, even optimize() won't help here, as
it is a noop for already-optimized indexes
2. way easier to write - as flex shows, index format changes are
accompanied by API changes. Here you don't have to emulate new APIs
over old structures (can be impossible for some cases?), you only have
to, well, convert.

On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
 All I ask is a way to migrate existing indexes to newer format.


 On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:

 its open source, if you feel this way, you can put the work to add
 features to some version branch from trunk in a backwards compatible way.
 Then this branch can have a backwards-compatible minor release with new
 features, but nothing ground-breaking.
 but this kinda stuff shouldnt hinder development on trunk.

 On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely
 prohibitive cost to do in a running production system (i can't shut it down
 for maintainance, so i need a lot of hardware to reindex ~5 billion
 documents, i have no idea what are the costs to retrieve that data all over
 again, but i estimate it to be quite a lot)
 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.
 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.
 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.
 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.
 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:


 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 
 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?
 The idea here, is that development can be free of such hassles.
 Development should be this way.
 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.

 --
 Robert Muir
 rcm...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

I think you guys miss the entire point.

The idea that you can keep getting all the new features without reindexing
is merely an illusion

Instead, features simply aren't being added at all, because the policy makes
it too cumbersome.

Why is it problematic to have a different SVN branch/release, with lots of
new features, but requires you to reindex and change your app?

If its too difficult to reindex, it doesnt break your app that features
exist elsewhere that you cannot access.
Its the same as it is today, there are features you cannot access, except
they do not even exist in apache SVN at all, even trunk, because of these
problems.

On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote:

 I like the idea of index conversion tool over silent online upgrade
 because it is
 1. controllable - with online upgrade you never know for sure when
 your index is completely upgraded, even optimize() won't help here, as
 it is a noop for already-optimized indexes
 2. way easier to write - as flex shows, index format changes are
 accompanied by API changes. Here you don't have to emulate new APIs
 over old structures (can be impossible for some cases?), you only have
 to, well, convert.

 On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
  All I ask is a way to migrate existing indexes to newer format.
 
 
  On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:
 
  its open source, if you feel this way, you can put the work to add
  features to some version branch from trunk in a backwards compatible
 way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com
 wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
  prohibitive cost to do in a running production system (i can't shut it
 down
  for maintainance, so i need a lot of hardware to reindex ~5 billion
  documents, i have no idea what are the costs to retrieve that data all
 over
  again, but i estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is
 crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is key
 to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other
 systems, I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
  reindex, and are willing to do the work to port it back to Version 3
 in a
  completely backwards compatible way, then under this new scheme it can
  happen.
 
  --
  Robert Muir
  rcm...@gmail.com
 
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

2010-04-15 Thread Yonik Seeley

Seamless online upgrades have their place too... say you are upgrading
one server at a time in a cluster.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote:
 I like the idea of index conversion tool over silent online upgrade
 because it is
 1. controllable - with online upgrade you never know for sure when
 your index is completely upgraded, even optimize() won't help here, as
 it is a noop for already-optimized indexes
 2. way easier to write - as flex shows, index format changes are
 accompanied by API changes. Here you don't have to emulate new APIs
 over old structures (can be impossible for some cases?), you only have
 to, well, convert.

 On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
 All I ask is a way to migrate existing indexes to newer format.


 On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:

 its open source, if you feel this way, you can put the work to add
 features to some version branch from trunk in a backwards compatible way.
 Then this branch can have a backwards-compatible minor release with new
 features, but nothing ground-breaking.
 but this kinda stuff shouldnt hinder development on trunk.

 On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote:

 Sometimes it's REALLY impossible to reindex, or has absolutely
 prohibitive cost to do in a running production system (i can't shut it down
 for maintainance, so i need a lot of hardware to reindex ~5 billion
 documents, i have no idea what are the costs to retrieve that data all over
 again, but i estimate it to be quite a lot)
 And providing a way to migrate existing indexes to new lucene is crucial
 from my point of view.
 I don't care what this way is: calling optimize() with newer lucene or
 running some tool that takes 5 days, it's ok with me.
 Just don't put me through full reindexing as I really don't have all that
 data anymore.
 It's not my data, i just receive it from clients, and provide a search
 interface.
 It took years to build those indexes, rebuilding is not an option, and
 staying with old lucene forever just sucks.

 Danil.
 On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:


 On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote:

 Well ... I must say that I completely disagree w/ dropping index
 structure back-support. Our customers will simply not hear of reindexing 
 10s
 of TBs of content because of version upgrades. Such a decision is key to
 Lucene adoption in large-scale projects. It's entirely not about whether
 Lucene is a content store or not - content is stored on other systems, I
 agree. But that doesn't mean reindexing it is tolerable.


 I don't understand how its helpful to do a MAJOR version upgrade without
 reindexing... what in the world do you stand to gain from that?
 The idea here, is that development can be free of such hassles.
 Development should be this way.
 If you, Shai, need some feature X.Y.Z from Version 4 and don't want to
 reindex, and are willing to do the work to port it back to Version 3 in a
 completely backwards compatible way, then under this new scheme it can
 happen.

 --
 Robert Muir
 rcm...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Well ... I could argue that it's you who miss the point :).

I completely don't buy the all the new features comment -- how many new
features are in a major release which force you to consider reindexing? Yet
there are many of them that change the API. How will I know whether a
release supports my index or not? Why do I need to work hard to back-port
all the new developed issues onto a branch I use? How many of those branches
will exist? Will they all run nightly unit tests? Can I cut a release of
such branch myself? Or will I need the PMC or a VOTE? This will get
complicated pretty fast ...

Lucene is not a do it yourself kit - we try so hard to have the best
defaults, best out of the box experience ... best everything for our users.
Even w/ Analyzers we try so damn hard. While we could have simply
componentize everything and tell the users you can use those filters,
tokenizers, segment mergers, policies etc. to make up your indexing
application ...

And I don't think there are features out there that exist and are not
contributed because people are afraid of the index format changes ...
obviously if they have done it, they're passed the fear of handling index
format ... I'd like to hear of one such feature. I'd bet there are such out
there that are not contributed for IP, Business and Laziness reasons.

Shai

On Thu, Apr 15, 2010 at 3:56 PM, Robert Muir rcm...@gmail.com wrote:

 I think you guys miss the entire point.

 The idea that you can keep getting all the new features without
 reindexing is merely an illusion

 Instead, features simply aren't being added at all, because the policy
 makes it too cumbersome.

 Why is it problematic to have a different SVN branch/release, with lots of
 new features, but requires you to reindex and change your app?

 If its too difficult to reindex, it doesnt break your app that features
 exist elsewhere that you cannot access.
 Its the same as it is today, there are features you cannot access, except
 they do not even exist in apache SVN at all, even trunk, because of these
 problems.

 On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote:

 I like the idea of index conversion tool over silent online upgrade
 because it is
 1. controllable - with online upgrade you never know for sure when
 your index is completely upgraded, even optimize() won't help here, as
 it is a noop for already-optimized indexes
 2. way easier to write - as flex shows, index format changes are
 accompanied by API changes. Here you don't have to emulate new APIs
 over old structures (can be impossible for some cases?), you only have
 to, well, convert.

 On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
  All I ask is a way to migrate existing indexes to newer format.
 
 
  On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:
 
  its open source, if you feel this way, you can put the work to add
  features to some version branch from trunk in a backwards compatible
 way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com
 wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
  prohibitive cost to do in a running production system (i can't shut it
 down
  for maintainance, so i need a lot of hardware to reindex ~5 billion
  documents, i have no idea what are the costs to retrieve that data all
 over
  again, but i estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is
 crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com
 wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is
 key to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other
 systems, I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from

Re: Proposal about Version API relaxation

I'm realize that just transforming old index won't give me anything new.

The applications usually evolve.

Let's take as example 2.9 (relatively few changes in index structure, but
Trie was a nice addition, per segment search and reload was a bless):
- There are 4 billion documents which don't have numeric ranges (but those
still got faster reopen)
- But for next 1 billion documents in another index i do have numeric
ranges.

The whole application works in ONE environment from same codebase.

Splitting it into several environments based on whatever version of lucene
happend to be current at index creation date,
and maintaining branches of code would be quite a PITA for a developer (and
very error prone)

So yeah, I won't get new features for old indexes if i transform them to new
format, but new indexes will be able to use them.
And my application as a whole will be much cleaner and easier to maintain
(I'm a lazy developer that thinks that he is already overworked)

I just want my system as a whole to evolve together with lucene without
dropping the indexes I already have
and keeping tens of branches of code and remembering how things worked back
in 2005 to slightly modify the analyzer because data in 2010 changed a bit.

Danil.

On Thu, Apr 15, 2010 at 15:56, Robert Muir rcm...@gmail.com wrote:

 I think you guys miss the entire point.

 The idea that you can keep getting all the new features without
 reindexing is merely an illusion

 Instead, features simply aren't being added at all, because the policy
 makes it too cumbersome.

 Why is it problematic to have a different SVN branch/release, with lots of
 new features, but requires you to reindex and change your app?

 If its too difficult to reindex, it doesnt break your app that features
 exist elsewhere that you cannot access.
 Its the same as it is today, there are features you cannot access, except
 they do not even exist in apache SVN at all, even trunk, because of these
 problems.

 On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote:

 I like the idea of index conversion tool over silent online upgrade
 because it is
 1. controllable - with online upgrade you never know for sure when
 your index is completely upgraded, even optimize() won't help here, as
 it is a noop for already-optimized indexes
 2. way easier to write - as flex shows, index format changes are
 accompanied by API changes. Here you don't have to emulate new APIs
 over old structures (can be impossible for some cases?), you only have
 to, well, convert.

 On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote:
  All I ask is a way to migrate existing indexes to newer format.
 
 
  On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote:
 
  its open source, if you feel this way, you can put the work to add
  features to some version branch from trunk in a backwards compatible
 way.
  Then this branch can have a backwards-compatible minor release with new
  features, but nothing ground-breaking.
  but this kinda stuff shouldnt hinder development on trunk.
 
  On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com
 wrote:
 
  Sometimes it's REALLY impossible to reindex, or has absolutely
  prohibitive cost to do in a running production system (i can't shut it
 down
  for maintainance, so i need a lot of hardware to reindex ~5 billion
  documents, i have no idea what are the costs to retrieve that data all
 over
  again, but i estimate it to be quite a lot)
  And providing a way to migrate existing indexes to new lucene is
 crucial
  from my point of view.
  I don't care what this way is: calling optimize() with newer lucene or
  running some tool that takes 5 days, it's ok with me.
  Just don't put me through full reindexing as I really don't have all
 that
  data anymore.
  It's not my data, i just receive it from clients, and provide a search
  interface.
  It took years to build those indexes, rebuilding is not an option, and
  staying with old lucene forever just sucks.
 
  Danil.
  On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote:
 
 
  On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com
 wrote:
 
  Well ... I must say that I completely disagree w/ dropping index
  structure back-support. Our customers will simply not hear of
 reindexing 10s
  of TBs of content because of version upgrades. Such a decision is
 key to
  Lucene adoption in large-scale projects. It's entirely not about
 whether
  Lucene is a content store or not - content is stored on other
 systems, I
  agree. But that doesn't mean reindexing it is tolerable.
 
 
  I don't understand how its helpful to do a MAJOR version upgrade
 without
  reindexing... what in the world do you stand to gain from that?
  The idea here, is that development can be free of such hassles.
  Development should be this way.
  If you, Shai, need some feature X.Y.Z from Version 4 and don't want
 to
  reindex, and are willing to do the work to port it back to

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com wrote:
 Seamless online upgrades have their place too... say you are upgrading
 one server at a time in a cluster.

Nothing here that can't be solved with an upgrade tool. Down one
server, upgrade index, upgrade sofware, up.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

True. Just need the tool.

On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote:

 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com 
 wrote:
  Seamless online upgrades have their place too... say you are upgrading
  one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Yonik Seeley

On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote:
 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 Seamless online upgrades have their place too... say you are upgrading
 one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

It's still harder.  Consider a common scenario where you have one
master and the index being replicated to multiple slaves.  One would
need to stop replication to an upgraded slave until the master is also
upgraded.  Some people can't even stop replication because they use
something like a SAN to share the index.

I'm just pointing out that there is a lot of value for many people to
back compatible indexes... I'm not trying to make any points about
when that back combat should be broken.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

wrong, it doesnt fix the analyzers problem.

you need to reindex.

On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote:

 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com
 wrote:
  Seamless online upgrades have their place too... say you are upgrading
  one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

+1

On Apr 14, 2010, at 5:22 PM, Michael McCandless wrote:

 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 mar...@rectangular.com wrote:
 
 Essentially, we're free to break back compat within Lucy at any time, but
 we're not able to break back compat within a stable fork like Lucy1,
 Lucy2, etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.
 
 So... what if we change up how we develop and release Lucene:
 
  * A major release always bumps the major release number (2.x -
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch
 
  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.
 
 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).
 
 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.
 
 I think in such a future world, we could:
 
  * Remove Version entirely!
 
  * Not worry at all about back-compat when developing on trunk
 
  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.
 
  * Let analyzers freely, incrementally improve
 
  * Use interfaces without fear
 
  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat
 
  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.
 
 Thoughts...?
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2396) remove version from contrib/analyzers.

remove version from contrib/analyzers.
--

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir


Contrib/analyzers has no backwards-compatibility policy, so let's remove 
Version so the API is consumable.

if you think we shouldn't do this, then instead explicitly state and vote on 
what the backwards compatibility policy for contrib/analyzers should be 
instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

I do think major versions should be able to read the previous version index.  
Still, even being able to do that is no guarantee that it will produce correct 
results.  Likewise, even having an upgrade tool is no guarantee that correct 
results will be produced.  So, my take is that we strive for it, but we all 
have to realize, and document, that it might not always be possible.  Let's 
just be practical and pragmatic.  Past history indicates we are capable of, for 
the most part, reading the prev. version index and upgrading it.  If it can't 
be done automatically, then we can consider a tool.  If the tool won't work, 
then we will have to reindex.  It doesn't have to be an all or nothing decision 
made in the void.  We've always been very practical here about making decisions 
on problems that are directly facing us, so I would suggest we move forward 
with the new approach (which I agree makes more sense and is pretty prevalent 
across a lot of projects) and we take this issue on a case-by-case basis.

-Grant


On Apr 15, 2010, at 9:49 AM, Yonik Seeley wrote:

 On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote:
 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 Seamless online upgrades have their place too... say you are upgrading
 one server at a time in a cluster.
 
 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.
 
 It's still harder.  Consider a common scenario where you have one
 master and the index being replicated to multiple slaves.  One would
 need to stop replication to an upgraded slave until the master is also
 upgraded.  Some people can't even stop replication because they use
 something like a SAN to share the index.
 
 I'm just pointing out that there is a lot of value for many people to
 back compatible indexes... I'm not trying to make any points about
 when that back combat should be broken.
 
 -Yonik
 Apache Lucene Eurocon 2010
 18-21 May 2010 | Prague
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857321#action_12857321
 ] 

Robert Muir commented on LUCENE-2396:
-

Additionally, i would like to remove all CHANGES from backwards compatibility 
policy from contrib/CHANGES.

contrib has no backwards compatibility policy, so it makes no sense. these are 
just ordinary changes for Contrib.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir

 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 17:49, Robert Muir rcm...@gmail.com wrote:
 wrong, it doesnt fix the analyzers problem.
 you need to reindex.

 On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote:

 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com
 wrote:
  Seamless online upgrades have their place too... say you are upgrading
  one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

Couldn't care less about analyzers. There's two kinds of breaks in
index compatibility - soft and hard ones.
Hard break is - your index structure changed, you're using a new
encoding for numeric fields, such kind of things.
Soft break is - you fixed a stemmer, so now 'some' words are stemmed
differently, such kind of things.

With hard break you have to do an offline reindex, and then switch
over. With soft breaks you can sometimes just enqueue all your
documents and do reindexation online - that breaks a small percentage
of your queries for a small period of time. Something you can bear, if
that saves you from doing manual labor.

I never claimed an index upgrade tool should fix your tokens, offsets
and whatnot.
It is power-user stuff that allows you to turn some hard breaks into
soft breaks, and then decide on your own how to handle the latter.

We also can hit some index format changes that deny any kind of
automatic conversion. Well, too sad. We'll just skip issuing index
upgrade tool on that release.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857325#action_12857325
 ] 

Robert Muir commented on LUCENE-2396:
-

Also, i would like to remove all deprecated methods from contrib/analyzers as 
well.

this again shouldnt be a problem, as it has no back compat policy.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir

 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Agree.

However I don't see how lucene could suddenly change that even a
conversion tool is impossible to create.
After all it's all about terms, positions and frequencies.

Yeah..some additions as payloads may appear, disappear, or evolve into
something new, but those are on user's side anyway.

Analyzers indeed are delicate problem so when StandardAnalyzer(which
probably 90% of users use) for same string generates different set of
terms.
But again it's user side problem.
Nothing stops him to rip StandrardAnalyzer from whatever version of
lucene, adapt it to newer indexing API, plug it in and continue.

I already use  50% customized analyzers, my own query parser and so on.
I have junits for (hopefully) all cases I need to cover, so if new
Analyzer misbehaves, it's my responsability.

Danil.

On Thu, Apr 15, 2010 at 16:56, Grant Ingersoll gsing...@apache.org wrote:
 I do think major versions should be able to read the previous version index.  
 Still, even being able to do that is no guarantee that it will produce 
 correct results.  Likewise, even having an upgrade tool is no guarantee that 
 correct results will be produced.  So, my take is that we strive for it, but 
 we all have to realize, and document, that it might not always be possible.  
 Let's just be practical and pragmatic.  Past history indicates we are capable 
 of, for the most part, reading the prev. version index and upgrading it.  If 
 it can't be done automatically, then we can consider a tool.  If the tool 
 won't work, then we will have to reindex.  It doesn't have to be an all or 
 nothing decision made in the void.  We've always been very practical here 
 about making decisions on problems that are directly facing us, so I would 
 suggest we move forward with the new approach (which I agree makes more sense 
 and is pretty prevalent across a lot of projects) and we take this issue on a 
 case-by-case basis.

 -Grant


 On Apr 15, 2010, at 9:49 AM, Yonik Seeley wrote:

 On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com wrote:
 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 Seamless online upgrades have their place too... say you are upgrading
 one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

 It's still harder.  Consider a common scenario where you have one
 master and the index being replicated to multiple slaves.  One would
 need to stop replication to an upgraded slave until the master is also
 upgraded.  Some people can't even stop replication because they use
 something like a SAN to share the index.

 I'm just pointing out that there is a lot of value for many people to
 back compatible indexes... I'm not trying to make any points about
 when that back combat should be broken.

 -Yonik
 Apache Lucene Eurocon 2010
 18-21 May 2010 | Prague

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Yonik Seeley

On Wed, Apr 14, 2010 at 5:22 PM, Michael McCandless
luc...@mikemccandless.com wrote:
  * There is no back compat across major releases (index nor APIs),
    but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

Sort of... except many of these projects listed above care a lot about
back compat, even between major releases.  So while we could always
break back compat, we shouldn't do so unless it's necessary.  It's not
an all-or-nothing scenario though... requiring re-indexing seems
reasonable, but changing APIs around when there's not a good reason
behind it (other than someone liked the name a little better) should
still be approached with caution.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Erick Erickson

Coming in late to the discussion, and without really understanding the
underlying Lucene issues, but...

The size of the problem of reindexing is under-appreciated I think.
Somewhere
in my company is the original data I indexed. But the effort it would take
to
resurrect it is O(unknown). An unfortunate reality of commercial products is
that the often receive very little love for extended periods of time until
all of
the sudden more work is required. There ensues an extended period of
re-orientation, even if the people who originally worked on the project are
still
around.

*Assuming* the data is available to reindex (and there are many reasons
besides poor practice on the part of the company that it may not be),
remembering/finding out exactly which of the various backups you made
of the original data is the one that's actually in your product can be
highly
non-trivial. Compounded by the fact that the product manager will be
adamant about Do NOT surprise our customers.

So I can be in a spot of saying I *think* I have the original data set, and
I
*think* I have the original code used to index it, and if I get a new
version of
Lucene I *think* I can recreate the index and I *think* that the user will
see
the expected change. After all that effort is completed, I *think* we'll see
the
expected changes, but we won't know until we try it puts me in a very
precarious position.

This assumes that I have a reasonable chance of getting the original data.
But
say I've been indexing data from a live feed. Sure as hell hope I stored the
data somewhere, because going back to the source and saying please resend
me 10 years worth of data that I have in my index is...er...hard. Or say
that the original provider has gone out of business, or the licensing
arrangement
specifies a one-time transmission of data that may not be retained in its
original
form or.

The point of this long diatribe is that there are many reasons why
reindexing is
impossible and/or impractical. Making any decision that requires reindexing
for
a new version is locking a user into a version potentially forever. We
should not
underestimate how painful that can be and should never think that just
reindex
is acceptable in all situations. It's not. Period.

Be very clear that some number of Lucene users will absolutely not be able
to reindex. We may still make a decision that requires this, but let's make
it
without deluding ourselves that it's a possible solution for everyone.

So an upgrade tool seems like a reasonable compromise. I agree that being
hampered in what we can develop in Lucene by having to accomodate
reading old indexes slows new features etc. It's always nice to be
able to work without dealing with pesky legacy issues G. Perhaps
splitting out the indexing upgrades into a separate program lets us
accommodate both concerns.

FWIW
Erick

On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote:

 True. Just need the tool.

 On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote:
 
  On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com
 wrote:
   Seamless online upgrades have their place too... say you are upgrading
   one server at a time in a cluster.
 
  Nothing here that can't be solved with an upgrade tool. Down one
  server, upgrade index, upgrade sofware, up.
 
  --
  Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
  Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

 reasonable, but changing APIs around when there's not a good reason
 behind it (other than someone liked the name a little better) should
 still be approached with caution.

Changing names is a good enough reason :)
They make a darn difference between having to read a book to be able
to use some library, or just playing around with it for a bit.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Mark Miller

If you absolutely cannot re-index, and you have *no* access to the data 
again - you are one ballsy mofo to upgrade to a new version of Lucene 
for features. It means you likely BASE jump in your free time?


On 04/15/2010 10:14 AM, Erick Erickson wrote:

Coming in late to the discussion, and without really understanding the
underlying Lucene issues, but...

The size of the problem of reindexing is under-appreciated I think. 
Somewhere
in my company is the original data I indexed. But the effort it would 
take to
resurrect it is O(unknown). An unfortunate reality of commercial 
products is
that the often receive very little love for extended periods of time 
until all of

the sudden more work is required. There ensues an extended period of
re-orientation, even if the people who originally worked on the 
project are still

around.

*Assuming* the data is available to reindex (and there are many reasons
besides poor practice on the part of the company that it may not be),
remembering/finding out exactly which of the various backups you made
of the original data is the one that's actually in your product can be 
highly

non-trivial. Compounded by the fact that the product manager will be
adamant about Do NOT surprise our customers.

So I can be in a spot of saying I *think* I have the original data 
set, and I
*think* I have the original code used to index it, and if I get a new 
version of
Lucene I *think* I can recreate the index and I *think* that the user 
will see
the expected change. After all that effort is completed, I *think* 
we'll see the

expected changes, but we won't know until we try it puts me in a very
precarious position.

This assumes that I have a reasonable chance of getting the original 
data. But
say I've been indexing data from a live feed. Sure as hell hope I 
stored the

data somewhere, because going back to the source and saying please resend
me 10 years worth of data that I have in my index is...er...hard. Or say
that the original provider has gone out of business, or the licensing 
arrangement
specifies a one-time transmission of data that may not be retained in 
its original

form or.

The point of this long diatribe is that there are many reasons why 
reindexing is
impossible and/or impractical. Making any decision that requires 
reindexing for
a new version is locking a user into a version potentially forever. We 
should not
underestimate how painful that can be and should never think that 
just reindex

is acceptable in all situations. It's not. Period.

Be very clear that some number of Lucene users will absolutely not be able
to reindex. We may still make a decision that requires this, but let's 
make it

without deluding ourselves that it's a possible solution for everyone.

So an upgrade tool seems like a reasonable compromise. I agree that being
hampered in what we can develop in Lucene by having to accomodate
reading old indexes slows new features etc. It's always nice to be
able to work without dealing with pesky legacy issues G. Perhaps
splitting out the indexing upgrades into a separate program lets us
accommodate both concerns.

FWIW
Erick

On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com 
mailto:torin...@gmail.com wrote:


True. Just need the tool.

On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com
mailto:ear...@gmail.com wrote:

 On Thu, Apr 15, 2010 at 17:17, Yonik Seeley
yo...@lucidimagination.com mailto:yo...@lucidimagination.com
wrote:
  Seamless online upgrades have their place too... say you are
upgrading
  one server at a time in a cluster.

 Nothing here that can't be solved with an upgrade tool. Down one
 server, upgrade index, upgrade sofware, up.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com
mailto:ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785


-
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
mailto:java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
mailto:java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
mailto:java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
mailto:java-dev-h...@lucene.apache.org





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Erick Erickson

'Cause some exec finally noticed the product was losing market share.
Or got a wild hair strategically placed. My point is only that
we should be clear that some number of Lucene users *will* be in such
a position.

I'm actually fine with a decision that we're not going to support such
a scenario, but let's be clear that that's the decision we're making.

And corporate competence aside, there's still licensing that may prevent
me archiving the raw data

Erick

On Thu, Apr 15, 2010 at 10:20 AM, Earwin Burrfoot ear...@gmail.com wrote:

 I think the need to upgrade to latest and greatest lucene for poor
 corporate users that lost all their data is somewhat overblown.
 Why the heck do you need to upgrade if your app rotted in neglect for
 years??

 On Thu, Apr 15, 2010 at 18:14, Erick Erickson erickerick...@gmail.com
 wrote:
  Coming in late to the discussion, and without really understanding the
  underlying Lucene issues, but...
  The size of the problem of reindexing is under-appreciated I think.
  Somewhere
  in my company is the original data I indexed. But the effort it would
 take
  to
  resurrect it is O(unknown). An unfortunate reality of commercial products
 is
  that the often receive very little love for extended periods of time
 until
  all of
  the sudden more work is required. There ensues an extended period of
  re-orientation, even if the people who originally worked on the project
 are
  still
  around.
  *Assuming* the data is available to reindex (and there are many reasons
  besides poor practice on the part of the company that it may not be),
  remembering/finding out exactly which of the various backups you made
  of the original data is the one that's actually in your product can be
  highly
  non-trivial. Compounded by the fact that the product manager will be
  adamant about Do NOT surprise our customers.
  So I can be in a spot of saying I *think* I have the original data set,
 and
  I
  *think* I have the original code used to index it, and if I get a new
  version of
  Lucene I *think* I can recreate the index and I *think* that the user
 will
  see
  the expected change. After all that effort is completed, I *think* we'll
 see
  the
  expected changes, but we won't know until we try it puts me in a very
  precarious position.
  This assumes that I have a reasonable chance of getting the original
 data.
  But
  say I've been indexing data from a live feed. Sure as hell hope I stored
 the
  data somewhere, because going back to the source and saying please
 resend
  me 10 years worth of data that I have in my index is...er...hard. Or say
  that the original provider has gone out of business, or the licensing
  arrangement
  specifies a one-time transmission of data that may not be retained in its
  original
  form or.
  The point of this long diatribe is that there are many reasons why
  reindexing is
  impossible and/or impractical. Making any decision that requires
 reindexing
  for
  a new version is locking a user into a version potentially forever. We
  should not
  underestimate how painful that can be and should never think that just
  reindex
  is acceptable in all situations. It's not. Period.
  Be very clear that some number of Lucene users will absolutely not be
 able
  to reindex. We may still make a decision that requires this, but let's
 make
  it
  without deluding ourselves that it's a possible solution for everyone.
  So an upgrade tool seems like a reasonable compromise. I agree that being
  hampered in what we can develop in Lucene by having to accomodate
  reading old indexes slows new features etc. It's always nice to be
  able to work without dealing with pesky legacy issues G. Perhaps
  splitting out the indexing upgrades into a separate program lets us
  accommodate both concerns.
  FWIW
  Erick
  On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote:
 
  True. Just need the tool.
 
  On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com
 wrote:
  
   On Thu, Apr 15, 2010 at 17:17, Yonik Seeley 
 yo...@lucidimagination.com
   wrote:
Seamless online upgrades have their place too... say you are
 upgrading
one server at a time in a cluster.
  
   Nothing here that can't be solved with an upgrade tool. Down one
   server, upgrade index, upgrade sofware, up.
  
   --
   Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
   Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
   ICQ: 104465785
  
   -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903)

Re: Proposal about Version API relaxation

The app is not rotted, it's alive and kicking, and gets a lot of TLC.

There are some older indexes that use some features and there are
newer indexes that will benefit greatly from newer features.
All running in one freaking big distributed application.

Leveraging lucene version by updating to newer lucene for new indexes and
changing analyzer chain of old indexes in a way that doesn't affect
(too much) search results they used to get,
is a logical way from my point of view.

I only ask for a tool to convert from old lucene format to new one.
I don't expect magic to happen, but give me the possibility to go
forward and let me worry about backward compatibility of search
results.

On Thu, Apr 15, 2010 at 17:20, Earwin Burrfoot ear...@gmail.com wrote:
 I think the need to upgrade to latest and greatest lucene for poor
 corporate users that lost all their data is somewhat overblown.
 Why the heck do you need to upgrade if your app rotted in neglect for years??

 On Thu, Apr 15, 2010 at 18:14, Erick Erickson erickerick...@gmail.com wrote:
 Coming in late to the discussion, and without really understanding the
 underlying Lucene issues, but...
 The size of the problem of reindexing is under-appreciated I think.
 Somewhere
 in my company is the original data I indexed. But the effort it would take
 to
 resurrect it is O(unknown). An unfortunate reality of commercial products is
 that the often receive very little love for extended periods of time until
 all of
 the sudden more work is required. There ensues an extended period of
 re-orientation, even if the people who originally worked on the project are
 still
 around.
 *Assuming* the data is available to reindex (and there are many reasons
 besides poor practice on the part of the company that it may not be),
 remembering/finding out exactly which of the various backups you made
 of the original data is the one that's actually in your product can be
 highly
 non-trivial. Compounded by the fact that the product manager will be
 adamant about Do NOT surprise our customers.
 So I can be in a spot of saying I *think* I have the original data set, and
 I
 *think* I have the original code used to index it, and if I get a new
 version of
 Lucene I *think* I can recreate the index and I *think* that the user will
 see
 the expected change. After all that effort is completed, I *think* we'll see
 the
 expected changes, but we won't know until we try it puts me in a very
 precarious position.
 This assumes that I have a reasonable chance of getting the original data.
 But
 say I've been indexing data from a live feed. Sure as hell hope I stored the
 data somewhere, because going back to the source and saying please resend
 me 10 years worth of data that I have in my index is...er...hard. Or say
 that the original provider has gone out of business, or the licensing
 arrangement
 specifies a one-time transmission of data that may not be retained in its
 original
 form or.
 The point of this long diatribe is that there are many reasons why
 reindexing is
 impossible and/or impractical. Making any decision that requires reindexing
 for
 a new version is locking a user into a version potentially forever. We
 should not
 underestimate how painful that can be and should never think that just
 reindex
 is acceptable in all situations. It's not. Period.
 Be very clear that some number of Lucene users will absolutely not be able
 to reindex. We may still make a decision that requires this, but let's make
 it
 without deluding ourselves that it's a possible solution for everyone.
 So an upgrade tool seems like a reasonable compromise. I agree that being
 hampered in what we can develop in Lucene by having to accomodate
 reading old indexes slows new features etc. It's always nice to be
 able to work without dealing with pesky legacy issues G. Perhaps
 splitting out the indexing upgrades into a separate program lets us
 accommodate both concerns.
 FWIW
 Erick
 On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com wrote:

 True. Just need the tool.

 On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com wrote:
 
  On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com
  wrote:
   Seamless online upgrades have their place too... say you are upgrading
   one server at a time in a cluster.
 
  Nothing here that can't be solved with an upgrade tool. Down one
  server, upgrade index, upgrade sofware, up.
 
  --
  Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
  Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2395) Add a scoring DistanceQuery that does not need caches and separate filters

[
https://issues.apache.org/jira/browse/LUCENE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2395:
--

Attachment: DistanceQuery.java

A first idea of the Query, it does not even compile as some classes are missing
(coming with Chris' later patches), but it shows how it should work and how its
customizeable.

Add a scoring DistanceQuery that does not need caches and separate filters
--

Key: LUCENE-2395
URL: https://issues.apache.org/jira/browse/LUCENE-2395
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/spatial
Reporter: Uwe Schindler
Fix For: 3.1

Attachments: DistanceQuery.java

In a chat with Chris Male and my own ideas when implementing for PANGAEA, I
thought about the broken distance query in contrib. It lacks the following
features:
- It needs a query/filter for the enclosing bbox (which is constant score)
- It needs a separate filter for filtering out hits to far away (inside bbox
but outside distance limit)
- It has no scoring, so if somebody wants to sort by distance, he needs to
use the custom sort. For that to work, spatial caches distance calculation
(which is broken for multi-segment search)
The idea is now to combine all three things into one query, but customizeable:
We first thought about extending CustomScoreQuery and calculate the distance
from FieldCache in the customScore method and return a score of 1 for
distance=0, score=0 on the max distance and score0 for farer hits, that are
in the bounding box but not in the distance circle. To filter out such
negative scores, we would need to override the scorer in CustomScoreQuery
which is priate.
My proposal is now to use a very stripped down CustomScoreQuery (but not
extend it) that does call a method getDistance(docId) in its scorer's advance
and nextDoc that calculates the distance for the current doc. It stores this
distance also in the scorer. If the distance maxDistance it throws away the
hit and calls nextDoc() again. The score() method will reurn per default
weight.value*(maxDistance - distance)/maxDistance and uses the precalculated
distance. So the distance is only calculated one time in nextDoc()/advance().
To be able to plug in custom scoring, the following methods in the query can
be overridden:
- float getDistanceScore(double distance) - returns per default: (maxDistance
- distance)/maxDistance; allows score customization
- DocIdSet getBoundingBoxDocIdSet(Reader, LatLng sw, LatLng ne) - returns an
DocIdSet for the bounding box. Per default it returns e.g. the docIdSet of a
NRF or a cartesian tier filter. You can even plug in any other DocIdSet, e.g.
wrap a Query with QueryWrapperFilter
- support a setter for the GeoDistanceCalculator that is used by the scorer
to get the distance.
- a LatLng provider (similar to CustomScoreProvider/ValueSource) that returns
for a given doc id the lat/lng. This method is called per IndexReader one
time in scorer creation and will retrieve the coordinates. By that we support
FieldCache or whatever.
This query is almost finished in my head, it just needs coding :-)

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2396) remove version from contrib/analyzers.


 [ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2396:
---

Assignee: Robert Muir

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir

 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857373#action_12857373
]

Michael McCandless commented on LUCENE-2324:

bq. The usual design is a queued ingestion pipeline, where a pool of indexer
threads take docs out of a queue and feed them to an IndexWriter, I think?

bq. Mainly, because I think apps with such an affinity that you describe are
very rare?

Hmm I suspect it's not that rare yes one design is a single
indexing queue w/ dedicated thread pool only for indexing, but a push
model is equal valid, where your app already has separate threads (or
thread pools) servicing different content sources, so when a doc
arrives to one of those source-specific threads, it's that thread that
indexes it, rather than handing off to a separately pool.

Lucene is used in a very wide variety of apps -- we shouldn't optimize
the indexer on such hard app specific assumptions.

bq. And if a user really has so different docs, maybe the right answer would be
to have more than one single index?

Hmm but the app shouldn't have to resort to this... (it doesn't have
to today).

But... could we allow an add/updateDocument call to express this
affinity, explicitly? If you index homogenous docs you wouldn't use
it, but, if you index drastically different docs that fall into clear
categories, expressing the affinity can get you a good gain in
indexing throughput.

This may be the best solution, since then one could pass the affinity
even through a thread pool, and then we would fallback to thread
binding if the document class wasn't declared?

I mean this is virtually identical to having more than one index,
since the DW is like its own index. It just saves some of the
copy-back/merge cost of addIndexes...

bq. Even if today an app utilizes the thread affinity, this only results in
maybe somewhat faster indexing performance, but the benefits would be lost
after flusing/merging.

Yes this optimization is only about the initial flush, but, it's
potentially sizable. Merging matters less since typically it's not
the bottleneck (happens in the BG, quickly enough).

On the right apps, thread affinity can make a huge difference. EG if
you allow up to 8 thread states, and the threads are indexing content
w/ highly divergent terms (eg, one language per thread, or, docs w/
very different field names), in the worst case you'll be up to 1/8 as
efficient since each term must now be copied in up to 8 places
instead of one. We have a high per-term RAM cost (reduced thanks to
the parallel arrays, but, still high).

bq. If we assign docs randomly to available DocumentsWriterPerThreads, then we
should on average make good use of the overall memory?

It really depends on the app -- if the term space is highly thread
dependent (above examples) you an end up flush much more frequently for
a given RAM buffer.

bq. Alternatively we could also select the DWPT from the pool of available
DWPTs that has the highest amount of free memory?

Hmm... this would be kinda costly binder? You'd need a pqueue?
Thread affinity (or the explicit affinity) is a single
map/array/member lookup. But it's an interesting idea...

bq. If you do have a global RAM management, how would the flushing work? E.g.
when a global flush is triggered because all RAM is consumed, and we pick the
DWPT with the highest amount of allocated memory for flushing, what will the
other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to
make sure we don't exceed the maxRAMBufferSize?

The other DWs would keep indexing :) That's the beauty of this
approach... a flush of one DW doesn't stop all other DWs from
indexing, unliked today.

And you want to serialize the flushing right? Ie, only one DW flushes
at a time (the others keep indexing).

Hmm I suppose flushing more than one should be allowed (OS/IO have
alot of concurrency, esp since IO goes into write cache)... perhaps
that's the best way to balance index vs flush time? EG we pick one to
flush @ 90%, if we cross 95% we pick another to flush, another at
100%, etc.

bq. Of course we could say always flush when 90% of the overall memory is
consumed, but how would we know that the remaining 10% won't fill up during
the time the flush takes?

Regardless of the approach for document - DW binding, this is an
issue (ie it's non-differentiating here)? Ie the other DWs continue
to consume RAM while one DW is flushing. I think the low/high water
mark is an OK solution here? Or the tiered flushing (I think I like
that better :) ).

bq. Having a fully decoupled memory management is compelling I think, mainly
because it makes everything so much simpler. A DWPT could decide itself when
it's time to flush, and the other ones can keep going independently.

I'm all for simplifying things, which you've already nicely done here,
but not of it's at the cost of a

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857375#action_12857375
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. But... could we allow an add/updateDocument call to express this affinity, 
explicitly?

i would love to be able to explicitly define a segment affinity for documents 
i'm feeding

this would then allow me to say: 
all docs from table a has affinity 1
all docs from table b has affinity 2

this would ideally result in indexing documents from each table into a 
different segment (obviously, i would then need to be able to have segment 
merging be affinity aware so optimize/merging would only merge segments that 
share an affinity)

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2396) remove version from contrib/analyzers.


 [ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2396:


Attachment: LUCENE-2396.patch

attached is a patch, including CHANGES rewording.

All Lucene/Solr tests pass.

If no one objects, I plan to commit in a day or two.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857380#action_12857380
]

Jason Rutherglen commented on LUCENE-2324:
--

bq. only one DW flushes at a time (the others keep indexing).

I think it's best to simply flush at 90% for now. We already
exceed the ram buffer size because of over allocation? Perhaps
we can view the ram buffer size as a rough guideline not a hard
and fast limit because, lets face it, we're using Java which is
about as inexact when it comes to RAM consumption as it gets?
Also, hopefully it would move the patch along faster and more
complex algorithms could easily be added later.

Per thread DocumentsWriters that write their own private segments
-

Key: LUCENE-2324
URL: https://issues.apache.org/jira/browse/LUCENE-2324
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 3.1

Attachments: lucene-2324.patch, LUCENE-2324.patch

See LUCENE-2293 for motivation and more details.
I'm copying here Mike's summary he posted on 2293:
Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
normal segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO CPU.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857381#action_12857381
 ] 

Michael McCandless commented on LUCENE-2324:


{quote}
i would love to be able to explicitly define a segment affinity for documents 
i'm feeding

this would then allow me to say: 
all docs from table a has affinity 1
all docs from table b has affinity 2
{quote}

Right, this is exactly what affinity would be good for -- so IW would
try to send table a docs their own DW(s) and table b docs to their
own DW(s), which should give faster indexing than randomly binding to
DWs.

But:

bq. this would ideally result in indexing documents from each table into a 
different segment (obviously, i would then need to be able to have segment 
merging be affinity aware so optimize/merging would only merge segments that 
share an affinity)

This part I was not proposing :)

The affinity would just be an optimization hint in creating the
initial flushed segments, so IW can speed up indexing.

Probably if you really want to keep the segments segregated like that,
you should in fact index to separate indices?


 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857384#action_12857384
 ] 

Uwe Schindler commented on LUCENE-2396:
---

Are you sure you want to use LUCENE_CURRENT in some ctors?

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857386#action_12857386
 ] 

Robert Muir commented on LUCENE-2396:
-

bq. Are you sure you want to use LUCENE_CURRENT in some ctors?

The lucene core subclasses used by some analyzers require this, so another 
alternative is to create a static CONTRIB_ANALYZERS_VERSION = 3.1 for this 
purpose, and bump it every release. that's fine too.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-15 Thread Tim Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857385#action_12857385
 ] 

Tim Smith commented on LUCENE-2324:
---

bq. Probably if you really want to keep the segments segregated like that, you 
should in fact index to separate indices?

Thats what i'm currently thinking i'll have to do

however it would be ideal if i could either subclass IndexWriter or use 
IndexWriter directly with this affinity concept (potentially writing my own 
segment merger that is affinity aware)
that makes it so i can easily use near real time indexing, as only one 
IndexWriter will be in the mix, as well as make managing deletes and a whole 
other host of issues with multiple indexes disappear
Also makes it so i can configure memory settings across all affinity groups 
instead of having to dynamically create them, each with their own memory bounds

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

2010-04-15 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857388#action_12857388
]

Shai Erera commented on LUCENE-2396:

Robert I think this is great! Can we move more analyzers from core here? I
think however that a backwards section in changes is important because it
alerts users about those analyzers whose runtime behavior changed. Otherwise
how would the poor uses know that? It doesn't mean you need to maintain back
compat support but at least alert them when things change.

Even if we eventually decide to remove API bw completely, a section in CHANGES
will still be required to help users upgrade easily.

remove version from contrib/analyzers.
--

Key: LUCENE-2396
URL: https://issues.apache.org/jira/browse/LUCENE-2396
Project: Lucene - Java
Issue Type: Task
Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Attachments: LUCENE-2396.patch

Contrib/analyzers has no backwards-compatibility policy, so let's remove
Version so the API is consumable.
if you think we shouldn't do this, then instead explicitly state and vote on
what the backwards compatibility policy for contrib/analyzers should be
instead, or move it all to core.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857395#action_12857395
]

Robert Muir commented on LUCENE-2396:
-

{quote}
Robert I think this is great! Can we move more analyzers from core here? I
think however that a backwards section in changes is important because it
alerts users about those analyzers whose runtime behavior changed. Otherwise
how would the poor uses know that? It doesn't mean you need to maintain back
compat support but at least alert them when things change.
{quote}

I think this belongs in Changes in Runtime Behavior. Its a question of
wording..., which is why i renamed it as such in the patch.

If folks want to move the analyzers in core into here, that would be great too,
even better the Solr analyzers. we can call it a module if we want, or whatever.
But for now, I'm working with what I got.

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

2010-04-15 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857396#action_12857396
 ] 

Shai Erera commented on LUCENE-2396:


Static? Weren't you against that!? 

But if we remove back compat from analyzers why do we need Version? Or is this 
API bw that we remove?

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857398#action_12857398
 ] 

Robert Muir commented on LUCENE-2396:
-

{quote}
Static? Weren't you against that!?
But if we remove back compat from analyzers why do we need Version? Or is this 
API bw that we remove?
{quote}

Whoah... don't get too excited :).

*Internally* some of these contrib analyzers subclass stuff thats in lucene 
core, which requires Version.
If this stuff was moved into say, contrib analyzers, then we wouldnt need this 
*Internal-only-use* Version arg.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857402#action_12857402
 ] 

Uwe Schindler commented on LUCENE-2396:
---

bq. Static? Weren't you against that!? 

He meant a static final! It is just to make the analyzers that depend on core 
stuff fix to a specific version. Until we have no more analyzers in core 
exspect Whitespace.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857412#action_12857412
 ] 

Robert Muir commented on LUCENE-2396:
-

bq. Until we have no more analyzers in core exspect Whitespace.

Actually i think whitespace belongs in analyzers module too. I would suggest a 
TestAnalyzer in src/test, which might just be quick-and-dirty or whatever.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On 04/15/2010 09:49 AM, Robert Muir wrote:

wrong, it doesnt fix the analyzers problem.

you need to reindex.

On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot ear...@gmail.com 
mailto:ear...@gmail.com wrote:


On Thu, Apr 15, 2010 at 17:17, Yonik Seeley
yo...@lucidimagination.com mailto:yo...@lucidimagination.com
wrote:
 Seamless online upgrades have their place too... say you are
upgrading
 one server at a time in a cluster.

Nothing here that can't be solved with an upgrade tool. Down one
server, upgrade index, upgrade sofware, up.



Having read the thread, I have a few comments. Much of it is summary.

The current proposal requires re-index on every upgrade to Lucene. Plain 
and simple.


Robert is right about the analyzers.

There are three levels of backward compatibility, though we talk about 2.

First, the index format. IMHO, it is a good thing for a major release to 
be able to read the prior major release's index. And the ability to 
convert it to the current format via optimize is also good. Whatever is 
decided on this thread should take this seriously.


Second, the API. The current mechanism to use deprecations to migrate 
users to a new API is both a blessing and a curse. It is a blessing to 
end users so that they have a clear migration path. It is a curse to 
development because the API is bloated with the old and the new. Further 
it causes unfortunate class naming, with the tendency to migrate away 
from the good name. It is a curse to end users because it can cause 
confusion.


While I like the mechanism of deprecations to migrate me from one 
release to another, I'd be open to another mechanism.  So much effort is 
put into API bw compat that might be better spent on another mechanism. 
E.g. thorough documentation.


Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers, 
stop words, ...) if the token stream changes, the index is no longer 
valid. It may appear to work, but it is broken. The token stream applies 
not only to the indexed documents, but also to the user supplied query. 
A simple example, if from one release to another the stop word 'a' is 
dropped, then phrase searches including 'a' won't work as 'a' is not in 
the index. Even a simple, obvious bug fix that changes the stream is bad.


Another behavior change is an upgrade in Java version. By forcing users 
to go to Java 5 with Lucene 3, the version of Unicode changed. This in 
itself causes a change in some token streams.


With a change to a token stream, the index must be re-created to ensure 
expected behavior. If the original input is no longer available or the 
index cannot be rebuilt for whatever reason, then lucene should not be 
upgraded.


It is my observation, though possibly not correct, that core only has 
rudimentary analysis capabilities, handling English very well. To handle 
other languages well contrib/analyzers is required. Until recently it 
did not get much love. There have been many bw compat breaking changes 
(though w/ version one can probably get the prior behavior). IMHO, most 
of contrib/analyzers should be core. My guess is that most non-trivial 
applications will use contrib/analyzers.


The other problem I have is the assumption that re-index is feasible and 
that indexes are always server based. Re-index feasibility has already 
been well-discussed on this thread from a server side perspective. There 
are many client side applications, like mine, where the index is built 
and used on the clients computer. In my scenario the user builds indexes 
individually for books. From the index perspective, the sentence is the 
Lucene document and the book is the index. Building an index is 
voluntary and takes time proportional to the size of the document and 
time inversely proportional to the power of the computer. Our user base 
are those with ancient, underpowered laptops in 3-rd world countries. On 
those machines it might take 10 minutes to create an index and during 
that time the machine is fairly unresponsive. There is no opportunity to 
do it in the background.


So what are my choices? (rhetorical) With each new release of my app, 
I'd like to exploit the latest and greatest features of Lucene. And I'm 
going to change my app with features which may or may not be related to 
the use of Lucene. Those latter features are what matter the most to my 
user base. They don't care what technologies are used to do searches. If 
the latest Lucene jar does not let me use Version (or some other 
mechanism) to maintain compatibility with an older index, the user will 
have to re-index. Or I can forgo any future upgrades with Lucene. 
Neither are very palatable.


-- DM Smith

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857427#action_12857427
]

DM Smith commented on LUCENE-2396:
--

Robert,
I think this is a red-herring. There has been an implicit bw compat policy,
with all the effort to maintain bw compat in the analyzers. With the
re-shuffling of contrib this has been made a bit murky and does need to be
re-addressed.

How is this any different than the discussion to eliminate Version altogether?
I think that should be resolved first and this follow the lead of that.

How can one have a useful index across releases without a stable token stream?
From the thread it is clear that few understand the impact of an analyzer on
the usefulness of an index.

If this succeeds there is little reason to maintain Version at all.

-- DM

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

 First, the index format. IMHO, it is a good thing for a major release to be
 able to read the prior major release's index. And the ability to convert it
 to the current format via optimize is also good. Whatever is decided on this
 thread should take this seriously.
Optimize is a bad way to convert to current.
1. conversion is not guaranteed, optimizing already optimized index is a noop
2. it merges all your segments. if you use BalancedSegmentMergePolicy,
that destroys your segment size distribution

Dedicated upgrade tool (available both from command-line and
programmatically) is a good way to convert to current.
1. conversion happens exactly when you need it, conversion happens for
sure, no additional checks needed
2. it should leave all your segments as is, only changing their format

 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well. To handle
 other languages well contrib/analyzers is required. Until recently it did
 not get much love. There have been many bw compat breaking changes (though
 w/ version one can probably get the prior behavior). IMHO, most of
 contrib/analyzers should be core. My guess is that most non-trivial
 applications will use contrib/analyzers.
I counter - most non-trivial applications will use their own analyzers.
The more modules - the merrier. You can choose precisely what you need.

 Our user base are those with ancient,
 underpowered laptops in 3-rd world countries. On those machines it might
 take 10 minutes to create an index and during that time the machine is
 fairly unresponsive. There is no opportunity to do it in the background.
Major Lucene releases (feature-wise, not version-wise) happen like
once in a year, or year-and-a-half.
Is it that hard for your users to wait ten minutes once a year?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 1:30 PM, DM Smith dmsmith...@gmail.com wrote:


 Another behavior change is an upgrade in Java version. By forcing users to
 go to Java 5 with Lucene 3, the version of Unicode changed. This in itself
 causes a change in some token streams.

 ...


 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well.


DM brings up some interesting points here. For example, the Porter Stemmer
in core from 1970 or whenever, is essentially frozen to all changes for
some time now, it says so on Porter's site.

This is not the case for non-english, things are very much in flux,
including how the characters themselves are encoded on a computer. If we
want to support languages other than english in lucene, we have to make it
possible to iterate and improve things without making 20 copies of something
or scattering Version everywhere.


-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857440#action_12857440
]

Robert Muir commented on LUCENE-2396:
-

bq. There has been an implicit bw compat policy,

Part of the point of this patch for me was two things:
# what would the code look like if we delete the back compat cruft?
# why do i constantly hear different ideas about what contrib/analyzer's back
compat and what it should be? I want it defined!

At first I said, this is a stupid idea, but I am gonna delete all the backwards
cruft from a few Analyzers and just give it a try... its amazing how much
easier it is to see what is going on when you delete the 1.8MB of backwards
crap... a lot of it i put a lot of effort myself into.

So I think we should instead use real-versions for contrib/analyzers. You can
be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't
going to change behavior... no matter how much backwards stuff we try to add,
this is easiest and safest on everyone.

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:

First, the index format. IMHO, it is a good thing for a major release to be
able to read the prior major release's index. And the ability to convert it
to the current format via optimize is also good. Whatever is decided on this
thread should take this seriously.
 

Optimize is a bad way to convert to current.
1. conversion is not guaranteed, optimizing already optimized index is a noop
2. it merges all your segments. if you use BalancedSegmentMergePolicy,
that destroys your segment size distribution

Dedicated upgrade tool (available both from command-line and
programmatically) is a good way to convert to current.
1. conversion happens exactly when you need it, conversion happens for
sure, no additional checks needed
2. it should leave all your segments as is, only changing their format

   

It is my observation, though possibly not correct, that core only has
rudimentary analysis capabilities, handling English very well. To handle
other languages well contrib/analyzers is required. Until recently it did
not get much love. There have been many bw compat breaking changes (though
w/ version one can probably get the prior behavior). IMHO, most of
contrib/analyzers should be core. My guess is that most non-trivial
applications will use contrib/analyzers.
 

I counter - most non-trivial applications will use their own analyzers.
The more modules - the merrier. You can choose precisely what you need.
   
By and large an analyzer is a simple wrapper for a tokenizer and some 
filters. Are you suggesting that most non-trivial apps write their own 
tokenizers and filters?


I'd find that hard to believe. For example, I don't know enough Chinese, 
Farsi, Arabic, Polish, ... to come up with anything better than what 
Lucene has to tokenize, stem or filter these.


   

Our user base are those with ancient,
underpowered laptops in 3-rd world countries. On those machines it might
take 10 minutes to create an index and during that time the machine is
fairly unresponsive. There is no opportunity to do it in the background.
 

Major Lucene releases (feature-wise, not version-wise) happen like
once in a year, or year-and-a-half.
Is it that hard for your users to wait ten minutes once a year?
   
 I said that was for one index. Multiply that times the number of books 
available (300+) and yes, it is too much to ask. Even if a small subset 
is indexed, say 30, that's around 5 hours of waiting.


Under consideration is the frequency of breakage. Some are suggesting a 
greater frequency than yearly.


DM

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-15 Thread Tom Burton-West (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-2393:


Attachment: LUCENE-2393.patch

New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N 
terms with the highest docFreq and looks up the total term frequency and 
outputs the list of terms sorted by highest term frequency (which approximates 
the largest entries in the *prx files).I'm not sure how to combine the 
GetTermInfo program, with either version of HighFreqTerms  in a way that leads 
to sane command line arguments and argument processing.   I suppose that 
HighFreqTerms could have a flag that turns on or off the inclusion of total 
term frequency.

 Utility to output total term frequency and df from a lucene index
 -

 Key: LUCENE-2393
 URL: https://issues.apache.org/jira/browse/LUCENE-2393
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Tom Burton-West
Priority: Trivial
 Attachments: LUCENE-2393.patch, LUCENE-2393.patch


 This is a command line utility that takes a field name, term, and index 
 directory and outputs the document frequency for the term and the total 
 number of occurrences of the term in the index (i.e. the sum of the tf of the 
 term for each document).  It is useful for estimating the size of the term's 
 entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857456#action_12857456
]

DM Smith commented on LUCENE-2396:
--

{quote}
So I think we should instead use real-versions for contrib/analyzers. You can
be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't
going to change behavior... no matter how much backwards stuff we try to add,
this is easiest and safest on everyone.
{quote}

I could live with thatmaybe. What guarantee is there that
lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work?

How can I use lucene-analyzers-3.0.jar on old indexes and
lucene-analyzers-3.5.jar on newer ones within the same package?

What I'd like to see is that all analyzers and their parts are kept together in
an analyzer jar (maybe more than one for the honking big analyzers as we have
today) and that it be elevated to core. (I think contrib give the wrong
impression.) And have a well-define compatibility policy.

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

I'd like to remind that Mike's proposal has stable branches.

We can branch off preflex trunk right now and wrap it up as 3.1.
Current trunk is declared as future 4.0 and all backcompat cruft is
removed from it.
If some new features/bugfixes appear in trunk, and they don't break
stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3,
etc

Thus, devs are free to work without back-compat burden, bleeding edge
users get their blood, conservative users get their stability + a
subset of new features from stable branches.


On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote:
 On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:

 First, the index format. IMHO, it is a good thing for a major release to
 be
 able to read the prior major release's index. And the ability to convert
 it
 to the current format via optimize is also good. Whatever is decided on
 this
 thread should take this seriously.


 Optimize is a bad way to convert to current.
 1. conversion is not guaranteed, optimizing already optimized index is a
 noop
 2. it merges all your segments. if you use BalancedSegmentMergePolicy,
 that destroys your segment size distribution

 Dedicated upgrade tool (available both from command-line and
 programmatically) is a good way to convert to current.
 1. conversion happens exactly when you need it, conversion happens for
 sure, no additional checks needed
 2. it should leave all your segments as is, only changing their format



 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well. To handle
 other languages well contrib/analyzers is required. Until recently it
 did
 not get much love. There have been many bw compat breaking changes
 (though
 w/ version one can probably get the prior behavior). IMHO, most of
 contrib/analyzers should be core. My guess is that most non-trivial
 applications will use contrib/analyzers.


 I counter - most non-trivial applications will use their own analyzers.
 The more modules - the merrier. You can choose precisely what you need.


 By and large an analyzer is a simple wrapper for a tokenizer and some
 filters. Are you suggesting that most non-trivial apps write their own
 tokenizers and filters?

 I'd find that hard to believe. For example, I don't know enough Chinese,
 Farsi, Arabic, Polish, ... to come up with anything better than what Lucene
 has to tokenize, stem or filter these.



 Our user base are those with ancient,
 underpowered laptops in 3-rd world countries. On those machines it might
 take 10 minutes to create an index and during that time the machine is
 fairly unresponsive. There is no opportunity to do it in the
 background.


 Major Lucene releases (feature-wise, not version-wise) happen like
 once in a year, or year-and-a-half.
 Is it that hard for your users to wait ten minutes once a year?


  I said that was for one index. Multiply that times the number of books
 available (300+) and yes, it is too much to ask. Even if a small subset is
 indexed, say 30, that's around 5 hours of waiting.

 Under consideration is the frequency of breakage. Some are suggesting a
 greater frequency than yearly.

 DM

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857456#action_12857456
]

DM Smith edited comment on LUCENE-2396 at 4/15/10 2:16 PM:
---

I could live with thatmaybe. What guarantee is there that
lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work?

How can I use lucene-analyzers-3.0.jar on old indexes and
lucene-analyzers-3.5.jar on newer ones within the same application?

was (Author: dmsmith):
{quote}
So I think we should instead use real-versions for contrib/analyzers. You can
be damn sure if you stick with lucene-analyzers-3.0.jar that your stemmer isn't
going to change behavior... no matter how much backwards stuff we try to add,
this is easiest and safest on everyone.
{quote}

I could live with thatmaybe. What guarantee is there that
lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work?

How can I use lucene-analyzers-3.0.jar on old indexes and
lucene-analyzers-3.5.jar on newer ones within the same package?

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.

[
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857466#action_12857466
]

Robert Muir commented on LUCENE-2396:
-

bq. I could live with thatmaybe. What guarantee is there that
lucene-analyzers-3.0.jar will work with lucene-core-3.7.jar? How does that work?

Well, they should work, unless lucene-core breaks backwards compatibility with
analyzers!

{quote}
How can I use lucene-analyzers-3.0.jar on old indexes and
lucene-analyzers-3.5.jar on newer ones within the same application?

Well, I think asking for a well-defined backwards compatibility policy for 'all
analyzers' is asking too much. Things are not so simple and sorted out like
they are with English/porter stemming, etc.

I'll go with the flow, we can stay with what we have now, and the language
support will also likely remain weak like it is now.

Currently I feel its an immense up-front effort to contribute any analysis
support, it has to be near-perfect less it will cause future problems, because
its not easy to iterate with the current situation without creating a mess.

Forget about applying little patches or improvements (assuming adequately
relevance-tested / sane etc)... we've really only been able to fix bugs, add
tests, and reorganize analyzers because touching them at all means you have to
add backwards compat cruft.

remove version from contrib/analyzers.
--

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857471#action_12857471
 ] 

Robert Muir commented on LUCENE-2396:
-

bq. How can I use lucene-analyzers-3.0.jar on old indexes and 
lucene-analyzers-3.5.jar on newer ones within the same application?

sorry DM, i meant to respond to this too!

I think this is an advanced use case, that doesn't justify complex backwards 
compatibility layers.


 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

I seriously don't understand the fuss around index format back compat.
How many times is this changed such that it is too much to ask to keep
X support X-1?

I prefer to have ongoing segment merging but can live w/ a manual
converter tool. Thing is - I'll probably not be able to develop one
myself outside the scope of Lucene because I'll miss tons of API. So
having Lucene declare and adhere to it seems reasonable to me.

BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
manual migration on the segments that are still on old versions.
That's not the point about whether optimize() is good or not. It is
the difference between telling the customer to run a 5-day migration
process, or a couple of hours. At the end of the day, the same
migration code will need to be written whether for the manual or
automatic case. And probably by the same developer which changed the
index format. It's the difference of when does it happen.

And I also think that a manual migration tool will need access to some
lower level API which is not exposed today, and will generally not
perform as well as online migration. But that's a side note...

Shai

On Thursday, April 15, 2010, Earwin Burrfoot ear...@gmail.com wrote:
 I'd like to remind that Mike's proposal has stable branches.

 We can branch off preflex trunk right now and wrap it up as 3.1.
 Current trunk is declared as future 4.0 and all backcompat cruft is
 removed from it.
 If some new features/bugfixes appear in trunk, and they don't break
 stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3,
 etc

 Thus, devs are free to work without back-compat burden, bleeding edge
 users get their blood, conservative users get their stability + a
 subset of new features from stable branches.


 On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote:
 On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:

 First, the index format. IMHO, it is a good thing for a major release to
 be
 able to read the prior major release's index. And the ability to convert
 it
 to the current format via optimize is also good. Whatever is decided on
 this
 thread should take this seriously.


 Optimize is a bad way to convert to current.
 1. conversion is not guaranteed, optimizing already optimized index is a
 noop
 2. it merges all your segments. if you use BalancedSegmentMergePolicy,
 that destroys your segment size distribution

 Dedicated upgrade tool (available both from command-line and
 programmatically) is a good way to convert to current.
 1. conversion happens exactly when you need it, conversion happens for
 sure, no additional checks needed
 2. it should leave all your segments as is, only changing their format



 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well. To handle
 other languages well contrib/analyzers is required. Until recently it
 did
 not get much love. There have been many bw compat breaking changes
 (though
 w/ version one can probably get the prior behavior). IMHO, most of
 contrib/analyzers should be core. My guess is that most non-trivial
 applications will use contrib/analyzers.


 I counter - most non-trivial applications will use their own analyzers.
 The more modules - the merrier. You can choose precisely what you need.


 By and large an analyzer is a simple wrapper for a tokenizer and some
 filters. Are you suggesting that most non-trivial apps write their own
 tokenizers and filters?

 I'd find that hard to believe. For example, I don't know enough Chinese,
 Farsi, Arabic, Polish, ... to come up with anything better than what Lucene
 has to tokenize, stem or filter these.



 Our user base are those with ancient,
 underpowered laptops in 3-rd world countries. On those machines it might
 take 10 minutes to create an index and during that time the machine is
 fairly unresponsive. There is no opportunity to do it in the
 background.


 Major Lucene releases (feature-wise, not version-wise) happen like
 once in a year, or year-and-a-half.
 Is it that hard for your users to wait ten minutes once a year?


  I said that was for one index. Multiply that times the number of books
 available (300+) and yes, it is too much to ask. Even if a small subset is
 indexed, say 30, that's around 5 hours of waiting.

 Under consideration is the frequency of breakage. Some are suggesting a
 greater frequency than yearly.

 DM

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Sanne Grinovero

Hello,
I think some compatibility breaks should really be accepted, otherwise
these requirements are going to kill the technological advancement:
the effort in backwards compatibility will grow and be more
timeconsuming and harder every day.

A mayor release won't happen every day, likely not even every year, so
it seems acceptable to have milestones defining compatibility
boundaries: you need to be able to reset the complexity curve
occasionally.

Backporting a feature would benefit from being merged in the correct
testsuite, and avoid the explosion of this matrix-like backwards
compatibility test suite. BTW the current testsuite is likely covering
all kinds of combinations which nobody is actually using or caring
about.

Also if I where to discover a nice improvement in an Analyzer, and you
where telling me that to contribute it I would have to face this
amount of complexity.. I would think twice before trying; honestly the
current requirements are scary.

+1

Sanne

2010/4/15 Earwin Burrfoot ear...@gmail.com:
 I'd like to remind that Mike's proposal has stable branches.

 We can branch off preflex trunk right now and wrap it up as 3.1.
 Current trunk is declared as future 4.0 and all backcompat cruft is
 removed from it.
 If some new features/bugfixes appear in trunk, and they don't break
 stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3,
 etc

 Thus, devs are free to work without back-compat burden, bleeding edge
 users get their blood, conservative users get their stability + a
 subset of new features from stable branches.


 On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote:
 On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:

 First, the index format. IMHO, it is a good thing for a major release to
 be
 able to read the prior major release's index. And the ability to convert
 it
 to the current format via optimize is also good. Whatever is decided on
 this
 thread should take this seriously.


 Optimize is a bad way to convert to current.
 1. conversion is not guaranteed, optimizing already optimized index is a
 noop
 2. it merges all your segments. if you use BalancedSegmentMergePolicy,
 that destroys your segment size distribution

 Dedicated upgrade tool (available both from command-line and
 programmatically) is a good way to convert to current.
 1. conversion happens exactly when you need it, conversion happens for
 sure, no additional checks needed
 2. it should leave all your segments as is, only changing their format



 It is my observation, though possibly not correct, that core only has
 rudimentary analysis capabilities, handling English very well. To handle
 other languages well contrib/analyzers is required. Until recently it
 did
 not get much love. There have been many bw compat breaking changes
 (though
 w/ version one can probably get the prior behavior). IMHO, most of
 contrib/analyzers should be core. My guess is that most non-trivial
 applications will use contrib/analyzers.


 I counter - most non-trivial applications will use their own analyzers.
 The more modules - the merrier. You can choose precisely what you need.


 By and large an analyzer is a simple wrapper for a tokenizer and some
 filters. Are you suggesting that most non-trivial apps write their own
 tokenizers and filters?

 I'd find that hard to believe. For example, I don't know enough Chinese,
 Farsi, Arabic, Polish, ... to come up with anything better than what Lucene
 has to tokenize, stem or filter these.



 Our user base are those with ancient,
 underpowered laptops in 3-rd world countries. On those machines it might
 take 10 minutes to create an index and during that time the machine is
 fairly unresponsive. There is no opportunity to do it in the
 background.


 Major Lucene releases (feature-wise, not version-wise) happen like
 once in a year, or year-and-a-half.
 Is it that hard for your users to wait ten minutes once a year?


  I said that was for one index. Multiply that times the number of books
 available (300+) and yes, it is too much to ask. Even if a small subset is
 indexed, say 30, that's around 5 hours of waiting.

 Under consideration is the frequency of breakage. Some are suggesting a
 greater frequency than yearly.

 DM

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.

Converting stuff is easier then emulating, that's exactly why I want a
separate tool.
There's no need to support cross-version merging, nor to emulate old APIs.

I also don't understand why offline migration is going to take days
instead of hours for online migration??
WTF, it's gonna be even faster, as it doesn't have to merge things.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857487#action_12857487
 ] 

DM Smith commented on LUCENE-2396:
--

bq. Well, I think asking for a well-defined backwards compatibility policy for 
'all analyzers' is asking too much. Things are not so simple and sorted out 
like they are with English/porter stemming, etc.

Some ramblings:

I think things need to change/improve wrt analyzers, tokenizers and filters. 
The current Version mechanism is a road block. So is bw compat. I get that.

When I asked for a well-define compatibility policy, I was not suggesting that 
we go back to the old mechanism or keep the new Version mechanism. Just a clear 
statement on what the policy is. It might be on a per class basis.

One mechanism that would work is versioned Java package names or class names. 
The current release would get the good names. If a user wanted the old jar 
they'd have to get it from the current release (e.g. 
lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff 
which now has either a new package name or a new class name. Example, 
trStemmer.java is going to be changed as the first breaking change since 3.0, 
so trStemmer3_0.java is created as a copy and then trStemmer.java is changed.

The compatibility policy would be that the jar is not a drop in replacement, 
but that the old classes still exist, albeit with a different name.

I have worked on some contributions w/ bw compat and it is a pain. I didn't 
like it. And that was both pre-version and post-version. I'd like to see 
version go away, but I'm not sure I'd like bw compat to go away. As it is I'm 
resigning myself that as I use each release of Lucene, I'm going to want more 
from it and that is likely to require index rebuilds. Right now I'm stuck with 
the 2.9 series and what happens until I upgrade to 3.x or 4.x doesn't really 
impact me. It will impact me then. I'll figure out how to deal with it and suck 
it up.

Some other things I'd like to see:
* I'd like to see fully controllable Unicode support. The only way I see this 
is if we use ICU. It will take the java version problem out of the picture. A 
user would have control of the version of Unicode by their control of the 
version of ICU.
* An analyzer construction factory, that would take a spec (of fields, 
tokenizers, stop words, stemmers, ) and spit out an per field analyzer. 
This would allow for the deprecation of the analyzers.

These and others would be more readily tackled if the bw compat policy did not 
get in the way.

bq. I'll go with the flow, we can stay with what we have now, and the language 
support will also likely remain weak like it is now.
You know I don't want that ;)

I was suggesting that this issue should wait to see what the outcome of the 
general Version discussion is. Even if it is negative, perhaps this can go 
forward.

bq.Currently I feel its an immense up-front effort to contribute any analysis 
support, it has to be near-perfect less it will cause future problems, because 
its not easy to iterate with the current situation without creating a mess.
With new stuff, even in core, if it is marked as experimental, it is outside 
the bw compat policy. That gives the opportunity to iterate. Dev branches are 
another good way.

But please, keep up the good work!



 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On 04/15/2010 03:04 PM, Earwin Burrfoot wrote:

BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
manual migration on the segments that are still on old versions.
That's not the point about whether optimize() is good or not. It is
the difference between telling the customer to run a 5-day migration
process, or a couple of hours. At the end of the day, the same
migration code will need to be written whether for the manual or
automatic case. And probably by the same developer which changed the
index format. It's the difference of when does it happen.
 

Converting stuff is easier then emulating, that's exactly why I want a
separate tool.
There's no need to support cross-version merging, nor to emulate old APIs.

I also don't understand why offline migration is going to take days
instead of hours for online migration??
WTF, it's gonna be even faster, as it doesn't have to merge things.

   
Will it be able to be used within a client application that creates and 
uses local indexes?


I;m assuming it will be faster than re-indexing.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857490#action_12857490
 ] 

Robert Muir commented on LUCENE-2396:
-

{quote}
One mechanism that would work is versioned Java package names or class names. 
The current release would get the good names. If a user wanted the old jar 
they'd have to get it from the current release (e.g. 
lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff 
which now has either a new package name or a new class name. Example, 
trStemmer.java is going to be changed as the first breaking change since 3.0, 
so trStemmer3_0.java is created as a copy and then trStemmer.java is changed.
{quote}

Right, but I dont think Lucene should manage this. I think if we assume 
normally versioned releases, a user with a really complex case that needs 
multiple versions of lucene working in the same JVM , like you, could use some 
other tool (eclipse refactor or maybe google's jarjar) to do rename things?

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 23:07, DM Smith dmsmith...@gmail.com wrote:
 On 04/15/2010 03:04 PM, Earwin Burrfoot wrote:

 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.


 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.

 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.



 Will it be able to be used within a client application that creates and uses
 local indexes?

 I;m assuming it will be faster than re-indexing.

As I said earlier in the topic, it is obvious the tool has to have
both programmatic and command-line interfaces.
I will also reiterate - it only upgrades the index structurally. If
you changed your analyzers - that's your problem and you have to deal
with it.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread jm

Not sure if plain users are allowed/encouraged to post in this list,
but wanted to mention (just an opinion from a happy user), as other
users have, that not all of us can reindex just like that. It would
not be 10 min for one of our installations for sure...

First, i would need to implement some code to reindex, cause my source
data is postprocessed/compressed/encrypted/moved after it arrives to
the application, so I would need to retrieve all etc. And then
reindexing it would take days.
javier

On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.

 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.

 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857498#action_12857498
 ] 

DM Smith commented on LUCENE-2396:
--

{quote}
bq. One mechanism that would work is versioned Java package names or class 
names. The current release would get the good names. If a user wanted the old 
jar they'd have to get it from the current release (e.g. 
lucene-analyzers-3.5_3.0.jar) and then change their code to use the old stuff 
which now has either a new package name or a new class name. Example, 
trStemmer.java is going to be changed as the first breaking change since 3.0, 
so trStemmer3_0.java is created as a copy and then trStemmer.java is changed.

Right, but I dont think Lucene should manage this. I think if we assume 
normally versioned releases, a user with a really complex case that needs 
multiple versions of lucene working in the same JVM , like you, could use some 
other tool (eclipse refactor or maybe google's jarjar) to do rename things?
{quote}

I can go along with this.

I still think it might be good to let the dust settle on the general Version 
question before committing.

 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

but seriously... are you moving across major lucene releases every single
day?

if you are using 3.x, how does it hurt you if there is a version 4.x that
you can't use without re-indexing?

why wouldn't you just stay on your stable branch (say 3.x)?

2010/4/15 jm jmugur...@gmail.com

 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...

 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier

 On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
  BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
  manual migration on the segments that are still on old versions.
  That's not the point about whether optimize() is good or not. It is
  the difference between telling the customer to run a 5-day migration
  process, or a couple of hours. At the end of the day, the same
  migration code will need to be written whether for the manual or
  automatic case. And probably by the same developer which changed the
  index format. It's the difference of when does it happen.
 
  Converting stuff is easier then emulating, that's exactly why I want a
  separate tool.
  There's no need to support cross-version merging, nor to emulate old
 APIs.
 
  I also don't understand why offline migration is going to take days
  instead of hours for online migration??
  WTF, it's gonna be even faster, as it doesn't have to merge things.
 
  --
  Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
  Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...

 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier

There's absolutely no, zero, nada, way to use modified/fixed analyzer
stack without reindexing.
If you want it - reindex, if you don't - stick with the stable branch.

If your stack is unchanged, but the index format changes - upgrade it
with the proposed tool and be happy.

Speaking as a happy plain user, whose indexes take two days to be
fully rebuilt and who does it (though not always full) at least once a
month.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On 04/15/2010 03:12 PM, Earwin Burrfoot wrote:

On Thu, Apr 15, 2010 at 23:07, DM Smithdmsmith...@gmail.com  wrote:
   

On 04/15/2010 03:04 PM, Earwin Burrfoot wrote:
 

BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
manual migration on the segments that are still on old versions.
That's not the point about whether optimize() is good or not. It is
the difference between telling the customer to run a 5-day migration
process, or a couple of hours. At the end of the day, the same
migration code will need to be written whether for the manual or
automatic case. And probably by the same developer which changed the
index format. It's the difference of when does it happen.

 

Converting stuff is easier then emulating, that's exactly why I want a
separate tool.
There's no need to support cross-version merging, nor to emulate old APIs.

I also don't understand why offline migration is going to take days
instead of hours for online migration??
WTF, it's gonna be even faster, as it doesn't have to merge things.


   

Will it be able to be used within a client application that creates and uses
local indexes?

I;m assuming it will be faster than re-indexing.
 

As I said earlier in the topic, it is obvious the tool has to have
both programmatic and command-line interfaces.
I will also reiterate - it only upgrades the index structurally. If
you changed your analyzers - that's your problem and you have to deal
with it
Good. (Sorry I missed that. There's just too much in the thread to keep 
track of ;)


As long as my old analyzers will still work with the new lucene-core 
jar, I'm fat, dumb and happy with the upgraded index.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

The reason Earwin why online migration is faster is because when u
finally need to *fully* migrate your index, most chances are that most
of the segments are already on the newer format. Offline migration
will just keep the application idle for some amount of time until ALL
segments are migrated.

During the lifecycle of the index, segments are merged anyway, so
migrating them on the fly virtually costs nothing. At the end, when u
upgrade to a Lucene version which doesn't support the previous index
format, you'll on the worse case need to migrate few large segments
which were never merged. I don't know how many of those there will be
as it really depends on the application, but I'd bet this process will
touch just a few segments. And hence, throughput wise it will be a lot
faster.

We should create a migrate() API on IW which will touch just those
segments and not incur a full optimize. That API can also be used for
an offline migration tool, if we decide that's what we want.

Shai

On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote:
 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...

 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier

 On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.

 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.

 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On 04/15/2010 03:25 PM, Shai Erera wrote:

We should create a migrate() API on IW which will touch just those
segments and not incur a full optimize. That API can also be used for
an offline migration tool, if we decide that's what we want.

   
What about an index that has already called optimize()? I presume it 
will be upgraded with what ever is decided?



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Proposal about Version API relaxation

2010-04-15 Thread Uwe Schindler

Hi Earwin,

I am strongly +1 on this. I would also make the Release Manager for 3.1, if 
nobody else wants to do this. I would like to take the preflex tag or some 
revisions before (maybe without the IndexWriterConfig, which is a really new 
API) to be 3.1 branch. And after that port some of my post-flex-changes like 
the StandardTokenizer refactoring back (so we can produce the old analyzer 
still without Java 1.4).

So +1 on branching pre-flex and release as 3.1 soon. The Unicode improvements 
rectify a new release. I think also s1monw wants to have this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Earwin Burrfoot [mailto:ear...@gmail.com]
 Sent: Thursday, April 15, 2010 8:15 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Proposal about Version API relaxation
 
 I'd like to remind that Mike's proposal has stable branches.
 
 We can branch off preflex trunk right now and wrap it up as 3.1.
 Current trunk is declared as future 4.0 and all backcompat cruft is
 removed from it.
 If some new features/bugfixes appear in trunk, and they don't break
 stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3,
 etc
 
 Thus, devs are free to work without back-compat burden, bleeding edge
 users get their blood, conservative users get their stability + a
 subset of new features from stable branches.
 
 
 On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote:
  On 04/15/2010 01:50 PM, Earwin Burrfoot wrote:
 
  First, the index format. IMHO, it is a good thing for a major
 release to
  be
  able to read the prior major release's index. And the ability to
 convert
  it
  to the current format via optimize is also good. Whatever is
 decided on
  this
  thread should take this seriously.
 
 
  Optimize is a bad way to convert to current.
  1. conversion is not guaranteed, optimizing already optimized index
 is a
  noop
  2. it merges all your segments. if you use
 BalancedSegmentMergePolicy,
  that destroys your segment size distribution
 
  Dedicated upgrade tool (available both from command-line and
  programmatically) is a good way to convert to current.
  1. conversion happens exactly when you need it, conversion happens
 for
  sure, no additional checks needed
  2. it should leave all your segments as is, only changing their
 format
 
 
 
  It is my observation, though possibly not correct, that core only
 has
  rudimentary analysis capabilities, handling English very well. To
 handle
  other languages well contrib/analyzers is required. Until
 recently it
  did
  not get much love. There have been many bw compat breaking changes
  (though
  w/ version one can probably get the prior behavior). IMHO, most of
  contrib/analyzers should be core. My guess is that most non-trivial
  applications will use contrib/analyzers.
 
 
  I counter - most non-trivial applications will use their own
 analyzers.
  The more modules - the merrier. You can choose precisely what you
 need.
 
 
  By and large an analyzer is a simple wrapper for a tokenizer and some
  filters. Are you suggesting that most non-trivial apps write their
 own
  tokenizers and filters?
 
  I'd find that hard to believe. For example, I don't know enough
 Chinese,
  Farsi, Arabic, Polish, ... to come up with anything better than what
 Lucene
  has to tokenize, stem or filter these.
 
 
 
  Our user base are those with ancient,
  underpowered laptops in 3-rd world countries. On those machines it
 might
  take 10 minutes to create an index and during that time the machine
 is
  fairly unresponsive. There is no opportunity to do it in the
  background.
 
 
  Major Lucene releases (feature-wise, not version-wise) happen like
  once in a year, or year-and-a-half.
  Is it that hard for your users to wait ten minutes once a year?
 
 
   I said that was for one index. Multiply that times the number of
 books
  available (300+) and yes, it is too much to ask. Even if a small
 subset is
  indexed, say 30, that's around 5 hours of waiting.
 
  Under consideration is the frequency of breakage. Some are suggesting
 a
  greater frequency than yearly.
 
  DM
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 
 
 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010/4/15 Shai Erera ser...@gmail.com:
 The reason Earwin why online migration is faster is because when u
 finally need to *fully* migrate your index, most chances are that most
 of the segments are already on the newer format. Offline migration
 will just keep the application idle for some amount of time until ALL
 segments are migrated.

 During the lifecycle of the index, segments are merged anyway, so
 migrating them on the fly virtually costs nothing. At the end, when u
 upgrade to a Lucene version which doesn't support the previous index
 format, you'll on the worse case need to migrate few large segments
 which were never merged. I don't know how many of those there will be
 as it really depends on the application, but I'd bet this process will
 touch just a few segments. And hence, throughput wise it will be a lot
 faster.

 We should create a migrate() API on IW which will touch just those
 segments and not incur a full optimize. That API can also be used for
 an offline migration tool, if we decide that's what we want.

We should not create such an API on IW, and we should build offline
migration tool as a separate thing :)
Because otherwise we have to keep all back-compat stuff within IW, SR
and friends as it is.

Look at current SegmentReader.Norm code - there's three freaking
places they can be loaded from.
I will also reiterate the issue of the API. Fat index changes are
almost certainly accompanied by API changes.
With online migration we have to emulate new APIs over old segments,
which is really cumbersome.
With offline migration we only need to be able to read said segments
in one or another manner.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.


[ 
https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857507#action_12857507
 ] 

Robert Muir commented on LUCENE-2396:
-

bq. I can go along with this.

Cool! 

bq. I still think it might be good to let the dust settle on the general 
Version question before committing.

Sure... but we should still remember there's really no back compat for the 
stuff changed in this patch :)

I'm also glad you mentioned the unicode issue, i mean if you are doing 
non-English, some of the ideas in lucene's back compat with analyzers are 
basically downright silly at the end of the day.

Besides the fact that upgrading your JVM can cause java itself to treat text 
differently (which we currently cannot control), changes to the users operating 
system [potentially completely outside of the scope of your application!] can 
cause 'searches that worked before to not work anymore'.

For example, if your users upgrade and their new input method generates U+09CE 
instead of U+09A4 U+09CD U+200D for Khanda-ta, the search won't match, even 
though perhaps they typed the exact same key on their keyboard.

Unicode normalization does nothing in this case, and its your app's 
responsibility to be aware of stuff like this (Not Lucene's analyzers!) and 
deal with them.

At the end of the day, I think a lot of what lucene considers our own backwards 
compatibility responsibility necessarily belongs in the app instead.

{noformat}
Versions of the Unicode Standard prior to Version 4.1 recommended that khanda 
ta be represented
as the sequence U+09A4 bengali letter ta, U+09CD bengali sign virama,
U+200D zero width joiner in all circumstances. U+09CE bengali letter khanda ta
should instead be used explicitly in newly generated text, but users are 
cautioned that
instances of the older representation may exist.
{noformat}



 remove version from contrib/analyzers.
 --

 Key: LUCENE-2396
 URL: https://issues.apache.org/jira/browse/LUCENE-2396
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2396.patch


 Contrib/analyzers has no backwards-compatibility policy, so let's remove 
 Version so the API is consumable.
 if you think we shouldn't do this, then instead explicitly state and vote on 
 what the backwards compatibility policy for contrib/analyzers should be 
 instead, or move it all to core.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

Unfortunately, live searching against an old index can get very hairy.
EG look at what I had to do for the flex API on pre-flex index flex
emulation layer.

It's also not great because it gives the illusion that all is good,
yet, you've taken a silent hit (up to ~10% or so) in your search
perf.

Whereas building  maintaining a one-time index migration tool, in
contrast, is much less work.

I realize the migration tool has issues -- it fixes the hard changes
but silently allows the soft changes to break (ie, your analyzers my
not produce the same tokens, until we move all core analyzers outside
of core, so they are separately versioned), but it seems like a good
compromise here?

Mike

2010/4/15 Shai Erera ser...@gmail.com:
 The reason Earwin why online migration is faster is because when u
 finally need to *fully* migrate your index, most chances are that most
 of the segments are already on the newer format. Offline migration
 will just keep the application idle for some amount of time until ALL
 segments are migrated.

 During the lifecycle of the index, segments are merged anyway, so
 migrating them on the fly virtually costs nothing. At the end, when u
 upgrade to a Lucene version which doesn't support the previous index
 format, you'll on the worse case need to migrate few large segments
 which were never merged. I don't know how many of those there will be
 as it really depends on the application, but I'd bet this process will
 touch just a few segments. And hence, throughput wise it will be a lot
 faster.

 We should create a migrate() API on IW which will touch just those
 segments and not incur a full optimize. That API can also be used for
 an offline migration tool, if we decide that's what we want.

 Shai

 On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote:
 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...

 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier

 On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.

 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.

 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010/4/15 Michael McCandless luc...@mikemccandless.com


 I realize the migration tool has issues -- it fixes the hard changes
 but silently allows the soft changes to break (ie, your analyzers my
 not produce the same tokens, until we move all core analyzers outside
 of core, so they are separately versioned), but it seems like a good
 compromise here?


Well, lets consider doing that too. Since analyzers have this tough problem
of being soft changes, I propose the following:
1. get rid of version
2. minimize the interface between the indexer and analysis
3. put analyzers in their own versioned jar files.

this way, we could provide a realistic capability for users to use
lucene-3.5.jar with lucene-3.2-analyzers.jar, and possibly have STRONGER
analyzer back compat (e.g. if we minimize the damn thing enough, perhaps
very old analyzers.jar's could even work across major releases).

its also much safer when you are using the same bytecodes you used before,
instead of hairy back compat layers. I don't refer to Uwe's code here: its
perfect, but we cant force Uwe into writing the back compat for every big
feature.


-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

From IRC:
why do I get the feeling that everyone is in heated agreement on the Version 
thread?
there are some cases that mean people will have to reindex
in those cases, we should tell people they will have to reindex
then they can decide to upgrade or not
all other cases, just do the sensible thing and test first
I have yet to meet anyone who simply drops a new version into production and 
says go

So, as I said earlier, why don't we just move forward with it, strive to 
support reading X-1 index format in X and let the user know the cases in which 
they will have to re-index. If a migration tool is necessary, then someone can 
write it at the appropriate time.  Just as was said w/ the Solr merge, it's 
software.  If it doesn't work, we can change it.  Thank goodness we don't have 
a back compatibility policy for our policies!

-Grant




On Apr 15, 2010, at 3:35 PM, Michael McCandless wrote:

 Unfortunately, live searching against an old index can get very hairy.
 EG look at what I had to do for the flex API on pre-flex index flex
 emulation layer.
 
 It's also not great because it gives the illusion that all is good,
 yet, you've taken a silent hit (up to ~10% or so) in your search
 perf.
 
 Whereas building  maintaining a one-time index migration tool, in
 contrast, is much less work.
 
 I realize the migration tool has issues -- it fixes the hard changes
 but silently allows the soft changes to break (ie, your analyzers my
 not produce the same tokens, until we move all core analyzers outside
 of core, so they are separately versioned), but it seems like a good
 compromise here?
 
 Mike
 
 2010/4/15 Shai Erera ser...@gmail.com:
 The reason Earwin why online migration is faster is because when u
 finally need to *fully* migrate your index, most chances are that most
 of the segments are already on the newer format. Offline migration
 will just keep the application idle for some amount of time until ALL
 segments are migrated.
 
 During the lifecycle of the index, segments are merged anyway, so
 migrating them on the fly virtually costs nothing. At the end, when u
 upgrade to a Lucene version which doesn't support the previous index
 format, you'll on the worse case need to migrate few large segments
 which were never merged. I don't know how many of those there will be
 as it really depends on the application, but I'd bet this process will
 touch just a few segments. And hence, throughput wise it will be a lot
 faster.
 
 We should create a migrate() API on IW which will touch just those
 segments and not incur a full optimize. That API can also be used for
 an offline migration tool, if we decide that's what we want.
 
 Shai
 
 On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote:
 Not sure if plain users are allowed/encouraged to post in this list,
 but wanted to mention (just an opinion from a happy user), as other
 users have, that not all of us can reindex just like that. It would
 not be 10 min for one of our installations for sure...
 
 First, i would need to implement some code to reindex, cause my source
 data is postprocessed/compressed/encrypted/moved after it arrives to
 the application, so I would need to retrieve all etc. And then
 reindexing it would take days.
 javier
 
 On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote:
 BTW Earwin, we can come up w/ a migrate() method on IW to accomplish
 manual migration on the segments that are still on old versions.
 That's not the point about whether optimize() is good or not. It is
 the difference between telling the customer to run a 5-day migration
 process, or a couple of hours. At the end of the day, the same
 migration code will need to be written whether for the manual or
 automatic case. And probably by the same developer which changed the
 index format. It's the difference of when does it happen.
 
 Converting stuff is easier then emulating, that's exactly why I want a
 separate tool.
 There's no need to support cross-version merging, nor to emulate old APIs.
 
 I also don't understand why offline migration is going to take days
 instead of hours for online migration??
 WTF, it's gonna be even faster, as it doesn't have to merge things.
 
 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

I think this should split off the mega-thread :)

On Thu, Apr 15, 2010 at 23:28, Uwe Schindler u...@thetaphi.de wrote:
 Hi Earwin,

 I am strongly +1 on this. I would also make the Release Manager for 3.1, if 
 nobody else wants to do this. I would like to take the preflex tag or some 
 revisions before (maybe without the IndexWriterConfig, which is a really new 
 API) to be 3.1 branch. And after that port some of my post-flex-changes like 
 the StandardTokenizer refactoring back (so we can produce the old analyzer 
 still without Java 1.4).

 So +1 on branching pre-flex and release as 3.1 soon. The Unicode improvements 
 rectify a new release. I think also s1monw wants to have this.

 Uwe

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

2010-04-15 Thread Andi Vajda



On Thu, 15 Apr 2010, Robert Muir wrote:




2010/4/15 Michael McCandless luc...@mikemccandless.com

  I realize the migration tool has issues -- it fixes the hard
  changes
  but silently allows the soft changes to break (ie, your
  analyzers my
  not produce the same tokens, until we move all core analyzers
  outside
  of core, so they are separately versioned), but it seems like a
  good
  compromise here?


Well, lets consider doing that too. Since analyzers have this tough problem
of being soft changes, I propose the following:
1. get rid of version
2. minimize the interface between the indexer and analysis
3. put analyzers in their own versioned jar files.


Yes, every analyzer needs to have its own version and thus, jar file.
Putting all analyzers into one versioned jar file joins them at the hip and 
suffers from the same versioning and compat problems we're currently facing 
in core.


Andi..



this way, we could provide a realistic capability for users to use
lucene-3.5.jar with lucene-3.2-analyzers.jar, and possibly have STRONGER
analyzer back compat (e.g. if we minimize the damn thing enough, perhaps
very old analyzers.jar's could even work across major releases).

its also much safer when you are using the same bytecodes you used before,
instead of hairy back compat layers. I don't refer to Uwe's code here: its
perfect, but we cant force Uwe into writing the back compat for every big
feature.

--
Robert Muir
rcm...@gmail.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Proposal about Version API relaxation

2010-04-15 Thread Uwe Schindler

I wish we could have a face to face talk like in the evenings at ApacheCon :(

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant
 Ingersoll
 Sent: Thursday, April 15, 2010 9:46 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Proposal about Version API relaxation
 
 From IRC:
 why do I get the feeling that everyone is in heated agreement on the
 Version thread?
 there are some cases that mean people will have to reindex
 in those cases, we should tell people they will have to reindex
 then they can decide to upgrade or not
 all other cases, just do the sensible thing and test first
 I have yet to meet anyone who simply drops a new version into
 production and says go
 
 So, as I said earlier, why don't we just move forward with it, strive
 to support reading X-1 index format in X and let the user know the
 cases in which they will have to re-index. If a migration tool is
 necessary, then someone can write it at the appropriate time.  Just as
 was said w/ the Solr merge, it's software.  If it doesn't work, we can
 change it.  Thank goodness we don't have a back compatibility policy
 for our policies!
 
 -Grant
 
 
 
 
 On Apr 15, 2010, at 3:35 PM, Michael McCandless wrote:
 
  Unfortunately, live searching against an old index can get very
 hairy.
  EG look at what I had to do for the flex API on pre-flex index flex
  emulation layer.
 
  It's also not great because it gives the illusion that all is good,
  yet, you've taken a silent hit (up to ~10% or so) in your search
  perf.
 
  Whereas building  maintaining a one-time index migration tool, in
  contrast, is much less work.
 
  I realize the migration tool has issues -- it fixes the hard changes
  but silently allows the soft changes to break (ie, your analyzers my
  not produce the same tokens, until we move all core analyzers outside
  of core, so they are separately versioned), but it seems like a good
  compromise here?
 
  Mike
 
  2010/4/15 Shai Erera ser...@gmail.com:
  The reason Earwin why online migration is faster is because when u
  finally need to *fully* migrate your index, most chances are that
 most
  of the segments are already on the newer format. Offline migration
  will just keep the application idle for some amount of time until
 ALL
  segments are migrated.
 
  During the lifecycle of the index, segments are merged anyway, so
  migrating them on the fly virtually costs nothing. At the end, when
 u
  upgrade to a Lucene version which doesn't support the previous index
  format, you'll on the worse case need to migrate few large segments
  which were never merged. I don't know how many of those there will
 be
  as it really depends on the application, but I'd bet this process
 will
  touch just a few segments. And hence, throughput wise it will be a
 lot
  faster.
 
  We should create a migrate() API on IW which will touch just those
  segments and not incur a full optimize. That API can also be used
 for
  an offline migration tool, if we decide that's what we want.
 
  Shai
 
  On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote:
  Not sure if plain users are allowed/encouraged to post in this
 list,
  but wanted to mention (just an opinion from a happy user), as other
  users have, that not all of us can reindex just like that. It would
  not be 10 min for one of our installations for sure...
 
  First, i would need to implement some code to reindex, cause my
 source
  data is postprocessed/compressed/encrypted/moved after it arrives
 to
  the application, so I would need to retrieve all etc. And then
  reindexing it would take days.
  javier
 
  On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com
 wrote:
  BTW Earwin, we can come up w/ a migrate() method on IW to
 accomplish
  manual migration on the segments that are still on old versions.
  That's not the point about whether optimize() is good or not. It
 is
  the difference between telling the customer to run a 5-day
 migration
  process, or a couple of hours. At the end of the day, the same
  migration code will need to be written whether for the manual or
  automatic case. And probably by the same developer which changed
 the
  index format. It's the difference of when does it happen.
 
  Converting stuff is easier then emulating, that's exactly why I
 want a
  separate tool.
  There's no need to support cross-version merging, nor to emulate
 old APIs.
 
  I also don't understand why offline migration is going to take
 days
  instead of hours for online migration??
  WTF, it's gonna be even faster, as it doesn't have to merge
 things.
 
  --
  Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
  Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
  ICQ: 104465785
 
  --
 ---
  To unsubscribe, e-mail:

Re: Proposal about Version API relaxation

 3. put analyzers in their own versioned jar files.


 Yes, every analyzer needs to have its own version and thus, jar file.
 Putting all analyzers into one versioned jar file joins them at the hip and
 suffers from the same versioning and compat problems we're currently facing
 in core.

 Andi..


that was actually a typo, sorry :) But maybe not a bad idea for the future.

for now simply moving analyzers to its own jar filE would be a great step!

-- 
Robert Muir
rcm...@gmail.com

Re: Proposal about Version API relaxation

On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote:
 for now simply moving analyzers to its own jar filE would be a great step!

+1 -- why not consolidate all analyzers now?  (And fix indexer to
require a minimal API = TokenStream minus reset  close).

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation

+1 on the Analyzers as well.

Earwin, I think I don't mind if we introduce migrate() elsewhere rather than
on IW. What I meant to say is that if we stick w/ index format back-compat
and ongoing migration, then such a method would be useful on IW for
customers to call to ensure they're on the latest version.
But if the majority here agree w/ a standalone tool, then I'm ok if it sits
elsewhere.

Grant, I'm all for 'just doing it and see what happens'. But I think we need
to at least decide what we're going to do so it's clear to everyone. Because
I'd like to know if I'm about to propose an index format change, whether I
need to build migration tool or not. Actually, I'd like to know if people
like Robert (basically those who have no problem to reindex and don't
understand the fuss around it) will want to change the index format - can I
count on them to be asked to provide such tool? That's to me a policy we
should decide on ... whatever the consequences.

But +1 for changing something ! Analyzers at first, API second.

Shai

On Thu, Apr 15, 2010 at 10:52 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote:
  for now simply moving analyzers to its own jar filE would be a great
 step!

 +1 -- why not consolidate all analyzers now?  (And fix indexer to
 require a minimal API = TokenStream minus reset  close).

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation


On Apr 15, 2010, at 4:21 PM, Shai Erera wrote:

 +1 on the Analyzers as well.
 
 Earwin, I think I don't mind if we introduce migrate() elsewhere rather than 
 on IW. What I meant to say is that if we stick w/ index format back-compat 
 and ongoing migration, then such a method would be useful on IW for customers 
 to call to ensure they're on the latest version.
 But if the majority here agree w/ a standalone tool, then I'm ok if it sits 
 elsewhere.
 
 Grant, I'm all for 'just doing it and see what happens'. But I think we need 
 to at least decide what we're going to do so it's clear to everyone. Because 
 I'd like to know if I'm about to propose an index format change, whether I 
 need to build migration tool or not. Actually, I'd like to know if people 
 like Robert (basically those who have no problem to reindex and don't 
 understand the fuss around it) will want to change the index format - can I 
 count on them to be asked to provide such tool? That's to me a policy we 
 should decide on ... whatever the consequences.

As I said, we should strive for index compatibility, but even in the past, we 
said we did, but the implications weren't always clear.   I think index 
compatibility is very important.  I've seen plenty of times where reindexing is 
not possible.  But even then, you still have the option of testing to find out 
whether you can update or not.  If you can't update, then don't until you can 
figure out how to do it.  FWIW, I think our approach is much more proactive 
than see what happens.  I'd argue, that in the past, our approach was see 
what happens, only the seeing didn't happen until after the release!

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API relaxation