DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841


[EMAIL PROTECTED] changed:

   What|Removed |Added

  Attachment #14784|0   |1
is obsolete||




--- Additional Comments From [EMAIL PROTECTED]  2005-04-26 09:34 ---
Created an attachment (id=14841)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=14841&action=view)
Additional patch for deprecation issue - corrected

This was just an oversight. I've replaced the remaining calls to query.weight()
in Searcher with Searcher.createWeight() and corrected that method so that it
calls query.weight() now.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Jeremy Rayner
On 4/26/05, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Also, is "a future for a hit" a typo, or does that actually mean
> something?  This makes me think of Python's "future", but I'm not sure
> what this means in this context.

My feeling originally was that as the obtaining of the document 
was expensive, a Hit should be a bit like the 'Future Value' pattern,
where a Hit is just a promise to delve into Hits with a certain index
at some point in the future.
( see http://c2.com/cgi/wiki?FutureValue )  
Which interestingly enough now seems to be implemented in Doug Lea's
changes for Java 5
( http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Future.html )

Although without the asynchronous element, I guess it is just lazy
initialization.

An alternative implementation of Hit could be a 'Virtual
Proxy(GOF:207)' that stores
a delegate FutureHit or ActualHit, the FutureHit could be the starting
position, but after any call the delegates reference is swapped over
to ActualHit.  This would eliminate the check of 'resolved' at the start
of each method, and therefore increase perfomance.  However a memory
overhead would be incurred for the overhead of having three classes
instead of one.
So it's a better perfomance vs less memory usage tradeoff.

Thanks for allowing this change, it has now turned my previous Groovy example
( http://javanicus.com/blog2/items/178-index.html ) from

for ( i in 0 ..< hits.length() ) {
println(hits.doc(i)["filename"])
}

into

hits.each{
println(it.filename)
}

which has far less chances for making typos :-)

jez.
-- 
http://javanicus.com/blog2

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SortTest failing

2005-04-26 Thread Wolf Siberski
Chuck Williams wrote:
Otis Gospodnetic wrote:
And this makes me think that this broke during my last commit of Wolf's
patch for MultiSearcher and docFreq stuff.  However I did run 'ant
test' before commit and did see BUILD SUCCESSFUL, so I'm not 100% sure.
Anyone else seeing this error?
 

Otis, I think that is a bug in the test case that Wolf's patch has 
exposed. It creates an index consisting of two copies of "full". This 
should be equivalent to an index where every document is duplicated 
(occurs twice), not to an index where each document occurs only once. 
With the new correct idf normalizaiton, this increases all the docfreq's 
for the terms, which changes the score. Scores on the duplicated index 
should not be the same as scores on the unduplicated index.
Exactly. The test relied on the 'wrong' scoring of the previous 
MultiSearcher
implementation, and thus fails now. For some reason, the test didn't fail
when I submitted the patch. A correction is included as part of the addition
I submitted recently (http://issues.apache.org/bugzilla/show_bug.cgi?id=31841).
--Wolf
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: UnscoredRangeQuery

2005-04-26 Thread Yonik Seeley
> ConstantScoreQuery would seem better, with the addition of the constant
> score value as a constructor argument.

OK, I changed the names to ConstantScoreQuery and ConstantScoreRangeQuery.

Should I add a constantScore field to the class, or just rely on the
boost  (and have the user call setBoost() if they want to change the
weight, or I could set it myself in the constructor).?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DO NOT REPLY [Bug 34629] New: - Play with term postings or .. to a easy way to update

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34629

   Summary: Play with term postings or .. to a easy way to update
   Product: Lucene
   Version: 1.4
  Platform: All
OS/Version: other
Status: NEW
  Severity: enhancement
  Priority: P2
 Component: Index
AssignedTo: lucene-dev@jakarta.apache.org
ReportedBy: [EMAIL PROTECTED]


With this contribution you can add, delete or replace  term/document relation   
  

A use case is a very fast update of document. Exemple: The update of 1 million
of documents containing 2 fields  take some seconds (see test case)

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DO NOT REPLY [Bug 34629] - Play with term postings or .. to a easy way to update

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34629





--- Additional Comments From [EMAIL PROTECTED]  2005-04-26 18:29 ---
Created an attachment (id=14844)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=14844&action=view)
the source 


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DO NOT REPLY [Bug 34629] - [PATHC] Play with term postings or .. to a easy way to update

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34629


[EMAIL PROTECTED] changed:

   What|Removed |Added

Summary|Play with term postings or  |[PATHC] Play with term
   |.. to a easy way to update  |postings or .. to a easy way
   ||to update




-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



a new way to update without delete+add

2005-04-26 Thread Nicolas Maisonneuve
Hy,

i made a contribution in the bugzilla allowing to play with the
posting lists: add, delete , replace relation term/document

One of the pb in Lucene is the updating of document.  
With this patch, you can update  documents  very quickly. (see test case)

http://issues.apache.org/bugzilla/show_bug.cgi?id=34563

1 -this patch wasn't tested a lot, so bugs/optimization can be found.
(but the idea is ok).
I added it in the bugzilla site so that developpers could fix it if
some bugs are found.
 
2- this patch doesn't support Term Vector 

3-  this patch works for optimized and no compounded   index . 

4- it's just to update doc, no for create new doc 


Nicolas Maisonneuve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DO NOT REPLY [Bug 34629] - [PATCH] Play with term postings or .. to a easy way to update

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34629


[EMAIL PROTECTED] changed:

   What|Removed |Added

Summary|[PATHC] Play with term  |[PATCH] Play with term
   |postings or .. to a easy way|postings or .. to a easy way
   |to update   |to update




-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: a new way to update without delete+add

2005-04-26 Thread Erik Hatcher
On Apr 26, 2005, at 1:38 PM, Nicolas Maisonneuve wrote:
One of the pb in Lucene is the updating of document.
With this patch, you can update  documents  very quickly. (see test 
case)

http://issues.apache.org/bugzilla/show_bug.cgi?id=34563
This is not the link you meant.  Here's the one you just submitted:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34629
1 -this patch wasn't tested a lot, so bugs/optimization can be found.
(but the idea is ok).
I added it in the bugzilla site so that developpers could fix it if
some bugs are found.
2- this patch doesn't support Term Vector
3-  this patch works for optimized and no compounded   index .
4- it's just to update doc, no for create new doc
Nicolas Maisonneuve
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: broken compilation

2005-04-26 Thread Daniel Naber
On Friday 22 April 2005 05:29, Erik Hatcher wrote:

> I don't normally run this target, but one of our deprecated tests has a
> Â compilation issue. ÂI haven't researched when this broke, but could
> someone fix this please?

I fixed it. TermInfosTest is, however, not a real JUnit test case, so I 
wonder how useful it is at all...

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Daniel Naber
On Tuesday 26 April 2005 02:21, [EMAIL PROTECTED] wrote:

> + Âpublic String toString() {
> + Â Âtry {
> + Â Â Âreturn getDocument().toString();
> + Â Â} catch (IOException e) {
> + Â Â Âreturn null;
> + Â Â}
> + Â}

Wouldn't it be better here to re-throw the exception as a RuntimeException?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: broken compilation

2005-04-26 Thread Erik Hatcher
On Apr 26, 2005, at 2:02 PM, Daniel Naber wrote:
On Friday 22 April 2005 05:29, Erik Hatcher wrote:
I don't normally run this target, but one of our deprecated tests has 
a
  compilation issue.  I haven't researched when this broke, but could
someone fix this please?
I fixed it. TermInfosTest is, however, not a real JUnit test case, so I
wonder how useful it is at all...
I'm curious - did your fix change the code to go against a new API?  In 
other words, is there something that has changed that breaks API 
compatibility?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Erik Hatcher
On Apr 26, 2005, at 2:38 PM, Daniel Naber wrote:
On Tuesday 26 April 2005 02:21, [EMAIL PROTECTED] wrote:
+  public String toString() {
+    try {
+      return getDocument().toString();
+    } catch (IOException e) {
+      return null;
+    }
+  }
Wouldn't it be better here to re-throw the exception as a 
RuntimeException?
I don't know would it?  I have no preference, though it seems ok to 
me to simply return null since this is the toString method.  For a 
Document, the toString is only useful for debugging anyway.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: broken compilation

2005-04-26 Thread Doug Cutting
Erik Hatcher wrote:
I fixed it. TermInfosTest is, however, not a real JUnit test case, so I
wonder how useful it is at all...

I'm curious - did your fix change the code to go against a new API?
Yes, but not a public API.
In 
other words, is there something that has changed that breaks API 
compatibility?
No.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841





--- Additional Comments From [EMAIL PROTECTED]  2005-04-26 21:30 ---
I have applied the deprecation patch.

The solution to my patch difficulties was to use 'patch -l -F 4'.  This gets
around the end-of-line issues.

Thanks again, Wolf!


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Daniel Naber
On Tuesday 26 April 2005 21:09, Erik Hatcher wrote:

> I don't know would it? ÂI have no preference, though it seems ok to
> me to simply return null since this is the toString method. ÂFor a
> Document, the toString is only useful for debugging anyway.

Yes, and during debugging it would be especially confusing to just hide the 
exception. Sure, people will see that there's a problem with a "null" 
document, but then why not show the exception directly?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread DM Smith
Erik Hatcher wrote:
On Apr 26, 2005, at 2:38 PM, Daniel Naber wrote:
On Tuesday 26 April 2005 02:21, [EMAIL PROTECTED] wrote:
+  public String toString() {
+try {
+  return getDocument().toString();
+} catch (IOException e) {
+  return null;
+}
+  }

Wouldn't it be better here to re-throw the exception as a 
RuntimeException?

I don't know would it?  I have no preference, though it seems ok 
to me to simply return null since this is the toString method.  For a 
Document, the toString is only useful for debugging anyway.
Two thoughts:
If getDocument().toString() cannot possibly throw an IOException, but it 
is part of the signature, then it does not matter.

Once lucene is at 1.4, it would be better to use an assert in the catch 
and not throw an error but return "" instead of null. The asserts can be 
removed at runtime by passing flags to the program. Assertions are best 
used for situations that should never happen.

public String toString()
{
   try {
  return getDocument().toString();
   } catch (IOException e) {
  assert false : e;
  return "";
   }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31841





--- Additional Comments From [EMAIL PROTECTED]  2005-04-26 22:51 ---
Created an attachment (id=14846)
 --> (http://issues.apache.org/bugzilla/attachment.cgi?id=14846&action=view)
[PATCH] Fix to Query.combine() method and all specializations

This fixes the bugs in Query.combine() that were uncovered by the failing test
in the Highlighter.  Only Query.combine() remains -- all overrides in
BooleanQuery, RangeQuery, MultiTermQuery and PrefixQuery are deleted.  I
believe this fix is correct, robust realtive to possible user Query
implementations, and generates optimal queries for at least the cases that are
built-in to Lucene (query rewriting of MultTermQuery's and RangeQuery's).  This
is more robust relative to possible user Query implementations and covers more
optimizations cases than the version I sent via email last night.  With this
patch, all Lucene and Highlighter tests pass (with the exception of the buggy
TestSort.testNormalizedScores() which should be fixed by Wolf's patch).


-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Correct of Query.combine() bugs with new MultiSearcher

2005-04-26 Thread Chuck Williams
As noted in the patch description I just submitted, it should be a 
complete, correct, robust (relative to possible user Query 
implementations) and reasonably optimal solution for Query.combine().  
It also simplifies the previous methods, deleting all overrides of 
Query.combine() and Query.mergeBooleanQueries().  The current 
implementation fails to account for queries that rewrite into different 
primitive types on different sub-searchers and fails to account for the 
fact that the rewritten query type of the first sub-searcher is nothing 
special.  The current solution looks at all rewritten subsearcher 
queries as a whole and computes the (reasonably) best single query to 
distribute.  This patch is slightly better than what I sent via email 
last night:
 1.  It's a patch that can be applied in the usual way
 2.  It handles the missing optimization cases I noted in last night's 
email
 3.  It fixes potential bugs that would not arise with Lucene's query 
types but could arise with user-written queries (e.g., user queries that 
rewrite differently in arbitrary ways for the different sub-serarchers).

Doug and Wolf, please review the patch.  All tests pass.
Thanks,
Chuck
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Jeremy Rayner
On 4/26/05, Daniel Naber <[EMAIL PROTECTED]> wrote:
> On Tuesday 26 April 2005 21:09, Erik Hatcher wrote:
> 
> > I don't know would it? I have no preference, though it seems ok to
> > me to simply return null since this is the toString method. For a
> > Document, the toString is only useful for debugging anyway.
> 
> Yes, and during debugging it would be especially confusing to just hide the
> exception. Sure, people will see that there's a problem with a "null"
> document, but then why not show the exception directly?
> 

Rather than return null, or throw an undesirable RuntimeException from the
toString() method, it may be more useful for the toString() to indicate the
critical parameters of the promised Hit, rather than the String representation
of one of the underlying members.

How about replacing the toString() method in Hit.java with...

  /**
   * Prints the parameters to be used to discover the promised result.
   */
  public String toString() {
  StringBuffer buffer = new StringBuffer();
  buffer.append("Hit<");
  buffer.append(hits.toString());
  buffer.append(" [");
  buffer.append(hitNumber);
  buffer.append("] ");
  if (resolved) {
  buffer.append("resolved");
  } else {
  buffer.append("unresolved");
  }
  buffer.append(">");
  return buffer.toString();
  }


which will return something like 
  "Hit<[EMAIL PROTECTED] [5] unresolved>"

and no RuntimeException or null in sight.

If the user of the API wants to deal with the potential IOException, then
they would write hit.getDocument().toString() and act accordingly.

Hope this helps,

jez.
-- 
http://javanicus.com/blog2

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DO NOT REPLY [Bug 34359] - Cannot use Lucene in an unsigned applet due to Java security restrictions

2005-04-26 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=34359





--- Additional Comments From [EMAIL PROTECTED]  2005-04-26 22:51 ---
Thanks for opening this bug report. Hope it'll be accepted and resolved soon.



-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



too many classes visible with "ant javadocs"

2005-04-26 Thread Daniel Naber
Hi,

the java API documentation now seems to contain classes which have no 
useful documentation and thus probably shouldn't be part of the API docs, 
e.g. Among, Testapp, SnowballProgram (maybe more).

Also, build.xml has a typo: "MorLikeThis" (Mor instead of More), but I 
don't know whether this has any effect.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: too many classes visible with "ant javadocs"

2005-04-26 Thread Erik Hatcher
On Apr 26, 2005, at 6:10 PM, Daniel Naber wrote:
Hi,
the java API documentation now seems to contain classes which have no
useful documentation and thus probably shouldn't be part of the API 
docs,
e.g. Among, Testapp, SnowballProgram (maybe more).

Also, build.xml has a typo: "MorLikeThis" (Mor instead of More), but I
don't know whether this has any effect.
Sorry for the typo.  This is still a work-in-progress - I only have a 
little bit of time to devote to this per day and have checked in my 
changes even though its not entirely complete.  By all means feel free 
to take over the build process refactorings if you'd like.  I'm a bad 
estimator of my time and had hoped to have the packaging changes for a 
1.9 release finished but they aren't yet.  I'll keep plodding along, 
but won't be upset if someone else jumps in.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: too many classes visible with "ant javadocs"

2005-04-26 Thread Erik Hatcher
On Apr 26, 2005, at 6:10 PM, Daniel Naber wrote:
the java API documentation now seems to contain classes which have no
useful documentation and thus probably shouldn't be part of the API 
docs,
e.g. Among, Testapp, SnowballProgram (maybe more).
This is now fixed.  I excluded net.sf.* from being javadoc'd.  This 
causes a warning when generating documentation for 
SnowballAnalyzer/Filter, but its not a problem.

Also, build.xml has a typo: "MorLikeThis" (Mor instead of More), but I
don't know whether this has any effect.
Corrected, as well as made several adjustments to group packages 
properly.  It took me a few tries to understand how the  
patterns work.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-04-26 Thread Wolfgang Hoschek
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585  
along with another contrib - PatternAnalyzer.
 	
For a quick overview without downloading code, there's javadoc for it  
all at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?

Thanks,
Wolfgang.
/**
 * Efficient Lucene analyzer/tokenizer that preferably operates on a  
String
rather than a
 * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular  
expression
[EMAIL PROTECTED] Pattern}
 * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}),
 * and that combines the functionality of
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single efficient
 * multi-purpose class.
 * 
 * If you are unsure how exactly a regular expression should look like,
consider
 * prototyping by simply trying various expressions on some test texts  
via
 * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that  
regex to
 * PatternAnalyzer. Also see 
 * href="http://java.sun.com/docs/books/tutorial/extra/regex/";>Java  
Regular
Expression Tutorial.
 * 
 * This class can be considerably faster than the "normal" Lucene  
tokenizers.
 * It can also serve as a building block in a compound Lucene
 * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example as  
in this

 * stemming example:
 * 
 * PatternAnalyzer pat = ...
 * TokenStream tokenStream = new SnowballFilter(
 * pat.tokenStream("content", "James is running round in the  
woods"),
 * "English"));
 * 


On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
I've now got the contrib code cleaned up, tested and documented into a  
decent state, ready for your review and comments.
Consider this a formal contrib (Apache license is attached).

The relevant files are attached to the following bug ID:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
For a quick overview without downloading code, there's some javadoc at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

There are several small open issues listed in the javadoc and also  
inside the code. Thoughts? Comments?

I've also got small performance patches for various parts of Lucene  
core (not submitted yet). Taken together they lead to substantially  
improved performance for MemoryIndex, and most likely also for Lucene  
in general. Some of them are more involved than others. I'm now  
figuring out how much performance each of these contributes and how to  
propose potential integration - stay tuned for some follow-ups to  
this.

The code as submitted would certainly benefit a lot from said patches,  
but they are not required for correct operation. It should work out of  
the box (currently only on 1.4.3 or lower). Try running

cd lucene-cvs
java org.apache.lucene.index.memory.MemoryIndexTest
with or without custom arguments to see it in action.
Before turning to a performance patch discussion I'd a this point  
rather be most interested in folks giving it a spin, comments on the  
API, or any other issues.

Cheers,
Wolfgang.
On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100  
times faster (i.e. 3 - 20 index+query steps/sec) than the  
simplistic RAMDirectory approach, depending on the nature of the  
input data and query. From some preliminary testing it returns  
exactly what RAMDirectory returns.
Awesome.  Using the basic StringIndexReader I sent?
Yep, it's loosely based on the empty skeleton you sent.
I've been fiddling with it a bit more to get other query types.   
I'll add it to the contrib area when its a bit more robust.
Perhaps we could merge up once I'm ready and put that into the  
contrib area? My version now supports tokenization with any analyzer  
and it supports any arbitrary Lucene query. I might make the API for  
adding terms a little more general, perhaps allowing arbitrary  
Document objects if that's what other folks really need...


As an aside, is there any work going on to potentially support  
prefix (and infix) wild card queries ala "*fish"?
WildcardQuery supports wildcard characters anywhere in the string.   
QueryParser itself restricts expressions that have leading wildcards  
from be

Re: [Performance] Streaming main memory indexing of single strings

2005-04-26 Thread Erik Hatcher
Wolfgang,
You have provided a superb set of patches!  I'm in awe of the extensive  
documentation you've done.

There is nothing further you need to do, but be patient while we  
incorporate it into the contrib area somewhere.  Your PatternAnalyzer  
could fit into the contrib/analyzers area nicely.  I'm not quite sure  
where to put MemoryIndex - maybe it deserves to stand on its own in a  
new contrib area?  Or does it make sense to put this into misc (still  
in sandbox/misc)?  Or where?

Erik
On Apr 26, 2005, at 9:47 PM, Wolfgang Hoschek wrote:
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to  
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with  
another contrib - PatternAnalyzer.
 	
For a quick overview without downloading code, there's javadoc for it  
all at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?

Thanks,
Wolfgang.
/**
 * Efficient Lucene analyzer/tokenizer that preferably operates on a  
String
rather than a
 * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular  
expression
[EMAIL PROTECTED] Pattern}
 * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}),
 * and that combines the functionality of
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single  
efficient
 * multi-purpose class.
 * 
 * If you are unsure how exactly a regular expression should look like,
consider
 * prototyping by simply trying various expressions on some test texts  
via
 * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that  
regex to
 * PatternAnalyzer. Also see 
 * href="http://java.sun.com/docs/books/tutorial/extra/regex/";>Java  
Regular
Expression Tutorial.
 * 
 * This class can be considerably faster than the "normal" Lucene  
tokenizers.
 * It can also serve as a building block in a compound Lucene
 * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example  
as in this

 * stemming example:
 * 
 * PatternAnalyzer pat = ...
 * TokenStream tokenStream = new SnowballFilter(
 * pat.tokenStream("content", "James is running round in the  
woods"),
 * "English"));
 * 


On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
I've now got the contrib code cleaned up, tested and documented into  
a decent state, ready for your review and comments.
Consider this a formal contrib (Apache license is attached).

The relevant files are attached to the following bug ID:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
For a quick overview without downloading code, there's some javadoc  
at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

There are several small open issues listed in the javadoc and also  
inside the code. Thoughts? Comments?

I've also got small performance patches for various parts of Lucene  
core (not submitted yet). Taken together they lead to substantially  
improved performance for MemoryIndex, and most likely also for Lucene  
in general. Some of them are more involved than others. I'm now  
figuring out how much performance each of these contributes and how  
to propose potential integration - stay tuned for some follow-ups to  
this.

The code as submitted would certainly benefit a lot from said  
patches, but they are not required for correct operation. It should  
work out of the box (currently only on 1.4.3 or lower). Try running

cd lucene-cvs
java org.apache.lucene.index.memory.MemoryIndexTest
with or without custom arguments to see it in action.
Before turning to a performance patch discussion I'd a this point  
rather be most interested in folks giving it a spin, comments on the  
API, or any other issues.

Cheers,
Wolfgang.
On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100  
times faster (i.e. 3 - 20 index+query steps/sec) than the  
simplistic RAMDirectory approach, depending on the nature of the  
input data and query. From some preliminary testing it returns  
exactly what RAMDirectory returns.
Awesome.  Using the basic StringIndexReader I sent?
Yep, it's loosely based on the empty skeleton you sent.
I've been fiddling with it a bit more to get other query types.   
I'll add it to the contrib area when its a bit more robust.
Perhaps we could merge up o

Re: Correct of Query.combine() bugs with new MultiSearcher

2005-04-26 Thread Erik Hatcher
I've confirmed Chuck's patch does fix the Highlighter test.  I'm set to 
commit it once it gets the thumbs-up from Doug.

Erik
On Apr 26, 2005, at 4:58 PM, Chuck Williams wrote:
As noted in the patch description I just submitted, it should be a 
complete, correct, robust (relative to possible user Query 
implementations) and reasonably optimal solution for Query.combine().  
It also simplifies the previous methods, deleting all overrides of 
Query.combine() and Query.mergeBooleanQueries().  The current 
implementation fails to account for queries that rewrite into 
different primitive types on different sub-searchers and fails to 
account for the fact that the rewritten query type of the first 
sub-searcher is nothing special.  The current solution looks at all 
rewritten subsearcher queries as a whole and computes the (reasonably) 
best single query to distribute.  This patch is slightly better than 
what I sent via email last night:
 1.  It's a patch that can be applied in the usual way
 2.  It handles the missing optimization cases I noted in last night's 
email
 3.  It fixes potential bugs that would not arise with Lucene's query 
types but could arise with user-written queries (e.g., user queries 
that rewrite differently in arbitrary ways for the different 
sub-serarchers).

Doug and Wolf, please review the patch.  All tests pass.
Thanks,
Chuck
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Correct of Query.combine() bugs with new MultiSearcher

2005-04-26 Thread Chuck Williams
Thanks Erik.  If you don't here more, I'm sure this fixes a whole class 
of problems and is better than the previous situation.  I'm also 
confident that it will do the right thing for all the query types built 
into Lucene.  My remaining uncertainty concerns whether user query types 
might somehow cause a problem, and in that regard the pre-patch 
implementation might throw an exception when the patch will try to do 
something that should work (this occurs when the new query types do 
non-trivial rewrites, sometimes rewriting into built-in query types).  
Such user query types that need to work with MultiSearcher might need to 
provide their own combine() method or a patch to this one.  That was 
true before as well.

My message however below however is pretty poorly written!  
Clarifications to make it intelligible follow:

Erik Hatcher wrote:
I've confirmed Chuck's patch does fix the Highlighter test.  I'm set 
to commit it once it gets the thumbs-up from Doug.

Erik
On Apr 26, 2005, at 4:58 PM, Chuck Williams wrote:
As noted in the patch description I just submitted, it should be a 
complete, correct, robust (relative to possible user Query 
implementations) and reasonably optimal solution for 
Query.combine().  It also simplifies the previous methods, deleting 
all overrides of Query.combine() and Query.mergeBooleanQueries().  
The current implementation fails to account for queries that rewrite 
into different primitive types on different

"current implentation" = before this patch
sub-searchers and fails to account for the fact that the rewritten 
query type of the first sub-searcher is nothing special.  The current 
solution

"current solution" = this patch
looks at all rewritten subsearcher queries as a whole and computes 
the (reasonably) best single query to distribute.  This patch is 
slightly better than what I sent via email last night:
 1.  It's a patch that can be applied in the usual way
 2.  It handles the missing optimization cases I noted in last 
night's email
 3.  It fixes potential bugs that would not arise with Lucene's query 
types but could arise with user-written queries (e.g., user queries 
that rewrite differently in arbitrary ways for the different 
sub-serarchers).

Doug and Wolf, please review the patch.  All tests pass.
Thanks,
Chuck
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
*Chuck Williams*
All Things Local
Founder and CEO
V: (415)464-1889
C: (415)846-9018
[EMAIL PROTECTED] 
AIM: hawimanawiz
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: svn commit: r164695 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/search/Hit.java src/java/org/apache/lucene/search/HitIterator.java src/java/org/apache/lucene/search/Hits.java s

2005-04-26 Thread Otis Gospodnetic
Thanks for the Future pointers.  I actually used it, but didn't know it
by name.

Otis

--- Jeremy Rayner <[EMAIL PROTECTED]> wrote:
> On 4/26/05, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> > Also, is "a future for a hit" a typo, or does that actually mean
> > something?  This makes me think of Python's "future", but I'm not
> sure
> > what this means in this context.
> 
> My feeling originally was that as the obtaining of the document 
> was expensive, a Hit should be a bit like the 'Future Value' pattern,
> where a Hit is just a promise to delve into Hits with a certain index
> at some point in the future.
> ( see http://c2.com/cgi/wiki?FutureValue )  
> Which interestingly enough now seems to be implemented in Doug Lea's
> changes for Java 5
> (
>
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Future.html
> )
> 
> Although without the asynchronous element, I guess it is just lazy
> initialization.
> 
> An alternative implementation of Hit could be a 'Virtual
> Proxy(GOF:207)' that stores
> a delegate FutureHit or ActualHit, the FutureHit could be the
> starting
> position, but after any call the delegates reference is swapped over
> to ActualHit.  This would eliminate the check of 'resolved' at the
> start
> of each method, and therefore increase perfomance.  However a memory
> overhead would be incurred for the overhead of having three classes
> instead of one.
> So it's a better perfomance vs less memory usage tradeoff.
> 
> Thanks for allowing this change, it has now turned my previous Groovy
> example
> ( http://javanicus.com/blog2/items/178-index.html ) from
> 
> for ( i in 0 ..< hits.length() ) {
> println(hits.doc(i)["filename"])
> }
> 
> into
> 
> hits.each{
> println(it.filename)
> }
> 
> which has far less chances for making typos :-)
> 
> jez.
> -- 
> http://javanicus.com/blog2
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]