Re: renaming fields in index

2005-03-07 Thread Doug Cutting
Ajay Upadhyaya wrote: We have large number of documents which are indexed(approx 1M), the size of the index is approx 1G. . We have few Keyword fields as well as few UnStored fields. Now there is a requirement to change names of few fields. The application code can be easily changed to created the

Re: renaming fields in index

2005-03-07 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: You cannot easily change the field names in the index. ..through the existing API, that is. Because you can change the content of the *.fnm file appropriately, right? Right. One could write something that would re-write all of the .fnm files. It

Re: Interfaces

2005-03-17 Thread Doug Cutting
Erik Hatcher wrote: Ultimately, though, the decision to refactor the codebase to use interfaces more pervasively lies with Doug. Actually the decision lies not with me, but with the Lucene PMC as a group, according to Apache's voting process: http://www.apache.org/foundation/voting.html But, lik

Re: snowball analyzer uismo issue in spanish stemmer

2005-03-17 Thread Doug Cutting
Erik Hatcher wrote: I just tried regenerating, which automatically pulls from CVS, and got this error: /Users/erik/dev/lucene/java/contrib/snowball/snowball/website/p/ generator.c:425: internal compiler error: in extract_insn, at recog.c:2175 [apply] Please submit a full bug report, [a

Re: snowball analyzer uismo issue in spanish stemmer

2005-03-17 Thread Doug Cutting
Erik Hatcher wrote: If you see regeneration differences would you please commit them? There were no differences. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: running out of file handles

2005-04-15 Thread Doug Cutting
Guillermo Payet wrote: In any case... the point being that we want to just have one IndexSearcher for the whole App. But.. were starting to run out of file handles on our server, and an lsof returns lots and lots of these: java 22755 tomcat 320r REG 9,3 2992899 1177377 /var/ix/_2a8.cfs (deleted

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Doug Cutting
Erik Hatcher wrote: I think something like this would make a handy addition to our contrib area at least. Perhaps. What use cases cannot be met by regular expression matching? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For a

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Doug Cutting
Wolfgang Hoschek wrote: The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute. With a single document there'll be no IDFs, so y

lucene 2.0?

2005-04-19 Thread Doug Cutting
Bernhard Messer wrote: I'm not a fan of outdated software or historical systems. So i think the best would be to keep lucene still backward compatible with version 1.9 and perform the switch to JDK 1.4 with lucene 2.0. That sounds like a good plan. Which raises the question, when should we make t

Re: DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-04-20 Thread Doug Cutting
[EMAIL PROTECTED] wrote: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 [EMAIL PROTECTED] changed: What|Removed |Added Status|NEW |RESOLVED

Re: DO NOT REPLY [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-04-21 Thread Doug Cutting
Wolf Siberski wrote: In each case applications should call a corresponding Searcher method. Here I don't agree completely and have another suggestion to resolve that issue. The affected methods are low-level API methods anyway, and even before their javadoc referred application developers to othe

Re: Fwd: Lucene and Groovy...

2005-04-25 Thread Doug Cutting
Erik Hatcher wrote: There are two .java files attached that may not make it through to the list. These are simple wrappers that do exactly what you'd expect. The idea is to make dealing with Lucene Hits more "Java like" with an Iterator, which in turn makes this much more amenable to Groovy. +

Re: broken compilation

2005-04-26 Thread Doug Cutting
Erik Hatcher wrote: I fixed it. TermInfosTest is, however, not a real JUnit test case, so I wonder how useful it is at all... I'm curious - did your fix change the code to go against a new API? Yes, but not a public API. In other words, is there something that has changed that breaks API compati

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Doug Cutting
Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/

ParallelReader

2005-04-28 Thread Doug Cutting
Please find attached something I wrote today. It has not been yet tested extensively, and the documentation could be improved, but I thought it would be good to get comments sooner rather than later. Would folks find this useful? Should it go into the core or in contrib? Doug Index: src/java/or

Re: build process changes

2005-05-02 Thread Doug Cutting
Thanks for doing all this! It looks great! Erik Hatcher wrote: However it seems much simpler for us to only distribute lucene-XX.tar.gz/zip and lucene-XX-src.tar.gz/.zip rather than distributing each contrib component separately. I agree. The current build process builds the same 4 distributio

Re: java.util.zip (was Questions about DeleteFile method)

2005-05-04 Thread Doug Cutting
Monsur Hossain wrote: George, what about SharpZipLib: http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx It's a third-party project, but its written in C# and is under GPL. GPL unfortunately means that the library cannot be distributed by Apache with Lucene.Net. Doug -

Re: svn commit: r168213 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/store/FSDirectory.java

2005-05-04 Thread Doug Cutting
I'd prefer if the list of file extensions was in a single place, and that place should be somewhere in the index package, not in the store package. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail

Re: build process changes

2005-05-05 Thread Doug Cutting
Erik Hatcher wrote: My rationale for keeping all the contrib components in their own subdirectories was to allow room for eventual documentation or other files that might want to come along for the ride (like maybe a dependent ASL'd JAR?). That makes sense. I'd be happy to change it if that i

Re: svn commit: r168450 - in /lucene/java/trunk/src/java/org/apache/lucene: index/SegmentMerger.java search/MultiPhraseQuery.java search/MultiSearcher.java search/PhrasePrefixQuery.java search/PhraseQ

2005-05-06 Thread Doug Cutting
[EMAIL PROTECTED] wrote: don't declare Exceptions that are never thrown; remove an unused variable When these are implementing a pubic interface or abstract method I think it is good to keep the exception declaration, as it is a part of the interface. That way, if an exception needs to be thrown

multi-field highlighting

2005-05-06 Thread Doug Cutting
There's a post over at SearchEngineWatch theorizing about how Google produces summaries. http://forums.searchenginewatch.com/showthread.php?threadid=5448 Lucene's current highlighter doesn't easily support multi-fields, nor does it take phrasal matching into account. It might be useful to have

FieldCache parser

2005-05-10 Thread Doug Cutting
Attached is a patch that makes it possible to supply a user-specified parser to FieldCache. For example, one might use this to process a date field as ints even if was not indexed as a decimal integer. Comments? Doug Index: src/java/org/apache/lucene/search/FieldCache.java =

Re: multi-field highlighting

2005-05-10 Thread Doug Cutting
ument. With that principle in mind I should really make sure that if I search for: ("Doug Cutting" AND lucene) OR google I shouldn't highlight "Doug Cutting" in a matching document that has google but not lucene. Shouldn't the search code already take care of tha

Re: Helping PyLucene and RubyLucene incubate

2005-05-10 Thread Doug Cutting
tch as a new Podling. The Nutch proposal vote is at: http://www.mail-archive.com/general@incubator.apache.org/msg04201.html The Nutch proposal is at: http://wiki.apache.org/incubator/NutchProposal The Nutch mentors are: Doug Cutting Erik Hatcher If you accept this podling, please add Erik and

constant scoring queries

2005-05-10 Thread Doug Cutting
Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673, Yonik Seely proposes a ConstantScoreQuery, based on a Filter. And in http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg08007.html I proposed a mechanism to promote the use of Filters. Through all of this, Paul

Re: constant scoring queries

2005-05-11 Thread Doug Cutting
Yonik Seeley wrote: Could you elaborate on the advantage of having say a TermQuery that could be either normal-scoring or constant-scoring vs two different Query classes for doing this? They seem roughly equivalent. You could code it that way too. It would require exposing TermWeight and TermSco

patch to build.xml

2005-05-11 Thread Doug Cutting
Attached is a patch to build.xml and common-build.xml that makes 'ant test' succeed. The problem is that, classically, unit tests in Lucene are named Test*.java, but there are tests in contrib named *Test.java, and there are non-unit tests in src/test named *Test.java. Until this is resolved,

Re: ParallelReader

2005-05-12 Thread Doug Cutting
Doug Cutting wrote: > Would folks find this useful? Since the general feedback was positive, I committed this. Chuck Williams wrote: Yes, very useful, especially if you added one additional feature that looks straightforward from the code below. That is a facility to append the stored fie

Re: svn commit: r170003 - /lucene/java/trunk/build.xml

2005-05-13 Thread Doug Cutting
[EMAIL PROTECTED] wrote: controversial: do not fail the build for contrib components not building successfully. this is to make Gump happy for now, but in the future a more granular conditional build of each contrib project may be desirable +1 The contrib stuff doesn't have the same guarantees as

Re: One Byte is Seven bits too many? - A Design suggestion

2005-05-23 Thread Doug Cutting
Robert Engels wrote: I have always thought that the norms should be an interface, rather than fixed, as there are many uses of lucene where norms are not necessary, and the memory overhead is substantial. I agree, but that's not the whole story. If one seeks merely to avoid caching the norms i

Re: major searching performance improvement

2005-05-25 Thread Doug Cutting
Robert Engels wrote: Attached are files that dramatically improve the searching performance (2x improvement on several hardware configurations!) in a multithreaded, high concurrency environment. This looks like some good stuff! Can you perhaps break it down into independent, layered patches?

Re: major searching performance improvement

2005-05-26 Thread Doug Cutting
Robert Engels wrote: Ok. Attached are the updated files. I also forgot some of the changed files the first time around (CompoundFileReader also had synchronization that needed to be removed). Again, it would be much easier to understand if you supplied patches, i.e., diffs, so that we can focu

Re: major searching performance improvement

2005-05-26 Thread Doug Cutting
Robert Engels wrote: 2. I agree that creating NioFSDirectory rather than modifying FSDirectory. I originally felt the memory mapped files would be the fastest, but it also requires OS calls, the "caching" code is CONSIDERABLY faster, since it does not need to do any JNI, or make OS calls. On th

Re: Potential Segment corruption

2005-05-26 Thread Doug Cutting
Arvind Srinivasan wrote: Some options are: (1)Commit the counter after the newSegmentName call. This way we never reuse the the segmentName. (2) Add a callback API to directory interface for a new Segment Creation allowing the directory interface to clean up, on a new segment write. (3) Provi

Re: Potential Segment corruption

2005-05-26 Thread Doug Cutting
Doug Cutting wrote: I've attached a patch. Does this fix things for you? Oops. That had a bug. Here's a revised patch. It now passes all unit tests. Doug Index: src/java/org/apache/lucene/store/FSDirectory.java =

Re: Potential Segment corruption

2005-05-26 Thread Doug Cutting
Arvind Srinivasan wrote: The patch on the follow up mail does look good. However, I have additional concerns: (a) deleteFile call may fail. eg. File is left open from the previous exception. This makes me believe the ideal scenario is to not to reuse the segment name once the newSegment call iss

Re: FieldCache parser

2005-06-02 Thread Doug Cutting
Doug Cutting wrote: Attached is a patch that makes it possible to supply a user-specified parser to FieldCache. For example, one might use this to process a date field as ints even if was not indexed as a decimal integer. As there were no objections, I have committed this patch. Doug

Re: Potential Segment corruption

2005-06-02 Thread Doug Cutting
Doug Cutting wrote: I think the fix is much simpler. This is a bug in FSDirectory. Directory.createOutput() should always create a new empty file, and FSDirectory's implementation does not ensure this. It should try to delete the file before opening it and/or call RandomAccessFile.setL

Re: class for delete/add access to an index

2005-06-03 Thread Doug Cutting
Daniel Naber wrote: What do you think? If this gets accepted, it also needs a better name. It looks reasonable to me. As for names, IndexWriter would be a good one for this, and IndexAppender would be a better name for what's now called IndexWriter. Unfortunately, I don't see a way to make

Re: compound file documentation

2005-06-03 Thread Doug Cutting
Daniel Naber wrote: can someone please check my changes to fileformats.xml regarding the compound format? (not yet on the website, call "ant" in the "site" directory to build the files locally). Looks good. One improvement: You could define FileData more formally as something like: FileData

Re: compound file documentation

2005-06-03 Thread Doug Cutting
Daniel Naber wrote: On Friday 03 June 2005 19:02, Doug Cutting wrote: FileLength[i] -> (i==FileCount) ? DataOffset[i+1] : EOF) - DataOffset[n] Not sure if that really helps. At least I find it confusing, as neither the "?" operator nor the "EOF" occurs anywhe

IndexFileNames

2005-06-06 Thread Doug Cutting
[EMAIL PROTECTED] wrote: --- lucene/java/trunk/src/java/org/apache/lucene/store/FSDirectory.java (original) +++ lucene/java/trunk/src/java/org/apache/lucene/store/FSDirectory.java Mon Jun 6 10:52:12 2005 @@ -52,8 +52,8 @@ if (name.endsWith("."+IndexReader.FILENAME_EXTENSIONS[i]))

Re: svn-commit: 168449 FSDirectory

2005-06-06 Thread Doug Cutting
Bernhard Messer wrote: Therefore i would like to propose two changes: 1) we should store the extension in a hash and not in String[] to have a faster lookup Do you mean to use something like: String lastDot = name.lastIndexOf('.'); if (lastDot >= 0) { String nameExt = name.substring(lastDot

Re: IndexFileNames

2005-06-07 Thread Doug Cutting
Bernhard Messer wrote: sorry for the confusion. On the first look, i thought the new class IndexFileNames, containing the necessary constant values, fits perfect into org.apache.lucene.index. After a more detailed look, i get the feeling that it would be much better to place the new class into

Re: IndexFileNames

2005-06-09 Thread Doug Cutting
Bernhard Messer wrote: I finished the changes and commited the changes. There are two new classes in package org.apache.lucene.index. org.apache.lucene.index.IndexFileNames contains common lucene related filenames and extensions, the scope of the class itself and it's members are package. org.

Re: 2nd call - [Vote] Wolfgang Hoschek for committer

2005-07-11 Thread Doug Cutting
+1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

please help update Lucene status page

2005-08-02 Thread Doug Cutting
I started a Lucene status page for July at: http://wiki.apache.org/jakarta-lucene/Report-2005-07 Please help populate this page. It should contain news related to the Lucene top-level project (not just the Java sub-project) since Lucene became a top-level project at the beginning of this year

Re: Map-Reduce

2005-08-04 Thread Doug Cutting
Paul Smith wrote: I know there's a mapreduce branch in the nutch project, but is there any plan/talk of perhaps integrating something like that directly into the Lucene API? For projects that need a lower-level API like Lucene, rather than the crawl-like nature of Nutch, the potential to i

Re: Map-Reduce

2005-08-04 Thread Doug Cutting
Doug Cutting wrote: Perhaps we need to factor Nutch into two projects, one with NDFS and MapReduce and the other with the search-specific code. This falls almost exactly on package lines. The packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the rest of Nutch. FYI

Re: Lucene does NOT use UTF-8.

2005-08-29 Thread Doug Cutting
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for b

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
[EMAIL PROTECTED] wrote: How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. I spoke a bit too soon. I should have looked at the code first. You're right, I don't think it would require more allocations. When con

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. TermBuffer.java:66 Things

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too. Sigh

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For back-compatibility it would be

Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Doug Cutting
Wolfgang Hoschek wrote: I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to

Re: Uneffective writeBytes and readBytes [FIX]

2005-09-08 Thread Doug Cutting
I don't in general disagree with this sort of optimization, but I think a good fix is a bit more complicated than what you posted. Lukas Zapletal wrote: And here comes the fixes: OutputStream: /** * Writes an array of bytes. * * @param b *the bytes

Re: Uneffective writeBytes and readBytes [FIX]

2005-09-08 Thread Doug Cutting
Paul Elschot wrote: I suppose one of these cases are when many terms are used in a query. Would it be easily possible to make the buffer size for a term iterator depend on the numbers of documents to be iterated? Many terms only occur in a few documents, so this could be a nice win on total buf

Re: Delaying buffer allocation in BufferedIndexInput

2005-09-12 Thread Doug Cutting
Paul Elschot wrote: I tried delaying the buffer allocation in BufferedIndexInput by using this clone() method: public Object clone() { BufferedIndexInput clone = (BufferedIndexInput)super.clone(); clone.buffer = null; clone.bufferLength = 0; clone.bufferPosition = 0; clone.

Re: Version 1.9

2005-09-12 Thread Doug Cutting
Erik Hatcher wrote: I'm using the trunk of Subversion (pretty much what 1.9 will be) on all my projects and it is quite stable. I defer to the others on when we release it as 1.9 officially, though. I think the 1.9 release should be made soon. What is required is a motivated committer wit

Re: Fwd: [jira] Commented: (INFRA-199) Convert Lucene's Bugzilla to JIRA

2005-09-12 Thread Doug Cutting
Erik Hatcher wrote: I haven't seen this come across the java-dev list (I could have missed it though). Everyone ok with moving to JIRA? +1 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL P

Re: Version 1.9

2005-09-12 Thread Doug Cutting
Scott Ganyo wrote: What is required to make the release? The (somewhat dated) steps are at: http://wiki.apache.org/jakarta-lucene/ReleaseTodo Probably the first thing to do is to update these (cvs -> svn) and see if folks suggest any other improvements. We should start with a 1.9-rc1 relea

Re: Eliminating norms ... completley

2005-10-10 Thread Doug Cutting
Chris Hostetter wrote: 2) Can you think of a clean way for individual applications to eliminate norms (via subclassing the lucene code base - ie: no patching) Can't you simply subclass FilterIndexReader and override norms() to return a cached dummy array of Similarity.encodeNorm(1.0f) f

Re: Eliminating norms ... completley

2005-10-10 Thread Doug Cutting
Robert Engels wrote: Doesn't this cause a problem for highly interactive and large indexes? Since every update to the index requires the rewriting of the norms, and constructing a new array. The original complaint was primarily about search-time memory size, not update speed. I like the propo

Re: Lock class question

2005-10-24 Thread Doug Cutting
Marvin Humphrey wrote: What are the advantages of the With class? Why not just obtain the lock, run a block, and release the lock? The release should be in a 'finally' block. 'With' enforces that. Doug - To unsubscribe, e-

Welcome Yonik Seeley as committer!

2005-10-24 Thread Doug Cutting
Last week I proposed to the Lucene PMC that we make Yonik Seeley a committer on Lucene Java. I am pleased to announce that other PMC members agreed. Welcome, Yonik! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: Welcome Yonik Seeley as committer!

2005-10-25 Thread Doug Cutting
Chris Hostetter wrote: 2) On the subject of commiting and becoming a commiter: I've noticed a few questions recently about why/when patches can/will-be commited; and Yonik's new status has me wondering about how people become commiters, and what guidelines exist for commiters to know how/when to

Re: Welcome Yonik Seeley as committer!

2005-10-25 Thread Doug Cutting
Erik Hatcher wrote: As for accepting patches - with Lucene I'm personally very conservative with applying patches. There are good reasons to be conservative. When a committer commits a patch he or she vouches for the quality of that patch. Any problems that ensue are, to some degree, the re

Re: [jira] Commented: (LUCENE-414) Java NIO patch against Lucene 1.9

2005-10-26 Thread Doug Cutting
Robert Engels wrote: The reason for using Nio and not IO is IO requires multiple file handles per file. There are already numerous bugs/work-arounds in Lucene to limit the use of file handles (as this is a OS limited resource), so I did not wish to further increase the number of file descripto

Re: ApacheCon 2005 and Lucene

2005-10-27 Thread Doug Cutting
Grant Ingersoll wrote: Should I get the source and propose a patch or is there somebody who is in "charge" of the website? A patch would be great. The site is generated from the xdocs directory with 'ant docs'. You also need to check out jakarta-site2 as ../jakarta-site2. Doug --

Re: bytecount as String and prefix length

2005-11-01 Thread Doug Cutting
Marvin Humphrey wrote: I think it's time to throw in the towel. Please don't give up. I think you're quite close. I would be careful using CharBuffer instead of char[] unless you're sure all methods you call are very efficient. You could try avoiding CharBuffer by adding something (ugly) l

Re: bytecount as String and prefix length

2005-11-01 Thread Doug Cutting
Another approach might be to, instead of converting to UTF-8 to strings right away, change things to convert lazily, if at all. During index merging such conversion should never be needed. You needn't do this systematically throughout Lucene, but only where it makes a big difference. For exa

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-14 Thread Doug Cutting
[EMAIL PROTECTED] wrote: +23. Added regular expression queries, RegexQuery and SpanRegexQuery. +Note the same term enumeration caveats apply with these queries as +apply to WildcardQuery and other term expanding queries. +(Erik Hatcher) I don't like adding more error-prone stuff lik

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-14 Thread Doug Cutting
Erik Hatcher wrote: The downside is scoring closer matches (in say the WildcardQuery) would no longer be possible, right? Right. We could implement a scorer that keeps a byte array of scores instead of a bit vector, using Similarity.java's 8-bit float format. That would use more memory, but

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Paul Elschot wrote: I think loosing the field boosts for PrefixQuery and friends would not be advisable. Field boosts have a very big range and from that a very big influence on the score and the order of the results in Hits. It should not be hard to add these. If a field name is provided, the

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Yonik Seeley wrote: As far as API goes, I guess there should be a constructor ConstantScoreQuery(Filter filter, String field) If field is non-null, then the field norm can be multiplied into the score. You could implement this with a scorer subclass that multiplys by the norm, removing a condi

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Paul Elschot wrote: Not using the document term frequencies in PrefixQuery would still leave these as a surprise factor between PrefixQuery and TermQuery. Should we dynamically decide to switch to FieldNormQuery when BooleanQuery.maxClauseCount is exceeded? That way queries that currently wo

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread Doug Cutting
Yonik Seeley wrote: Scoring recap... I think I've seen 4 different types of scoring mentioned in this thread for a term expanding query on a single field: 1) query_boost 2) query_boost * (field_boost * lengthNorm) 3) query_boost * (field_boost * lengthNorm) * tf(t in q) 4) query_boost * (field_b

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Doug Cutting
Yonik Seeley wrote: Totally untested, but here is a hack at what the scorer might look like when the number of terms is large. Looks plausible to me. You could instead use a byte[maxDoc] and encode/decode floats as you store and read them, to use a lot less RAM. // could also use a bitse

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-16 Thread Doug Cutting
Yonik Seeley wrote: Hmmm, very interesting idea. Less than one decimal digit of precision might be hard to swallow when you have to add scores together though: smallfloat(score1) + smallfloat(score2) + smallfloat(score3) Do you think that the 5/3 exponent/mantissa split is right for this, or wo

Re: Float.floatToRawIntBits

2005-11-16 Thread Doug Cutting
In general I would not take this sort of profiler output too literally. If floatToRawIntBits is 5x faster, then you'd expect a 16% improvement from using it, but my guess is you'll see far less. Still, it's probably worth switching & measuring as it might be significant. Doug Paul Smith wro

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-17 Thread Doug Cutting
Yonik Seeley wrote: I'm not sure I understand why this is. epsilon is based on 1, (smallest number such that 1-epsilon != 1, right?). What's special about 1? 1 is special for multiplication, but, you're right, not so special for addition, the operation in question. The thing that makes addi

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-17 Thread Doug Cutting
Yonik Seeley wrote: mantissa_bits=4, zeroExp=4: 1) 0.0021972656 2) 0.0024414062 70) 0.875 71) 0.9375 72) 1.0 73) 1.125 74) 1.25 75) 1.375 76) 1.5 254) 7340032.0 255) 7864320.0 This would be a good choice. I think the following is also a contender: mantissa_bits=5, zeroExp=2: 1) 0.033203125 2)

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-17 Thread Doug Cutting
Yonik Seeley wrote: Hmmm, is .03->2000 really enough range? Seems like the choice is between that and .0005->200 will one less mantissa bit. Consider the failure modes: With the .0005->200 range we'll fail to distinguish close-scoring matches in more commmon score ranges, while more c

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-17 Thread Doug Cutting
Yonik Seeley wrote: Do you think that underflow should map to the smallest representable number (like norm encoding does) or 0? The smallest representable, I think. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: Spans, appended fields, and term positions

2005-11-21 Thread Doug Cutting
Erik Hatcher wrote: Would this be an acceptable change to commit? The javadoc is pretty slim! This adds another field to Field which is not stored and will hence not be reflected in hit documents. So it will confuse folks. Doug -

Re: Lock Dir Location

2005-11-21 Thread Doug Cutting
Marvin Humphrey wrote: As I work on the ports for FSDirectory and FSLock, I'm wondering why the file system directory itself shouldn't be used to hold the lockfiles. If there's a write permissions problem, such as an FSDirectory on CDrom, disableLocks takes care of it. Is there some scen

Re: svn commit: r348060 - in /lucene/java/trunk/src: java/org/apache/lucene/analysis/ java/org/apache/lucene/index/ test/org/apache/lucene/index/ test/org/apache/lucene/search/

2005-11-22 Thread Doug Cutting
[EMAIL PROTECTED] wrote: + * Invoked, by DocumentWriter, before indexing a Field instance if + * terms have already been added to that field. This allows custom + * analyzers to place an automatic position increment gap between + * Field instances using the same field name. The default

Re: open source YourKit licence

2005-12-01 Thread Doug Cutting
Yonik Seeley wrote: a) do any other committers want a license, and Why not just include all committer names? b) would we be willing to put their logo somewhere in exchange? Perhaps we should reserve that until we find that Lucene has been significantly improved by YourKit. Doug

Re: "Advanced" query language

2005-12-07 Thread Doug Cutting
Erik Hatcher wrote: While there have been several different topics brought up on this thread, it seems we're diverging from the original idea. Let's consider the most basic use case example here, and I'm making it intentionally as concrete as possible: A Swing client performs searches by

Re: Directory Implementation: Java Content Repository

2005-12-16 Thread Doug Cutting
Nicolas Belisle wrote: Since Java Content Repository uses java.io.InputStream, I extended RAMInputStream to achieve random reads from the java.io.InputStream. (Have a better idea ?) So you're buffering the entire file? That doesn't sound good. If there are no provisions for random access, t

Re: indexreader refresh

2006-01-04 Thread Doug Cutting
Amol Bhutada wrote: If I have a reader and searcher on a indexdata folder and another indexwriter writing documents to the same indexdata folder, do I need to close existing reader and searcher and create new so that newly indexed data comes into search effect? [ moved from user to dev list]

Re: indexreader refresh

2006-01-04 Thread Doug Cutting
Yes, that's a good start. Your patch does not handle deletions correctly. If a segment has had deletions since it was opened then its deletions file needs to be re-read. I also think returning a new IndexReader is preferable to modifying one, since an IndexReader is often used as a cache key

nightly builds!

2006-01-25 Thread Doug Cutting
I just setup nightly builds for Lucene on our new Solaris zone. These are at: http://cvs.apache.org/dist/lucene/java/nightly/ I've updated the header for the binary release page to note this: http://www.apache.org/dist/jakarta/lucene/binaries/ (BTW, we should sometime move our releases out of

Re: nightly builds!

2006-01-26 Thread Doug Cutting
Daniel Naber wrote: On Mittwoch 25 Januar 2006 23:26, Doug Cutting wrote: I just setup nightly builds for Lucene on our new Solaris zone. These are at: http://cvs.apache.org/dist/lucene/java/nightly/ Thanks! What about putting that on the front page as a news item? +1 Doug

Re: Preventing "killer" queries

2006-02-07 Thread Doug Cutting
mark harwood wrote: For these outlier situations is it worth adding a "maxDf" property to TermQuery like BooleanQuery's maxClause query-time control? I could fix my problem in my own app-specific query construction code but I wonder if others would find it a useful fix to add to TermQuery in the

1.9 RC1

2006-02-13 Thread Doug Cutting
I'd like to push out a 1.9 release candidate in the next week or so. Are there any patches folks are really hoping to sneak into 1.9? If so, now's the time. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comm

Re: 1.9 RC1

2006-02-13 Thread Doug Cutting
Doug Cutting wrote: I'd like to push out a 1.9 release candidate in the next week or so. Are there any patches folks are really hoping to sneak into 1.9? If so, now's the time. This is a great time to improve the javadoc. I see lots of blank boxes which could use a bit of descri

Re: updating fieldNorms in mass

2006-02-14 Thread Doug Cutting
Chris Hostetter wrote: in the case where doc boosts and field boosts aren't used, it seems like it would be very easy to write a maintenance app that did something like... get instance of similarity based on input foreach fieldName in input { int[] termCounts = new int[maxDoc];

Re: 1.9 RC1

2006-02-14 Thread Doug Cutting
Chris Hostetter wrote: I'm not sure what the ASF/Lucene policy is on keeping Copyright/License statements in source files up to date, but should they all be updated to say "Copyright 2006 The Apache Software Foundation" prior to a 1.9 release? It shouldn't hurt! This week is pretty booked for

Re: 1.9 RC1

2006-02-15 Thread Doug Cutting
DM Smith wrote: Would that mean that 1.9 and 2.0 will be released at the same time? No. 2.0 will be released after 1.9. The primary change will be that all deprecated methods are removed, but there may be other changes, but probably not many. Doug

  1   2   3   4   5   6   >