[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749678#action_12749678 ] Chuck Williams commented on LUCENE-600: --- A given logical Document must have the

[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749660#action_12749660 ] Chuck Williams commented on LUCENE-600: --- Erratum: "deletion changes do

[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749656#action_12749656 ] Chuck Williams commented on LUCENE-600: --- The version attached here is from ov

[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749599#action_12749599 ] Chuck Williams commented on LUCENE-600: --- The patent isn't on the parall

[jira] Commented: (LUCENE-600) ParallelWriter companion to ParallelReader

2009-08-31 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749450#action_12749450 ] Chuck Williams commented on LUCENE-600: --- I contributed the first patch to make f

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-21 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544644 ] Chuck Williams commented on LUCENE-1052: > It almost feels like we should have "hooks" that a

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136 ] Chuck Williams commented on LUCENE-1052: I can report that in our application having a formula is critical

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055 ] Chuck Williams commented on LUCENE-1052: I agree a general configuration system would be much better. Doug

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-19 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Williams updated LUCENE-1052: --- Attachment: termInfosConfigurer.patch termInfosConfigurer.patch extends the

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-18 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543383 ] Chuck Williams commented on LUCENE-1052: I believe this needs to be a formula as a reasonable bound on the

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-17 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543306 ] Chuck Williams commented on LUCENE-1052: Michael, thanks for creating an excellent production version of

Re: Term pollution from binary data

2007-11-12 Thread Chuck Williams
Doug Cutting wrote on 11/07/2007 09:26 AM: Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ens

Term pollution from binary data

2007-11-06 Thread Chuck Williams
Hi All, We are experiencing OOM's when binary data contained in text files (e.g., a base64 section of a text file) is indexed. We have extensive recognition of file types but have encountered binary sections inside of otherwise normal text files. We are using the default value of 128 for te

[jira] Created: (LUCENE-1037) Corrupt index: term out of order after forced stop during indexing

2007-10-28 Thread Chuck Williams (JIRA)
Type: Bug Components: Index Affects Versions: 2.0.1 Environment: Windows Server 2003 Reporter: Chuck Williams In testing a reboot during active indexing, upon restart this exception occurred: Caused by: java.io.IOException: term out of order ("ancestorForwa

[jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2007-01-23 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466904 ] Chuck Williams commented on LUCENE-762: --- I use FieldInfo heavily and many other package-level API's, bu

Re: Lucene 2.1, soon

2007-01-18 Thread Chuck Williams
I need to support NFS and would not want to rely on the reader refreshing in X minutes. Setting X too small risks a query failure and setting X too large wastes disk space. X would need to be set for 100% reader availability, implying a large value and a lot of disk space waste. I like the idea

Re: Lucene 2.1, soon

2007-01-18 Thread Chuck Williams
How about a direct solution with a reference count scheme? Segments files could be reference-counted, as well as individual segments either directly, possibly by interning SegmentInfo instances, or indirectly by reference counting all files via Directory. The most recent checkpoint and snapshot w

Re: Lucene 2.1, soon

2007-01-17 Thread Chuck Williams
Grant Ingersoll wrote on 01/17/2007 01:42 AM: > Also, I'm curious as to how many people use NFS in live systems. > I've got the requirement to support large indexes and collections of indexes on NAS devices, which from linux pretty much means NFS or CIFS. This doesn't seem unusual. Chuck -

Re: adding "explicit commits" to Lucene?

2007-01-17 Thread Chuck Williams
I don't see how to do commits without at least some new methods. There needs to be some way to roll back changes rather than committing them. If the commit action is IndexWriter.close() (even if just an interface) the user still needs another method to roll back. There are reasons to close an In

Re: adding "explicit commits" to Lucene?

2007-01-16 Thread Chuck Williams
Michael McCandless wrote on 01/16/2007 12:09 PM: > Doug Cutting wrote: >> Michael McCandless wrote: >>> We could indeed simply tie "close" to mean "commit now", and not add a >>> separate "commit" method. >>> >>> But what about the "bulk delete then bulk add" case? Ideally if a >>> reader refreshe

Re: adding "explicit commits" to Lucene?

2007-01-16 Thread Chuck Williams
Yonik Seeley wrote on 01/16/2007 11:29 AM: > On 1/16/07, robert engels <[EMAIL PROTECTED]> wrote: >> You have the same problem if there is an existing reader open, so >> what is the difference? You can't remove the segments there either. > > The disk space for the segments is currently removed if n

[jira] Commented: (LUCENE-756) Maintain norms in a single file .nrm

2007-01-16 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465240 ] Chuck Williams commented on LUCENE-756: --- I may have the only app that will be broken by the 10-day backwards

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Chuck Williams
lts and then joining those by UID will be much slower than BooleanQuery operating on ParallelReader with doc-id sorted postings. The alternative of a UID-based BooleanQuery would have similar challenges unless the postings were sorted by UID. But hey, that's permanent doc-ids. Chuck

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Chuck Williams
1 on all of this. I think it will detract from the simple and > efficient storage mechanism that Lucene uses. > > On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote: > >> Ning Li wrote on 01/15/2007 06:29 PM: >>> On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrot

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Chuck Williams
Ning Li wrote on 01/15/2007 06:29 PM: > On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote: >> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature >> could have a more efficient implementation (just like Solr) when >> autoCommit is false, because deletes don't need t

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Chuck Williams
robert engels wrote on 01/15/2007 08:01 AM: > Is your parallel adding code available? > There is an early version in LUCENE-600, but without the enhancements described. I didn't update that version because it didn't capture any interest and requires Java 1.5 and so it seems will not be committed.

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Chuck Williams
Michael McCandless wrote on 01/15/2007 01:49 AM: > Chuck, > >> Possibly related, one of the ways I improved concurrency in >> ParallelWriter was to break up IndexWriter.addDocument() into one method >> to invert the document and create a RAMSegment and a second method that >> takes the RAMSegment

Re: adding "explicit commits" to Lucene?

2007-01-14 Thread Chuck Williams
Micahel, This seems to me to be a great idea, especially the ability to support index transactions. ParallelWriter (original implementation in LUCENE-600 -- I have a much better one now) provides a companion writer to ParallelReader. It takes a Document, breaks it up into subdocuments associated

[jira] Commented: (LUCENE-772) Lucene infinite loop? In FieldsReader.uncompress called from IndexSearcher.doc

2007-01-12 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464358 ] Chuck Williams commented on LUCENE-772: --- I had many concurrency problems with java.util.zip and ended up

Re: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Chuck Williams
Doug Cutting wrote on 01/12/2007 09:49 AM: > Marvin Humphrey wrote: >> Can you show us some code or pseudo-code for a BooleanScorer that >> would use impact-sorted posting lists? > > Another way to interpret this proposal is index-only: the low-level > indexing APIs should be general enough to per

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-11 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464055 ] Chuck Williams commented on LUCENE-769: --- Robert, Could you attach your current implementation of reopen() as

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-11 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464012 ] Chuck Williams commented on LUCENE-769: --- I have this same issue with a constantly changing large index where

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-10 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463729 ] Chuck Williams commented on LUCENE-769: --- The test case uses only tiny documents, and the reported timings for

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322 ] Chuck Williams commented on LUCENE-767: --- Isn't maxDoc always the same as the docCount of the segment, whi

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2007-01-03 Thread Chuck Williams (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122 ] Chuck Williams commented on LUCENE-510: --- Has an improvement been made to eliminate the reported 20% indexing

[jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-29 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-762?page=comments#action_12461460 ] Chuck Williams commented on LUCENE-762: --- Hi Grant, Maybe even better would be to have an appropriate method on FieldSelectorResult. E.g

[jira] Updated: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-28 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-762?page=all ] Chuck Williams updated LUCENE-762: -- Attachment: SizeFieldSelector.patch > [PATCH] Efficiently retrieve sizes of field val

[jira] Created: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2006-12-28 Thread Chuck Williams (JIRA)
: Store Affects Versions: 2.1 Reporter: Chuck Williams Sometimes an application would like to know how large a document is before retrieving it. This can be important for memory management or choosing between algorithms, especially in cases where documents might be very large

[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance

2006-12-19 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459791 ] Chuck Williams commented on LUCENE-754: --- This patch, together with LUCENE-750 (already committed) solved our problem completely. It sped up simultaneous

[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance

2006-12-19 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459763 ] Chuck Williams commented on LUCENE-754: --- Cool! This should solve at least part of my problem. Trying this now (along with finalizer removal patch that is

Re: 15 minute hang in IndexInput.clone() involving finalizers

2006-12-16 Thread Chuck Williams
va:175) > org.apache.lucene.store.BufferedIndexInput.clone(BufferedIndexInput.java:128) > > org.apache.lucene.store.FSIndexInput.clone(FSDirectory.java:564) > org.apache.lucene.index.SegmentTermDocs.(SegmentTermDocs.java:45) Thanks, Chuck Chuck Williams wrote on 12/15/2006 08:22 A

Re: 15 minute hang in IndexInput.clone() involving finalizers

2006-12-15 Thread Chuck Williams
Yonik and Robert, thanks for the suggestions and pointer to the patch! We've looked at the synchronization involved with finalizers and don't see how it could cause the issue as running the finalizers themselves is outside the lock. The code inside the lock is simple fixed-time list manipulation,

15 minute hang in IndexInput.clone() involving finalizers

2006-12-15 Thread Chuck Williams
Hi All, I've had a bizarre anomaly arise in an application and am wondering if anybody has ever seen anything like this. Certain queries, in not easy to reproduce cases, take 15-20 minutes to execute rather than a few seconds. The same query is fast some times and anomalously slow others. This

Re: Locale string compare: Java vs. C#

2006-12-13 Thread Chuck Williams
Surprising but it looks to me like a bug in Java's collation rules for en-US. According to http://developer.mimer.com/collations/charts/UCA_latin.htm, \u00D8 (which is Latin Capital Letter O With Stroke) should be before U, implying -1 is the correct result. Java is returning 1 for all strengths

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

2006-12-05 Thread Chuck Williams
Mike Klaas wrote on 12/05/2006 11:38 AM: > On 12/5/06, negrinv <[EMAIL PROTECTED]> wrote: > >> Chris Hostetter wrote: > >> > If the code was not already in the core, and someone asked about >> adding it >> > I would argue against doing so on the grounds that some helpfull >> utility >> > methods

Re: Efficiently expunging deletions of recently added documents

2006-12-05 Thread Chuck Williams
Thanks Ning. This is all very helpful. I'll make sure to be consistent with the new merge policy and its invariant conditions. Chuck Ning Li wrote on 12/05/2006 08:01 AM: > An old issue (http://issues.apache.org/jira/browse/LUCENE-325 new > method expungeDeleted() added to IndexWriter) request

Efficiently expunging deletions of recently added documents

2006-12-04 Thread Chuck Williams
Hi All, I'd like to open up the API to mergeSegments() in IndexWriter and am wondering if there are potential problems with this. I use ParallelReader and ParallelWriter (in jira) extensively as these provide the basis for fast bulk updates of small metadata fields. ParallelReader requires that

Re: [jira] Resolved: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-22 Thread Chuck Williams
Michael Busch wrote on 11/22/2006 08:47 AM: > Ning Li wrote: >> A possible design could be: >> First, in addDocument(), compute the byte size of a ram segment after >> the ram segment is created. In the synchronized block, when the newly >> created segment is added to ramSegmentInfos, also add its

[jira] Commented: (LUCENE-723) QueryParser support for MatchAllDocs

2006-11-21 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-723?page=comments#action_12451849 ] Chuck Williams commented on LUCENE-723: --- +1 With this could also come negative-only queries, e.g. -foo as a shortcut for *:* -foo > QueryParser supp

[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-21 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch This one should be golden as it addresses all the issues that have been raised and I believe the syncrhonization is

[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-17 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450894 ] Chuck Williams commented on LUCENE-709: --- I didn't see Yonik's new version or comments until after my attach. Throwing IOExceptions when files t

[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-17 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch I've just attached my version of this patch. It includes a multi-threaded test case. I believe it is soun

[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-15 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450301 ] Chuck Williams commented on LUCENE-709: --- I hadn' t considered the case of such large values for maxBufferedDocs, and agree that the loop execution ti

[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-15 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12450260 ] Chuck Williams commented on LUCENE-709: --- Not synchronizing on the Hashtable, even if using an Enumerator, creates problems as the contents of the hash table

Re: ParallelMultiSearcher reimplementation

2006-11-13 Thread Chuck Williams
Doug Cutting wrote on 11/13/2006 10:50 AM: > Chuck Williams wrote: >> I followed this same logic in ParallelWriter and got burned. My first >> implementation (still the version submitted as a patch in jira) used >> dynamic threads to add the subdocuments to th

[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-10 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch > [PATCH] Enable application-level management of IndexWriter.ramDirectory s

[jira] Commented: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-10 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=comments#action_12448923 ] Chuck Williams commented on LUCENE-709: --- Mea Culpa! Bad bug on my part. Thanks for spotting it! I believe the solution is simple. RAMDirectory.files is a

[jira] Updated: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-09 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-709?page=all ] Chuck Williams updated LUCENE-709: -- Attachment: ramDirSizeManagement.patch > [PATCH] Enable application-level management of IndexWriter.ramDirectory s

[jira] Created: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-09 Thread Chuck Williams (JIRA)
Issue Type: Improvement Components: Index Affects Versions: 2.0.1 Environment: All Reporter: Chuck Williams IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents

Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Michael Busch wrote on 11/09/2006 09:56 AM: > >> This sounds good. Michael, I'd love to see your patch, >> >> Chuck > > Ok, I'll probably need a few days before I can submit it (have to code > unit tests and check if it compiles with the current head), because > I'm quite busy with other stuff rig

Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
> > Yonik Seeley wrote: >> On 11/9/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >>> Thanks Yonik! Poor wording on my part. I won't vary maxBufferedDocs, >>> just am making flushRamSegments() public and calling it externally >>> (properly sync

Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Chuck Williams wrote on 11/09/2006 08:55 AM: > Yonik Seeley wrote on 11/09/2006 08:50 AM: > >> For best behavior, you probably want to be using the current >> (svn-trunk) version of Lucene with the new merge policy. It ensures >> there are mergeFactor segments with

Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Yonik Seeley wrote on 11/09/2006 08:50 AM: > For best behavior, you probably want to be using the current > (svn-trunk) version of Lucene with the new merge policy. It ensures > there are mergeFactor segments with size <= maxBufferedDocs before > triggering a merge. This makes for faster indexin

Re: Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
eeley wrote on 11/09/2006 08:37 AM: > On 11/9/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >> My main concern is that the mergeFactor escalation merging logic will >> somehow behave poorly in the presence of dynamically varying initial >> segment sizes. > > Thi

Dynamically varying maxBufferedDocs

2006-11-09 Thread Chuck Williams
Hi All, Does anybody have experience dynamically varying maxBufferedDocs? In my app, I can never truncate docs and so work with maxFieldLength set to Integer.MAX_VALUE. Some documents are large, over 100 MBytes. Most documents are tiny. So a fixed value of maxBufferedDocs to avoid OOM's is too

Re: ParallelMultiSearcher reimplementation

2006-11-05 Thread Chuck Williams
Doug Cutting wrote on 11/03/2006 12:18 PM: > Chuck Williams wrote: >> Why would a thread pool be more controversial? Dynamically creating and >> garbaging threads has many downsides. > > The JVM already pools native threads, so mostly what's saved by thread

Re: ParallelMultiSearcher reimplementation

2006-11-03 Thread Chuck Williams
Chris Hostetter wrote on 11/03/2006 09:40 AM: > : Is there any timeline for when Java 1.5 packages will be allowed? > > I don't think i'll incite too much rioting to say "no there is no > timeline" > .. I may incite some rioting by saying "my guess is 1.5 packages will be > supported when the patch

Re: Include BM25 in Lucene?

2006-10-17 Thread Chuck Williams
Vic Bancroft wrote on 10/17/2006 02:44 AM: > In some of my group's usage of lucene over large document collections, > we have split the documents across several machines. This has lead to > a concern of whether the inverse document frequency was appropriate, > since the score seems to be dependant

Re: Ferret's changes

2006-10-11 Thread Chuck Williams
David Balmain wrote on 10/10/2006 08:53 PM: > On 10/11/06, Chuck Williams <[EMAIL PROTECTED]> wrote: > > I personally would always store term vectors since I use a > StandardTokenizer and Stemming. In this case highlighting matches in > small documents is not trivial. Ferret&

Re: Ferret's changes

2006-10-10 Thread Chuck Williams
David Balmain wrote on 10/10/2006 03:56 PM: > Actually not using single doc segments was only possible due to the > fact that I have constant field numbers so both optimizations stem > from this one change. So it I'm not sure if it is worth answering your > question but I'll try anyway. It obviousl

Re: Define end-of-paragraph

2006-10-03 Thread Chuck Williams
t the > best way > What do you think? > Thanks in advance > > Reuven Ivgi > > -Original Message- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 03, 2006 10:58 AM > To: java-dev@lucene.apache.org > Subject: Re: Define end-of-para

Re: Define end-of-paragraph

2006-10-03 Thread Chuck Williams
Reuven Ivgi wrote on 10/02/2006 09:32 PM: > I want to divide a document to paragraphs, still having proximity search > within each paragraph > > How can I do that? > Is your issue that you want the paragraphs to be in a single document, but you want to limit proximity search to find matches on

Re: After kill -9 index was corrupt

2006-09-29 Thread Chuck Williams
t and the recovery code forgot to turn that off prior to the optimize! Thus a .cfs file was created, which confused the bulk updater -- it did not see a segment that was inside the cfs. Sorry for the false alarm and thanks to all who helped with the original question/concern, Chuck Chuck Williams

Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams
that appears to > be the likely culprit to me. > > On Sep 11, 2006, at 2:56 PM, Chuck Williams wrote: > >> robert engels wrote on 09/11/2006 07:34 AM: >>> A kill -9 should not affect the OS's writing of dirty buffers >>> (including directory modifications).

Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams
robert engels wrote on 09/11/2006 07:34 AM: > A kill -9 should not affect the OS's writing of dirty buffers > (including directory modifications). If this were the case, massive > system corruption would almost always occur every time a kill -9 was > used with any program. > > The only thing a kill

Re: After kill -9 index was corrupt

2006-09-11 Thread Chuck Williams
Paul Elschot wrote on 09/10/2006 09:15 PM: > On Monday 11 September 2006 02:24, Chuck Williams wrote: > >> Hi All, >> >> An application of ours under development had a memory link that caused >> it to slow interminably. On linux, the application did no

After kill -9 index was corrupt

2006-09-10 Thread Chuck Williams
Hi All, An application of ours under development had a memory link that caused it to slow interminably. On linux, the application did not response to kill -15 in a reasonable time, so kill -9 was used to forcibly terminate it. After this the segments file contained a reference to a segment whose

Re: Combining search steps without re-searching

2006-08-28 Thread Chuck Williams
Andrzej Bialecki wrote on 08/28/2006 09:19 AM: > Chuck Williams wrote: >> I presume your search steps are anded, as in typical drill-downs? >> >> >From a Lucene standpoint, each sequence of steps is a BooleanQuery of >> required clauses, one for each step.

Re: Combining search steps without re-searching

2006-08-28 Thread Chuck Williams
I presume your search steps are anded, as in typical drill-downs? >From a Lucene standpoint, each sequence of steps is a BooleanQuery of required clauses, one for each step. To add a step, you extend the BooleanQuery with a new clause. To not re-evaluate the full query, you'd need some query th

[jira] Created: (LUCENE-659) [PATCH] PerFieldAnalyzerWrapper fails to implement getPositionIncrementGap()

2006-08-17 Thread Chuck Williams (JIRA)
Issue Type: Bug Components: Analysis Affects Versions: 2.0.1, 2.1 Environment: Any Reporter: Chuck Williams Attachments: PerFieldAnalyzerWrapper.patch The attached patch causes PerFieldAnalyzerWrapper to delegate calls to getPositionIncrementGap

Re: Strange behavior of positionIncrementGap

2006-08-12 Thread Chuck Williams
Yonik Seeley wrote on 08/12/2006 05:08 AM: > On 8/11/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >> 1) a b C D ...results in: _gap_ _gap_ C _gap_ D >> 2) a B C D ...results in: _gap_ B _gap_ C _gap_ D >> 3) A b c D ...results in: A _gap_ _gap_ _gap_ D >> >

Re: Strange behavior of positionIncrementGap

2006-08-11 Thread Chuck Williams
Chris Hostetter wrote on 08/11/2006 09:08 AM: > (using lower case > to indicate no tokens produced and upper case to indicate tokens were > produced) ... > > 1) a b C _gap_ D ...results in: C _gap_ D > 2) a B _gap_ C _gap_ D ...results in: B _gap_ C _gap_ D > 3) A _gap_ b _gap_

Strange behavior of positionIncrementGap

2006-08-11 Thread Chuck Williams
Hi All, There is a strange treatment of positionIncrementGap in DocumentWriter.invertDocument().The gap is inserted between all values of a field, except it is not inserted between values if the prefix of the value list up to that point has not yet generated a token. For example, if a field F

Re: Using Lucene for Semantic search

2006-07-20 Thread Chuck Williams
I have built such a system, although not with Lucene at the time. I doubt you need to modify anything in Lucene to achieve this. You may want to index words, stems and/or concepts from the ontology. Concepts from the ontology may relate to words or phrases. Lucene's token structure is flexible,

Re: Lucene/Netbean Newbie looking for help

2006-07-10 Thread Chuck Williams
y suggestions? Or any pointers to getting the tests > to work in netbeans are appreciated. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] >

Re: Global field semantics

2006-07-10 Thread Chuck Williams
Chris Hostetter wrote on 07/10/2006 12:31 PM: > So i guess we are on the same page that this kind of thing can be done at > the App level -- what benefits do you see moving them into the Lucene > index level? > Other than performance per David's and Marvin's ideas, the functionality benefits

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-10 Thread Chuck Williams
Yonik Seeley wrote on 07/10/2006 09:27 AM: > I'll rephrase my original question: > When implementing NewIndexModifier, what type of efficiencies do we > get by using the new protected methods of IndexWriter vs using the > public APIs of IndexReader and IndexWriter? I won't comment on Ning's imp

Re: Global field semantics

2006-07-10 Thread Chuck Williams
Chris Hostetter wrote on 07/10/2006 02:06 AM: > As near as i can tell, the large issue can be sumarized with the following > sentiment: > > Performance gains could be realized if Field > properties were made fixed and homogeneous for > all Documents in an index. > This is cert

Re: Global field semantics

2006-07-10 Thread Chuck Williams
David Balmain wrote on 07/10/2006 01:04 AM: > The only problem I could find with this solution is that > fields are no longer in alphabetical order in the term dictionary but > I couldn't think of a use-case where this is necessary although I'm > sure there probably is one. So presumably fields ar

Re: Global field semantics

2006-07-09 Thread Chuck Williams
David Balmain wrote on 07/09/2006 06:44 PM: > On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >> Marvin Humphrey wrote on 07/08/2006 11:13 PM: >> > >> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: >> > >> >> Many things would be

Re: Global field semantics

2006-07-09 Thread Chuck Williams
Marvin Humphrey wrote on 07/08/2006 11:13 PM: > > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> i.e., if properties like text vs. binary, Index, Store, TermVector, the >> appropriate

[jira] Commented: (LUCENE-509) Performance optimization when retrieving a single field from a document

2006-07-09 Thread Chuck Williams (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-509?page=comments#action_12419926 ] Chuck Williams commented on LUCENE-509: --- LUCENE-545 does resolve this in a more general way, although the code to get precisely one field value efficiently is slightly

Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 12:27 PM: > On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote: > >> Karl, do you have specific reasons or use cases to normalize fields at >> Document rather than at Index? >> > > Nothing more than that the way the API

Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 10:27 AM: > On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> > > >> Has this been considered before? Are there good reasons this

Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-08 Thread Chuck Williams
Doug Cutting wrote on 07/08/2006 09:41 AM: > Chuck Williams wrote: >> I only work in 1.5 and use its features extensively. I don't think >> about 1.4 at all, and so have no idea how heavily dependent the code in >> question is on 1.5. >> >> Unfortunately,

Global field semantics

2006-07-08 Thread Chuck Williams
Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. This

Re: Java 1.5 (was ommented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided))

2006-07-07 Thread Chuck Williams
DM Smith wrote on 07/07/2006 07:07 PM: > Otis, > First let me say, I don't want to rehash the arguments for or > against Java 1.5. This is an emotional issue for people on both sides. > However, I think you have identified that the core people need to > make a decision and the rest of us

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
r to IndexModifier without > the warning that you should do all the deletions first, and then all > the additions - the BufferedWriter would manage this for you. > > On Jul 6, 2006, at 9:16 PM, Chuck Williams wrote: > >> Robert, >> >> Either you or I are missing somethi

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
Robert, Either you or I are missing something basic. I'm not sure which. As I understand things, an IndexWriter and an IndexReader cannot both have the write lock at the same time (they use the same write lock file name). Only an IndexReader can delete and only an IndexWriter can add. So to up

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Chuck Williams
robert engels wrote on 07/06/2006 12:24 PM: > I guess we just chose a much simpler way to do this... > > Even with you code changes, to see the modification made using the > IndexWriter, it must be closed, and a new IndexReader opened. > > So a far simpler way is to get the collection of updates fi

  1   2   >