Google-developed posting list encoding

2010-04-14 Thread Mike Klaas
Can be quite a bit faster than vInt in some cases: http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html -Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev

Re: possible TermInfosReader speedup

2009-04-09 Thread Mike Klaas
On 8-Apr-09, at 11:13 PM, Michael Busch wrote: I was thinking about doing this as part of LUCENE-1195. However, I doubt that the net win will be very noticeable here. A common scenario is that you have an index with one big body field that has a lot of unique terms, plus several metafield

[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688449#action_12688449 ] Mike Klaas commented on LUCENE-1561: I agree that it is going to be almost imposs

Re: Modularization

2009-03-23 Thread Mike Klaas
On 23-Mar-09, at 2:41 PM, Michael McCandless wrote: I agree, but at least we need some clear criteria so the future decision process is more straightforward. Towards that... it seems like there are good reasons why something should be put into contrib: * It uses a version of JDK higher than

Re: Getting tokens from search results. Simple concept

2009-03-06 Thread Mike Klaas
On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote: : What I would LOVE is if I could do it in a standard Lucene search like I : mentioned earlier. : Hit.doc[0].getHitTokenList() :confused: : Something like this... The Query/Scorer APIs don't provide any mechanism for information like that to b

[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669843#action_12669843 ] Mike Klaas commented on LUCENE-1534: [quote]But if we feel that over-emphasizes t

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2008-11-20 Thread Mike Klaas
On 19-Nov-08, at 5:12 AM, Michael McCandless (JIRA) wrote: How can the VM system possibly make good decisions about what to swap out? It can't know if a page is being used for terms dict index, terms dict, norms, stored fields, postings. LRU is not a good policy, because some pages (terms ind

Re: Setting Fix Version in JIRA

2008-09-23 Thread Mike Klaas
On 23-Sep-08, at 12:33 PM, Otis Gospodnetic wrote: Hi, When people add new issues to JIRA they most often don't set the "Fix Version" field. Would it not be better to have a default value for that field, so that new entries don't get forgotten when we filter by "Fix Version" looking for

Re: [jira] Closed: (LUCENE-1363) sub task of reopen performance

2008-08-22 Thread Mike Klaas
Wow, that was a fast resolution to this "issue" :) -Mike On 22-Aug-08, at 12:46 AM, F.Y. (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F.Y. closed LUCENE-1363. Resoluti

Re: per-field similarity

2008-06-25 Thread Mike Klaas
On 24-Jun-08, at 1:28 PM, Yonik Seeley wrote: Something to consider for Lucene 3 is to have something to retrieve Similarity per-field rather than passing the field name into some functions... +1 I've felt that this was the "proper" (and more useful) way to do things for a long time (http

Re: [jira] Updated: (LUCENE-1314) IndexReader.reopen(boolean force)

2008-06-23 Thread Mike Klaas
On 23-Jun-08, at 10:14 AM, Jason Rutherglen (JIRA) wrote: Does anyone know how to turn off Eclipse automatically changing the import statements? I am not making it reformat but if I edit some code in a file it sees fit to reformat the imports. http://www.google.com/search?q=turn%20off%20e

[jira] Commented: (LUCENE-1293) Tweaks to PhraseQuery.explain()

2008-05-29 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600973#action_12600973 ] Mike Klaas commented on LUCENE-1293: It is meant for debugging, though I have f

Re: LazyRAMDirectory

2008-05-01 Thread Mike Klaas
On 1-May-08, at 10:03 AM, Timo Nentwig wrote: Hello developers, I do have enough memory to load the index completely into RAM but can't live with the fact that it takes multiple minutes to do so. So I can up with the idea of implementing a RAMDirectory proxy that does the Directory.copy()

Re: [jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

2008-02-26 Thread Mike Klaas
On 26-Feb-08, at 3:00 PM, Michael Busch (JIRA) wrote: 50,000 AND queries with 3 terms each: old: 152 secs new (with LRU cache): 112 secs (26% faster) 50,000 OR queries with 3 terms each: old: 175 secs new (with LRU cache): 133 secs (24% faster) For bigger ind

[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support PhraseQuery, SpanQuery, ConstantScoreRangeQuery

2008-02-20 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570896#action_12570896 ] Mike Klaas commented on LUCENE-794: --- This may be largely irrelevant, but Solr h

Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-10 Thread Mike Klaas
hing new to experienced designers/ developers - I only offering a reminder. It is my observation (others will disagree !), but I think a lot of Lucene has some unneeded esoteric code, where the benefit doesn't match the cost. On Feb 10, 2008, at 5:48 PM, Mike Klaas wrote: While I

Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-10 Thread Mike Klaas
While I agree in general that excessive optimization at the expense of code clarity is undesirable, you are overstating the point. 2X is a ridiculous threshold to apply to something as performance critical as a full text search engine. If search was twice as slow, lucene would be utterly

Re: detected corrupted index / performance improvement

2008-02-07 Thread Mike Klaas
// To write term vectors private FieldsWriter fieldsWriter; is my clue that several files are written at once. On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote: On 7-Feb-08, at 2:00 PM, robert engels wrote: My point is that commit needs to be used in most applications, and the co

Re: detected corrupted index / performance improvement

2008-02-07 Thread Mike Klaas
On 7-Feb-08, at 2:00 PM, robert engels wrote: My point is that commit needs to be used in most applications, and the commit in Lucene is very slow. You don't have 2x the IO cost, mainly because only the log file needs to be sync'd. The index only has to be sync'd eventually, in order to

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-05 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565942#action_12565942 ] Mike Klaas commented on LUCENE-1157: If you just want to exclude them from se

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 1:20 PM, Shai Erera wrote: Thanks for the info. Too bad I use Windows ... Just allocate a bunch of memory and free it. This linux, but something similar should work on windows: $ vmstat -S M procs ---memory-- r b swpd free buff cache 0 0 0

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 12:11 PM, Shai Erera wrote: Actually, queries on large indexes are not necessarily I/O bound. It depends on how much of the posting list is being read into memory at once. I'm not that familiar with the inner-most of Lucene, but let's assume a posting element takes 4 bytes

Re: Performance Improvement for Search using PriorityQueue

2007-12-10 Thread Mike Klaas
On 10-Dec-07, at 11:31 AM, Shai Erera wrote: As you can see, the actual allocation time is really negligible and there isn't much difference in the avg. running times of the queries. However, the *current* runs performed a lot worse at the beginning, before the OS cache warmed up. This s

Re: O/S Search Comparisons

2007-12-10 Thread Mike Klaas
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote: +1 I have been thinking about this too. Solr clearly demonstrates the benefits of this kind of approach, although even it doesn't make it seamless for users in the sense that they still need to divvy up the docs on the app side. Would be nice if t

Re: O/S Search Comparisons

2007-12-07 Thread Mike Klaas
There is a good chance that they were using stock indexing defaults, based on: Lucene: " In the present work, the simple applications bundled with the library were used to index the collection. " On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote: Yeah, I wasn't too excited over it and I certain

[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-21 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544630 ] Mike Klaas commented on LUCENE-693: --- Yonik: this is great! I applied and tested the patch and everything looks

Re: Payload API

2007-11-17 Thread Mike Klaas
On 17-Nov-07, at 5:49 PM, Yonik Seeley wrote: So I think we should change + finalize the payload API before Lucene 2.3 comes out. Single biggest drawback about current payloads is that there isn't any explicit support for adding different types of payloads to the same token. I don't really see

Re: Apache logs and data

2007-11-15 Thread Mike Klaas
On 15-Nov-07, at 5:33 AM, Grant Ingersoll wrote: Would people be interested in asking infrastructure to see if we can get our hands on things like JIRA search logs and any other search/query logs available? I'm thinking if we had this, plus the underlying data, we could start to use this i

[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-07 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540913 ] Mike Klaas commented on LUCENE-693: --- Paul wrote: > As just discussed on java-dev, the creation of an object dur

[jira] Commented: (LUCENE-1035) Optional Buffer Pool to Improve Search Performance

2007-10-26 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538126 ] Mike Klaas commented on LUCENE-1035: > Query set with average 590K results, retrieving docids for the first

Re: payload api (scorePayload)

2007-09-10 Thread Mike Klaas
On 10-Sep-07, at 3:00 PM, Grant Ingersoll wrote: What I truly pine for is a way to globally override Similarity on a per-field basis. Wishful thinking... Instead of wishful thinking, let's figure out a patch... :-) Someday, I will find the time to delve more deeply into lucene wishful

payload api (scorePayload)

2007-09-10 Thread Mike Klaas
This is the current api for scorePayload: public float scorePayload(byte [] payload, int offset, int length) { ISTM that this function depends greatly on the field--what if the end user wants to store two completely different kinds of values in different fields? Could fieldName be added?

[jira] Commented: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-08-30 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523979 ] Mike Klaas commented on LUCENE-850: --- Do address the issue above, the following needs to be added

[jira] Updated: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-08-27 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas updated LUCENE-850: -- Attachment: CustomBoostQuery.java Here's an approach I think will work. Rename CustomScoreQue

[jira] Commented: (LUCENE-982) Create new method optimize(int maxNumSegments) in IndexWriter

2007-08-21 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521590 ] Mike Klaas commented on LUCENE-982: --- One heuristic that has been quite useful for us is to skip optimizing

[jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow

2007-08-20 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521191 ] Mike Klaas commented on LUCENE-871: --- The switch statement is not equivalent to a list of sequential ifelses--it is

Re: [VOTE] Migrate Lucene to JDK 1.5 for 3.0 release

2007-07-26 Thread Mike Klaas
On 26-Jul-07, at 5:36 PM, Grant Ingersoll wrote: I propose we take the following path for migrating Lucene Java to JDK 1.5: 1. Put in any new deprecations we want, cleanups, etc. 2. Release 2.4 so all of Mike M's goodness is available to 1.4 users within the next 2-4 weeks using our new re

[jira] Commented: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-07-03 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510001 ] Mike Klaas commented on LUCENE-850: --- Tim: That is typically done by adding an optional implicit phrase query

[jira] Commented: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-07-03 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509998 ] Mike Klaas commented on LUCENE-850: --- Hi Doron, The main use case is the same as for documents (and to a lesser

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-04-09 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487613 ] Mike Klaas commented on LUCENE-584: --- Instead of discarding the first run, the approach I usually take is to run 3

Re: [jira] Resolved: (LUCENE-796) Change Visibility of fields[] in MultiFieldQueryParser

2007-04-05 Thread Mike Klaas
On 4/4/07, Otis Gospodnetic (JIRA) <[EMAIL PROTECTED]> wrote: [ https://issues.apache.org/jira/browse/LUCENE-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-796. - Resolution: Fixed Makes s

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Mike Klaas
On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, bu

Re: Lucene and Javolution: A good mix ?

2007-04-05 Thread Mike Klaas
On 4/4/07, Jean-Philippe Robichaud <[EMAIL PROTECTED]> wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of "garbage" (strings, tokens, ...) either at index time or query time. While the

[jira] Updated: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-03-26 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas updated LUCENE-850: -- Attachment: prodscorer.patch.diff Generify the subquery handling logic of DisMax to make it easy to

[jira] Commented: (LUCENE-446) FunctionQuery - score based on field value

2007-03-26 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484195 ] Mike Klaas commented on LUCENE-446: --- I've often wanted to multiply the scores of two queries. I look

[jira] Created: (LUCENE-850) Easily create queries that transform subquery scores arbitrarily

2007-03-26 Thread Mike Klaas (JIRA)
Feature Components: Search Reporter: Mike Klaas Refactor DisMaxQuery into SubQuery(Query|Scorer) that admits easy subclassing. An example is given for multiplicatively combining scores. Note: patch is not clean; for demonstration purposes only. -- This message is

Re: Resolving term vector even when not stored?

2007-03-16 Thread Mike Klaas
On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote: I propose a change of the current IndexReader.getTermFreqVector/s- code so that it /always/ return the vector space model of a document, even when set fields are set as Field.TermVector.NO. Is that crazy? Could be really slow, but except for tha

Re: [jira] Field constructor, avoiding String.intern()

2007-02-23 Thread Mike Klaas
On 2/23/07, James Kennedy <[EMAIL PROTECTED]> wrote: In our case, we're trying to optimize document() retrieval and we found that disabling the String interning in the Field constructor improved performance dramatically. I agree that interning should be an option on the constructor. Out of cur

Re: Concurrent merge

2007-02-20 Thread Mike Klaas
On 2/20/07, robert engels <[EMAIL PROTECTED]> wrote: What about a queue of segments to merge. The add document will add segments to the queue, if the queue contains too many segments it blocks. Another thread reads the segments from the queue and merges them. This would effectively block adding

[jira] Updated: (LUCENE-799) Garbage data when reading a compressed, text field, lazily

2007-02-12 Thread Mike Klaas (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Klaas updated LUCENE-799: -- Attachment: CompressedLazyTextPatch.patch test case and fix > Garbage data when reading a compres

[jira] Created: (LUCENE-799) Garbage data when reading a compressed, text field, lazily

2007-02-12 Thread Mike Klaas (JIRA)
Components: Store Affects Versions: 2.0.1, 2.1 Reporter: Mike Klaas Fix For: 2.0.1, 2.1 lazy compressed text fields is a case that was neglected during lazy field implementation. TestCase and patch provided. -- This message is automatically generated by JIRA. - You

Re: [jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2007-02-01 Thread Mike Klaas
On 1/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Mike, Do you have any preference on making FieldInfo public versus moving the FieldSelector stuff into the index package? Not at all. Our use is pretty basic as will be easy to modify to conform to class movement/renaming. -Mike ---

Re: QueryParser Strips "++" out of my word "c++"

2007-01-26 Thread Mike Klaas
On 1/26/07, Joe Tang <[EMAIL PROTECTED]> wrote: Thanks for you reply Doron. It works partly on me. How should I customize the Analyzer so as to have the functionality of StandardAnalyzer as well as not stripping out some of the charactors? Joe, See nutch's version of StandardAnalyzer: it add

Re: [jira] Commented: (LUCENE-762) [PATCH] Efficiently retrieve sizes of field values

2007-01-23 Thread Mike Klaas
On 1/23/07, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote: [ https://issues.apache.org/jira/browse/LUCENE-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466885 ] Grant Ingersoll commented on LUCENE-762: This

Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)

2006-12-19 Thread Mike Klaas
On 12/19/06, robert engels <[EMAIL PROTECTED]> wrote: I would suggest that in order to even bring up "thread local issues" in the future that the submitter supplies a pure Java NON-LUCENE test case that demonstrates the problem (just as you would if reporting a bug to Sun). All of the "guessing"

Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-15 Thread Mike Klaas
On 12/14/06, Doron Cohen <[EMAIL PROTECTED]> wrote: But anyhow, this is not a negligible difference, and for real large indexes, and busy systems, when the just written non-compound segment is not in the system caches, it might have more effect. Possibly, search performance during indexing would

Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

2006-12-05 Thread Mike Klaas
On 12/5/06, negrinv <[EMAIL PROTECTED]> wrote: Chris Hostetter wrote: > If the code was not already in the core, and someone asked about adding it > I would argue against doing so on the grounds that some helpfull utility > methods (possibly in a contrib) would be just as usefull, and would h

Re: Re: Attached proposed modifications to Lucene 2.0 to support Field.Store.Encrypted

2006-12-01 Thread Mike Klaas
On 12/1/06, negrinv <[EMAIL PROTECTED]> wrote: I think we should not make too many assumptions about performance until we can test alternative solutions. <> The small payload overhead will be amply offset in my opinion by the ability to be very selective about what is being encrypted, as opp

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

2006-09-22 Thread Mike Klaas (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ] Mike Klaas commented on LUCENE-675: --- A few notes on benchmarks: First, it is important to realize that no benchmark will ever fully-capture all aspects of

Re: [jira] Created: (LUCENE-671) Hashtable based Document

2006-09-14 Thread Mike Klaas
On 9/14/06, Chris (JIRA) <[EMAIL PROTECTED]> wrote: If nothing else we would be interested in at least being able to extend Document, which is currently declared final. (Anyone know the performance gains on declaring a class final?) According to this, not much: http://www-128.ibm.com/develope

Re: LUCENE-584, was "Combining search steps without re-searching"

2006-08-30 Thread Mike Klaas
On 8/30/06, Paul Elschot <[EMAIL PROTECTED]> wrote: Well, I just posted a single patch file, and I'd like to know whether this patch applies cleanly. The patch itself has 841 lines and affects 11 files, so be careful, perhaps to the point of starting a new working copy. FWIW, I usually check o