Can be quite a bit faster than vInt in some cases:
http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html
-Mike
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev
On 8-Apr-09, at 11:13 PM, Michael Busch wrote:
I was thinking about doing this as part of LUCENE-1195. However, I
doubt that the net win will be very noticeable here. A common
scenario is that you have an index with one big body field that has
a lot of unique terms, plus several metafield
[
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688449#action_12688449
]
Mike Klaas commented on LUCENE-1561:
I agree that it is going to be almost imposs
On 23-Mar-09, at 2:41 PM, Michael McCandless wrote:
I agree, but at least we need some clear criteria so the future
decision process is more straightforward. Towards that... it seems
like there are good reasons why something should be put into contrib:
* It uses a version of JDK higher than
On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:
: What I would LOVE is if I could do it in a standard Lucene search
like I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...
The Query/Scorer APIs don't provide any mechanism for information like
that to b
[
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669843#action_12669843
]
Mike Klaas commented on LUCENE-1534:
[quote]But if we feel that over-emphasizes t
On 19-Nov-08, at 5:12 AM, Michael McCandless (JIRA) wrote:
How can the VM system possibly make good decisions about what to swap
out? It can't know if a page is being used for terms dict index,
terms dict, norms, stored fields, postings. LRU is not a good policy,
because some pages (terms ind
On 23-Sep-08, at 12:33 PM, Otis Gospodnetic wrote:
Hi,
When people add new issues to JIRA they most often don't set the
"Fix Version" field. Would it not be better to have a default value
for that field, so that new entries don't get forgotten when we
filter by "Fix Version" looking for
Wow, that was a fast resolution to this "issue" :)
-Mike
On 22-Aug-08, at 12:46 AM, F.Y. (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
F.Y. closed LUCENE-1363.
Resoluti
On 24-Jun-08, at 1:28 PM, Yonik Seeley wrote:
Something to consider for Lucene 3 is to have something to retrieve
Similarity per-field rather than passing the field name into some
functions...
+1
I've felt that this was the "proper" (and more useful) way to do
things for a long time
(http
On 23-Jun-08, at 10:14 AM, Jason Rutherglen (JIRA) wrote:
Does anyone know how to turn off Eclipse automatically changing the
import statements? I am not making it reformat but if I edit some
code in a file it sees fit to reformat the imports.
http://www.google.com/search?q=turn%20off%20e
[
https://issues.apache.org/jira/browse/LUCENE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600973#action_12600973
]
Mike Klaas commented on LUCENE-1293:
It is meant for debugging, though I have f
On 1-May-08, at 10:03 AM, Timo Nentwig wrote:
Hello developers,
I do have enough memory to load the index completely into RAM but
can't live
with the fact that it takes multiple minutes to do so.
So I can up with the idea of implementing a RAMDirectory proxy that
does the
Directory.copy()
On 26-Feb-08, at 3:00 PM, Michael Busch (JIRA) wrote:
50,000 AND queries with 3 terms each:
old: 152 secs
new (with LRU cache): 112 secs (26% faster)
50,000 OR queries with 3 terms each:
old: 175 secs
new (with LRU cache): 133 secs (24% faster)
For bigger ind
[
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570896#action_12570896
]
Mike Klaas commented on LUCENE-794:
---
This may be largely irrelevant, but Solr h
hing new to experienced designers/
developers - I only offering a reminder. It is my observation
(others will disagree !), but I think a lot of Lucene has some
unneeded esoteric code, where the benefit doesn't match the cost.
On Feb 10, 2008, at 5:48 PM, Mike Klaas wrote:
While I
While I agree in general that excessive optimization at the expense
of code clarity is undesirable, you are overstating the point. 2X is
a ridiculous threshold to apply to something as performance critical
as a full text search engine. If search was twice as slow, lucene
would be utterly
// To write term
vectors
private FieldsWriter fieldsWriter;
is my clue that several files are written at once.
On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote:
On 7-Feb-08, at 2:00 PM, robert engels wrote:
My point is that commit needs to be used in most applications,
and the co
On 7-Feb-08, at 2:00 PM, robert engels wrote:
My point is that commit needs to be used in most applications, and
the commit in Lucene is very slow.
You don't have 2x the IO cost, mainly because only the log file
needs to be sync'd. The index only has to be sync'd eventually, in
order to
[
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565942#action_12565942
]
Mike Klaas commented on LUCENE-1157:
If you just want to exclude them from se
On 10-Dec-07, at 1:20 PM, Shai Erera wrote:
Thanks for the info. Too bad I use Windows ...
Just allocate a bunch of memory and free it. This linux, but
something similar should work on windows:
$ vmstat -S M
procs ---memory--
r b swpd free buff cache
0 0 0
On 10-Dec-07, at 12:11 PM, Shai Erera wrote:
Actually, queries on large indexes are not necessarily I/O bound.
It depends
on how much of the posting list is being read into memory at once.
I'm not
that familiar with the inner-most of Lucene, but let's assume a
posting
element takes 4 bytes
On 10-Dec-07, at 11:31 AM, Shai Erera wrote:
As you can see, the actual allocation time is really negligible and
there
isn't much difference in the avg. running times of the queries.
However, the
*current* runs performed a lot worse at the beginning, before the
OS cache
warmed up.
This s
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
+1 I have been thinking about this too. Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.
Would be nice if t
There is a good chance that they were using stock indexing defaults,
based on:
Lucene:
" In the present work, the simple applications
bundled with the library were used to index the collection. "
On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
Yeah, I wasn't too excited over it and I certain
[
https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544630
]
Mike Klaas commented on LUCENE-693:
---
Yonik: this is great! I applied and tested the patch and everything looks
On 17-Nov-07, at 5:49 PM, Yonik Seeley wrote:
So I think we should change + finalize the payload API before Lucene
2.3 comes out.
Single biggest drawback about current payloads is that there isn't any
explicit support for adding different types of payloads to the same
token.
I don't really see
On 15-Nov-07, at 5:33 AM, Grant Ingersoll wrote:
Would people be interested in asking infrastructure to see if we
can get our hands on things like JIRA search logs and any other
search/query logs available? I'm thinking if we had this, plus the
underlying data, we could start to use this i
[
https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540913
]
Mike Klaas commented on LUCENE-693:
---
Paul wrote:
> As just discussed on java-dev, the creation of an object dur
[
https://issues.apache.org/jira/browse/LUCENE-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538126
]
Mike Klaas commented on LUCENE-1035:
> Query set with average 590K results, retrieving docids for the first
On 10-Sep-07, at 3:00 PM, Grant Ingersoll wrote:
What I truly pine for is a way to globally override Similarity on
a per-field basis. Wishful thinking...
Instead of wishful thinking, let's figure out a patch... :-)
Someday, I will find the time to delve more deeply into lucene wishful
This is the current api for scorePayload:
public float scorePayload(byte [] payload, int offset, int length) {
ISTM that this function depends greatly on the field--what if the end
user wants to store two completely different kinds of values in
different fields? Could fieldName be added?
[
https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523979
]
Mike Klaas commented on LUCENE-850:
---
Do address the issue above, the following needs to be added
[
https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Klaas updated LUCENE-850:
--
Attachment: CustomBoostQuery.java
Here's an approach I think will work.
Rename CustomScoreQue
[
https://issues.apache.org/jira/browse/LUCENE-982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521590
]
Mike Klaas commented on LUCENE-982:
---
One heuristic that has been quite useful for us is to skip optimizing
[
https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521191
]
Mike Klaas commented on LUCENE-871:
---
The switch statement is not equivalent to a list of sequential ifelses--it is
On 26-Jul-07, at 5:36 PM, Grant Ingersoll wrote:
I propose we take the following path for migrating Lucene Java to
JDK 1.5:
1. Put in any new deprecations we want, cleanups, etc.
2. Release 2.4 so all of Mike M's goodness is available to 1.4
users within the next 2-4 weeks using our new re
[
https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510001
]
Mike Klaas commented on LUCENE-850:
---
Tim: That is typically done by adding an optional implicit phrase query
[
https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509998
]
Mike Klaas commented on LUCENE-850:
---
Hi Doron,
The main use case is the same as for documents (and to a lesser
[
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487613
]
Mike Klaas commented on LUCENE-584:
---
Instead of discarding the first run, the approach I usually take is to run 3
On 4/4/07, Otis Gospodnetic (JIRA) <[EMAIL PROTECTED]> wrote:
[
https://issues.apache.org/jira/browse/LUCENE-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved LUCENE-796.
-
Resolution: Fixed
Makes s
On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval. I
: think for many Lucene apps these are a sizable part of time spend indexing.
true, bu
On 4/4/07, Jean-Philippe Robichaud <[EMAIL PROTECTED]> wrote:
I understand your concerns!
I was a little skeptical at the beginning. But even with the 1.5 jvm,
the improvements still holds.
Lucene creates a lots of "garbage" (strings, tokens, ...) either at
index time or query time. While the
[
https://issues.apache.org/jira/browse/LUCENE-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Klaas updated LUCENE-850:
--
Attachment: prodscorer.patch.diff
Generify the subquery handling logic of DisMax to make it easy to
[
https://issues.apache.org/jira/browse/LUCENE-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484195
]
Mike Klaas commented on LUCENE-446:
---
I've often wanted to multiply the scores of two queries. I look
Feature
Components: Search
Reporter: Mike Klaas
Refactor DisMaxQuery into SubQuery(Query|Scorer) that admits easy subclassing.
An example is given for multiplicatively combining scores.
Note: patch is not clean; for demonstration purposes only.
--
This message is
On 3/15/07, karl wettin <[EMAIL PROTECTED]> wrote:
I propose a change of the current IndexReader.getTermFreqVector/s-
code so that it /always/ return the vector space model of a document,
even when set fields are set as Field.TermVector.NO.
Is that crazy? Could be really slow, but except for tha
On 2/23/07, James Kennedy <[EMAIL PROTECTED]> wrote:
In our case, we're trying to optimize document() retrieval and we found that
disabling the String interning in the Field constructor improved performance
dramatically. I agree that interning should be an option on the constructor.
Out of cur
On 2/20/07, robert engels <[EMAIL PROTECTED]> wrote:
What about a queue of segments to merge. The add document will add
segments to the queue, if the queue contains too many segments it
blocks.
Another thread reads the segments from the queue and merges them.
This would effectively block adding
[
https://issues.apache.org/jira/browse/LUCENE-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Klaas updated LUCENE-799:
--
Attachment: CompressedLazyTextPatch.patch
test case and fix
> Garbage data when reading a compres
Components: Store
Affects Versions: 2.0.1, 2.1
Reporter: Mike Klaas
Fix For: 2.0.1, 2.1
lazy compressed text fields is a case that was neglected during lazy field
implementation. TestCase and patch provided.
--
This message is automatically generated by JIRA.
-
You
On 1/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
Mike,
Do you have any preference on making FieldInfo public versus moving
the FieldSelector stuff into the index package?
Not at all. Our use is pretty basic as will be easy to modify to
conform to class movement/renaming.
-Mike
---
On 1/26/07, Joe Tang <[EMAIL PROTECTED]> wrote:
Thanks for you reply Doron. It works partly on me.
How should I customize the Analyzer so as to have the functionality of
StandardAnalyzer as well as not stripping out some of the charactors?
Joe,
See nutch's version of StandardAnalyzer: it add
On 1/23/07, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote:
[
https://issues.apache.org/jira/browse/LUCENE-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466885
]
Grant Ingersoll commented on LUCENE-762:
This
On 12/19/06, robert engels <[EMAIL PROTECTED]> wrote:
I would suggest that in order to even bring up "thread local issues"
in the future that the submitter supplies a pure Java NON-LUCENE test
case that demonstrates the problem (just as you would if reporting a
bug to Sun).
All of the "guessing"
On 12/14/06, Doron Cohen <[EMAIL PROTECTED]> wrote:
But anyhow, this is not a negligible difference, and for real large
indexes, and busy systems, when the just written non-compound segment is
not in the system caches, it might have more effect. Possibly, search
performance during indexing would
On 12/5/06, negrinv <[EMAIL PROTECTED]> wrote:
Chris Hostetter wrote:
> If the code was not already in the core, and someone asked about adding it
> I would argue against doing so on the grounds that some helpfull utility
> methods (possibly in a contrib) would be just as usefull, and would h
On 12/1/06, negrinv <[EMAIL PROTECTED]> wrote:
I think we should not make too many assumptions about performance until we
can test alternative solutions.
<>
The small payload overhead will be amply offset in my opinion by the ability
to be very selective about what is being encrypted, as opp
[
http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ]
Mike Klaas commented on LUCENE-675:
---
A few notes on benchmarks:
First, it is important to realize that no benchmark will ever fully-capture all
aspects of
On 9/14/06, Chris (JIRA) <[EMAIL PROTECTED]> wrote:
If nothing else we would be interested in at least being able to extend
Document, which is currently declared final. (Anyone know the performance gains
on declaring a class final?)
According to this, not much:
http://www-128.ibm.com/develope
On 8/30/06, Paul Elschot <[EMAIL PROTECTED]> wrote:
Well, I just posted a single patch file, and I'd like to know whether this
patch applies cleanly. The patch itself has 841 lines and affects 11 files,
so be careful, perhaps to the point of starting a new working copy.
FWIW, I usually check o
61 matches
Mail list logo