I think this could be a generally useful feature?
+1. I could definitely use a "commitUserData" option for the same reasons.
Thinking more on this, we may not need to modify the index format at all for
this use-case. This is easily achieved in the current system by adding a
dummy document
Pros of keeping content only in the database
* Need only one stored copy of data (saved disk space)
Pros of storing copy of content in Lucene:
* A match is more easily explained
If you collapse multiple DB fields into a single searchable field e.g.
customer first name and surname database fiel
Related:
https://issues.apache.org/jira/browse/LUCENE-725
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
>>this can't be nearly as fast as OpenBitSet.intersect() or union,
respectively, can it?
I had a similar concern but it doesn't seem that bad:
https://issues.apache.org/jira/browse/LUCENE-1187?focusedCommentId=12596546#action_12596546
The above test showed a slight improvement using bitset
If you index the queries consider also that they can potentially be
indexed in an optimised form.
For example, take a phrase query for "Alonso Smith". You need only index
one of these terms - an incoming document must contain both terms to be
considered a match. If you chose to index this quer
Can you describe this in a little more detail; I'm not exactly sure what you
mean.
Break your large text documents into multiple Lucene documents. Rather
than dividing them up into entirely discreet chunks of text consider
storing/indexing *overlapping* sections of text with an overlap as
>>(a "write once" schema)
I like this idea. Enforcing consistent field-typing on instances of
fields with the same name does not seem like an unreasonable restriction
- especially given the upsides to this.
It doesn't dispense with all the full schema logic in Solr but seems
like a useful ba
If you can supply a Junit test that recreates the problem I think we can
start to make progress on this.
Amin Mohammed-Coleman wrote:
Hi
Apologies for re sending this mail. Just wondering if anyone has
experienced the below. I'm not sure if this could happen due nature of
document. It does
Hi Niels,
See the javadocs for IndexWriter.setRAMBufferSizeMB()
Cheers
Mark
Niels Ott wrote:
Hi Lucene professionals!
This may sound like a dumb beginner's question, but anyways: Can
Lucene run out of memory during indexing?
Should I use IndexWriter.flush() or .commit(), and if so, how ofte
Scott Tiger wrote:
I want get BooleanFilter contains two RangeFilters from query string.
The XMLQueryParser may be of interest.
See BooleanFilter.xml and CachedFilter.xml examples in the
XMLQueryParser Junit tests.
I typically use QueryTemplateManager to transform user input provided in
a
lucene user wrote:
Thanks for all your help!
We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene 2.2.0.
I have not been able to find SortedVIntList in the javadocs at all.
No, SortedVIntList is in the patch I provided a link to earlier.
Because both SortedVIntList and a r
Anyone care to suggest an approach to making this faster?
See TokenSources.java
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Cool Coder wrote:
>> Is there anyway I can specify which terms are "MUST", I mean they
have to appear in the result and some terms are optional,
One "hands off" approach you could try with this is to rewrite the
fuzzyQuery and then set the minimum number of terms you want a match on.
e.g.
The added IO is one factor. Another is the CPU load from doing many
edit-distance comparisons between index terms and the provided search
term. You can limit the number of edit distance comparisons conducted by
setting the minimum prefix length. This is a property of the QueryParser
if parsing
can use Soundex and then if you're lucky files==philes but there's no
room for error and they either match or they dont - there is no measure
of similarity.
There's no free lunch here.
Timo Nentwig wrote:
On Saturday 24 November 2007 18:28:48 markharw00d wrote:
term. You can l
lucene user wrote:
Am I being clear?
Now you are.
I don't know what you mean by "PERSON_ANNOTATION works for Google".
I suppose I meant annotations in the sense GATE and UIMA refer to
annotations. They are like a highlighter pen marking a particular
section of a document and adding me
I need to highlight an entire document as it is displayed
See NullFragmenter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
BooleanFilter in contrib is similar to ChainedFilter but just expresses
the boolean logic using the same vocabulary as BooleanQuery ("should"s,
"must"s and "not"s).
Cheers
Mark
Erick Erickson wrote:
I think you can just throw them all together in a
ChainedFilter and use the ChainedFilter wher
Let's say for the query algorithm, the word algorith is also a match,
how do the highlighter know that it should also highlight
occurrences of the word algorith? (I am not sure it does this anyway)
The highlighter knows to highlight stemmed words because both the query
terms and the docume
There is a trick to indexing queries in this way... you need only index
the rarest term in queries which have one or more mandatory terms.
As an example - for the phrase query "XYZ Group limited" you need only
index the rarest term "XYZ" and thus avoid the selecting the query for
execution with
See BooleanQuery.setMinimumNumberShouldMatch.
Add the addresses as "SHOULD" termQuery clauses and set
minumumNumberShouldMatch to the required value.
Cheers
Mark
- Original Message
From: Michael Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, January 21, 200
I also read something about web-based Luke, but can't find it in the
contrib in 2.3, is it part of Lucene 2.3? How do I use it?
See here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg13287.html
I think we decided to hold off until after the Lucene 2.3 release before
adding to contrib
Further to Grant's useful background - there is an analyzer specifically
for multi-word terms in "contrib".
See Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle
Cheers
Mark
Hi Ghinwa,
A Term is simply a unit of tokenization that has been indexed for a
Field, produced by a
Where can i find the information regarding embedding lucene with database.
Thanks
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
http://issues.apache.org/jira/browse/LUCENE-434
Cheers
Mark
See https://issues.apache.org/jira/browse/LUCENE-794
Spencer Tickner wrote:
Hi List,
Thanks in advance for any help. I'm working with the contrib
highlighting class and am having issues when doing searches with a
phrase. I've been able to duplicate this behaviour in the
HighlighterTest class.
lucene-seme1 s wrote:
Can you please share the custom Analyzer you have ?
Unfortunately it's not mine to share but see the Lucene Token and
Analyzer classes - it's not particularly hard to code.
-
To unsubscribe, e-mail: [
Here you go:
Analyzer a=new StandardAnalyzer();
//open an index
String textFieldName="contents";
IndexReader reader=IndexReader.open("E:/indexes/uksites");
IndexSearcher searcher=new IndexSearcher(reader);
QueryParser qp=new QueryParser(textFieldNa
This looks like it is related to an issue I first raised here:
http://markmail.org/message/37ywsemfudpos6uh
At the time I identified 2 issues with FuzzyQuery - that the usual
"coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
The coord factor got fixed but idf remains a
Sebastin wrote:
Hi All,
Is there any possibility to avoid duplicate records in lucene 2.3.1?
At index-time or query time?
See DuplicateFilter in contrib/queries for a query-time filter
Cheers
Mark
-
To unsubscribe, e-
>>could you define duplicate?
That's your choice of field that you want to de-dup on.
That could be a field such as "DatabasePrimaryKey" or perhaps a field
containing an MD5 hash of document content.
The DuplicateFilter ensures only one document can exist in results for
each unique value for th
Nope, not seen that one.
Looks like the reference to no such field is in the Java instance data
sense, not the Lucene document sense.
Class versioning issues somewhere?
That method takes a parameter called "prohibited" which is the name of
the field reported in the error. Is the word "prohibite
Not quite yet gone up to this scale but here are some points for
consideration based on a smaller scale system I have in production that
may be of interest:
By clustering I presume you are only talking about replication.
When we talk about scaling and using multiple machines we need to think
This is a deficiency in the highlighter functionality that has been
discussed several times before. The summary is - not a trivial fix.
See here for background:
http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1
http://www.gossamer-threads.com/lists/engine?do=post_view_printa
people seem to just want to highlight the source text.
Any words of wisdom would be sorely appreciated.
- Mark
markharw00d wrote:
This is a deficiency in the highlighter functionality that has been
discussed several times before. The summary is - not a trivial fix.
See here for back
>>For what it's worth Mark (Miller), there *is* a need for "just
highlight the query terms without trying to get excerpts" functionality
>>- something a la Google cache (different colours...mmm, nice).
FWIW, the existing highlighter doesn't *have* to fragment - just pass a
NullFragmenter to the
Not sure I fully understand the problem. The query is effectively
"allContent:someTitleText" and you want to highlight the string
"someTitleText" in the title field?
If you pass null as a fieldname to the QueryTermExtractor it will use
all term values, regardless of field, as string to highlight
Be careful with your use of GATE and multiple threads.
I recently had some trouble with their Factory.delete.. methods which
ended up requiring a change to the core and this was applied to the 4.0
trunk. A 3.1 patch has not been released so you'll need to be using the
latest from SVN (now requi
>>most of the body text is the same, but I want to group them all under
one result.
I created this analyzer class to identify content that was "mostly
similar" but not necessarily identical.
http://issues.apache.org/jira/browse/LUCENE-725
If you feed a small set of documents through it (say y
Chris Hostetter wrote:
this is something anyone using the Lucene API can do as long as they use a
HitCollector ... the Nutch impl seems to ctually spin up a seperate thread
I'm keen to understand the pros and cons of these two approaches.
With the HitCollector approach is this just engineer
On app startup:
1) parse all Queries and place in an array.
2) Create a RAMIndex containing a doc for each query with content
consisting of the query's terms (see Query.extractTerms). For optimal
performance only index the most rare term for queries with multiple
mandatory criteria e.g. Phrase
rectly.
Thanks
Mélanie
-Original Message-
From: markharw00d [mailto:[EMAIL PROTECTED]
Sent: Monday, March 26, 2007 12:36 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search
On app startup:
1) parse all Queries and place in an array.
2) Create a RAMIndex containing a doc fo
See the Junit test example for field-sensitive highlighting.
If you pass a fieldname to the QueryScorer constructor it only considers
query terms for that field - the default without is all fields
Cheers
Mark
-
To unsubscribe
The shortlisting isn't based on stop words - a score is produced to
prioritise term selections. The score uses the IDF (inverse document
frequency) of the original term and mixes in the "edit-distance" for
each of the fuzzy variations of original terms. Care is taken to ensure
that in the query
bhecht wrote:
Thanks Mark,
I have updated my previous post I guess, before you had a chance to read it.
Did you edit your post on Nabble? That edit didn't come through as a
message to java-user so I didn't see it.
You shouldn't need to call rewrite on your FuzzyLikeThisQuery unless you
wan
Here's a way to do it using the XML query parser in contrib
1) Create this query.xsl file (note use of cached double negative filter)
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
upperTerm="z"/>
- Is it feasible to do
it on a single machine with 1 GB of Physical Memory and 1.3GHz
processor. Can lucene handle it
efficiently.
Yes, I've indexed all of English Wikipedia using 1GB RAM (with a 3Ghz
processor)
- Secondly, I wanted to know that when doing search does lucene load the
whol
Hi Mark,
Good summary. I was running some timings earlier and my results echo
your findings.
>>I am currently trying to think of some possible hybrid approach to
highlighting...
I was thinking along the lines of wrapping some core classes such as
IndexReader to somehow observe the query mat
It looks that we may have different cases.
I was hoping to answer the original question which was how to retrieve
pages of matching documents from a Lucene index (no database mentioned).
>>So far worked just fine. I have 5000 rows of items and I think will
still work fine later when I'd have
>>the case matters only for those words that should be included.
Jong, just want to check we're on the same page - you do know
MoreLikeThis has a kind of automatic Stop-Wording built in , yes?
MoreLikeThis looks at the document frequency of all terms in the "this"
text you provide and only sele
>>So I'm afraid I can't use the technique you recommend.
ah right - so the TermVector you use from the index will return mixed
and lower case versions of the same text.
One point to note - this would mean that of the 25 or so top terms
selected by MoreLikeThis for querying there is a reasonable
On a related topic: yesterday I posted a round-up of all the possible
filtering options in Lucene with timings and example code to the WIKI :
http://wiki.apache.org/jakarta-lucene/FilteringOptions
One of the options demonstrated is along the lines of Chris's suggestion.
Cheers,
Mark
Sorry to contradict, Erik, but the Highlighter's QueryScorer will make
use of IDF, given a reader, in order to better prioritise which are the
"best" bits of a document.
However, In the particular example given, the criteria includes several
non-text fields which are not useful for IDF and gener
Hi Vince, sounds like the same issue I highlighted recently on the
java-dev list.
See here:
http://www.nabble.com/Preventing-%22killer%22-queries-t1077895.html
The problem lies in the underlying cost of reading TermDocs for very
common terms (a problem for both queries and filters)
For your
Hi Daniel/Chris,
Unfortunately, the contrib/highlighter code in source control fails to
meet our needs in two ways:
1. We don't just want fragments, we want *all* of the text, with
highlights in the appropriate places (although we do offer a means
to display just the fragments as w
[EMAIL PROTECTED] wrote:
Anyone have a doc or something that would allow me to explain this to execs?
Roughly speaking:
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents wh
FilteredQuery has the side effect of passing zero scoring docs to the
hitcollector.
This does break the contract for HitCollector.collect method because the
JavaDocs state:
"Called once for every non-zero scoring document, with the document
number and its score."
The quick fix is to simply add a t
See RangeFilter.bits() for some example code that creates a filter
from terms.
Also see TermsFilter in the "queries" module in the contrib section.
___
To help you stay safe and secure online, we've developed the al
Hi Marios.
>>Isn't this wrong?
Yes but this is an itch that no one has been suffficently been bothered
by to fix yet.
I still haven't had the time or a desperate need to implement this so it
will probably remain that way until someone feels strongly enough about
the problem to fix it. Highligh
I tried the cityName:city~0.8, and it is still not fast enough..
something around 2 seconds... to return only 2 results...
OK, so we trimmed down the search terms we actually used in the query but I suspect what you are
seeing is the effect of having to perform edit-distance comparisons on ALL
See QueryParser.setFuzzyPrefixLength()
This will apply to all fields parsed by the parser and is probably
generally advisable anyway to avoid server CPU overload.
Many production apps disable fuzzy searching completely in the search
syntax for this reason.
__
Erik Hatcher wrote:
wouldn't this work? +interested -"not interested"
Hi Erik.
Yes, sorry brain is disengaged with all the heat here - my example
wasn't great and my scenario may be more complex than I originally
outlined. I may have 20 different ways of saying "interested" and want
to q
Maybe this:
SpanNotQuery(interested, SpanNearQuery(not,interested))
with a SpanTermQuery for each term?
Thanks, Paul. This is working well for me and I can happily use multiple
SpanTermQueries embedded in a SpanOrQuery in place of each of the single
words in your example.
SpanNotQuery
Pass a field name to the QueryScorer constructor.
See "testFieldSpecificHighlighting" method in the Junit test for the
highlighter for an example.
Cheers
Mark
zhu jiang wrote:
Hi all,
For example, if I have a document with two fields text and num like
this:
text:foo bar
>>- get the top 1000 results WITHOUT executing query across whole data set
(Apologies if this is telling something you are already fully aware of )
- Counting matches doesn't involve scanning the text of all the docs so
may be less expensive than you think for a single index. It very quickly
l
Did you mean this?
http://marc.theaimsgroup.com/?l=lucene-user&m=108525376821114&w=2
Kelvin Tan wrote:
This is a bump post...
I'm wondering if there's any code (contributed, bugzilla, core or otherwise)
that provides document lazy-loading functionality, i.e. only eager-initialize
specific fields
So this is just the old problem of avoiding reading large, less
frequently accessed fields when you are trying to read just the smaller
more frequently accessed fields eg titles.
You can achieve this by:
a) Modifying Lucene using something like the code I originally posted
which stops reading
The short answer is "no", there is not support for this currently.
Implementing this support is possible but fiddly- there is a related
discussion here which outlines some of the challenges :
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12435.html
Cheers,
Mark
--
>>Stemming doesn't have to produce intelligible words
True, yes this should be fine for general search requirements.
However, the code presented does make some attempt to produce
intelligible words eg parties=party unlike Porter stemmer's parties=parti
Does this make it a "lemmatizer"?
This is a f
All looks OK with that bit.
At the risk of sounding obvious - are you mistaking the results from
multiple documents as the highlighted content from just one document?
eg the end of your "for" loop looks like this:
System.out.print(result);
}
and you assume the printed display is from just
Thanks for pointing out this issue.
The bug was related to having a doc bigger than the maxNumDocsToAnalyze
setting. In this situation, the last fragment created was always sized
from maxNumDocsToAnalyze position to the remainder of the doc (in your
case, quite large!)
I have fixed this in SVN
Fred Toth wrote:
Hi,
We have a need to present HTML documents with all search
terms highlighted. Everything I've seen regarding the Highlighter
code seems to point to the typical case of extracting relevant
fragments from the text for presentation of hit lists.
If you dont want to fragment yo
I suspect the most performant is as follows (but could require bags of
RAM) :
Heres the pseudo code .
[on IndexReader open, initialize map]
int []luceneDocIdsByDbKey=new int [largestDbKey]; //could be large array!
for (int i=0;i;Should be super-quick but requires (int size* num db records) m
Hi Erik,
Yes I was thinking that code could form the basis of a new highlighter.
I've just attached a QuerySpansExtractor to the bugzilla entry for the
new highlighter. This class produces Spans from queries other than
SpanXxxxQueries eg phrase, term and booleans.
I'm thinking you can throw the
about 4900 room units which I think is OK as far as
Still we have optimization work to do.
Assuming your availability is a year in advance and yours is a reputable chain
of hotels that books rooms by the day, (not the hour!) You only need:
4900 * 365 bits of true/false info to cache all the ava
Would that show up in the TermVectors?
Yes, but uou would need a scheme for identifying "original, unstemmed" terms vs
stems. For example, you could use another field and analyzer for the unstemmed forms.
Andrew Boyd wrote:
What about storing the unstemed word with the same position as the
I posted the code I use to do this (based on a single index) here:
http://marc.theaimsgroup.com/?l=lucene-dev&m=111044178212335&w=2
Cheers
Mark
___
Yahoo! Messenger - NEW crystal clear PC to PC calling
Thanks for the reminder, Otis.
I haven't done any more on this since this post:
http://archives.devshed.com/a/ml/200501-114586/lucene-query-sql-kind
The scalability concerns with the user-defined-functions I created
prevented me from taking it any further. A proper solution would need a
tight
MySQL has spatial extensions now too.
Your queries lack any free-text criteria so are probably best handled by
a database, not Lucene..
>>In case anyone's interested, I'm writing a zoomable/pannable world map
Save yourself some time. Just use the Google maps API. :-)
__
The "did you mean" implementation should ideally use all of the other
words in a query as context to guide the selection of spelling
alternatives. Google appear to do this - not sure if they use the doc
content or user queries to suggest the alternatives.
I've got some colocation finding code wh
I don't know the behaviour of the Japanese Analyzer you are using.
Can you add to your example diagnosis the Token.getPositionIncrement,
Token.startOffset and Token.endOffset for each of the tokens?
The highlighter groups tokens with overlapping start and end offsets
into a single TokenGroup f
>>I believe I have heard that Span queries provide some way to access
document offset information for their hits somehow.
See http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
Faithfully selecting extracts based *exactly* on query criteria will be
hard given complex queries eg
Isn't the trouble with introducing a scoring threshold based on raw
scores that the Similarity scoring mechanism is considering each
document in isolation? At this stage we don't know if the query is
generally a good one or not (ie spelt correctly, and not a Googlewhack
combination of rarely co
I know there have been some posts discussing how to integrate Lucene
with Derby recently.
I've added an example project that works with both HSQLDB and Derby
here: http://issues.apache.org/jira/browse/LUCENE-434
The bindings allow you to use SQL that mixes database and Lucene
functionality i
Mag Gam wrote:
Does your example store the index in the derby db or somewhere else? I was
thinking of indexing a table in a seperate column.
The software is not an org.apache.lucene.store.Directory implementation
ie an FSDirectory alternative for persisting Lucene data in a relational
table
>>Basically your lucene_query function will return a true/false in one
of the query predicates for each record.
Almost, it returns a score - much more useful than just a boolean and
the key difference between a search engine and a database (partial
matching with relevance ranked scores). Thes
Here's an example I put together to illustrate the point.
package distance;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lu
To avoid caching 10,025 docs when you only want to see 10,000 to 10,025
(and assuming the user was paging through results) you might have to
remember the lowest score used in the previous page of results to avoid
adding those 10,000 docs with score > lastLowScore
to the HitQueue again.
Or using XMLEncoder:
HashMap map=new HashMap();
map.put("foo","bar");
ByteArrayOutputStream baos=new ByteArrayOutputStream();
XMLEncoder encoder =new XMLEncoder(baos);
encoder.writeObject(map);
encoder.flush();
System.out.println(baos.toString());
If so, why not use it for the normal operation as well?
Because MemoryIndex only allows you to store/query one document.
It is fast, but I would not suggest running 1 queries against it.
Why not try store the queries as documents in a special index and query
them using the subject documen
89 matches
Mail list logo