Hi Siraj,
I think
https://lucene.apache.org/core/6_1_0/core/index.html?org/apache/lucene/search/ConstantScoreQuery.html
should be good enough.
On Fri, Jul 8, 2016 at 12:27 AM Siraj Haider wrote:
> We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented
> our own similarity whe
Update: I have implemented my own subclasses of QueryParser, BooleanQuery,
BooleanScorer and Similarity to deal with this.
I have been successful in getting the exact behaviour I want... when
calling the .explain() method. However, the scores for some documents often
differ when calling IndexSearc
File a Jira for this particular doc fix since it is significant and not
just mere worksmithing. Better yet, submit a patch since that's Javadoc,
although the exact form of the doc fix might be debatable, so I general
description of the problem should be sufficient, unless you feel motivated.
-- Ja
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that
explanation more prominent, as I clearly missed it.
Never mind, I am working on my own solution for this, through subclassing
QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other
classes.
Cheers,
Dani
On 1/15/15 11:23 AM, danield wrote:
Hi Mike,
Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a di
Hi Mike,
Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a different term.
Similarily, I could use
In practice, normalization by field length proves to be more useful than
normalization by the sum of the lengths of all fields (document length),
which I think is what you seem to be after. Think of a book chapter
document with two fields: title and full text. It makes little sense to
weight
Corrections:
document2={field1:”term1”, field2:”term1”}
Coord(query1,document2)= 1/1 = 1
(Doesn't affect the problem/observation)
--
View this message in context:
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4
Similarity.setDefault(new MySimilarity()) is certainly better than the
2 calls I recommended. Thanks.
I find it hard to see why one might not want to do this in normal
usage but have a vague recollection of someone once outlining some
obscure scenarios where different similarities at index and se
> [use] IndexWriterConfig.setSimilarity() and
> IndexSearcher.setSimilarity(), unless you are clever or like being confused.
>
> SweetSpotSimilarity might also be worth a look.
>
> --
> Ian.
Being even less clever, I just make sure I set:
Similarity.setDefault(new MySimilarity())
when crawl
You can override org.apache.lucene.search.Similarity/DefaultSimilarity
to tweak quite a lot of stuff.
computeNorm() may be the method you are interested in. Called at
indexing time so be sure to use the same implementation at index and
query time, using IndexWriterConfig.setSimilarity() and
Index
You can use RegexQuery (from contrib/regex) for this?
(In 3.1 there's a higher performance, very similar, RegexpQuery, too).
Mike
On Thu, Apr 8, 2010 at 10:10 AM, Hans-Henning Gabriel
wrote:
> Hello everybody,
>
> this is what I would like to do:
> I have an index with documents containing a fi
: "how do i set the score of each document result to be the score of that
: of the field that best matches the search terms"?
you'll want something like this psuedo code...
DisjunctionMaxQuery dq = new DMQ
foreach fieldname in list_of_fields {
BooleanQuery bq = new BQ
foreach word in l
I suppose this could be summarised as:
"how do i set the score of each document result to be the score of that
of the field that best matches the search terms"?
-Original Message-
From: Joel Halbert
Reply-To: java-user@lucene.apache.org
To: Lucene Users
Subject: similarity function
Da
y and constrcuting vector space.
>
> - RB
>
>
> - Original Message
> From: Shashi Kant
> To: java-user@lucene.apache.org
> Sent: Tuesday, June 23, 2009 3:20:16 PM
> Subject: Re: Similarity
>
> I suspect what you are looking for is "Latent Semantics
used for analyzing terms
semantically and constrcuting vector space.
- RB
- Original Message
From: Shashi Kant
To: java-user@lucene.apache.org
Sent: Tuesday, June 23, 2009 3:20:16 PM
Subject: Re: Similarity
I suspect what you are looking for is "Latent Semantics"
I suspect what you are looking for is "Latent Semantics" - it can
algorithmically infer that "iPod~iPhone" or "Apple~Steve Jobs". Google for
"Latent Semantic Indexing" or "Latent Semantic Analysis" - you can apply
some of those approaches using the TermVectors in Lucene index.
Ontologies such as Wo
Allthough (I could be wrong) but I'm wondering if the lenthNorm is the
correct one I should be overriding. I'm interested in the number of times a
term occurs found in a document (more occurance the higher the score) which
I believe is coord. I may well be i am barking up the wrong tree.
Cheers
Sounds like your most difficult part will be the question parser using POS.
This is kind of old school but use something like the AliceBot AIML library
http://en.wikipedia.org/wiki/AIML
Where the subjective terms can be extracted from the questions, and indexed
separately.
Or as Grant and others
Hi Seid,
Do you have a reference for the article? I've done some QA in my day,
but don't recall reading that one.
At any rate, I do think it is possible to do what you are after. See
below.
On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote:
For my work, I have read an article stating th
Hi,
The very fact that you are trying to answer factoid questions to start
with, it is better to use OpenNLP components to identify
NER (Named Entity recognition) in the document and use those tags as part
of your indexing process.
REgards
Vasu
On Thu, Mar 5, 2009 at 8:19 PM, Seid Mohamm
For those interested in my solution I took this article as based to
implement the requirements.
http://www.catalysoft.com/articles/StrikeAMatch.html
Thanks.
- Original Message -
From: [EMAIL PROTECTED]
Sent: Thu, September 4, 2008 1:20
Subject:Re: Similarity percentage betwee
I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. You
might want to add more weight the greater the size of the shingle.
There are shingle filters in lucene/java/contrib/analyzers and there
is a Tanimoto dist
Googling for "java string similarity" throws up some stuff you might
find useful.
--
Ian.
On Wed, Sep 3, 2008 at 11:58 PM, Thiago Moreira <[EMAIL PROTECTED]> wrote:
>
> Well, the similar definition that I'm looking for is the number 2, maybe
> the number 3, but to start the number 2 is enou
More details may change my opinion (not quite sure how others feel
yet), but with the way you've described it so far, it seems like all
you need is a basic string matcher:
For every message:
- if message.subject is found in the pool, then this
message is "similar to" the message in the poo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I don't know how much of this is a Lucene problem, but -- as I'm sure
you will inevitably hear from others on the list -- it depends on
what your definition of "similar" is.
By similar, do you mean:
1. Identical, except for variations in case (upper/lower)
2. Allow 1., but also allow prefix
Lucene In Action is a great book, but you can also have a look at
http://lucene.apache.org/java/docs/scoring.html for more info on
scoring and how to change the similarity and other details of
scoring. Also, search the archives for things you are interested in,
there is a lot of information
The PDF of Lucene in Action can be purchased from www.manning.com
I'd suggest reading and understanding Lucene in Action before you attempt
anything else :)
-Original Message-
From: Mahdi Rahimi [mailto:[EMAIL PROTECTED]
Sent: 26 June 2007 16:38
To: java-user@lucene.apache.org
Subject: Si
Subject: Re: Similarity for Span and Boolean query
: The equation for similarity is given on this web page:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similari
: ty.html
:
: I would like to know what are the equations for similarity if the
query
: is a span or boolean query
: The equation for similarity is given on this web page:
: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similari
: ty.html
:
: I would like to know what are the equations for similarity if the query
: is a span or boolean query.
That equation does cover BooleanQueries -- the "c
On Dec 19, 2005, at 1:23 PM, Klaus wrote:
I) What is exactly written to the index? Is the index just an
inverted list?
Is there term weight scoring stored?
http://lucene.apache.org/java/docs/fileformats.html
1) Get all the documents from the index via the inverted list.
Yo
You can use the HitCollector mechanism to fill your array, but what you
are doing is essentially what the Hits object already does, plus it
provides caching
Eugene Ezekiel wrote:
Yes, but what I wanna be able to do is something like, fill an array of
say size 100 such that:
array[0] = similar
Yes, but what I wanna be able to do is something like, fill an array of
say size 100 such that:
array[0] = similarity value of query and doc(0)
array[1] = similarity value of query and doc(1)
Any idea how to fill this array?
Thanks.
--
Regards,
Eugene
Koji Sekiguchi wrote:
You can get sco
You can get scores by calling Hits.score(). So you should search
at first to get Hits object.
regards,
Koji
> -Original Message-
> From: Eugene Ezekiel [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 07, 2005 6:03 PM
> To: java-user@lucene.apache.org
> Subject: Similarity scores fo
35 matches
Mail list logo