Hi Erick,
"debug":{ "rawquerystring":"title:\"title-123123123-end\"",
"querystring":"title:\"title-123123123-end\"",
"parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
(abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
(authors:title)^4.0 | (doi:title:)^1.0))
DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
(abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
(authors:\"title 123123123 end\"~1)^4.0 |
(doi:title-123123123-end)^1.0)))~1 ())/no_coord",
"parsedquery_toString":"+((((author_full:title)^7.0 |
(abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
(authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
(authors:\"title 123123123 end\"~1)^4.0 |
(doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
= sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
weight(abstract:titl in 23194) [], result of:\n 16.848969 =
score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n
3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 =
score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n
15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 =
score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n
13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 =
score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n
13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 =
score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n
13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 =
score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n
13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n
2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 =
score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
"1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max
of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n
11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0
= boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 =
tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
= parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 =
score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"},
"QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null,
Kind regards,
Darko Todoric
On 08/28/2017 06:35 PM, Erick Erickson wrote:
What are the results of adding &debug=query to the URL? The parsed
query will be especially illuminating.
Best,
Erick
On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:
Hi Darko,
The issue is the wrong expectations: title-1-end is parsed to 3 tokens
(guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2. Since
all your documents have 'title' and 'end' tokens, all match. If you want to
round up, you can use mm=-1% - that will result in zero (or one match if you
do not filter out original document).
You have to play with your tokenizers and define what is similarity match
percentage (if you want to stick with mm).
Regards,
Emir
On 28.08.2017 09:17, Darko Todoric wrote:
Hm... I cannot make that this DisMax work on my Solr...
In solr I have document with title:
- "title-1-end"
- "title-2-end"
- "title-3-end"
- ...
- ...
- "title-312-end"
and when I make query
"*http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title:"title-123123123-end"&wt=json*'
I get all documents from solr :\
What I doing wrong?
Also, I don't know if affecting results, but on "title" field I use
"WhitespaceTokenizerFactory".
Kind regards,
Darko
On 08/25/2017 06:38 PM, Junte Zhang wrote:
If you already have the title of the document, then you could run that
title as a new query against the whole index and exclude the source document
from the results as a filter.
You could use the DisMax query parser:
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
And then set the minimum match ratio of the OR clauses to 90%.
/JZ
-----Original Message-----
From: Darko Todoric [mailto:todo...@mdpi.com]
Sent: Friday, August 25, 2017 5:49 PM
To: solr-user@lucene.apache.org
Subject: Search by similarity?
Hi,
I have 90.000.000 documents in Solr and I need to compare "title" of this
document and get all documents with more than 80% similarity. PHP have
"similar_text" but it's not so smart inserting 90m documents in the array...
Can I do some query in Solr which will give me the more the 80%
similarity?
Kind regards,
Darko Todoric
--
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com
Disclaimer: The information and files contained in this message are
confidential and intended solely for the use of the individual or entity to
whom they are addressed.
f you have received this message in error, please notify me and delete
this message from your system.
You may not copy this message in its entirety or in part, or disclose its
contents to anyone.
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/
--
Darko Todoric
Web Engineer, MDPI DOO
Veljka Dugosevica 54, 11060 Belgrade, Serbia
+381 65 43 90 620
www.mdpi.com
Disclaimer: The information and files contained in this message are confidential
and intended solely for the use of the individual or entity to whom they are
addressed.
f you have received this message in error, please notify me and delete this
message from your system.
You may not copy this message in its entirety or in part, or disclose its
contents to anyone.