Re: Getting unique key of a document inside of a Similarity class.

2015-02-20 Thread J-Pro

from all the examples of what you've described, i'm fairly certain all you
really need is a TFIDF based Similarity where coord(), idf(), tf() and
queryNorm() return 1 allways, and you omitNorms from all fields.


Yeah, that's what I did in the very first iteration. It works only for 
cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get:


3. place:(34\ High\ Street)^3 = doc1(score=9), doc2(score=9)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=16), 
doc2(score=9)


That is not what I need. As I described above, in case of multiple 
tokens match for a field, method SimScorer.score is called X times, 
where X is number of matched tokens (in cases #3 and #4 there are 3 
tokens), therefore score sums up. I need to score only once in this 
case, regardless of number of tokens.


How to do it? First idea was HashSet based on fieldName, so that after 
scoring once, it don't score anymore. But in this case only first 
document was scoring (since second and other documents have the same 
field name). So I understood that I need also docID for that. And it 
worked fine until I found out (thank you for that) about that docID is 
segment-specific. So now I need segmentID as well (or something similar).




(You didn't give any examples of what you expect to happen with exclusion
clauses in your BooleanQueries


For my needs I won't need exclusion clauses, but in this case the same 
would happen - it would score depending on weight, because condition is 
true:


5. (NOT name:DocumentOne)^7 = doc2(score=7)


Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

Good afternoon.

I need to uniquely identify a document inside of a Similarity class 
during scoring. Is it possible to get value of unique key of a document 
at this point?


For some time I though I can use internal docID for achieving that. 
Method score(int doc, float freq) is called after every query execution 
for each matched doc. For each indexed doc it equals 0, 1, 2, etc. But 
this is only when documents indexed in a bulk, i.e. in single HTTP 
request. But when docs are indexed in separate requests, these docIds 
equal 0 for all documents.


To summarize, here are 2 final questions:

1. Is docIds behavior described above a bug or a feature? Obviously, if 
it's a bug and I can use docID to uniquely identify a document, then my 
question is answered after this bug is fixed.
2. If docIds behavior described above is normal, then what is an 
alternative way of uniquely identify a document inside of a Similarity 
class during scoring? Can I get unique key of a scoring document in 
Similarity?


FYI: I have asked 1st question in #solr IRC channel. The person named 
hoss answered the following: you're seeing the *internal* docIds ... 
you can't assign any special meaning to them ... i believe that at the 
level of the Similarity class, these may even be per segment, which 
means that in the context of a SegmentReader they can be used to get 
things like docValues, but they odn't have any meaning compared to your 
uniqueKey (for example). This kinda makes me think that answer for the 
1st question is it's a feature. But I am still not sure and don't know 
the answer to the 2nd question. Please help.


Thank you very much in advance.


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description 
is a bit vague, and sounds like it may be an XY Problem...

https://people.apache.org/~hossman/#xyproblem
Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

: 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a
: bug and I can use docID to uniquely identify a document, then my question is
: answered after this bug is fixed.
: 2. If docIds behavior described above is normal, then what is an alternative
: way of uniquely identify a document inside of a Similarity class during
: scoring? Can I get unique key of a scoring document in Similarity?

Assuming the method you are refering to (you didn't give a specific 
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

doc - document id within the inverted index segment
freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.

for #2: the methor for providing a SimScorer to lucene is by implementing 
Similarity.simScorer(...) -- that method gets as an argument an 
AtomicReaderContext context, which not only has an AtomicReader for the 
individual segment, but also details about that segments role in the 
larger index.

As far as getting the Solr uniqueKey ... it's non trivial, and there are 
different things you could do depending on what your ultimate goal is (ie: 
see my earlier question about XY problem) ... my guess is from this low 
level down in the code you want to use DocValues (aka: FieldCache in older 
versions of lucene) on your uniqueKey field, then ask it for the 
fieldvalue of each internal docId that gets passed to your method -- 
either by using the per-segment DocValues, or by using the 
AtomicReaderContext's base information to determine the top level 
internal docId and use the top level DocValues/FieldCache

(the per-segment vs top level DocValues and internalId stuff can be kind 
of confusing -- start with whichever seems simpler based on your 
understanding of the internal lucene/solr APIs and worry about maybe 
switching to the other approach later once you have something working and 
see if it helps or hinders performance for your usecases)

-Hoss
http://www.lucidworks.com/


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: 1. name:DocumentOne^7 = doc1(score=7)
: 2. name:DocumentOne^7 AND place:notExist^3 = doc1(score=7)
: 3. place:(34\ High\ Street)^3 = doc1(score=3), doc2(score=3)
: 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=10),
: doc2(score=3)
...
:  it's not clear why you need any sort of unique document identification for
:  you scoring algorithm .. from what you described, matches on fieldA should
:  get score A matches on fieldB should get score B ... why does it mater
:  which doc is which?
: 
: For case #3, for example, method SimScorer.score is called 3 times for each of
: these documents, total 6 times for both. I have added a
: ThreadLocalHashSetString to my custom similarity, which is cleared every
: time before new scoring session (after each query execution). This HashSet
: stores strings consisting of fieldName + docID. Every time score() is called,

Ah HA! ... this is why it's an XY problem... you've decided that you need 
a unique identifier for each doc so you can maintain a HashSet of all the 
times a doc matches a term in the query so you can count them ... you 
don't need to do any of that.

from all the examples of what you've described, i'm fairly certain all you 
really need is a TFIDF based Similarity where coord(), idf(), tf() and 
queryNorm() return 1 allways, and you omitNorms from all fields.

that's it ... that should literally be everything you need to do.

(You didn't give any examples of what you expect to happen with exclusion 
clauses in your BooleanQueries, but the approach you were describing 
wouldn't give you any aded advantages towards interesting MUST_NOT clauses 
either ... it would in fact only increase the scores for those docs in a 
way that is almost certainly not what you want)


-Hoss
http://www.lucidworks.com/


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro
Thank you for your answer, Chris. I will reply with inline comments as 
well. Please see below.



: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description
is a bit vague, and sounds like it may be an XY Problem...


Sure, sorry I did not do it before, I just wanted to take minimum of 
your valuable time. So in my custom Similarity class I am trying to 
implement such a logic, where score calculation is only based on field 
weight and a field match - that's it. In other words, if a field matches 
the query, I want score method to return this field's weight only, 
regardless of factors like: norms; coord; doc frequencies; fact that 
field was multivalued and more than one value matched; fact that field 
was tokenized as multiple tokens and more than one token matched, etc. 
As far as I know, there is no such a similarity in list of existing ones.
In order to implement this, I am trying to score only once for a 
combination of a specific field + doc unique identifier. And I don't 
care what is this unique doc identifier - it can be unique key or it can 
be internal doc ID.
I had my implementation working, but as I understood from your answer, I 
had it working only for one segment. So now I need to add segment ID or 
something like this to my combination.




Assuming the method you are refering to (you didn't give a specific
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

 doc - document id within the inverted index segment
 freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.


Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here 
we go, then I understand why it's 0 every new commit! SOLR doc says new 
docs are written to a new segment. Then question #1 is clear for me. 
Thanks, Chris!




for #2: the methor for providing a SimScorer to lucene is by implementing
Similarity.simScorer(...) -- that method gets as an argument an
AtomicReaderContext context, which not only has an AtomicReader for the
individual segment, but also details about that segments role in the
larger index.


Interesting details, that may be exactly what I need. If I can somehow 
uniquely identify a document using its internal doc id + data from 
context (like segment id or something), that would be awesome. I have 
checked AtomicReaderContext, it has 'ord' (The readers ord in the 
top-level's leaves array) and 'docBase' (The readers absolute doc base) 
- probably what I need. Do you have any more information (maybe links to 
wikis) about this AtomicReaderContext, DocValues, low and top levels 
(other than javadoc in source code)? I have a high-level understanding, 
but it's obviously not enough for the problem I am solving. I would be 
more than happy to understand it.


Thank you very much for your time, Chris and other people who spend time 
on reading/answering this thread!


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: Sure, sorry I did not do it before, I just wanted to take minimum of your
: valuable time. So in my custom Similarity class I am trying to implement such
: a logic, where score calculation is only based on field weight and a field
: match - that's it. In other words, if a field matches the query, I want
: score method to return this field's weight only, regardless of factors like:
: norms; coord; doc frequencies; fact that field was multivalued and more than
: one value matched; fact that field was tokenized as multiple tokens and more
: than one token matched, etc. As far as I know, there is no such a similarity
: in list of existing ones.

how are you defining/specifying these field weights?

it would help if you could give a concrete example of some sample docs, a 
sample query, and what results you would expect ... the sample input and 
sample output of the system you are interested in.

: In order to implement this, I am trying to score only once for a combination
: of a specific field + doc unique identifier. And I don't care what is this
: unique doc identifier - it can be unique key or it can be internal doc ID.

it's not clear why you need any sort of unique document identification for 
you scoring algorithm .. from what you described, matches on fieldA should 
get score A matches on fieldB should get score B ... why does it mater 
which doc is which?



-Hoss
http://www.lucidworks.com/


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

how are you defining/specifying these field weights?


I define weights inside of a query (name:SomeName^7).



it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.


Sure. Imagine we have 2 docs:

doc1
-
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

doc2
-
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

I want the following queries return docs with scores:

1. name:DocumentOne^7 = doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 = doc1(score=7)
3. place:(34\ High\ Street)^3 = doc1(score=3), doc2(score=3)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=10), 
doc2(score=3)



If you're curious about why do I need it, i.e. about my very initial 
problem X, then I need this scoring to be able to calculate matching 
percentage. That's a separate topic, I read a lot about it (including 
http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say 
it's either not doable or very-very complicated with SOLR. So I just 
want to give it a try. For case #3 from above matching percentage is 
100% for both docs. For case #4 it's doc1:100% and doc2:30%.




it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score A matches on fieldB should get score B ... why does it mater
which doc is which?


For case #3, for example, method SimScorer.score is called 3 times for 
each of these documents, total 6 times for both. I have added a 
ThreadLocalHashSetString to my custom similarity, which is cleared 
every time before new scoring session (after each query execution). This 
HashSet stores strings consisting of fieldName + docID. Every time 
score() is called, I check this HashSet - if fieldName + docID exists, I 
return 0 as score, otherwise field weight.
If there was no docID in this string (only field name), then case #3 
would return the following: doc1(score=3), doc2(score=0). If there was 
no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) 
since query matched all 3 tokens for every doc.


I know that what I'm doing is a hack, but that's the only way I've 
found so far to implement percentage matching. I just want to play 
around with it, see how it performs and decide whether to use it or not. 
But for that I need to uniquely identify a document while scoring :)