Re: Getting unique key of a document inside of a Similarity class.
from all the examples of what you've described, i'm fairly certain all you really need is a TFIDF based Similarity where coord(), idf(), tf() and queryNorm() return 1 allways, and you omitNorms from all fields. Yeah, that's what I did in the very first iteration. It works only for cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get: 3. place:(34\ High\ Street)^3 = doc1(score=9), doc2(score=9) 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=16), doc2(score=9) That is not what I need. As I described above, in case of multiple tokens match for a field, method SimScorer.score is called X times, where X is number of matched tokens (in cases #3 and #4 there are 3 tokens), therefore score sums up. I need to score only once in this case, regardless of number of tokens. How to do it? First idea was HashSet based on fieldName, so that after scoring once, it don't score anymore. But in this case only first document was scoring (since second and other documents have the same field name). So I understood that I need also docID for that. And it worked fine until I found out (thank you for that) about that docID is segment-specific. So now I need segmentID as well (or something similar). (You didn't give any examples of what you expect to happen with exclusion clauses in your BooleanQueries For my needs I won't need exclusion clauses, but in this case the same would happen - it would score depending on weight, because condition is true: 5. (NOT name:DocumentOne)^7 = doc2(score=7)
Getting unique key of a document inside of a Similarity class.
Good afternoon. I need to uniquely identify a document inside of a Similarity class during scoring. Is it possible to get value of unique key of a document at this point? For some time I though I can use internal docID for achieving that. Method score(int doc, float freq) is called after every query execution for each matched doc. For each indexed doc it equals 0, 1, 2, etc. But this is only when documents indexed in a bulk, i.e. in single HTTP request. But when docs are indexed in separate requests, these docIds equal 0 for all documents. To summarize, here are 2 final questions: 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a bug and I can use docID to uniquely identify a document, then my question is answered after this bug is fixed. 2. If docIds behavior described above is normal, then what is an alternative way of uniquely identify a document inside of a Similarity class during scoring? Can I get unique key of a scoring document in Similarity? FYI: I have asked 1st question in #solr IRC channel. The person named hoss answered the following: you're seeing the *internal* docIds ... you can't assign any special meaning to them ... i believe that at the level of the Similarity class, these may even be per segment, which means that in the context of a SegmentReader they can be used to get things like docValues, but they odn't have any meaning compared to your uniqueKey (for example). This kinda makes me think that answer for the 1st question is it's a feature. But I am still not sure and don't know the answer to the 2nd question. Please help. Thank you very much in advance.
Re: Getting unique key of a document inside of a Similarity class.
: I need to uniquely identify a document inside of a Similarity class during : scoring. Is it possible to get value of unique key of a document at this : point? Can you tell us a bit more about your usecase ... your problem description is a bit vague, and sounds like it may be an XY Problem... https://people.apache.org/~hossman/#xyproblem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 : 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a : bug and I can use docID to uniquely identify a document, then my question is : answered after this bug is fixed. : 2. If docIds behavior described above is normal, then what is an alternative : way of uniquely identify a document inside of a Similarity class during : scoring? Can I get unique key of a scoring document in Similarity? Assuming the method you are refering to (you didn't give a specific class/interface name) is SimScorer.score(doc,req) then the javadocs say... doc - document id within the inverted index segment freq - sloppy term frequency ...so for #1, yes this is definitely the per-segment docId. for #2: the methor for providing a SimScorer to lucene is by implementing Similarity.simScorer(...) -- that method gets as an argument an AtomicReaderContext context, which not only has an AtomicReader for the individual segment, but also details about that segments role in the larger index. As far as getting the Solr uniqueKey ... it's non trivial, and there are different things you could do depending on what your ultimate goal is (ie: see my earlier question about XY problem) ... my guess is from this low level down in the code you want to use DocValues (aka: FieldCache in older versions of lucene) on your uniqueKey field, then ask it for the fieldvalue of each internal docId that gets passed to your method -- either by using the per-segment DocValues, or by using the AtomicReaderContext's base information to determine the top level internal docId and use the top level DocValues/FieldCache (the per-segment vs top level DocValues and internalId stuff can be kind of confusing -- start with whichever seems simpler based on your understanding of the internal lucene/solr APIs and worry about maybe switching to the other approach later once you have something working and see if it helps or hinders performance for your usecases) -Hoss http://www.lucidworks.com/
Re: Getting unique key of a document inside of a Similarity class.
: 1. name:DocumentOne^7 = doc1(score=7) : 2. name:DocumentOne^7 AND place:notExist^3 = doc1(score=7) : 3. place:(34\ High\ Street)^3 = doc1(score=3), doc2(score=3) : 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=10), : doc2(score=3) ... : it's not clear why you need any sort of unique document identification for : you scoring algorithm .. from what you described, matches on fieldA should : get score A matches on fieldB should get score B ... why does it mater : which doc is which? : : For case #3, for example, method SimScorer.score is called 3 times for each of : these documents, total 6 times for both. I have added a : ThreadLocalHashSetString to my custom similarity, which is cleared every : time before new scoring session (after each query execution). This HashSet : stores strings consisting of fieldName + docID. Every time score() is called, Ah HA! ... this is why it's an XY problem... you've decided that you need a unique identifier for each doc so you can maintain a HashSet of all the times a doc matches a term in the query so you can count them ... you don't need to do any of that. from all the examples of what you've described, i'm fairly certain all you really need is a TFIDF based Similarity where coord(), idf(), tf() and queryNorm() return 1 allways, and you omitNorms from all fields. that's it ... that should literally be everything you need to do. (You didn't give any examples of what you expect to happen with exclusion clauses in your BooleanQueries, but the approach you were describing wouldn't give you any aded advantages towards interesting MUST_NOT clauses either ... it would in fact only increase the scores for those docs in a way that is almost certainly not what you want) -Hoss http://www.lucidworks.com/
Re: Getting unique key of a document inside of a Similarity class.
Thank you for your answer, Chris. I will reply with inline comments as well. Please see below. : I need to uniquely identify a document inside of a Similarity class during : scoring. Is it possible to get value of unique key of a document at this : point? Can you tell us a bit more about your usecase ... your problem description is a bit vague, and sounds like it may be an XY Problem... Sure, sorry I did not do it before, I just wanted to take minimum of your valuable time. So in my custom Similarity class I am trying to implement such a logic, where score calculation is only based on field weight and a field match - that's it. In other words, if a field matches the query, I want score method to return this field's weight only, regardless of factors like: norms; coord; doc frequencies; fact that field was multivalued and more than one value matched; fact that field was tokenized as multiple tokens and more than one token matched, etc. As far as I know, there is no such a similarity in list of existing ones. In order to implement this, I am trying to score only once for a combination of a specific field + doc unique identifier. And I don't care what is this unique doc identifier - it can be unique key or it can be internal doc ID. I had my implementation working, but as I understood from your answer, I had it working only for one segment. So now I need to add segment ID or something like this to my combination. Assuming the method you are refering to (you didn't give a specific class/interface name) is SimScorer.score(doc,req) then the javadocs say... doc - document id within the inverted index segment freq - sloppy term frequency ...so for #1, yes this is definitely the per-segment docId. Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here we go, then I understand why it's 0 every new commit! SOLR doc says new docs are written to a new segment. Then question #1 is clear for me. Thanks, Chris! for #2: the methor for providing a SimScorer to lucene is by implementing Similarity.simScorer(...) -- that method gets as an argument an AtomicReaderContext context, which not only has an AtomicReader for the individual segment, but also details about that segments role in the larger index. Interesting details, that may be exactly what I need. If I can somehow uniquely identify a document using its internal doc id + data from context (like segment id or something), that would be awesome. I have checked AtomicReaderContext, it has 'ord' (The readers ord in the top-level's leaves array) and 'docBase' (The readers absolute doc base) - probably what I need. Do you have any more information (maybe links to wikis) about this AtomicReaderContext, DocValues, low and top levels (other than javadoc in source code)? I have a high-level understanding, but it's obviously not enough for the problem I am solving. I would be more than happy to understand it. Thank you very much for your time, Chris and other people who spend time on reading/answering this thread!
Re: Getting unique key of a document inside of a Similarity class.
: Sure, sorry I did not do it before, I just wanted to take minimum of your : valuable time. So in my custom Similarity class I am trying to implement such : a logic, where score calculation is only based on field weight and a field : match - that's it. In other words, if a field matches the query, I want : score method to return this field's weight only, regardless of factors like: : norms; coord; doc frequencies; fact that field was multivalued and more than : one value matched; fact that field was tokenized as multiple tokens and more : than one token matched, etc. As far as I know, there is no such a similarity : in list of existing ones. how are you defining/specifying these field weights? it would help if you could give a concrete example of some sample docs, a sample query, and what results you would expect ... the sample input and sample output of the system you are interested in. : In order to implement this, I am trying to score only once for a combination : of a specific field + doc unique identifier. And I don't care what is this : unique doc identifier - it can be unique key or it can be internal doc ID. it's not clear why you need any sort of unique document identification for you scoring algorithm .. from what you described, matches on fieldA should get score A matches on fieldB should get score B ... why does it mater which doc is which? -Hoss http://www.lucidworks.com/
Re: Getting unique key of a document inside of a Similarity class.
how are you defining/specifying these field weights? I define weights inside of a query (name:SomeName^7). it would help if you could give a concrete example of some sample docs, a sample query, and what results you would expect ... the sample input and sample output of the system you are interested in. Sure. Imagine we have 2 docs: doc1 - name:DocumentOne place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created) doc2 - name:DocumentTwo place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created) I want the following queries return docs with scores: 1. name:DocumentOne^7 = doc1(score=7) 2. name:DocumentOne^7 AND place:notExist^3 = doc1(score=7) 3. place:(34\ High\ Street)^3 = doc1(score=3), doc2(score=3) 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=10), doc2(score=3) If you're curious about why do I need it, i.e. about my very initial problem X, then I need this scoring to be able to calculate matching percentage. That's a separate topic, I read a lot about it (including http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say it's either not doable or very-very complicated with SOLR. So I just want to give it a try. For case #3 from above matching percentage is 100% for both docs. For case #4 it's doc1:100% and doc2:30%. it's not clear why you need any sort of unique document identification for you scoring algorithm .. from what you described, matches on fieldA should get score A matches on fieldB should get score B ... why does it mater which doc is which? For case #3, for example, method SimScorer.score is called 3 times for each of these documents, total 6 times for both. I have added a ThreadLocalHashSetString to my custom similarity, which is cleared every time before new scoring session (after each query execution). This HashSet stores strings consisting of fieldName + docID. Every time score() is called, I check this HashSet - if fieldName + docID exists, I return 0 as score, otherwise field weight. If there was no docID in this string (only field name), then case #3 would return the following: doc1(score=3), doc2(score=0). If there was no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) since query matched all 3 tokens for every doc. I know that what I'm doing is a hack, but that's the only way I've found so far to implement percentage matching. I just want to play around with it, see how it performs and decide whether to use it or not. But for that I need to uniquely identify a document while scoring :)