Re: SOLR upgrade
Hi, following up on Charlie's detailed response I would recommend carefully assess the code you are using to interact with Apache Solr (on top of the Solr changes themselves). Assuming you are using some sort of client, it's extremely important to fully understand both the syntax and semantic of each call. I saw a lot of "compiling ok" search-api migrations that were ok syntactically but doing a disaster from the semantic perspective (missing important parameters ect). In case you have plugins to maintain this would be even more complicated than just make them compile. Regards -- Alessandro Benedetti Apache Lucene/Solr Committer Director, R Software Engineer, Search Consultant www.sease.io On Tue, 9 Feb 2021 at 11:01, Charlie Hull wrote: > Hi Lulu, > > I'm afraid you're going to have to recognise that Solr 5.2.1 is very > out-of-date and the changes between this version and the current 8.x > releases are significant. A direct jump is I think the only sensible > option. > > Although you could take the current configuration and attempt to upgrade > it to work with 8.x, I recommend that you should take the chance to look > at your whole infrastructure (from data ingestion through to query > construction) and consider what needs upgrading/redesigning for both > performance and future-proofing. You shouldn't just attempt a > lift-and-shift of the current setup - some things just won't work and > some may lock you into future issues. If you're running at large scale > (I've talked to some people at the BL before and I know you have some > huge indexes there!) then a redesign may be necessary for scalability > reasons (cost and feasibility). You should also consider your skills > base and how the team can stay up to date with Solr changes and modern > search practice. > > Hope this helps - this is a common situation which I've seen many times > before, you're certainly not the oldest version of Solr running I've > seen recently either! > > best > > Charlie > > On 09/02/2021 01:14, Paul, Lulu wrote: > > Hi SOLR team, > > > > Please may I ask for advice regarding upgrading the SOLR version (our > project currently running on solr-5.2.1) to the latest version? > > What are the steps, breaking changes and potential issues ? Could this > be done as an incremental version upgrade or a direct jump to the newest > version? > > > > Much appreciate the advice, Thank you! > > > > Best Wishes > > Lulu > > > > > > > ** > > Experience the British Library online at www.bl.uk<http://www.bl.uk/> > > The British Library's latest Annual Report and Accounts : > www.bl.uk/aboutus/annrep/index.html< > http://www.bl.uk/aboutus/annrep/index.html> > > Help the British Library conserve the world's knowledge. Adopt a Book. > www.bl.uk/adoptabook<http://www.bl.uk/adoptabook> > > The Library's St Pancras site is WiFi - enabled > > > * > > The information contained in this e-mail is confidential and may be > legally privileged. It is intended for the addressee(s) only. If you are > not the intended recipient, please delete this e-mail and notify the > postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail > must not be disclosed or copied without the sender's consent. > > The statements and opinions expressed in this message are those of the > author and do not necessarily reflect those of the British Library. The > British Library does not take any responsibility for the views of the > author. > > > * > > Think before you print > > > > -- > Charlie Hull - Managing Consultant at OpenSource Connections Limited > > Founding member of The Search Network <https://thesearchnetwork.com/> > and co-author of Searching the Enterprise > <https://opensourceconnections.com/about-us/books-resources/> > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 >
Re: Extremely Small Segments
Hi Yasoob, Can you check in the log when hard commits really happen? I ended up sometimes with auto soft/hard commit config in the wrong place of the solrconfig.xml and for that reason getting un-expected behaviour. Your assumptions are correct, the ramBuffer flushes as soon as one of the threshold is met for memory/doc count. For the auto-commit, it's the same, but for time/docs. Are you sure there's no additional commit happening? Do you see those numbers on all shards/replicas? Which kind of replica are you using? Sharding on 10 GB index may not be necessary, do you have any evidence you had to shard your index? Any performance benchmark? Cheers -- Alessandro Benedetti Apache Lucene/Solr Committer Director, R Software Engineer, Search Consultant www.sease.io On Fri, 12 Feb 2021 at 13:44, yasoobhaider wrote: > Hi > > I am migrating from master slave to Solr Cloud but I'm running into > problems > with indexing. > > Cluster details: > > 8 machines of 64GB memory, each hosting 1 replica. > 4 shards, 2 replica of each. Heap size is 16GB. > > Collection details: > > Total number of docs: ~250k (but only 50k are indexed right now) > Size of collection (master slave number for reference): ~10GB > > Our collection is fairly heavy with some dynamic fields with high > cardinality (of order of ~1000s), which is why the large heap size for even > a small collection. > > Relevant solrconfig settings: > > commit settings: > > > 1 > 360 > false > > > > ${solr.autoSoftCommit.maxTime:180} > > > index config: > > 500 > 1 > > class="org.apache.solr.index.TieredMergePolicyFactory"> > 10 > 10 > > > > class="org.apache.lucene.index.ConcurrentMergeScheduler"> > 6 > 4 > > > > Problem: > > I setup the cloud and started indexing at the throughput of our earlier > master-slave setup, but soon the machines ran into full blown Garbage > Collection. This throughput was not a lot though. We index the whole > collection overnight, so roughly ~250k documents in 6 hours. That's roughly > 12rps. > > So now I'm doing indexing at an extremely slow rate trying to find the > problem. > > Currently I'm indexing at 1 document/2seconds, so every minute ~30 > documents. > > Observations: > > 1. I'm noticing extremely small segments in the segments UI. Example: > > Segment _1h4: > #docs: 5 > #dels: 0 > size: 1,586,878 bytes > age: 2021-02-12T11:05:33.050Z > source: flush > > Why is lucene creating such small segments? I understood that segments are > created when ramBufferSizeMB or maxBufferedDocs limit is hit. Or on a hard > commit. Neither of those should lead to such small segments. > > 2. The index/ directory has a large number of files. For one shard with 30k > documents & 1.5GB size, there are ~450-550 files in this directory. I > understand that each segment is composed of a bunch of files. Even > accounting for that, the number of segments seems very large. > > Note: Nothing out of the ordinary in logs. Only /update request logs. > > Please help with making sense of the 2 observations above. > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Re:Interpreting Solr indexing times
I agree, documents may be gigantic or very small, with heavy text analysis or simple strings ... so it's not possible to give an evaluation here. But you could make use of the nightly benchmark to give you an idea of Lucene indexing speed (the engine inside Apache Solr) : http://home.apache.org/~mikemccand/lucenebench/indexing.html Not sure we have something similar for Apache Solr officially. https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceData -> this should be a bit outdated Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: leader election stuck after hosts restarts
I faced these problems a while ago, but at the time I created a blog post which I hope could help: https://sease.io/2018/05/solrcloud-leader-election-failing.html - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: QueryResponse ordering
Hi Srinivas, Filter queries don't impact scoring but only matching. So, what is the ordering you are expecting? A bq (boost query) parameter will add a clause to the query, impacting the score in an additive way. The query you posted is a bit confused, what was your intent there? To boost search results having "abc" as the PARTY.PARTY.ID ? https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
[Free Online Meetups] London Information Retrieval Meetup
Hi all, The London Information Retrieval Meetup has moved online: https://www.meetup.com/London-Information-Retrieval-Meetup-Group It is a free evening meetup aimed at Information Retrieval passionates and professionals who are curious to explore and discuss the latest trends in the field. It is technology agnostic, but you'll find many talks on Apache Solr and related technologies. Tomorrow (03.11 at 6:10 pm Uk time) we will host the sixth London Information Retrieval meetup (fully remote). We will have two talks: *Talk 1* "Feature Extraction for Large-Scale Text Collections" from Luke Gallagher, PhD candidate, RMIT University *Talk 2* "A Learning to Rank Project on a Daily Song Ranking Problem" from Ilaria Petreti (IR/ML Engineer, Sease) and Anna Ruggero (R Software Engineer, Sease) If you fancy some Search Stories, feel free to register here: https://www.meetup.com/London-Information-Retrieval-Meetup-Group/events/273905485/ Cheers have a nice evening! ------ Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io
Re: How to get boosted field and values?
Hi Taisuke, there are various ways of approaching boosting and scoring in Apache Solr. First of all you must decide if you are interested in multiplicative or additive boost. Multiplicative will multiply the score of your search result by a certain factor while the additive will just add the factor to the final score. Using advanced query parsers such as the dismax and edismax you can use the : *boost* parameter - multiplicative - takes function in input - https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-TheboostParameter *bq*(boost query) - additive - https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter *bf*(boost function) - additive - https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter This blog post is old but should help : https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/ Then you can boost fields or even specific query clauses: 1) https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqf_QueryFields_Parameter 2) q= features:2^1.0 AND features:3^5.0 1.0 is the default, you are multiplying the score contribution of the term by 1.0, so no effect. features:3^5.0 means that the score contribution of a match for the term '3' in the field 'features' will be multiplied by 5.0 (you can also see that enabling debug=results Finally you can force the score contribution of a term to be a constant, it's not recommended unless you are truly confident you don't need other types of scoring: q= features:2^=1.0 AND features:3^=5.0 in this example your document id: 3 will have a score of 6.0 Not sure if this answers your question, if not feel free to elaborate more. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, 19 Mar 2020 at 11:18, Taisuke Miyazaki wrote: > I'm using Solr 7.5.0. > I want to get boosted field and values per documents. > > e.g. > documents: > id: 1, features: [1] > id: 2, features: [1,2] > id: 3, features: [1,2,3] > > query: > bq: features:2^1.0 AND features:3^1.0 > > I expect results like below. > boosted: > - id: 2 > - field: features, value: 2 > - id: 3 > - field: features, value: 2 > - field: features, value: 3 > > I have an idea that set boost score like bit-flag, but it's not good I > think because I must send query twice. > > bit-flag: > bq: features:2^2.0 AND features:3^4.0 > docs: > - id: 1, score: 1.0(0x001) > - id: 2, score: 3.0(0x011) # have feature:2(2nd bit is 1) > - id: 3, score: 7.0(0x111) # have feature:2 and feature:3(2nd and 3rd > bit are 1) > check score value then I can get boosted field. > > Is there a better way? >
Re: Re: Anyone have experience with Query Auto-Suggestor?
I have been working extensively on query autocompletion, these blogs should be helpful to you: https://sease.io/2015/07/solr-you-complete-me.html https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html You idea of using search quality evaluation to drive the autocompletion is interesting. How do you currently calculate the NDCG for a query? What's your golden truth? Using that approach you will autocomplete favouring query completion that your search engine is able to process better, not necessarily closer to the user intent, still it could work. We should differentiate here between the suggester dictionary (where the suggestions come from, in your case it could be your extracted data) and the kind of suggestion (that in your case could be the free text suggester lookup) Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Mon, 20 Jan 2020 at 17:02, David Hastings wrote: > Not a bad idea at all, however ive never used an external file before, just > a field in the index, so not an area im familiar with > > On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld - > audrey.lorberf...@ibm.com wrote: > > > David, > > > > Thank you, that is useful. So, would you recommend using a (clean) field > > over an external dictionary file? We have lots of "top queries" and > measure > > their nDCG. A thought was to programmatically generate an external file > > where the weight per query term (or phrase) == its nDCG. Bad idea? > > > > Best, > > Audrey > > > > On 1/20/20, 11:51 AM, "David Hastings" > > wrote: > > > > Ive used this quite a bit, my biggest piece of advice is to choose a > > field > > that you know is clean, with well defined terms/words, you dont want > an > > autocomplete that has a massive dictionary, also it will make the > > start/reload times pretty slow > > > > On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld - > > audrey.lorberf...@ibm.com wrote: > > > > > Hi All, > > > > > > We plan to incorporate a query autocomplete functionality into our > > search > > > engine (like this: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0= > > > ). And I was wondering if anyone has personal experience with this > > > component and would like to share? Basically, we are just looking > > for some > > > best practices from more experienced Solr admins so that we have a > > starting > > > place to launch this in our beta. > > > > > > Thank you! > > > > > > Best, > > > Audrey > > > > > > > > > >
Re: Query Regarding SOLR cross collection join
>From the Join Query Parser code: "// most of these statistics are only used for the enum method int fromSetSize; // number of docs in the fromSet (that match the from query) long resultListDocs; // total number of docs collected int fromTermCount; long fromTermTotalDf; int fromTermDirectCount; // number of fromTerms that were too small to use the filter cache int fromTermHits; // number of fromTerms that intersected the from query long fromTermHitsTotalDf; // sum of the df of the matching terms int toTermHits; // num if intersecting from terms that match a term in the to field long toTermHitsTotalDf; // sum of the df for the toTermHits int toTermDirectCount;// number of toTerms that we set directly on a bitset rather than doing set intersections int smallSetsDeferred;// number of small sets collected to be used later to intersect w/ bitset or create another small set " The toSetSize has nothing to do with MB of data read from the index, it is the size in number of docs of the resulting set of documents. Improving this would require a much deeper analysis I reckon. Starting from your query and your data model till the architecture involved. Cheers ------ Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Wed, 22 Jan 2020 at 13:27, Doss wrote: > HI, > > SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes) > > One of our use cases requires joins, we are joining 2 large indexes. As > required by SOLR one index (2GB) has one shared and 10 replicas and the > other has 10 shard (40GB / Shard). > > The query takes too much time, some times in minutes how can we improve > this? > > Debug query produces one or more based on the number of shards (i believe) > > "time":303442, > "fromSetSize":0, > "toSetSize":81653955, > "fromTermCount":0, > "fromTermTotalDf":0, > "fromTermDirectCount":0, > "fromTermHits":0, > "fromTermHitsTotalDf":0, > "toTermHits":0, > "toTermHitsTotalDf":0, > "toTermDirectCount":0, > "smallSetsDeferred":0, > "toSetDocsAdded":0}, > > here what is the toSetSize mean? does it read 81MB of data from the > index? how can we reduce this? > > Read somewhere that the score join parser will be faster, but for me it > produces no results. I am using string type fields for from and to. > > > Thanks! >
Re: Is it possible to add stemming in a text_exact field
Edward is correct, furthermore using a stemmer in an analysis chain that don't tokenise is going to work just for single term queries and single term field values... Not sure it was intended ... Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Wed, 22 Jan 2020 at 16:26, Edward Ribeiro wrote: > Hi, > > One possible solution would be to create a second field (e.g., > text_general) that uses DefaultTokenizer, or other tokenizer that breaks > the string into tokens, and use a copyField to copy the content from > text_exact to text_general. Then, you can use edismax parser to search both > fields, but giving text_exact a higher boost (qf=text_exact^5 > text_general). In this case, both fields should be indexed, but only one > needs to be stored. > > Edward > > On Wed, Jan 22, 2020 at 10:34 AM Dhanesh Radhakrishnan > > wrote: > > > Hello, > > I'm facing an issue with stemming. > > My search query is "restaurant dubai" and returns results. > > If I search "restaurants dubai" it returns no data. > > > > How to stem this keyword "restaurant dubai" with "restaurants dubai" ? > > > > I'm using a text exact field for search. > > > > > multiValued="true" omitNorms="false" omitTermFreqAndPositions="false"/> > > > > Here is the field definition > > > > > positionIncrementGap="100"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is there any solutions without changing the tokenizer class. > > > > > > > > > > Dhanesh S.R > > > > -- > > IMPORTANT: This is an e-mail from HiFX IT Media Services Pvt. Ltd. Its > > content are confidential to the intended recipient. If you are not the > > intended recipient, be advised that you have received this e-mail in > error > > and that any use, dissemination, forwarding, printing or copying of this > > e-mail is strictly prohibited. It may not be disclosed to or used by > > anyone > > other than its intended recipient, nor may it be copied in any way. If > > received in error, please email a reply to the sender, then delete it > from > > your system. > > > > Although this e-mail has been scanned for viruses, HiFX > > cannot ultimately accept any responsibility for viruses and it is your > > responsibility to scan attachments (if any). > > > > Before you print this email > > or attachments, please consider the negative environmental impacts > > associated with printing. > > >
Re: Spell check with data from database and not from english dictionary
Hi Seetesh, As you can see from the wiki [1] there are mainly two input sources for a spellcheck dictionary: 1) a file 2) the index (in a couple of different forms) If you prefer the file approach, it's your call to produce the file and you can certainly use whatever you like to fill the data. It could be from the English dictionary or from a database. [1] https://lucene.apache.org/solr/guide/8_4/spell-checking.html -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, 23 Jan 2020 at 06:06, seeteshh wrote: > Hello all, > > Can the spell check feature be configured with words/data fetched from a > database and not from the English dictionary? > > Regards, > > Seetesh Hindlekar > > > > - > Seetesh Hindlekar > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: [Apache Solr ReRanking] Sort Clauses Bug
Personally I was expecting the sort request parameter to be applied on the final search results: 1) run original query, get top K based on score 2) run re rank query on the top K, recalculate the scores 3) finally apply the sort But when you mentioned "you expect the sort specified to be applied to both the “outer” and “inner” queries", I changed my mind, it is probably a better solution to give the user a nice flexibility on controlling both the original query sort (to affect the top K retrieval) and the final sort (the one sorting the reranked results). *Currently the 'sort' global request parameter affects the way the top K are retrieved, then they are re-ranked.* Unfortunately the workaround you suggested through the local params of the rerank query parser doesn't seem to work at all in 8.1.1 :( Unless it was introduced in 8.2 I think it is a good idea to create the jira issue, with this in mind: 1) we want to be able to decide the sort for both the original query(to assess the top K) and the final results 2) we need to decide which request parameter should do what e.g. should the 'sort' request param affect *the original query* OR the final results? should the 'sort' in the local params of the reRank query parser affect the original query OR *the final results*? In bold my personal preference, but I don't have any hard position in regards. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, Sep 26, 2019 at 5:23 PM Erick Erickson wrote: > OK so to restate, you expect the sort specified to be applied to both the > “outer” and “inner” queries. Makes sense, seems like a good enhancement. > > Hmm, I wonder if you can put the sort parameter in with the rerank > specification, like: q={!rerank reRankQuery=$rqq reRankDocs=1200 > reRankWeight=3 sort="score desc, downloads desc”} > > That doesn’t address your initial point, just curious if it’d do as a > workaround meanwhile. > > Best, > Erick > > > > On Sep 26, 2019, at 10:54 AM, Alessandro Benedetti > wrote: > > > > In the first OK scenario, the search results are sorted with score desc, > > and when the score is identical, the secondary sort field is applied. > > > > In the KO scenario, only score desc is taken into consideration(the > > reranked score) , the secondary sort by the sort field is ignored. > > > > I suspect an intuitive expected result would be to have the same > behaviour > > that happens with no reranking, so: > > 1) sort of the final results by reranked score desc > > 2) when identical raranked score, sort by secondat sort field > > > > Is it clearer? > > Any wrong assumption? > > > > > > On Thu, 26 Sep 2019, 14:34 Erick Erickson, > wrote: > > > >> Hmmm, can we see a bit of sample output? I always have to read this > >> backwards, the outer query results are sent to the inner query, so my > >> _guess_ is that the sort is applied to the “q=*:*” and then the top > 1,200 > >> are sorted by score by the rerank. But then I’m often confused about > this. > >> > >> Erick > >> > >>> On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti < > a.benede...@sease.io> > >> wrote: > >>> > >>> Hi all, > >>> I was playing a bit with the reranking capability and I discovered > that: > >>> > >>> *Sort by score, then by secondary field -> OK* > >>> http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score > >>> desc,downloads desc*=id,title,score,downloads > >>> > >>> *ReRank, Sort by score, then by secondary field -> KO* > >>> http://localhost:8983/solr/books/select?q=*:*={!rerank > >> reRankQuery=$rqq > >>> reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score > >> desc,downloads > >>> desc*=id,title,score,downloads > >>> > >>> Is this intended? It sounds counter-intuitive to me and I wanted to > check > >>> before opening a Jira issue > >>> Tested on 8.1.1 but it should be in master as well. > >>> > >>> Regards > >>> -- > >>> Alessandro Benedetti > >>> Search Consultant, R Software Engineer, Director > >>> www.sease.io > >> > >> > >
Re: [Apache Solr ReRanking] Sort Clauses Bug
In the first OK scenario, the search results are sorted with score desc, and when the score is identical, the secondary sort field is applied. In the KO scenario, only score desc is taken into consideration(the reranked score) , the secondary sort by the sort field is ignored. I suspect an intuitive expected result would be to have the same behaviour that happens with no reranking, so: 1) sort of the final results by reranked score desc 2) when identical raranked score, sort by secondat sort field Is it clearer? Any wrong assumption? On Thu, 26 Sep 2019, 14:34 Erick Erickson, wrote: > Hmmm, can we see a bit of sample output? I always have to read this > backwards, the outer query results are sent to the inner query, so my > _guess_ is that the sort is applied to the “q=*:*” and then the top 1,200 > are sorted by score by the rerank. But then I’m often confused about this. > > Erick > > > On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti > wrote: > > > > Hi all, > > I was playing a bit with the reranking capability and I discovered that: > > > > *Sort by score, then by secondary field -> OK* > > http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score > > desc,downloads desc*=id,title,score,downloads > > > > *ReRank, Sort by score, then by secondary field -> KO* > > http://localhost:8983/solr/books/select?q=*:*={!rerank > reRankQuery=$rqq > > reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score > desc,downloads > > desc*=id,title,score,downloads > > > > Is this intended? It sounds counter-intuitive to me and I wanted to check > > before opening a Jira issue > > Tested on 8.1.1 but it should be in master as well. > > > > Regards > > -- > > Alessandro Benedetti > > Search Consultant, R Software Engineer, Director > > www.sease.io > >
Re: Need more info on MLT (More Like This) feature
In addition to all the valuable information already shared I am curious to understand why you think the results are unreliable. Most of the times is the parameters that cause to ignore some of the terms of the original document/corpus (as simple of the min/max document frequency to consider or min term frequency in the source doc) . I have been working a lot on the MLT in the past years and presenting the work done (and internals) at various conferences/meetups. I'll share some slides and some Jira issues that may help you: https://www.youtube.com/watch?v=jkaj89XwHHw=540s <https://www.youtube.com/watch?v=jkaj89XwHHw=540s> https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works <https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works> https://issues.apache.org/jira/browse/LUCENE-8326 <https://issues.apache.org/jira/browse/LUCENE-8326> https://issues.apache.org/jira/browse/LUCENE-7802 <https://issues.apache.org/jira/browse/LUCENE-7802> https://issues.apache.org/jira/browse/LUCENE-7498 <https://issues.apache.org/jira/browse/LUCENE-7498> Generally speaking I favour the MLT query parser, it builds the MLT query and gives you the chance to see it using the debug query. ----- --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
[Apache Solr ReRanking] Sort Clauses Bug
Hi all, I was playing a bit with the reranking capability and I discovered that: *Sort by score, then by secondary field -> OK* http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score desc,downloads desc*=id,title,score,downloads *ReRank, Sort by score, then by secondary field -> KO* http://localhost:8983/solr/books/select?q=*:*={!rerank reRankQuery=$rqq reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score desc,downloads desc*=id,title,score,downloads Is this intended? It sounds counter-intuitive to me and I wanted to check before opening a Jira issue Tested on 8.1.1 but it should be in master as well. Regards ------ Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io
Re: MLT - unexpected design choice
Hi Maria, this is actually a great catch! I have been working a lot on the More Like This and this mistake never caught my attention. I agree with you, feel free to open a Jira Issue. First of all what you say, makes sense. Secondly it is the way it is the standard way used in the similarity Lucene calculations : *public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) { final long df = termStats.docFreq(); final long docCount = collectionStats.docCount(); final float idf = idf(df, docCount); return Explanation.match(idf, "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:", Explanation.match(df, "docFreq, number of documents containing term"), Explanation.match(docCount, "docCount, total number of documents with field"));}* *Indeed the int numDocs = ir.numDocs(); should actually be allocated per term in the for loop, using the field stats, something like:* *numDocs = ir.getDocCount(fieldName)* Feel free to open the Jira issue and attach a patch with at least a testCase that shows the bugfix. I will be available for doing the review. Cheers ------ Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce wrote: > Hi Maria, > > Would it help to add a filter to your query to restrict the results to > just those where the description field is populated? Eg. add > > fq=description:[* TO *] > > to your query parameters. > > Apologies if I'm misunderstanding the problem! > > Best, > > Matt > > > On 28/01/2019 16:29, Maria Mestre wrote: > > Hi all, > > > > First of all, I’m not a Java developer, and a SolR newbie. I have worked > with Elasticsearch for some years (not contributing, just as a user), so I > think I have the basics of text search engines covered. I am always > learning new things though! > > > > I created an index in SolR and used more-like-this on it, by passing a > document_id. My data has a special feature, which is that one of the fields > is called “description” but is only populated about 10% of the time. Most > of the time it is empty. I am using that field to query similar documents. > > > > So I query the /mlt endpoint using these parameters (for example): > > > > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”, > > mlt=true, > > mlt.fl=description, > > mlt.mindf=1, > > mlt.mintf=1, > > mlt.maxqt=5, > > wt=json, > > mlt.interestingTerms=details} > > > > The issue I have is that when retrieving the key scored terms > (interestingTerms), the code uses the total number of documents in the > index, not the total number of documents with populated “description” > field. This is where it’s done in the code: > https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651 > > > > The effect of this choice is that the “idf” does not vary much, given > that numDocs >> number of documents with “description”, so the key terms > end up being just the terms with the highest term frequencies. > > > > It is inconsistent because the MLT-search then uses these extracted key > terms and scores all documents using an idf which is computed only on the > subset of documents with “description”. So one part of the MLT uses a > different numDocs than another part. This sounds like an odd choice, and > not expected at all, and I wonder if I’m missing something. > > > > Best, > > Maria > > > > > > > > > > > > > > -- > Matt Pearce > Flax - Open Source Enterprise Search > www.flax.co.uk >
Re: Question about elevations
As far as I remember the answer is no. You could take a deep look into the code, but as far as I remember the elevated doc Ids must be in the index to be elevated. Those ids will be added to the query built, a sort of query expansion server side. And then the search executed. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: AW: Solr suggestions, best practices
I have done extensive work on auto suggestion, some additional resource from my company blog : https://sease.io/2015/07/solr-you-complete-me.html <https://sease.io/2015/07/solr-you-complete-me.html> https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html <https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html> Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Restrict search on term/phrase count in document.
I agree with Alexandre, it seems suspicious. Anyway, if you want to query for single term frequencies occurrence you could make use of the function range query parser : https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FunctionRangeQueryParser And the function: termfreq Returns the number of times the term appears in the field for that document. termfreq(text,'memory') tf Term frequency; returns the term frequency factor for the given term, using the Similarity for the field. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the document, which helps to control for the fact that some words are generally more common than others. See also idf. tf(text,'solr') Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Phrase query as feature in LTR not working
Hi AshB, from what I see, this is the expected behavior. You pass this efi to your "isPook" feature : efi.query=thrones%20of%20game*. Then you calculate: { "name" : "isPook", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": ["{!type=edismax qf=*text* v=$qq}=\"${query}\""] } } Given the document titles, it seems incorrect, but what about the document text ? Furthermore, if you are interested in exact phrase match, I would first go with : https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FieldQueryParser and then play with the following if more advance phrase querying was needed: https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-ComplexPhraseQueryParser Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Scores with Solr Suggester
Hi Christine, it depends on the suggester implementation, the one that got closer in having a score implementation is the BlendedInfix[1] but it is still in the TO DO phase. Feel free to contribute it if you like ! [1] https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 7 MoreLikeThis boost calculation
Hi Jesse, you are correct, the variable 'bestScore' used in the createQuery(PriorityQueue q) should be "minScore". it is used to normalise the terms score : tq = new BoostQuery(tq, boostFactor * myScore / bestScore); e.g. Queue -> Term1:100 , Term2:50, Term3:20, Term4:10 The minScore will be 10 and the normalised score will be : Term1:10 , Term2:5, Term3:2, Term4:1 These values will be used to build the boost term queries. I see no particular problem with that. What is your concern ? - ------- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr sort by score not working properly
Hi, if you add to the request the param : debugQuery=on you will see what happens under the hood and understand how the score is assigned. If you are new to the Lucene Similarity that Solr version uses ( BM25[1]) you can paste here the debug score response and we can briefly explain it to you the first time. First of all we are not even sure if the content field is actually used for scoring in your case, if it is and it is alone used, it may be related to the field length ( But it would be suspicious as they are quite similar in length in your example). Are you sorting by score for any reason ? It's been a while I have not checked but I doubt you get any benefit from the default ( which rank by score). So I recommend you to send here the debug response and then possibly your select request handler config. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to split index more than 2GB in size
Hi, in the first place, why do you want to split 2 Gb indexes ? Nowadays is a fairly small index. Secondly what you reported is incomplete. I would expect a Caused By section in the stacktrace. This are generic recommendations, always spend time in analysing the problem you had scrupulously. - SolrCloud problems often involve more than one node. Be sure to check the logs of all the nodes possibly involved. - Report the full stack trace to the community - Report your full request which provoked the exception Help is much easier this way :) Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 6.5 autosuggest suggests misspelt words and unwanted words
Hi, you should curate your data, that is fundamental to have an healthy search solution, but let's see what you can do anyway : 1) curate a dictionary of such bad words and then configure analysis to skip them 2) Have you tried different dictionary implementations ? I would assume that each single mispelled word has a low document frequency. You could use the High Frequency Document Dictionary[1] and see how it goes. [1] https://lucene.apache.org/solr/guide/7_3/suggester.html#highfrequencydictionaryfactory - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to exclude certain values in multi-value field filter query
The first idea that comes in my mind is to build a single valued copy field which concatenates them. in this way you will have very specific values to filter on : query1 -(copyfield:(A B AB)) To concatenate you can use this update request processor : https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solrj does not support ltr ?
Pretty sure you can't. As far as I know there is no client side implementation to help with managed resourced in general. Any contribution is welcome! - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Achieving AutoComplete feature using Solrj client
Indeed, you first configure it in the solrconfig.xml ( manually). Then you can query and parse the response as you like with the SolrJ client library. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Achieving AutoComplete feature using Solrj client
Hi, me and Tommaso contributed this few years ago.[1] You can easily get the suggester response now from the Solr response. Of course you need to configure and enable the suggester first.[2][3][4] [1] https://issues.apache.org/jira/browse/SOLR-7719 [2] https://sease.io/2015/07/solr-you-complete-me.html [3] https://lucidworks.com/2015/03/04/solr-suggester/ [4] https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr Suggest Component and OOM
I didn't get any answer to my questions ( unless you meant you have 25 millions of different values for those fields ...) Please read again my answer and elaborate further. Do you problem happen for the 2 different suggesters ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Logging Every document to particular core
Isn't the Transaction Log what you are looking for ? Read this good blog post as a reference : https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Changing Field Assignments
On top of that I would not recommend to use the schema-less mode in production. That mode is useful for experimenting and prototyping, but with a managed schema you would have much more control over a production instance. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr Suggest Component and OOM
Hi, first of all the two different suggesters you are using are based on different data structures ( with different memory utilisation) : - FuzzyLookupFactory -> FST ( in memory and stored binary on disk) - AnalyzingInfixLookupFactory -> Auxiliary Lucene Index Both the data structures should be very memory efficient ( both in building and storage). What is the cardinality of the fields you are building suggestions from ? ( site_address and site_address_other) What is the memory situation in Solr when you start the suggester building ? You are allocating much more memory to the JVM Solr process than the OS ( which in your situation doesn't fit the entire index ideal scenario). I would recommend to put some monitoring in place ( there are plenty of open source tools to do that) Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to find out which search terms have matches in a search
I would recommend to look into the Highlight feature[1] . There are few implementations and they should be all right for your user requirement. Regards [1] https://lucene.apache.org/solr/guide/7_3/highlighting.html - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Difference in fieldLengh and avgFieldLength in Solr 6.6 vs Solr 7.1
Shoot in the dark, I have not double checked in details but : With Solr 7.x "Index-time boosts have been removed from Lucene, and are no longer available from Solr. If any boosts are provided, they will be ignored by the indexing chain. As a replacement, index-time scoring factors should be indexed in a separate field and combined with the query score using a function query. See the section Function Queries for more information." Are you using index time boost by any chance ? If I remember correctly the Norms stored in the segment were affected by the field length and index time boost. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: BlendedInfixSuggester wiki errata corrige
Hi Cassandra, thanks for your reply. I did the fix in the official documentation as part of the bugfix I am working on: LUCENE-8343 <https://issues.apache.org/jira/browse/LUCENE-8343> Any feedback is welcome ! Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: BlendedInfixSuggester wiki errata corrige
Errata corrige to my Errata corrige post : e.g. Position Of First match = *0 | 1 | 2 | 3 |* Linear |1 | 0.9|0.8|0.7 Reciprocal |1 | 1/2|1/3|1/4 Exponential Reciprocal |1 | 1/4|*1/9*|1/16 - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
BlendedInfixSuggester wiki errata corrige
Hi all, I have been working quite a bit on the BlendedInfixSuggester : - to fix a bug : LUCENE-8343 <https://issues.apache.org/jira/browse/LUCENE-8343> - to bring an improvement : LUCENE-8347 <https://issues.apache.org/jira/browse/LUCENE-8347> I was reviewing the wiki documentation for the BlendedInfixSuggester[1]. This bit is incorrect or at least confusing : "position_linear weightFieldValue * (1 - 0.10*position): Matches to the start will be given a higher score. This is the default. position_reciprocal weightFieldValue / (1 + position): *Matches to the end will be given a higher score*. exponent An optional configuration variable for position_reciprocal to control how fast the score will increase or decrease. Default 2.0." 1) the *position_exponential_reciprocal* blenderType is missing ( it is the one the "exponent" apply to 2) It is not true that the position_reciprocal gives higher scores to matches in the end of a suggestion. All the blenderTypes boost matches at the beginning of the suggestions, the only difference is how fast the score of such terms decay with the position : e.g. Position Of First match = *0 | 1 | 2 | 3 |* Linear |1 | 0.9|0.8|0.7 Reciprocal |1 | 1/2|1/3|1/4 Exponential Reciprocal |1 | 1/4|1/8|1/16 I would be grateful if anyone can fix the documentation. Cheers [1] https://lucene.apache.org/solr/guide/7_3/suggester.html#blendedinfixlookupfactory - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 7.3 suggest dictionary building fails in cloud mode with large number of rows
In addition to what Erick and Walter correctly mentioned : "heap usage varies from 5 gb to 12 gb . Initially it was 5 gb then increased to 12 gb gradually and decreasing to 5 gb again. (may be because of garbage collection) 10-12 GB maximum heap uses, allocated is 50 GB. " Did I read it right ? Is 50 Gb allocated to the phisical/virtual machine where Solr is running or to the Solr JVM ? If the first is ok, the latter is considered a bad practice unless you really need all that heap for your Solr process ( which is extremely unlikely) You need to leave memory to the OS memory mapping ( which is heavily used by Solr). With such a big heap, you GC may indeed end up in long pauses. It is recommended to allocate to the Solr process as little as possible ( according yo your requirements) Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 7.3 suggest dictionary building fails in cloud mode with large number of rows
Hi Yogendra, you mentioned you are using SolrCloud. In SolrCloud an investigation does not isolate to a single Solr log : you see a timeout, i would recommend to check both the nodes involved. When you say : " heap usage is around 10 GB - 12 GB per node.", do you refer to the effective usage by the Solr JVM or the allocated heap ? Are you monitoring the memory utilisation for your Solr nodes ? Are Garbage Collection cycles behaving correctly ? When a timeout occurs, something bad happened in the communication between the Solr nodes. It could be network, but in your case it may be some Stop World situation caused by GC. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Update Solr Document
There is no quick answer, it really depends on a lot of factors... *TL;DR* : Updating a single document field will likely take more time in a bigger collection. *Partial Document Update* First of all, which field are you updating ? Depending on the type and attributes you may end up in different scenarios[1]. For example, an in place update would be much more convenient and less expensive as it will not end up writing a new document in the index . Viceversa a normal atomic update will cause an internal delete/re-index of the doc. What happens next will depend on the commit policies ( or in case you saturated the internal ram buffer, the content of the segment will be flushed. *Solr Commit Policies* In Solr there is the concept of Soft and hard commit. A soft commit is cheaper : grants visibility, warms up the caches, does minimal ( potentially none) disk writing An hard commit will flush the current segment to the disk in addition ( which brings all the background operations that Emir pointed out). Help yourself with this Erick's great classic[2] *Warming the caches* will take more time in a bigger collection ( as the queries will be executed on a bigger index). *Merging the segments* in the background, if it's triggered will take more time in a bigger collection. [1] https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates <https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates> [2] understanding-transaction-logs-softcommit-and-commit-in-sorlcloud <https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/> - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Weird behavioural differences between pf in dismax and edismax
Question in general for the community : what is the dismax capable of doing that the edismax is not ? Is it really necessary to keep both of them or the dismax could be deprecated ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr-extracting features values
The current feature extraction implementation in Solr is oriented to the Learning To Rank re-ranking capability, it is not built for feature extraction ( to then train your model). I am afraid you will need to implement your own system, that does multiple queries to Solr with the extraction feature enabled and then parse the results to build your training set. Do you have query level or query dependant features ? In case you are lucky enough to just have document level features, you may end up in a slightly simplified scenario. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Weird behavioural differences between pf in dismax and edismax
I don't have any hard position on this, It's ok to not build a phrase boost if the input query is 1 term and it remains one term after the analysis for one of the pf fields. But if the term produces multiple tokens after query time analysis, I do believe that building a phrase boost should be the correct interpretation ( e.g. wi-fi with a query time analiser which split by - ) . Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Weird behavioural differences between pf in dismax and edismax
In my opinion, given the definition of dismax and edismax query parsers, they should behave the same for parameters in common. To be a little bit extreme I don't think we need the dismax query parser at all anymore ( in the the end edismax is only offering more than the dismax) Finally, I do believe that even if the query is a single term ( before or after the analysis for a PF field) it should anyway boost the phrase. A phrase of 1 word is still a phrase, isn't it ? - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Could not load collection from ZK:
hi Aman, I had similar issues in the past and the reason was attributed to : SOLR-8868 <https://issues.apache.org/jira/browse/SOLR-8868> Which unfortunately is not solved yet. Did you manage to find a different cause in your case? hope that helps. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Debugging/scoring question
Hi Mariano, >From the documentation : docCount = total number of documents containing this field, in the range [1 .. {@link #maxDoc()}] In your debug the fields involved in the score computation are indeed different ( nomUsageE, prenomE) . Does this make sense ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Multiple languages, boosting and, stemming and KeywordRepeat
Hi Markus, can you show all the query parameters used when submitting the request to the request handler ? Can you also include the parsed query ( in the debug) I am curious to investigate this case. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello, > > And sorry to disturb again. Does anyone of you have any meaningful opinion > on this peculiar matter? The RemoveDuplicates filter exists for a reason, > but with query-time KeywordRepeat filter it causes trouble in some cases. > Is it normal for the clauses to be absent in the debug output, but the > boost doubled in value? > > I like this behaviour, but is it a side effect that is considered a bug in > later versions? And where is the documentation in this. I cannot find > anything in the Lucene or Solr Javadocs, or the reference manual. > > Many thanks, again, > Markus > > > > -Original message- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Wednesday 9th May 2018 17:39 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > > > Hello, > > > > First, apologies for the weird subject line. > > > > We index many languages and search over all those languages at once, but > boost the language of the user's preference. To differentiate between > stemmed tokens and unstemmed tokens we use KeywordRepeat and > RemoveDuplicates, this works very well. > > > > However, we just stumbled over the following example, q=australia is not > stemmed in English, but its suffix is removed by the Romanian stemmer, > causing the Romanian results to be returned on top of English results, > despite language boosting. > > > > This is because the Romanian part of the query consists of the stemmed > and unstemmed version of the word, but the English part of the query is > just one clause per field (title, content etc). Thus the Romanian results > score roughtly twice that of English results. > > > > Now, this is of course really obvious, but the 'solution' is not. To > work around the problem i removed the RemoveDuplicates filter so i get two > clauses for English as well, really ugly but it works. What i don't > understand is the debug output, it doesn't list two identical clauses, > instead, it doubled the boost on the field, so instead of: > > > > 27.048403 = PayloadSpanQuery, product of: > > 27.048403 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 7.4 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > I now get > > > > 54.096806 = PayloadSpanQuery, product of: > > 54.096806 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 14.8 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > So instead of expecting two clauses in the debug, i get one but with a > doubled boost. > > > > The question is, is this supposed to be like this? > > > > Also, are there any real solutions to this problem? Removing the > RemoveDuplicats filter looks really silly. > > > > Many thanks! > > Markus > > >
Re: Regarding LTR feature
"FQ_filter were 365 but below in the debugging part the docfreq used in the payload_score calculation was 3360" If you are talking about the doc frequency of a term, obviously this is corpus based ( necessary for the TF /IDF calculations) so it wil not be affected by the filter queries. The payload score part may be different. Anyway, you mentioned that you assign the weights, in that case the learning to rank plugin may be not necessary at all. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to implement Solr auto suggester and spell checker simultaneously on a single search box
Hi Sonal, if you want to go with a plain Solr suggester, what about the : FuzzyLookupFactory ? 1) it does support fuzzy matching ( spellcheck) 2) it does support auto complete If you want the context filtering as well, unfortunately the FST based Solr suggesters don't support this feature. I would recommend in that case to build your own autocompletion service defining a dedicated Lucene index ( to make it simple you could define an ad hoc Solr collection). Then, at query time, when a query doesn't return results you may want to execute a fuzzy query ( to bring the spellcheck functionality or just run the spellcheck response collation from the main query) Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Date Query Confusion
Hi Terry, let me go in order : /"Tried creation_date: 2016-11. That's supposed to match documents with any November 2016 date. But actually produces: |"Invalid Date String:'2016-11'| "/ Is "*DateRangeField*" the field type for your field : "creation_date" ? [1] You mentioned : org.apache.solr.schema.TrieDateField, this is not going to work, you need the specific field type I mentioned to use that date range syntax. /"||And Solr doesn't seem to let me sort on a date field. Tried creation_date asc Produced: |"can not sort on multivalued field: creation_date"| "/ Is your "creation_date" single valued ? If it is single valued semantically, make sure it is defined as single valued in the schema. Solr doesn't support sorting on multi valued fields. You schemaless conf may have assigned the multi valued attribute to that field. >From the Wiki[2] : "Solr can sort query responses according to document scores or the value of any field with a single value that is either indexed or uses DocValues (that is, any field whose attributes in the Schema include multiValued="false" and either docValues="true" or indexed="true" – if the field does not have DocValues enabled, the indexed terms are used to build them on the fly at runtime), provided that:" Hope this helps, Regards [1] https://lucene.apache.org/solr/guide/6_6/working-with-dates.html#WorkingwithDates-DateRangeFormatting [2] https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-ThesortParameter - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Regarding LTR feature
So Prateek : "You're right it doesn't have to be that accurate to the query time but our requirement is having a more solid control over our outputs from Solr like if we have 4 features then we can adjust the weights giving something like (40,20,20,20) to each feature such that the sum total of features for a document is 100 this is only possible if we could scale the feature outputs between 0-1." You are talking about weights so I assume you are using a linear Learning To Rank model. Which library are you using to train your model? Is this library allowing you to limit the summation of the linear weights and normalise the training set per feature ? At query time LTR will just apply the model weights to the query time feature vector. It makes sense to normalise each query time feature using the training time values. They should be close enough to the training set values ( if not the model is going to perform poor anyway and you need to curate a little bit better the training phase). Remember the model is used to give an order to the results, not to make an accurate regression prediction. "Secondly, I also have a doubt regarding the scaling function like why it is not considering only the documents filtered out by the FQ filter and considering all the documents which match the query." At the moment I would not focus on that scenario, I am not very convinced LTR SolrFeature is compatible to that complex function query, and I am not very convinced is going to be performance friendly anyway. i would need to investigate that properly. Regards ----- --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Autocomplete returning shingles
Yes, faceting will work, you can use an old approach used for autocompletion[1] . Be sure you add the shingle filter to the appropriate index time analysis for the field you want. Facet values are extracted from the indexed terms, so calculating faceting and filtering by prefix should do the trick. [1] https://solr.pl/en/2013/03/25/autocomplete-on-multivalued-fields-using-faceting/ - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Regarding LTR feature
Hi Preteek, I would assume you have that feature at training time as well, can't you use the training set to estabilish the parameters for the normalizer at query time ? In the end being a normalization, doesn't have to be that accurate to the query time state, but it must reflect the relations the model learnt from the training set. Let me know ! - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Regarding LTR feature
Mmmm, first of all, you know that each Solr feature is calculated per document right ? So you want to calculate the payload score for the document you are re-ranking, based on the query ( your External Feature Information) and normalize across the different documents? I would go with this feature and use the normalization LTR functionality : { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!payload_score f=aggregated_terms func=max v=${query}}" } } Then in the model you specify something like : "name" : "myModelName", "features" : [ { "name" : "isBook" }, ... { "name" : "in_aggregated_terms", "norm": { "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer", "params" : { "min":"x", "max":"y" } } }, } Give it a try, let me know - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Autocomplete returning shingles
So, your problem is you want to return shingle suggestions from a field in input but apply multiple filter queries to the documents you want to fetch suggestions from. Are you building an auxiliary index for that ? You need to design it accordingly. If you want to to map each suggestion to a single document in the auxiliary index, when you build this auxiliary index you need to calculate the shingles client side and push the multiple documents ( suggestion) per original field content. To do that automatically in Solr I was thinking you could write an UpdateRequestProcessor that given in input the document, split it in multiple docs, but unfortunately the current architecture of UpdateRequestProcessors takes in input 1 Doc and and returns in output just 1 doc. So it is not a viable approach. Unfortunately the shingle filter here doesn't help, as the user want shingle in output ( analysers doesn't affect stored content) Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Regarding LTR feature
Hi Prateek, with query and FQ Solr is expected to score a document only if that document is a match of all the FQ results intersected with the query results [1]. Then re-ranking happens, so effectively, only the top K intersected documents will be re-ranked. If you are curious about the code, this can be debugged running a variation of org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing filter queries ) and setting the breakpoint somewhere around : org/apache/solr/ltr/LTRRescorer.java:181 Can you elaborate how you have verified that is currently not working like that ? I am familiar with LTR code and I would be surprised to see this different behavior [1] https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to create a solr collection providing as much searching flexibility as possible?
Hi Raymond, as Charlie correctly stated, the input format is not that important, what is important is to focus on your requirements and properly design a configuration and data model to solve them. Extracting the information for such a data format is not going to be particularly challenging ( as i assume you know the semantic of such structure). You need to properly build your Solr document accordingly to the set of features you want to expose. Designing fields and field types will be fundamental to reach the search flexibility you are looking for. e.g. *Feature*: expose a fast range search on a numerical field (Int) *Implementation* : [1] IntPointField Integer field (32-bit signed integer). This class encodes int values using a "Dimensional Points" based data structure that allows for very efficient searches for specific values, or ranges of values. For single valued fields, docValues="true" must be used to enable sorting. [2] Regards [1] https://lucene.apache.org/solr/guide/7_3/field-types-included-with-solr.html [2] https://lucene.apache.org/solr/guide/7_3/the-standard-query-parser.html#range-searches ----- --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to create a solr collection providing as much searching flexibility as possible?
Hi Raymond, your requirements are quite vague, Solr offers you those capabilities but you need to model your configuration and data accordingly. https://lucene.apache.org/solr/guide/7_3/solr-tutorial.html is a good starting point. After that you can study your requirements and design the search solution accordingly. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Search Analytics Help
Michal, Doug was referring to an open source solution ready out of the box and just pluggable ( a sort of plug and play). Of course you can implement your own solution and using ELK or kafka is absolutely a valid option. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Fri, Apr 27, 2018 at 10:21 AM, Michal Hlavac <m...@hlavki.eu> wrote: > Hi, > > you have plenty options. Without any special effort there is ELK. Parse > solr logs with logstash, feed elasticsearch with data, then analyze in > kibana. > > Another option is to send every relevant search request to kafka, then you > can do more sophisticated data analytic using kafka-stream API. Then use > ELK to feed elasticsearch with logstash kafka input plugin. For this > scenario you need to do some programming. I`ve already created this > component but I hadn't time to publish it. > > Another option is use only logstash to feed e.g. graphite database and > show results with grafana or combine all these options. > > You can also monitor SOLR instances by JMX logstash input plugin. > > Really don't understand what do you mean by saying that there is nothing > satisfactory. > > m. > > On štvrtok, 26. apríla 2018 22:23:30 CEST Doug Turnbull wrote: > > Honestly I haven’t seen anything satisfactory (yet). It’s a huge need in > > the open source community > > > > On Thu, Apr 26, 2018 at 3:38 PM Ennio Bozzetti <ebozze...@thorlabs.com> > > wrote: > > > > > Hello, > > > > > > I'm setting up SOLR on an internal website for my company and I would > like > > > to know if anyone can recommend an analytics that I can see what the > users > > > are searching for? Does the log in SOLR give me that information? > > > > > > Thank you, > > > Ennio Bozzetti > > > > > > -- > > CTO, OpenSource Connections > > Author, Relevant Search > > http://o19s.com/doug > > >
Re: Learning to Rank (LTR) with grouping
Are you using SolrCloud or any distributed search ? If you are using just a single Solr instance, LTR should have no problem with pagination. The re-rank involves the top K and then you paginate. So if a document from the original score page 1 ends up in page 3, you will see it at page three. have you verified that : "Say, if an item (Y) from second page is moved to first page after re-ranking, while an item (X) from first page is moved away from the first page. ?" Top K shouldn't start from the "start" parameter, if it does, it is a bug. The situation change a little with distributed search where you can experiment this behaviour : *Pagination* Let’s explore the scenario on a single Solr node and on a sharded architecture. SINGLE SOLR NODE reRankDocs=15 rows=10 This means each page is composed by 10 results. What happens when we hit the page 2 ? The first 5 documents in the search results will have been rescored and affected by the reranking. The latter 5 documents will preserve the original score and original ranking. e.g. Doc 11 – score= 1.2 Doc 12 – score= 1.1 Doc 13 – score= 1.0 Doc 14 – score= 0.9 Doc 15 – score= 0.8 Doc 16 – score= 5.7 Doc 17 – score= 5.6 Doc 18 – score= 5.5 Doc 19 – score= 4.6 Doc 20 – score= 2.4 This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order. The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged. *SHARDED ARCHITECTURE* reRankDocs=15 rows=10 Shards number=2 When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard : Shard1 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored docs (page2) Shard2 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored docs (page2) The the results will be merged, and possibly, original scored search results can top up reranked docs. A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones. Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Run solr server using Java program
To do what? If you mean to start a Solr Server instance, you have the solr.sh ( or the windows starter). You can set up your automation stack to be able to startup Solr one click. SolrJ is a client which means you need Solr up and running. Cheers On Fri, 20 Apr 2018, 16:51 rameshkjes,wrote: > Using solrJ, I am able to access the solr core. But still I need to go to > command prompt to execute command for solr instance. Is there way to do > that? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Run solr server using Java program
There are various client API to use Apache Solr[1], in your case what you need is SolrJ[2] . Cheers [1] https://lucene.apache.org/solr/guide/7_3/client-apis.html [2] https://lucene.apache.org/solr/guide/7_3/using-solrj.html#using-solrj - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: SolrCloud design question
Unless you use recent Solt 7.x features where replicas can have different properties[1], each replica is functionally the same at Solr level. Zookeeper will elect a leader among them ( so temporary a replica will have more responsibilities ) but (R1-R2-R3) does not really exist at Solr level. It will just be Shard1 (ReplicaHost1, ReplicaHost2, ReplicaHost3). So you can't really shuffle anything at this level. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to protect middile initials during search
Hi Wendy, I recommend to properly configure your analysis chain. You can start posting it here and we can help. Generally speaking you should use the analysis tool in the Solr admin to verify first the analysis chain is configured as you expect, then you can pass modelling the query appropriately. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Learning to Rank (LTR) with grouping
Thanks for the response Shawn ! In relation to this : "I feel fairly sure that most of them are unwilling to document their skills. If information like that is documented, it might saddle a committer with an obligation to work on issues affecting those areas when they may not have the free time available to cover that obligation. " I understand your point. I was referring to pure Lucene/Solr modules interest/expertise more than skills but I get that "it might saddle a committer with an obligation to work on issues affecting those areas when they may not have the free time available to cover that obligation." It shouldn't transmit an obligation ( as no contributor operates under any SLA but purely passion driven ) but it might be a "suggestion" . I was thinking to some way to avoid such long standing Jiras. Let's pick this issue as an example. >From the little of my opinion I believe it is quite useful. The last activity is from 22/May/17 15:23 and no committer commented after that. I would assume that committers with interest or expertise on Learning To Rank or Grouping initially didn't have free time to evaluate the patch and then maybe they just forgot. Having some sort of tagging based on expertise could at least avoid the "forget" part ? Or the contributor should viralize the issue and get as much "votes" from the community as possible to validate an issue to be sexy ? Just thinking loudly, it was just an idea ( and I am not completely sure it could help) but I believe as a community we should manage a little bit better contributions, of course I am open to any idea and perspective. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Learning to Rank (LTR) with grouping
Hi Erick, I have a curiosity/suggestion regarding how to speed up pending( or forgotten ) Jiras, is there a way to find out the most suitable committer(s) for the task and tag them ? Apache Lucene/Solr is a big project, is there anywhere in the official Apache Lucene/Solr website where each committer list the modules of interest/expertise ? In this way when a contrbutor create a Jira and attach a patch, the committers could get a notification if the module involving the Jira is one of their interest. This could be done manually ( the contributor check the committers interests and manually tag them in the Jira) or automatically ( integrating Jira modules with this "Interests list" in some way) . Happy to help in this direction. I understand that all of us contributors ( and committers) are just volunteers, so no SLA is expected at all, but did the fact of the fixed version already assigned affect the address of that Jira issue ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Sorting using "packed" fields?
Hi Christopher, if you model your documents with a nested document approach ( like the one you mentioned) you should be able to achieve your requirement following this interesting blog [1] : *" ToParentBlockJoinQuery supports several score calculation modes. For example, a score for a parent could be calculated as a min(max) score among of all its children’s scores. So, with the piece of code below we can sort parent documents by their children’s prices in descending ordersort={!parent which=doc_type:parent score=max v=’+doc_type:child +{!func}price’} desc… "* Instead of using just the plain price function you could design your own function, such as : {!func}if(gt(query(prefix:),0),latest_submission,0) it's just a quick attempt to give you the idea, the function query I posted may need some refinement but it could work Cheers [1] https://blog.griddynamics.com/how-to-sort-parent-documents-by-child-attributes-in-solr/ -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Mon, Apr 16, 2018 at 9:48 PM, Christopher Schultz < ch...@christopherschultz.net> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > All, > > I have documents that need to appear to have different attributes > depending upon which user is trying to search them. One of the fields > I currently have in the document is called "latest_submission" and > it's a multi-valued text field that contains fields packed with a > numeric identifier prefix and then the real data. Something like this: > > 101:2018-04-16T16:41:00Z > 102:2017-01-25T22:08:17Z > 103:2018-11-19T02:52:28Z > > When searching, I will know which prefixes are valid for a certain > user, so I know I can search by *other* fields and then pull-out the > values that are appropriate for a particular user. > > But if I want Solr/Lucene to searcg/sort by the "latest submission", I > need to be able to tell Solr/Lucene which values are appropriate to > use for that user. > > Is this kind of thing possible? I'd like to be able to issue a search > that says e.g.: > > find documents matching name:foo sort by latest_submission starting > with ("102:" or "103:") > > I'm just starting out with this data set, so I can completely change > the organization of the data within the index if necessary. > > Does anyone have any suggestions? > > I've seen some questions on the list about "child documents", and it > seems like that might be relevant. Right now, my input data looks like > this: > > { > { "name" : "document name", > "latest_submission" : [ "prefix:date", "prefix:date", etc. ] > } > } > > But that could easily be changed to be: > > { > { "name" : "document name", > "latest_submission" : { "prefix" : "101", > "date" : "[date]" }, > { "prefix" : "103", > "date" : "[date]" }, > } > } > > > Thanks, > - -chris > -BEGIN PGP SIGNATURE- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlrVDAQACgkQHPApP6U8 > pFj7Mw//dnM0ZMRhbvAlMptYSH3LEj08I0l/oJWMQYilWOIltpZ148QOJp+5Iqu/ > Q9uYfkItdv0Fw77Ebgtmm7N5PUzH7utiyDfKNayvL9d9+MtfFzx4CKPyqdjNDXvC > 2LLUks9ABTX93h7AUdeO5rM4NsPci6LMY8dcxU6fbVDbDT5nYTRULUrbGfDxmY6E > SyMwk25DOzmrIoFCOJcyhuluvHhax753mOQCCljuFaCM3J8ap0+2ZqX8Nl5D2NLz > CqU5ROTGxm+qMVQ8dbqhT6LRdbjj6KqazutOxZl+H+Ix6yVeWZG/9TiAtkKZklvJ > 6wjMB2te4utj35YPhpMkghkIYwo7s6jt9DXyBaf2gv1fbiNKmvPN2eqhsI870f0t > UmknH8Atx3ygeru3ddjIvb2Fn17E7EpKHWxkmmrexKE8uzCo9Ith6BWqL8ae19o/ > LtBQ7RNCNjIbyNk3GcUJmvboM+PAAvUWbnpwQ4V2oI8b5sO9zeopE4JlzbWmG89H > WVmtPpIdw0H8AwLNbJuGaaksY5ZIcYg2iFH56BHvvu1ri3ArSgcQuyHfxEZD7gs3 > cjh+mX9QEgbCVrz2i0CwRkgAMMIffG2SjBsHhUs5ESYqeskkDcyFDi70Q+5wNJ71 > GhAESSbgpI31lpbhkGwh7gdXiJyKJG3EMFDEEZVN5sLhFYv96Q8= > =V+EE > -END PGP SIGNATURE- >
Re: Match a phrase like "Apple iPhone 6 32GB white" with "iphone 6"
Hi Sami, I agree with Mikhail, if you have relatively complex data you could curate your own knowledge base for products as use it for Named entity Recognition. You can then search a field compatible_with the extracted entity. If the scenario is simpler using the analysis chain you mentioned should work (if the product names are always complete and well curated). Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Mon, Apr 9, 2018 at 10:40 AM, Adhyan Arizki <a.ari...@gmail.com> wrote: > You can just use synonyms for that.. rather hackish but it works > > On Mon, 9 Apr 2018, 05:06 Sami al Subhi, <s...@alsubhi.me> wrote: > > > I think this filter will output the desired result: > > > > > > > > > > > > > > > > > > > > > > > > > > indexing: > > "iPhone 6" will be indexed as "iphone 6" (always a single token) > > > > querying: > > so this will analyze "Apple iPhone 6 32GB white" to "apple", "apple > > iphone", > > "iphone", "iphone 6" and so on... > > then here a match will be achieved using the 4th token. > > > > > > I dont see how this will result in false positive matching. > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > >
Re: LTR - OriginalScore query issue
>From Apache Solr tests : loadFeature( "SomeEdisMax", SolrFeature.class.getCanonicalName(), "{\"q\":\"{!edismax qf='title description' pf='description' mm=100% boost='pow(popularity, 0.1)' v='w1' tie=0.1}\"}"); *qf='title description'* Can you try again using the proper expected syntax ( with single quotes). If it doesn't work we may need to raise it as a bug. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR - OriginalScore query issue
I understood your requirement, the SolrFeature feature type should be quite flexible, have you tried : { name: "overallEdismaxScore", class: "org.apache.solr.ltr.feature.SolrFeature", params: { q: "{!dismax qf=item_typel^3.0 brand^2.0 title^5.0}${user_query}" }, store: "myFeatureStoreDemo", } Cheers ----- --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: SpellCheck Reload
Hi Sadiki, the kind of spellchecker you are using built an auxiliary Lucene index as a support data structure. That is going to be used to provide the spellcheck suggestions. "My question is, does "reloading the dictionary" mean completely erasing the current dictionary and starting from scratch (which is what I want)? " What you want is re-build the spellchecker. In the case of the the IndexBasedSpellChecker, the index is used to build the dictionary. When the spellchecker is initialized a reader is opened from the latest index version available. if in the meantime your index has changed and commits have happened, just building the spellchecker *should* use the old reader : @Override public void build(SolrCore core, SolrIndexSearcher searcher) throws IOException { IndexReader reader = null; if (sourceLocation == null) { // Load from Solr's index reader = searcher.getIndexReader(); } else { // Load from Lucene index at given sourceLocation reader = this.reader; } This means your dictionary is not going to see any substantial changes. So what you need to do is : 1) reload the spellchecker -> which will initialise again the source for the dictionary to the latest index commit 2) re-build the dictionary Cheers - ------- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Some performance questions....
*Single Solr Instance VS Multiple Solr instances on Single Server * I think there is no benefit in having multiple Solr instances on a single server, unless the heap memory required by the JVM is too big. And remember that this has relatively to do with the index size ( inverted index is memory mapped OFF heap and docValues as well). On the other hand of course Apache Solr uses plenty of JVM heap memory as well ( caches, temporary data structures during indexing, ect ect) > Deepak: > > Well its kinda a given that when running ANYTHING under a VM you have an > overhead.. ***Deepak*** You mean you are assuming without any facts (performance benchmark with n without VM) ***Deepak*** I think Shawn detailed this quite extensively, I am no sys admin or OS expert, but there is no need of benchmarks and I don't even understand your doubts. In Information technology anytime you add additional layers of software you need adapters which means additional instructions executed. It is obvious that having : metal -> OS -> APP is cheaper instruction wise then metal -> OS -> VM -> APP The APP will execute instruction in the VM that will be responsible to translate those instructions for the underlining OS. Going direct you skip one passage. you can think about this when you emulate different OS, is it cheaper to run windows on a machine directly to execute windows applications or run a Windows VM on top of another OS to execute windows applications ? - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR - OriginalScore query issue
>From the snippet you posted this is the query you run : q=id:"13245336" So the original score ( for each document in the result set) can only be the score associated to that query. You then pass an EFI with a different text. You can now use that information to calculate another feature if you want. You can define a SolrFeature : { "store" : "myFeatureStore", "name" : "userTextCat", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{! <localParams}${user_query}" } } e.g. { "store" : "myFeatureStore", "name" : "titleTfIdf", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!field f=title}${user_query}" } } Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR not able to upload org.apache.solr.ltr.model.MultipleAdditiveTreesModel
This is the piece of code involved : "try { // create an instance of the model model = solrResourceLoader.newInstance( className, LTRScoringModel.class, new String[0], // no sub packages new Class[] { String.class, List.class, List.class, String.class, List.class, Map.class }, new Object[] { name, features, norms, featureStoreName, allFeatures, params }); if (params != null) { SolrPluginUtils.invokeSetters(model, params.entrySet()); } } catch (final Exception e) { throw new ModelException("Model type does not exist " + className, e); }" I admit it is generic and contains even a catch "Exception" clause, but wasn't it logging the stacktrace ? Just out of curiosity, how was the entire stacktrace ? This may help to improve it. Regards ----- --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr Warming Up Doubts
I see quite a bit of confusion here : *1. FirstSearcher* I have added some 2 frequent used query but all my autowarmCount are set to 0. I have also added facet for warming. So if my autowarmCount=0, does this mean by queries are not getting cached. /First Searcher as the name suggests is the First searcher opened on the Solr instance on startup. NewSearcher refers to the new searcher opened every commit instead. If your autowarm count for your caches are set to 0 it means that 0 entries for the old caches will be used to warm up the new caches ( old caches get invalidated on both soft or hard commit)./ *2. useColdSearcher = false* Despite reading many document, i am not able to understand how it works after full import (assuming this is not my first full-import) Normally when a commit happens, the searcher is first warmed up and then is registered to serve queries. If you want to use a Cold Searcher you can, setting this property. *3. not defined maxWarmingSearchers in solrconfig.* This refers to the number of warming searcher in background, if you have frequent commits you may have different searchers concurrently warming up. This parameter limit this number ( normally to 2 searchers) So, in short, you are definitely doing something wrong and your auto warming is not going to work as you like :) Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: LTR not able to upload org.apache.solr.ltr.model.MultipleAdditiveTreesModel
Hi Roopa, that model changed name few times, which Apache Solr version are you using ? It is very likely you are using a class name not in sync with your Apache Solr version. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Need help with match contains query in SOLR
It was not clear at the beginning, but If I understood correctly you could : *Index Time analysis* Use whatever charFilter you need, the keyword tokenizer[1] and then token filters you like ( such as lowercase filter, synonyms ect) *Query Time Analysis* You can use a tokenizer you like ( that tokenizes so not keywordTokenizer), the Shingle Token filter[2] and whatever additional filter you need. This should do the trick. Cheers [1] https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer [2] https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Index size increases disproportionately to size of added field when indexed=false
Hi David, good to know that sorting solved your problem. I understand perfectly that given the urgency of your situation, having the solution ready takes priority over continuing with the investigations. I would recommend anyway to open a Jira issue in Apache Solr with all the information gathered so far. Your situation caught our attention and definitely changing the order of the documents in input shouldn't affect the index size ( by such a greater factor). The fact that the optimize didn't change anything is even more suspicious. It may be an indicator that in some edge cases ordering of input documents is affecting one of the index data structure. As a last thing when you have time I would suggest to : 1) index the ordering which gives you a small index - Optimize - Take note of the size by index file extension 2) index the ordering which gives you a big index - Optimize - Take note of the size by index file extension And attach that to the Jira issue. Whenever someone picks it up, that would definitely help. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr ltr jar is not able to recognize MultipleAdditiveTreesModel
You can not just use the model output from Ranklib. I opened this issue few months ago but I never had the right time/motivation to implement it [1] . You need to convert it in the Apache Solr LTR expected format. I remember a script should be available[2] [1] https://sourceforge.net/p/lemur/feature-requests/144/ [2] https://github.com/ryac/lambdamart-xml-to-json -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, Feb 15, 2018 at 3:55 PM, Brian Yee <b...@wayfair.com> wrote: > I'm not sure if this will solve your problem, but you are using a very old > version of Ranklib. The most recent version is 2.9. > https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.9/ > > > -Original Message- > From: kusha.pande [mailto:kusha.pa...@gmail.com] > Sent: Thursday, February 15, 2018 8:12 AM > To: solr-user@lucene.apache.org > Subject: solr ltr jar is not able to recognize MultipleAdditiveTreesModel > > Hi I am trying to upload a training model generated from ranklib jar using > lamdamart mart. > > The model is like > {"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel", > "name":"lambdamartmodel", > "params" : { > "trees" :[ >{ > "id": "1", > "weight": "0.1", > "split": { > "feature": "8", > "threshold": "7.111333", > "split": [ > { >"pos": "left", >"feature": "8", >"threshold": "5.223557", >"split": [ > { > "pos": "left", > "feature": "8", > "threshold": "3.2083516", > "split": [ > { >"pos": "left", >"feature": "1", >"threshold": "100.0", >"split": [ > { > "pos": "left", > "feature": "8", > "threshold": "2.2626402", > "split": [ > { >"pos": "left", >"feature": "8", >"threshold": "2.2594802", >"split": [ > { > "pos": "left", > "output": "-1.6371088" > }, > { > "pos": "right", > "output": "-2.0" > } >] > }, > { >"pos": "right", >"feature": "8", >"threshold": "2.4438097", >"split": [ > { > "pos": "left", > "feature": "2", > "threshold": "0.05", > "split": [ > { >"pos": "left", >"output": "2.0" > }, .. > > > getting an exception as : > Exception: Status: 400 Bad Request > Response: { > "responseHeader":{ > "status":400, > "QTime":43}, > "error":{ > "metadata":[ > "error-class","org.apache.solr.common.SolrException", > "root-error-class","java.lang.RuntimeException"], > "msg":"org.apache.solr.ltr.model.ModelException: Model type does not > exist org.apache.solr.ltr.model.MultipleAdditiveTreesModel", > "code":400}} > . > > I have used RankLib-2.1-patched.jar to generate the model and converted the > generated xml to json. > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Index size increases disproportionately to size of added field when indexed=false
It's a silly thing, but to confirm the direction that Erick is suggesting : How many rows in the DB ? If updates are happening on Solr ( causing the deletes), I would expect a greater number of documents in the DB than in the Solr index. Is the DB primary key ( if any) the same of the uniqueKey field in Solr ? Regards -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Fri, Feb 16, 2018 at 10:18 AM, Howe, David <david.h...@auspost.com.au> wrote: > > Hi Emir, > > We have no copy field definitions. To keep things simple, we have a one > to one mapping between the columns in our staging table and the fields in > our Solr index. > > Regards, > > David > > David Howe > Java Domain Architect > Postal Systems > Level 16, 111 Bourke Street Melbourne VIC 3000 > > T 0391067904 > > M 0424036591 > > E david.h...@auspost.com.au > > W auspost.com.au > W startrack.com.au > > Australia Post is committed to providing our customers with excellent > service. If we can assist you in any way please telephone 13 13 18 or visit > our website. > > The information contained in this email communication may be proprietary, > confidential or legally professionally privileged. It is intended > exclusively for the individual or entity to which it is addressed. You > should only read, disclose, re-transmit, copy, distribute, act in reliance > on or commercialise the information if you are authorised to do so. > Australia Post does not represent, warrant or guarantee that the integrity > of this email communication has been maintained nor that the communication > is free of errors, virus or interference. > > If you are not the addressee or intended recipient please notify us by > replying direct to the sender and then destroy any electronic or paper copy > of this message. Any views expressed in this email communication are taken > to be those of the individual sender, except where the sender specifically > attributes those views to Australia Post and is authorised to do so. > > Please consider the environment before printing this email. >
Re: Multiple context fields in suggester component
You can start from here : org/apache/solr/spelling/suggest/SolrSuggester.java:265 Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Index size increases disproportionately to size of added field when indexed=false
@Pratik: you should have investigated. I understand that solved your issue, but in case you needed norms it doesn't make sense that cause your index to grow up by a factor of 30. You must have faced a nasty bug if it was just the norms. @Howe : *Compound File* .cfs, .cfe An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. *Frequencies* .docContains the list of docs which contain each term along with frequency *Field Data*.fdtThe stored fields for documents *Positions* .posStores position information about where a term occurs in the index *Term Index*.tipThe index into the Term Dictionary So, David, you confirm that those two index have : 1) same number of documents 2) identical documents ( + 1 new field each not indexed) 3) same number of deleted documents 4) they both were born from scratch ( an empty index) The matter is still suspicious : - Cfs seems to highlight some sort of malfunctioning during indexing/committing in relation with the OS. What was the way of commiting you were using ? - .doc, .pos, .tip -> they shouldn't change, assuming both the indexes are optimised, you are adding a not indexed field, those data structures shouldn't be affected - the stored content as well, too much of an increment Can you send us the full configuration for the new field ? You don't want, norms, positions and frequencies for it. But in case they are the issue, you may have found some very edge case, because also enabling all of them you shouldn't incur in such a penalty for just an additional tiny field - ------- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Index size increases disproportionately to size of added field when indexed=false
Hi pratik, how is it possible that just the norms for a single field were causing such a massive index size increment in your case ? In your case I think it was for a field type used by multiple fields, but it's still suspicious in my opinions, norms should be that big. If I remember correctly in old versions of Solr before the drop of index time boost, norms were containing both an approximation of the length of the field + index time boost. >From your mailing list problem you moved from 10 Gb to 300 Gb. It can't be just the norms, are you sure you didn't face some bug ? Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using Synonyms as a feature with LTR
I see, According to what I know it is not possible to run for the same field different query time analysis. Not sure if anyone was working on that. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Not getting appropriate spell suggestions
Given your schema the stemmer seems a very likely responsible. You need to disable it and re-index. Just commenting it is not going to work if you don't re-index. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using Synonyms as a feature with LTR
"I can go with the "title" field and have that include the synonyms in analysis. Only problem is that the number of fields and number of synonyms files are quite a lot (~ 8 synonyms files) due to different weightage and type of expansion (exact vs partial) based on these. Hence going with this approach would mean creating more fields for all these synonyms (synonyms.txt) So, I am looking to build a custom parser for which I could supply the file and the field and that would expand the synonyms and return a score. " Having a binary or scalar feature is completely up to you and the way you configure the Solr feature. If you have 8 (copy?)fields with same content but different expansion, that is still ok. You can have 8 features, one per type of expansion. LTR will take care of the weight to be assigned to those features. "So, I am looking to build a custom parser for which I could supply the file and the field and that would expand the synonyms and return a score. "" I don't get this , can you elaborate ? Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Judging the MoreLikeThis results for relevancy
So let me answer point by point : 1) Similarity is misleading here if you interpret it as a probabilistic measure. Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and BM25 ( that solves the problem better) you are scoring the document. Higher the score, higher the relevance of that document for the given query. BM25 does a better job in this , the relevance function will hit a saturation point so it is closer to your expectation, this blog from Doug should help[1] 2) "if document vector A is at a distance of 5 and 10 units from document vectors B and C respectively then can't we say that B is twice as relevant to A as C is to A? Or in terms of distance, C is twice as distant to A and B is to A?" Not in Lucene, at least not strictly. Current MLT uses TF-IDF as a scoring formula. When the score of B is double of the score of C, you can say that B is twice as relevant to A than C for Lucene. >From a User perspective this can be different (quoting Doug : "If an article mentions “dog” six times is it twice as relevant as an article mentioning “dog” 3 times? Most users say no") 3) MLT under the hood build a Lucene query and retrieve documents from the index. When building the MLT query, to keep it simple it extract from the seed document a subset of terms which are considered representative of the seed document ( let's call them relevant terms). This is managed through a parameter, but usually and by default you collect a limited set of relevant terms ( not all the terms). When retrieving similar documents you score them using TF-IDF ( and in the future BM25). So first of all, you can have documents with higher scores than the original ( it doesn't make sense in a probabilistic world, but this is how Lucene works). Reverting the documents, so applying the MLT to document B you could build a slightly different query. So : given seed(a) the score(b) != the score(a) given seed(b) I understand you think it doesn't make sense, but this how Lucene works. I do also understand that a lot of times users want a percentage out of a MLT query. I will work toward that direction for sure, step by step, first I need to have the MLT refactor approved and patched :) [1] https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Index size increases disproportionately to size of added field when indexed=false
Hi David, given the fact that you are actually building a new index from scratch, my shot in the dark didn't hit any target. When you say : "Once the import finishes we save the docker image in the AWS docker repository. We then build our cluster using that image as the base" Do you mean just configuraiton wise ? Will the new cluster have any starting index on disk? If i understood correctly your latest statements I expect a NO in here. So you are building a completely new index and comparing to the old index ( which is completely separate) you denote such a big difference in size. This is extremely suspicious . Optimizing in the end is just a huge merge to force 1 ( or N) final segments. Given the additional information you gave me, it's not going to make much difference. I would recommend to check how the index space is divided in different file formats [1] ( i.e. list how much space is dedicated to a specific extension) Stored content is in the .fdt files. [1] https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Multiple context fields in suggester component
Simple answer is No. Only one context field is supported out of the box. The query you provide as context filtering query ( suggest.cfq= ) is going to be parsed and a boolean query for the context field is created [1]. You will need some customizations if you are targeting that behavior. [1] query = new StandardQueryParser(contextFilterQueryAnalyzer).parse(contextFilter, CONTEXTS_FIELD_NAME); - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: facet.method=uif not working in solr cloud?
*Update* : This has been actually already solved by Hoss. https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull Request : https://github.com/apache/lucene-solr/pull/279/files This should go live with 7.3 Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: facet.method=uif not working in solr cloud?
+1 I believe it is a bug related to that patch in some way. facet.distrib.mco ( the naming is not very explicit) should activate the feature in the patch, which forces the mincount in the distributed requests to be set to 1. The normal behavior expected is that you pass to the distributed requests the same value for the parameter that you originally set. Can you open a bug Wei ? We can investigate the part where the requests are distributed. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr spell check index dictionary build failed issue
Shooting in the dark it seems that 2 processes are trying to write the same disk directory. Is this directory shared by different Solr cores or Solr instances ? If you contribute the configuration from the solrconfig we may be able to help. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Index size increases disproportionately to size of added field when indexed=false
I assume you re-index in full right ? My shot in the dark is that this increment is temporary. You re-index, so effectively delete and add all documents ( this means that even if the new field is just stored, you re-build the entire index for all the fields). Create new segments and the old docs are marked as deleted. Until the background merge happens, the index could reach those sizes. the weird thing is why the merge didn't kick in... Have you configured any special approach in segments merging ? What happens if you explicitly optimize ? Let us know ... - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using Synonyms as a feature with LTR
In the end a feature will just be a numerical value. How do you plan to use synonyms in a field to generate a numerical feature ? Are you planning to define a binary feature for a field, in case there is a match on the synonyms ? Or a feature which contains a score for a query ( with synonyms expansion) ? I would start from the SolrFeature, let's assume the "title" field has a field type that includes synonyms ( query time) : { "store" : "featureStore", "name" : "hasTitleMatch", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": [ "{!field f=title}${query}" ] } Query time analysis will be applied and synonyms expanded. So the feature will have a value , which is the score returned for the query and the document ( under scoring) . You can play with that and design the feature that best fit your idea. Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using Context field unable to get autosuggestion for zip code having '-'.
With that configuration you want to auto suggest Office names filtering them by zip code. Not sure why you perform an ngram analysis though. How do you want to filter by zip code ? Exact Search ? Edge ngram ? Regards - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Relevancy Tuning For Solr With Apache Nutch 2.3
uhm, not really. I am just saying that if you are running a version >=6.6.0 keep in mind that the index time boost you think you are enabling is not actually working anymore. You are now mentioning a nutch boost field... Can you elaborate that ? It may be a completely different thing... How is this boost stored Solr side ? Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Relevancy Tuning For Solr With Apache Nutch 2.3
With : boost from nutch's side. If you refer to Index Time boost, this has been deprecated time ago[1] At least from 6.6.0. [1] http://lucene.apache.org/solr/6_6_0/solr-solrj/deprecated-list.html - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Spellcheck collations results
Given this configurations you may state that if no collation is returned there was no collation returning results after : - getting back a maximum of 7 corrections for mispelled terms - getting a max of 10.000 combinations of collations to extendedResults - test 3 collations against the index to check if results are returned and then give up So there are scenarios where you don't get the collation, but it actually would have returned results : - the collation involve a correction that was not included in the closest 7 collations - the collation was not tested ( not being included in the first 3 collation combinations) We can go more in deep if required, the Spellcheck is quite a complex module :) Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Judging the MoreLikeThis results for relevancy
Hi, I have been personally working a lot with the MoreLikeThis and I am close to contribute a refactor of that module ( to break up the monolithic giant facade class mostly) . First of all the MoreLikeThis handler will return the original document ( not scored) + the similar documents(scored). The original document is not considered by the MoreLikeThis query, so it is not returned as part of the results of the MLT lucene query, it is just added to the response in the beginning. if I remember well, but I am unable to check at the moment, you should be able to get the original document in the response set ( with max score) using the More Like This query parser. Please double check that Generally speaking at the moment TF-IDF is used under the hood, which means that sometime the score is not probabilistic. So a document which has a score 50% of the original doc score, it doesn't mean it is 50% similar, but for your use case it may be a feasible approximation. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html