Re: SOLR upgrade

2021-02-12 Thread Alessandro Benedetti
Hi,
following up on Charlie's detailed response I would recommend carefully
assess the code you are using to interact with Apache Solr (on top of the
Solr changes themselves).
Assuming you are using some sort of client, it's extremely important to
fully understand both the syntax and semantic of each call.
I saw a lot of "compiling ok" search-api migrations that were ok
syntactically but doing a disaster from the semantic perspective (missing
important parameters ect).

In case you have plugins to maintain this would be even more complicated
than just make them compile.

Regards
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R Software Engineer, Search Consultant
www.sease.io


On Tue, 9 Feb 2021 at 11:01, Charlie Hull 
wrote:

> Hi Lulu,
>
> I'm afraid you're going to have to recognise that Solr 5.2.1 is very
> out-of-date and the changes between this version and the current 8.x
> releases are significant. A direct jump is I think the only sensible
> option.
>
> Although you could take the current configuration and attempt to upgrade
> it to work with 8.x, I recommend that you should take the chance to look
> at your whole infrastructure (from data ingestion through to query
> construction) and consider what needs upgrading/redesigning for both
> performance and future-proofing. You shouldn't just attempt a
> lift-and-shift of the current setup - some things just won't work and
> some may lock you into future issues. If you're running at large scale
> (I've talked to some people at the BL before and I know you have some
> huge indexes there!) then a redesign may be necessary for scalability
> reasons (cost and feasibility). You should also consider your skills
> base and how the team can stay up to date with Solr changes and modern
> search practice.
>
> Hope this helps - this is a common situation which I've seen many times
> before, you're certainly not the oldest version of Solr running I've
> seen recently either!
>
> best
>
> Charlie
>
> On 09/02/2021 01:14, Paul, Lulu wrote:
> > Hi SOLR team,
> >
> > Please may I ask for advice regarding upgrading the SOLR version (our
> project currently running on solr-5.2.1) to the latest version?
> > What are the steps, breaking changes and potential issues ? Could this
> be done as an incremental version upgrade or a direct jump to the newest
> version?
> >
> > Much appreciate the advice, Thank you!
> >
> > Best Wishes
> > Lulu
> >
> >
> >
> **
> > Experience the British Library online at www.bl.uk<http://www.bl.uk/>
> > The British Library's latest Annual Report and Accounts :
> www.bl.uk/aboutus/annrep/index.html<
> http://www.bl.uk/aboutus/annrep/index.html>
> > Help the British Library conserve the world's knowledge. Adopt a Book.
> www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
> > The Library's St Pancras site is WiFi - enabled
> >
> *
> > The information contained in this e-mail is confidential and may be
> legally privileged. It is intended for the addressee(s) only. If you are
> not the intended recipient, please delete this e-mail and notify the
> postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail
> must not be disclosed or copied without the sender's consent.
> > The statements and opinions expressed in this message are those of the
> author and do not necessarily reflect those of the British Library. The
> British Library does not take any responsibility for the views of the
> author.
> >
> *
> > Think before you print
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network <https://thesearchnetwork.com/>
> and co-author of Searching the Enterprise
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Re: Extremely Small Segments

2021-02-12 Thread Alessandro Benedetti
Hi Yasoob,
Can you check in the log when hard commits really happen?
I ended up sometimes with auto soft/hard commit config in the wrong place
of the solrconfig.xml and for that reason getting un-expected behaviour.
Your assumptions are correct, the ramBuffer flushes as soon as one of the
threshold is met for memory/doc count.
For the auto-commit, it's the same, but for time/docs.

Are you sure there's no additional commit happening?
Do you see those numbers on all shards/replicas?
Which kind of replica are you using?
Sharding on 10 GB index may not be necessary, do you have any evidence you
had to shard your index?
Any performance benchmark?

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R Software Engineer, Search Consultant
www.sease.io


On Fri, 12 Feb 2021 at 13:44, yasoobhaider  wrote:

> Hi
>
> I am migrating from master slave to Solr Cloud but I'm running into
> problems
> with indexing.
>
> Cluster details:
>
> 8 machines of 64GB memory, each hosting 1 replica.
> 4 shards, 2 replica of each. Heap size is 16GB.
>
> Collection details:
>
> Total number of docs: ~250k (but only 50k are indexed right now)
> Size of collection (master slave number for reference): ~10GB
>
> Our collection is fairly heavy with some dynamic fields with high
> cardinality (of order of ~1000s), which is why the large heap size for even
> a small collection.
>
> Relevant solrconfig settings:
>
> commit settings:
>
> 
>   1
>   360
>   false
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:180}
> 
>
> index config:
>
> 500
> 1
>
>  class="org.apache.solr.index.TieredMergePolicyFactory">
>   10
>   10
> 
>
>
> class="org.apache.lucene.index.ConcurrentMergeScheduler">
>  6
>  4
>
>
>
> Problem:
>
> I setup the cloud and started indexing at the throughput of our earlier
> master-slave setup, but soon the machines ran into full blown Garbage
> Collection. This throughput was not a lot though. We index the whole
> collection overnight, so roughly ~250k documents in 6 hours. That's roughly
> 12rps.
>
> So now I'm doing indexing at an extremely slow rate trying to find the
> problem.
>
> Currently I'm indexing at 1 document/2seconds, so every minute ~30
> documents.
>
> Observations:
>
> 1. I'm noticing extremely small segments in the segments UI. Example:
>
> Segment _1h4:
> #docs: 5
> #dels: 0
> size: 1,586,878 bytes
> age: 2021-02-12T11:05:33.050Z
> source: flush
>
> Why is lucene creating such small segments? I understood that segments are
> created when ramBufferSizeMB or maxBufferedDocs limit is hit. Or on a hard
> commit. Neither of those should lead to such small segments.
>
> 2. The index/ directory has a large number of files. For one shard with 30k
> documents & 1.5GB size, there are ~450-550 files in this directory. I
> understand that each segment is composed of a bunch of files. Even
> accounting for that, the number of segments seems very large.
>
> Note: Nothing out of the ordinary in logs. Only /update request logs.
>
> Please help with making sense of the 2 observations above.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Re:Interpreting Solr indexing times

2021-01-13 Thread Alessandro Benedetti
I agree, documents may be gigantic or very small,  with heavy text analysis
or simple strings ...
so it's not possible to give an evaluation here.
But you could make use of the nightly benchmark to give you an idea of
Lucene indexing speed (the engine inside Apache Solr) :

http://home.apache.org/~mikemccand/lucenebench/indexing.html

Not sure we have something similar for Apache Solr officially.
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceData -> this
should be a bit outdated

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: leader election stuck after hosts restarts

2021-01-13 Thread Alessandro Benedetti
I faced these problems a while ago, but at the time I created a blog post
which I hope could help:
https://sease.io/2018/05/solrcloud-leader-election-failing.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: QueryResponse ordering

2021-01-13 Thread Alessandro Benedetti
Hi Srinivas,
Filter queries don't impact scoring but only matching.
So, what is the ordering you are expecting?
A bq (boost query) parameter will add a clause to the query, impacting the
score in an additive way.
The query you posted is a bit confused, what was your intent there?
To boost search results having "abc" as the PARTY.PARTY.ID ?
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


[Free Online Meetups] London Information Retrieval Meetup

2020-11-02 Thread Alessandro Benedetti
Hi all,
The London Information Retrieval Meetup has moved online:

https://www.meetup.com/London-Information-Retrieval-Meetup-Group

It is a free evening meetup aimed at Information Retrieval passionates and
professionals who are curious to explore and discuss the latest trends in
the field.

It is technology agnostic, but you'll find many talks on Apache Solr and
related technologies.

Tomorrow (03.11 at 6:10 pm Uk time) we will host the sixth London
Information Retrieval meetup (fully remote).
We will have two talks:
*Talk 1*
"Feature Extraction for Large-Scale Text Collections"
from Luke Gallagher, PhD candidate, RMIT University
*Talk 2*
"A Learning to Rank Project on a Daily Song Ranking Problem"
from Ilaria Petreti (IR/ML Engineer, Sease) and Anna Ruggero (R Software
Engineer, Sease)

If you fancy some Search Stories, feel free to register here:
https://www.meetup.com/London-Information-Retrieval-Meetup-Group/events/273905485/

Cheers

have a nice evening!
------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


Re: How to get boosted field and values?

2020-03-20 Thread Alessandro Benedetti
Hi Taisuke,
there are various ways of approaching boosting and scoring in Apache Solr.
First of all you must decide if you are interested in multiplicative or
additive boost.
Multiplicative will multiply the score of your search result by a certain
factor while the additive will just add the factor to the final score.

Using advanced query parsers such as the dismax and edismax you can use the
:
*boost* parameter - multiplicative - takes function in input -
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-TheboostParameter
*bq*(boost query) - additive -
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter
*bf*(boost function) - additive -
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter

This blog post is old but should help :
https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/

Then you can boost fields or even specific query clauses:

 1)
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqf_QueryFields_Parameter

2) q= features:2^1.0 AND features:3^5.0

1.0 is the default, you are multiplying the score contribution of the term
by 1.0, so no effect.
features:3^5.0 means that the score contribution of a match for the term
'3' in the field 'features' will be multiplied by 5.0 (you can also see
that enabling debug=results

Finally you can force the score contribution of a term to be a constant,
it's not recommended unless you are truly confident you don't need other
types of scoring:
q= features:2^=1.0 AND features:3^=5.0

in this example your document  id: 3 will have a score of 6.0

Not sure if this answers your question, if not feel free to elaborate more.

Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Thu, 19 Mar 2020 at 11:18, Taisuke Miyazaki 
wrote:

> I'm using Solr 7.5.0.
> I want to get boosted field and values per documents.
>
> e.g.
> documents:
>   id: 1, features: [1]
>   id: 2, features: [1,2]
>   id: 3, features: [1,2,3]
>
> query:
>   bq: features:2^1.0 AND features:3^1.0
>
> I expect results like below.
> boosted:
>   - id: 2
> - field: features, value: 2
>   - id: 3
> - field: features, value: 2
> - field: features, value: 3
>
> I have an idea that set boost score like bit-flag, but it's not good I
> think because I must send query twice.
>
> bit-flag:
>   bq: features:2^2.0 AND features:3^4.0
>   docs:
> - id: 1, score: 1.0(0x001)
> - id: 2, score: 3.0(0x011) # have feature:2(2nd bit is 1)
> - id: 3, score: 7.0(0x111) # have feature:2 and feature:3(2nd and 3rd
> bit are 1)
> check score value then I can get boosted field.
>
> Is there a better way?
>


Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-23 Thread Alessandro Benedetti
I have been working extensively on query autocompletion, these blogs should
be helpful to you:

https://sease.io/2015/07/solr-you-complete-me.html
https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html

You idea of using search quality evaluation to drive the autocompletion is
interesting.
How do you currently calculate the NDCG for a query? What's your golden
truth?
Using that approach you will autocomplete favouring query completion that
your search engine is able to process better, not necessarily closer to the
user intent, still it could work.

We should differentiate here between the suggester dictionary (where the
suggestions come from, in your case it could be your extracted data) and
the kind of suggestion (that in your case could be the free text suggester
lookup)

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Mon, 20 Jan 2020 at 17:02, David Hastings 
wrote:

> Not a bad idea at all, however ive never used an external file before, just
> a field in the index, so not an area im familiar with
>
> On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > David,
> >
> > Thank you, that is useful. So, would you recommend using a (clean) field
> > over an external dictionary file? We have lots of "top queries" and
> measure
> > their nDCG. A thought was to programmatically generate an external file
> > where the weight per query term (or phrase) == its nDCG. Bad idea?
> >
> > Best,
> > Audrey
> >
> > On 1/20/20, 11:51 AM, "David Hastings" 
> > wrote:
> >
> > Ive used this quite a bit, my biggest piece of advice is to choose a
> > field
> > that you know is clean, with well defined terms/words, you dont want
> an
> > autocomplete that has a massive dictionary, also it will make the
> > start/reload times pretty slow
> >
> > On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > We plan to incorporate a query autocomplete functionality into our
> > search
> > > engine (like this:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > > ). And I was wondering if anyone has personal experience with this
> > > component and would like to share? Basically, we are just looking
> > for some
> > > best practices from more experienced Solr admins so that we have a
> > starting
> > > place to launch this in our beta.
> > >
> > > Thank you!
> > >
> > > Best,
> > > Audrey
> > >
> >
> >
> >
>


Re: Query Regarding SOLR cross collection join

2020-01-23 Thread Alessandro Benedetti
>From the Join Query Parser code:

"// most of these statistics are only used for the enum method

int fromSetSize;  // number of docs in the fromSet (that match
the from query)
long resultListDocs;  // total number of docs collected
int fromTermCount;
long fromTermTotalDf;
int fromTermDirectCount;  // number of fromTerms that were too small
to use the filter cache
int fromTermHits; // number of fromTerms that intersected the from query
long fromTermHitsTotalDf; // sum of the df of the matching terms
int toTermHits;   // num if intersecting from terms that match
a term in the to field
long toTermHitsTotalDf;   // sum of the df for the toTermHits
int toTermDirectCount;// number of toTerms that we set directly on
a bitset rather than doing set intersections
int smallSetsDeferred;// number of small sets collected to be used
later to intersect w/ bitset or create another small set

"

The toSetSize has nothing to do with MB of data read from the index, it is
the size in number of docs of the resulting set of documents.

Improving this would require a much deeper analysis I reckon.
Starting from your query and your data model till the architecture involved.

Cheers
------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Wed, 22 Jan 2020 at 13:27, Doss  wrote:

> HI,
>
> SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)
>
> One of our use cases requires joins, we are joining 2 large indexes. As
> required by SOLR one index (2GB) has one shared and 10 replicas and the
> other has 10 shard (40GB / Shard).
>
> The query takes too much time, some times in minutes how can we improve
> this?
>
> Debug query produces one or more based on the number of shards (i believe)
>
> "time":303442,
> "fromSetSize":0,
> "toSetSize":81653955,
> "fromTermCount":0,
> "fromTermTotalDf":0,
> "fromTermDirectCount":0,
> "fromTermHits":0,
> "fromTermHitsTotalDf":0,
> "toTermHits":0,
> "toTermHitsTotalDf":0,
> "toTermDirectCount":0,
> "smallSetsDeferred":0,
> "toSetDocsAdded":0},
>
> here what is the  toSetSize  mean? does it read 81MB of data from the
> index? how can we reduce this?
>
> Read somewhere that the score join parser will be faster, but for me it
> produces no results. I am using string type fields for from and to.
>
>
> Thanks!
>


Re: Is it possible to add stemming in a text_exact field

2020-01-23 Thread Alessandro Benedetti
Edward is correct, furthermore using a stemmer in an analysis chain that
don't tokenise is going to work just for single term queries and single
term field values...
Not sure it was intended ...

Cheers


--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Wed, 22 Jan 2020 at 16:26, Edward Ribeiro 
wrote:

> Hi,
>
> One possible solution would be to create a second field (e.g.,
> text_general) that uses DefaultTokenizer, or other tokenizer that breaks
> the string into tokens, and use a copyField to copy the content from
> text_exact to text_general. Then, you can use edismax parser to search both
> fields, but giving text_exact a higher boost (qf=text_exact^5
> text_general). In this case, both fields should be indexed, but only one
> needs to be stored.
>
> Edward
>
> On Wed, Jan 22, 2020 at 10:34 AM Dhanesh Radhakrishnan  >
> wrote:
>
> > Hello,
> > I'm facing an issue with stemming.
> > My search query is "restaurant dubai" and returns  results.
> > If I search "restaurants dubai" it returns no data.
> >
> > How to stem this keyword "restaurant dubai" with "restaurants dubai" ?
> >
> > I'm using a text exact field for search.
> >
> >  > multiValued="true" omitNorms="false" omitTermFreqAndPositions="false"/>
> >
> > Here is the field definition
> >
> >  > positionIncrementGap="100">
> > 
> >
> >
> >
> >
> > 
> > 
> >   
> >   
> >   
> >   
> >
> > 
> >
> > Is there any solutions without changing the tokenizer class.
> >
> >
> >
> >
> > Dhanesh S.R
> >
> > --
> > IMPORTANT: This is an e-mail from HiFX IT Media Services Pvt. Ltd. Its
> > content are confidential to the intended recipient. If you are not the
> > intended recipient, be advised that you have received this e-mail in
> error
> > and that any use, dissemination, forwarding, printing or copying of this
> > e-mail is strictly prohibited. It may not be disclosed to or used by
> > anyone
> > other than its intended recipient, nor may it be copied in any way. If
> > received in error, please email a reply to the sender, then delete it
> from
> > your system.
> >
> > Although this e-mail has been scanned for viruses, HiFX
> > cannot ultimately accept any responsibility for viruses and it is your
> > responsibility to scan attachments (if any).
> >
> > ​Before you print this email
> > or attachments, please consider the negative environmental impacts
> > associated with printing.
> >
>


Re: Spell check with data from database and not from english dictionary

2020-01-23 Thread Alessandro Benedetti
Hi Seetesh,
As you can see from the wiki [1] there are mainly two input sources for a
spellcheck dictionary:
1) a file
2) the index (in a couple of different forms)

If you prefer the file approach, it's your call to produce the file and you
can certainly use whatever you like to fill the data.
It could be from the English dictionary or from a database.


[1] https://lucene.apache.org/solr/guide/8_4/spell-checking.html
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Thu, 23 Jan 2020 at 06:06, seeteshh  wrote:

> Hello all,
>
> Can the spell check feature be configured with words/data fetched from a
> database and not from the English dictionary?
>
> Regards,
>
> Seetesh Hindlekar
>
>
>
> -
> Seetesh Hindlekar
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Alessandro Benedetti
Personally I was expecting the sort request parameter to be applied on the
final search results:
1) run original query, get top K based on score
2) run re rank query on the top K, recalculate the scores
3) finally apply the sort

But when you mentioned "you expect the sort specified to be applied to both
the “outer” and “inner” queries",
I changed my mind, it is probably a better solution to give the user a nice
flexibility on controlling both the original query sort (to affect the top
K retrieval) and the final sort (the one sorting the reranked results).

*Currently the 'sort' global request parameter affects the way the top K
are retrieved, then they are re-ranked.*
Unfortunately the workaround you suggested through the local params of the
rerank query parser doesn't seem to work at all in 8.1.1 :(
Unless it was introduced in 8.2 I think it is a good idea to create the
jira issue, with this in mind:
1) we want to be able to decide the sort for both the original query(to
assess the top K) and the final results
2) we need to decide which request parameter should do what
e.g.
should the 'sort' request param affect *the original query* OR the final
results?
should the 'sort' in the local params of the reRank query parser affect
 the original query OR *the final results*?

In bold my personal preference, but I don't have any hard position in
regards.

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Thu, Sep 26, 2019 at 5:23 PM Erick Erickson 
wrote:

> OK so to restate, you expect the sort specified to be applied to both the
> “outer” and “inner” queries. Makes sense, seems like a good enhancement.
>
> Hmm, I wonder if you can put the sort parameter in with the rerank
> specification, like: q={!rerank reRankQuery=$rqq reRankDocs=1200
> reRankWeight=3 sort="score desc, downloads desc”}
>
> That doesn’t address your initial point, just curious if it’d do as a
> workaround meanwhile.
>
> Best,
> Erick
>
>
> > On Sep 26, 2019, at 10:54 AM, Alessandro Benedetti 
> wrote:
> >
> > In the first OK scenario, the search results are sorted with score desc,
> > and when the score is identical, the secondary sort field is applied.
> >
> > In the KO scenario, only score desc is taken into consideration(the
> > reranked score) , the secondary sort by the sort field is ignored.
> >
> > I suspect an intuitive expected result would be to have the same
> behaviour
> > that happens with no reranking, so:
> > 1) sort of the final results by reranked score desc
> > 2) when identical raranked score, sort by secondat sort field
> >
> > Is it clearer?
> > Any wrong assumption?
> >
> >
> > On Thu, 26 Sep 2019, 14:34 Erick Erickson, 
> wrote:
> >
> >> Hmmm, can we see a bit of sample output? I always have to read this
> >> backwards, the outer query results are sent to the inner query, so my
> >> _guess_ is that the sort is applied to the “q=*:*” and then the top
> 1,200
> >> are sorted by score by the rerank. But then I’m often confused about
> this.
> >>
> >> Erick
> >>
> >>> On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti <
> a.benede...@sease.io>
> >> wrote:
> >>>
> >>> Hi all,
> >>> I was playing a bit with the reranking capability and I discovered
> that:
> >>>
> >>> *Sort by score, then by secondary field -> OK*
> >>> http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
> >>> desc,downloads desc*=id,title,score,downloads
> >>>
> >>> *ReRank, Sort by score, then by secondary field -> KO*
> >>> http://localhost:8983/solr/books/select?q=*:*={!rerank
> >> reRankQuery=$rqq
> >>> reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score
> >> desc,downloads
> >>> desc*=id,title,score,downloads
> >>>
> >>> Is this intended? It sounds counter-intuitive to me and I wanted to
> check
> >>> before opening a Jira issue
> >>> Tested on 8.1.1 but it should be in master as well.
> >>>
> >>> Regards
> >>> --
> >>> Alessandro Benedetti
> >>> Search Consultant, R Software Engineer, Director
> >>> www.sease.io
> >>
> >>
>
>


Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Alessandro Benedetti
In the first OK scenario, the search results are sorted with score desc,
and when the score is identical, the secondary sort field is applied.

In the KO scenario, only score desc is taken into consideration(the
reranked score) , the secondary sort by the sort field is ignored.

I suspect an intuitive expected result would be to have the same behaviour
that happens with no reranking, so:
1) sort of the final results by reranked score desc
2) when identical raranked score, sort by secondat sort field

Is it clearer?
Any wrong assumption?


On Thu, 26 Sep 2019, 14:34 Erick Erickson,  wrote:

> Hmmm, can we see a bit of sample output? I always have to read this
> backwards, the outer query results are sent to the inner query, so my
> _guess_ is that the sort is applied to the “q=*:*” and then the top 1,200
> are sorted by score by the rerank. But then I’m often confused about this.
>
> Erick
>
> > On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti 
> wrote:
> >
> > Hi all,
> > I was playing a bit with the reranking capability and I discovered that:
> >
> > *Sort by score, then by secondary field -> OK*
> > http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
> > desc,downloads desc*=id,title,score,downloads
> >
> > *ReRank, Sort by score, then by secondary field -> KO*
> > http://localhost:8983/solr/books/select?q=*:*={!rerank
> reRankQuery=$rqq
> > reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score
> desc,downloads
> > desc*=id,title,score,downloads
> >
> > Is this intended? It sounds counter-intuitive to me and I wanted to check
> > before opening a Jira issue
> > Tested on 8.1.1 but it should be in master as well.
> >
> > Regards
> > --
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > www.sease.io
>
>


Re: Need more info on MLT (More Like This) feature

2019-09-26 Thread Alessandro Benedetti
In addition to all the valuable information already shared I am curious to
understand why you think the results are unreliable.
Most of the times is the parameters that cause to ignore some of the terms
of the original document/corpus (as simple of the min/max document frequency
to consider or min term frequency in the source doc) .

I have been working a lot on the MLT in the past years and presenting the
work done (and internals) at various conferences/meetups.

I'll share some slides and some Jira issues that may help you:

https://www.youtube.com/watch?v=jkaj89XwHHw=540s
<https://www.youtube.com/watch?v=jkaj89XwHHw=540s>  
https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works
<https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works>  

https://issues.apache.org/jira/browse/LUCENE-8326
<https://issues.apache.org/jira/browse/LUCENE-8326>  
https://issues.apache.org/jira/browse/LUCENE-7802
<https://issues.apache.org/jira/browse/LUCENE-7802>  
https://issues.apache.org/jira/browse/LUCENE-7498
<https://issues.apache.org/jira/browse/LUCENE-7498>  

Generally speaking I favour the MLT query parser, it builds the MLT query
and gives you the chance to see it using the debug query.



-----
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


[Apache Solr ReRanking] Sort Clauses Bug

2019-09-25 Thread Alessandro Benedetti
Hi all,
I was playing a bit with the reranking capability and I discovered that:

*Sort by score, then by secondary field -> OK*
http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
desc,downloads desc*=id,title,score,downloads

*ReRank, Sort by score, then by secondary field -> KO*
http://localhost:8983/solr/books/select?q=*:*={!rerank reRankQuery=$rqq
reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score desc,downloads
desc*=id,title,score,downloads

Is this intended? It sounds counter-intuitive to me and I wanted to check
before opening a Jira issue
Tested on 8.1.1 but it should be in master as well.

Regards
------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


Re: MLT - unexpected design choice

2019-01-29 Thread Alessandro Benedetti
Hi Maria,
this is actually a great catch!
I have been working a lot on the More Like This and this mistake never
caught my attention.

I agree with you, feel free to open a Jira Issue.

First of all what you say, makes sense.
Secondly it is the way it is the standard way used in the similarity Lucene
calculations :








*public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {  final long df = termStats.docFreq();
final long docCount = collectionStats.docCount();  final float idf =
idf(df, docCount);  return Explanation.match(idf, "idf, computed as
log((docCount+1)/(docFreq+1)) + 1 from:",  Explanation.match(df,
"docFreq, number of documents containing term"),
Explanation.match(docCount, "docCount, total number of documents with
field"));}*


*Indeed the int numDocs = ir.numDocs(); should actually be allocated
per term in the for loop, using the field stats, something like:*

*numDocs = ir.getDocCount(fieldName)*

Feel free to open the Jira issue and attach a patch with at least a
testCase that shows the bugfix.

I will be available for doing the review.


Cheers

------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce  wrote:

> Hi Maria,
>
> Would it help to add a filter to your query to restrict the results to
> just those where the description field is populated? Eg. add
>
> fq=description:[* TO *]
>
> to your query parameters.
>
> Apologies if I'm misunderstanding the problem!
>
> Best,
>
> Matt
>
>
> On 28/01/2019 16:29, Maria Mestre wrote:
> > Hi all,
> >
> > First of all, I’m not a Java developer, and a SolR newbie. I have worked
> with Elasticsearch for some years (not contributing, just as a user), so I
> think I have the basics of text search engines covered. I am always
> learning new things though!
> >
> > I created an index in SolR and used more-like-this on it, by passing a
> document_id. My data has a special feature, which is that one of the fields
> is called “description” but is only populated about 10% of the time. Most
> of the time it is empty. I am using that field to query similar documents.
> >
> > So I query the /mlt endpoint using these parameters (for example):
> >
> > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
> > mlt=true,
> > mlt.fl=description,
> > mlt.mindf=1,
> > mlt.mintf=1,
> > mlt.maxqt=5,
> > wt=json,
> > mlt.interestingTerms=details}
> >
> > The issue I have is that when retrieving the key scored terms
> (interestingTerms), the code uses the total number of documents in the
> index, not the total number of documents with populated “description”
> field. This is where it’s done in the code:
> https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651
> >
> > The effect of this choice is that the “idf” does not vary much, given
> that numDocs >> number of documents with “description”, so the key terms
> end up being just the terms with the highest term frequencies.
> >
> > It is inconsistent because the MLT-search then uses these extracted key
> terms and scores all documents using an idf which is computed only on the
> subset of documents with “description”. So one part of the MLT uses a
> different numDocs than another part. This sounds like an odd choice, and
> not expected at all, and I wonder if I’m missing something.
> >
> > Best,
> > Maria
> >
> >
> >
> >
> >
> >
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk
>


Re: Question about elevations

2018-11-19 Thread Alessandro Benedetti
As far as I remember the answer is no.
You could take a deep look into the code, but as far as I remember the
elevated doc Ids must be in the index to be elevated.
Those ids will be added to the query built, a sort of query expansion server
side.
And then the search executed.

Cheers





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: AW: Solr suggestions, best practices

2018-11-19 Thread Alessandro Benedetti
I have done extensive work on auto suggestion, some additional resource from
my company blog :

https://sease.io/2015/07/solr-you-complete-me.html
<https://sease.io/2015/07/solr-you-complete-me.html>  

https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html
<https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html>
  

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Restrict search on term/phrase count in document.

2018-11-19 Thread Alessandro Benedetti
I agree with Alexandre, it seems suspicious.
Anyway, if you want to query for single term frequencies occurrence you
could make use of the function range query parser :

https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FunctionRangeQueryParser

And the function:

termfreq
Returns the number of times the term appears in the field for that document.
termfreq(text,'memory')

tf
Term frequency; returns the term frequency factor for the given term, using
the Similarity for the field. The tf-idf value increases proportionally to
the number of times a word appears in the document, but is offset by the
frequency of the word in the document, which helps to control for the fact
that some words are generally more common than others. See also idf.
tf(text,'solr')

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Phrase query as feature in LTR not working

2018-11-19 Thread Alessandro Benedetti
Hi AshB, from what I see, this is the expected behavior.

You pass this efi to your "isPook" feature : efi.query=thrones%20of%20game*.
Then you calculate:

{ 
"name" : "isPook", 
"class" : "org.apache.solr.ltr.feature.SolrFeature", 
"params" : { 
  "fq": ["{!type=edismax qf=*text* v=$qq}=\"${query}\""] 
} 
  } 

Given the document titles, it seems incorrect, but what about the document
text ?
Furthermore, if you are interested in exact phrase match, I would first go
with :

https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FieldQueryParser

and then  play with the following if more advance phrase querying was
needed: 

https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-ComplexPhraseQueryParser

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Scores with Solr Suggester

2018-07-04 Thread Alessandro Benedetti
Hi Christine,
it depends on the suggester implementation, the one that got closer in
having a score implementation is the BlendedInfix[1] but it is still in the
TO DO phase.
Feel free to contribute it if you like !

[1]
https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7 MoreLikeThis boost calculation

2018-06-29 Thread Alessandro Benedetti
Hi Jesse,
you are correct, the variable 'bestScore' used in the
createQuery(PriorityQueue q) should be "minScore".

it is used to normalise the terms score :
tq = new BoostQuery(tq, boostFactor * myScore / bestScore);
e.g.

Queue -> Term1:100 , Term2:50, Term3:20, Term4:10

The minScore will be 10 and the normalised score will be :
Term1:10 , Term2:5, Term3:2, Term4:1

These values will be used to build the boost term queries.

I see no particular problem with that.
What is your concern ?



-
-------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr sort by score not working properly

2018-06-22 Thread Alessandro Benedetti
Hi,
if you add to the request the param : debugQuery=on you will see what
happens under the hood and understand how the score is assigned.

If you are new to the Lucene Similarity that Solr version uses ( BM25[1])
you can paste here the debug score response and we can briefly explain it to
you the first time.

First of all we are not even sure if the content field is actually used for
scoring in your case, if it is and it is alone used, it may be related to
the field length ( But it would be suspicious as they are quite similar in
length in your example).
Are you sorting by score for any reason ? 
It's been a while I have not checked but I doubt you get any benefit from
the default ( which rank by score).

So I recommend you to send here the debug response and then possibly your
select request handler config.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to split index more than 2GB in size

2018-06-20 Thread Alessandro Benedetti
Hi,
in the first place, why do you want to split 2 Gb indexes ?
Nowadays is a fairly small index.

Secondly what you reported is incomplete.
I would expect a Caused By section in the stacktrace.

This are generic recommendations, always spend time in analysing the problem
you had scrupulously.
- SolrCloud problems often involve more than one node. Be sure to check the
logs of all the nodes possibly involved.
- Report the full stack trace to the community
- Report your full request which provoked the exception

Help is much easier this way :)

Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 6.5 autosuggest suggests misspelt words and unwanted words

2018-06-20 Thread Alessandro Benedetti
Hi,
you should curate your data, that is fundamental to have an healthy search
solution, but let's see what you can do anyway :

1) curate a dictionary of such bad words and then configure analysis to skip
them
2) Have you tried different dictionary implementations ? I would assume that
each single mispelled word has a low document frequency. You could use the
High Frequency Document Dictionary[1] and see how it goes.


[1]
https://lucene.apache.org/solr/guide/7_3/suggester.html#highfrequencydictionaryfactory



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to exclude certain values in multi-value field filter query

2018-06-19 Thread Alessandro Benedetti
The first idea that comes in my mind is to build a single valued copy field
which concatenates them.
in this way you will have very specific values to filter on :

query1 -(copyfield:(A B AB))

To concatenate you can use this update request processor :
https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html

Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solrj does not support ltr ?

2018-06-19 Thread Alessandro Benedetti
Pretty sure you can't.
As far as I know there is no client side implementation to help with managed
resourced in general.
Any contribution is welcome!



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Achieving AutoComplete feature using Solrj client

2018-06-18 Thread Alessandro Benedetti
Indeed, you first configure it in the solrconfig.xml ( manually).

Then you can query and parse the response as you like with the SolrJ client
library.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Achieving AutoComplete feature using Solrj client

2018-06-18 Thread Alessandro Benedetti
Hi,
me and Tommaso contributed this few years ago.[1]
You can easily get the suggester response now from the Solr response.
Of course you need to configure and enable the suggester first.[2][3][4]


[1] https://issues.apache.org/jira/browse/SOLR-7719
[2] https://sease.io/2015/07/solr-you-complete-me.html
[3] https://lucidworks.com/2015/03/04/solr-suggester/
[4]
https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Suggest Component and OOM

2018-06-14 Thread Alessandro Benedetti
I didn't get any answer to my questions ( unless you meant you have 25
millions of different values for those fields ...)
Please read again my answer and elaborate further.
Do you problem happen for the 2 different suggesters ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Logging Every document to particular core

2018-06-14 Thread Alessandro Benedetti
Isn't the Transaction Log what you are looking for ?

Read this good blog post as a reference :
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Changing Field Assignments

2018-06-12 Thread Alessandro Benedetti
On top of that I would not recommend to use the schema-less mode in
production.
That mode is useful for experimenting and prototyping, but with a managed
schema you would have much more control over a production instance.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Suggest Component and OOM

2018-06-12 Thread Alessandro Benedetti
Hi,
first of all the two different suggesters you are using are based on
different data structures ( with different memory utilisation) :

- FuzzyLookupFactory -> FST ( in memory and stored binary on disk)
- AnalyzingInfixLookupFactory -> Auxiliary Lucene Index

Both the data structures should be very memory efficient ( both in building
and storage).
What is the cardinality of the fields you are building suggestions from ? (
site_address and site_address_other)
What is the memory situation in Solr when you start the suggester building ?
You are allocating much more memory to the JVM Solr process than the OS (
which in your situation doesn't fit the entire index ideal scenario).

I would recommend to put some monitoring in place ( there are plenty of open
source tools to do that)

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to find out which search terms have matches in a search

2018-06-12 Thread Alessandro Benedetti
I would recommend to look into the Highlight feature[1] .
There are few implementations and they should be all right for your user
requirement.

Regards

[1] https://lucene.apache.org/solr/guide/7_3/highlighting.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Difference in fieldLengh and avgFieldLength in Solr 6.6 vs Solr 7.1

2018-06-08 Thread Alessandro Benedetti
Shoot in the dark, I have not double checked in details but :

With Solr 7.x
"Index-time boosts have been removed from Lucene, and are no longer
available from Solr. If any boosts are provided, they will be ignored by the
indexing chain. As a replacement, index-time scoring factors should be
indexed in a separate field and combined with the query score using a
function query. See the section Function Queries for more information."

Are you using index time boost by any chance ?
If I remember correctly the Norms stored in the segment were affected by the
field length and index time boost.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: BlendedInfixSuggester wiki errata corrige

2018-06-06 Thread Alessandro Benedetti
Hi Cassandra,
thanks for your reply.
I did the fix in the official documentation as part of the bugfix I am
working on:

LUCENE-8343 
<https://issues.apache.org/jira/browse/LUCENE-8343> 

Any feedback is welcome !

Cheers






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: BlendedInfixSuggester wiki errata corrige

2018-06-05 Thread Alessandro Benedetti
Errata corrige to my Errata corrige post :

e.g. 

Position Of First match =   *0 |  1  | 2  | 3 |* 
Linear |1 | 0.9|0.8|0.7 
Reciprocal   |1 | 1/2|1/3|1/4 
Exponential Reciprocal |1 | 1/4|*1/9*|1/16 



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


BlendedInfixSuggester wiki errata corrige

2018-06-05 Thread Alessandro Benedetti
Hi all,
I have been working quite a bit on the BlendedInfixSuggester :
- to fix a bug :  LUCENE-8343
<https://issues.apache.org/jira/browse/LUCENE-8343>  
- to bring an improvement :  LUCENE-8347
<https://issues.apache.org/jira/browse/LUCENE-8347>  

I was reviewing the wiki documentation for the BlendedInfixSuggester[1].

This bit is incorrect or at least confusing :

"position_linear
weightFieldValue * (1 - 0.10*position): Matches to the start will be given a
higher score. This is the default.

position_reciprocal
weightFieldValue / (1 + position): *Matches to the end will be given a
higher score*.

exponent
An optional configuration variable for position_reciprocal to control how
fast the score will increase or decrease. Default 2.0."

1) the *position_exponential_reciprocal* blenderType is missing ( it is the
one the "exponent" apply to

2) It is not true that the position_reciprocal gives higher scores to
matches in the end of a suggestion.
All the blenderTypes boost matches at the beginning of the suggestions, the
only difference is how fast the score of such terms decay with the position
:

e.g.

Position Of First match =   *0 |  1  | 2  | 3 |*
Linear |1 | 0.9|0.8|0.7
Reciprocal   |1 | 1/2|1/3|1/4
Exponential Reciprocal |1 | 1/4|1/8|1/16

I would be grateful if anyone can fix the documentation.

Cheers

[1]
https://lucene.apache.org/solr/guide/7_3/suggester.html#blendedinfixlookupfactory



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7.3 suggest dictionary building fails in cloud mode with large number of rows

2018-06-05 Thread Alessandro Benedetti
In addition to what Erick and Walter correctly mentioned :

"heap usage varies from 5 gb to 12 gb . Initially it was 5 gb then increased 
to 12 gb gradually and decreasing to 5 gb again. (may be because of garbage 
collection) 
10-12 GB maximum  heap uses, allocated is 50 GB. "

Did I read it right ?
Is 50 Gb allocated to the phisical/virtual machine where Solr is running or
to the Solr JVM ?
If the first is ok, the latter is considered a bad practice unless you
really need all that heap for your Solr process ( which is extremely
unlikely)

You need to leave memory to the OS memory mapping ( which is heavily used by
Solr).
With such a big heap, you GC may indeed end up in long pauses.
It is recommended to allocate to the Solr process as little as possible (
according yo your requirements)

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 7.3 suggest dictionary building fails in cloud mode with large number of rows

2018-06-04 Thread Alessandro Benedetti
Hi Yogendra,
you mentioned you are using SolrCloud.
In SolrCloud an investigation does not isolate to a single Solr log : you
see a timeout, i would recommend to check both the nodes involved.

When you say : " heap usage is around 10 GB - 12 GB per node.", do you refer
to the effective usage by the Solr JVM or the allocated heap ?
Are you monitoring the memory utilisation for your Solr nodes ?
Are Garbage Collection cycles behaving correctly ?
When a timeout occurs, something bad happened in the communication between
the Solr nodes.
It could be network, but in your case it may be some Stop World situation
caused by GC.




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Update Solr Document

2018-06-01 Thread Alessandro Benedetti
There is no quick answer, it really depends on a lot of factors...
*TL;DR* : Updating a single document field will likely take more time in a
bigger collection.

*Partial Document Update*
First of all, which field are you updating ?
Depending on the type and attributes you may end up in different
scenarios[1].
For example, an in place update would be much more convenient and less
expensive as it will not end up writing a new document in the index .
Viceversa a normal atomic update will cause an internal delete/re-index of
the doc.
What happens next will depend on the commit policies ( or in case you
saturated the internal ram buffer, the content of the segment will be
flushed.

*Solr Commit Policies*
In Solr there is the concept of Soft and hard commit.
A soft commit is cheaper : grants visibility, warms up the caches, does
minimal ( potentially none) disk writing
An hard commit will flush the current segment to the disk in addition (
which brings all the background operations that Emir pointed out).
Help yourself with this Erick's great classic[2]
*Warming the caches* will take more time in a bigger collection ( as the
queries will be executed on a bigger index).
*Merging the segments* in the background, if it's triggered will take more
time in a bigger collection.


[1] 
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates
<https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates>
  
[2]  understanding-transaction-logs-softcommit-and-commit-in-sorlcloud
<https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/>
  



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Weird behavioural differences between pf in dismax and edismax

2018-05-30 Thread Alessandro Benedetti
Question in general for the community :
what is the dismax capable of doing that the edismax is not ?
Is it really necessary to keep both of them or the dismax could be
deprecated ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr-extracting features values

2018-05-30 Thread Alessandro Benedetti
The current feature extraction implementation in Solr is oriented to the
Learning To Rank re-ranking capability, it is not built for feature
extraction ( to then train your model).

I am afraid you will need to implement your own system, that does multiple
queries to Solr with the extraction feature enabled and then parse the
results to build your training set.
Do you have query level or query dependant features ?
In case you are lucky enough to just have document level features, you may
end up in a slightly simplified scenario.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Weird behavioural differences between pf in dismax and edismax

2018-05-29 Thread Alessandro Benedetti
I don't have any hard position on this, It's ok to not build a phrase boost
if the input query is 1 term and it remains one term after the analysis for
one of the pf fields.

But if the term produces multiple tokens after query time analysis, I do
believe that building a phrase boost should be the correct interpretation (
e.g. wi-fi with a query time analiser which split by - ) .

Cheers







-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Weird behavioural differences between pf in dismax and edismax

2018-05-29 Thread Alessandro Benedetti
In my opinion, given the definition of dismax and edismax query parsers, they
should behave the same for parameters in common.
To be a little bit extreme I don't think we need the dismax query parser at
all anymore ( in the the end edismax  is only offering more than the dismax)

Finally, I do believe that even if the query is a single term ( before or
after the analysis for a PF field) it should anyway boost the phrase.
A phrase of 1 word is still a phrase, isn't it ?





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Could not load collection from ZK:

2018-05-24 Thread Alessandro Benedetti
hi Aman,
I had similar issues in the past and the reason was attributed to :

SOLR-8868 <https://issues.apache.org/jira/browse/SOLR-8868>  

Which unfortunately is not solved yet.

Did you manage to find a different cause in your case?

hope that helps.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Debugging/scoring question

2018-05-23 Thread Alessandro Benedetti
Hi Mariano,
>From the documentation :

docCount = total number of documents containing this field, in the range [1
.. {@link #maxDoc()}]

In your debug the fields involved in the score computation are indeed
different ( nomUsageE, prenomE) .

Does this make sense ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Alessandro Benedetti
Hi Markus,
can you show all the query parameters used when submitting the request to
the request handler ?
Can you also include the parsed query  ( in the debug)

I am curious to investigate this case.

Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello,
>
> And sorry to disturb again. Does anyone of you have any meaningful opinion
> on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> but with query-time KeywordRepeat filter it causes trouble in some cases.
> Is it normal for the clauses to be absent in the debug output, but the
> boost doubled in value?
>
> I like this behaviour, but is it a side effect that is considered a bug in
> later versions? And where is the documentation in this. I cannot find
> anything in the Lucene or Solr Javadocs, or the reference manual.
>
> Many thanks, again,
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Wednesday 9th May 2018 17:39
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> >
> > Hello,
> >
> > First, apologies for the weird subject line.
> >
> > We index many languages and search over all those languages at once, but
> boost the language of the user's preference. To differentiate between
> stemmed tokens and unstemmed tokens we use KeywordRepeat and
> RemoveDuplicates, this works very well.
> >
> > However, we just stumbled over the following example, q=australia is not
> stemmed in English, but its suffix is removed by the Romanian stemmer,
> causing the Romanian results to be returned on top of English results,
> despite language boosting.
> >
> > This is because the Romanian part of the query consists of the stemmed
> and unstemmed version of the word, but the English part of the query is
> just one clause per field (title, content etc). Thus the Romanian results
> score roughtly twice that of English results.
> >
> > Now, this is of course really obvious, but the 'solution' is not. To
> work around the problem i removed the RemoveDuplicates filter so i get two
> clauses for English as well, really ugly but it works. What i don't
> understand is the debug output, it doesn't list two identical clauses,
> instead, it doubled the boost on the field, so instead of:
> >
> > 27.048403 = PayloadSpanQuery, product of:
> >   27.048403 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   7.4 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > I now get
> >
> > 54.096806 = PayloadSpanQuery, product of:
> >   54.096806 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   14.8 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > So instead of expecting two clauses in the debug, i get one but with a
> doubled boost.
> >
> > The question is, is this supposed to be like this?
> >
> > Also, are there any real solutions to this problem? Removing the
> RemoveDuplicats filter looks really silly.
> >
> > Many thanks!
> > Markus
> >
>


Re: Regarding LTR feature

2018-05-17 Thread Alessandro Benedetti
"FQ_filter were 365 but below in the 
debugging part the docfreq used in the payload_score calculation was 
3360" 

If you are talking about the doc frequency of a term, obviously this is
corpus based ( necessary for the TF /IDF calculations) so it wil not be
affected by the filter queries.
The payload score part may be different.

Anyway, you mentioned that you assign the weights, in that case the learning
to rank plugin may be not necessary at all.

Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to implement Solr auto suggester and spell checker simultaneously on a single search box

2018-05-17 Thread Alessandro Benedetti
Hi Sonal,
if you want to go with a plain Solr suggester, what about the :
FuzzyLookupFactory ?

1) it does support fuzzy matching ( spellcheck)
2) it does support auto complete

If you want the context filtering as well, unfortunately the FST based Solr
suggesters don't support this feature.

I would recommend in that case to build your own autocompletion service
defining a dedicated Lucene index ( to make it simple you could define an ad
hoc Solr collection).

Then, at query time, when a query doesn't return results you may want to
execute a fuzzy query ( to bring the spellcheck functionality or just run
the spellcheck response collation from the main query)

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Date Query Confusion

2018-05-17 Thread Alessandro Benedetti
Hi Terry,
let me go in order :

/"Tried creation_date: 2016-11.  That's supposed to match 
documents with any November 2016 date.  But actually produces:  
|"Invalid Date String:'2016-11'| "/

Is "*DateRangeField*" the field type for your field : "creation_date" ? [1]
You mentioned : org.apache.solr.schema.TrieDateField, this is not going to
work, you need the specific field type I mentioned to use that date range
syntax.

/"||And Solr doesn't seem to let me sort on a date field.  Tried 
creation_date asc  Produced: |"can not sort on multivalued field: 
creation_date"| "/

Is your "creation_date" single valued ?
If it is single valued semantically, make sure it is defined as single
valued in the schema.
Solr doesn't support sorting on multi valued fields.
You schemaless conf may have assigned the multi valued attribute to that
field.

>From the Wiki[2] :
"Solr can sort query responses according to document scores or the value of
any field with a single value that is either indexed or uses DocValues (that
is, any field whose attributes in the Schema include multiValued="false" and
either docValues="true" or indexed="true" – if the field does not have
DocValues enabled, the indexed terms are used to build them on the fly at
runtime), provided that:"

Hope this helps,

Regards



[1]
https://lucene.apache.org/solr/guide/6_6/working-with-dates.html#WorkingwithDates-DateRangeFormatting
[2]
https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-ThesortParameter



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Regarding LTR feature

2018-05-09 Thread Alessandro Benedetti
So Prateek :

"You're right it doesn't have to be that accurate to the query time but our
requirement is having a more solid control over our outputs from Solr like
if we have 4 features then we can adjust the weights giving something like
(40,20,20,20) to each feature such that the sum total of features for a
document is 100 this is only possible if we could scale the feature outputs
between 0-1."
You are talking about weights so I assume you are using a linear Learning To
Rank model.
Which library are you using to train your model?
Is this library allowing you to limit the summation of the linear weights
and normalise the training set per feature ?

At query time LTR will just apply the model weights to the query time
feature vector.
It makes sense to normalise each query time feature using the training time
values.
They should be close enough to the training set values ( if not the model is
going to perform poor anyway and you need to curate a little bit better the
training phase).
Remember the model is used to give an order to the results, not to make an
accurate regression prediction.


"Secondly, I also have a doubt regarding the scaling function like why it is
not considering only the documents filtered out by the FQ filter and
considering all the documents which match the query."

At the moment I would not focus on that scenario, I am not very convinced
LTR SolrFeature is compatible to that complex function query, and I am not
very convinced is going to be performance friendly anyway.
i would need to investigate that properly.

Regards



-----
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Autocomplete returning shingles

2018-05-04 Thread Alessandro Benedetti
Yes, faceting will work, you can use an old approach used for
autocompletion[1] .
Be sure you add the shingle filter to the appropriate index time analysis
for the field you want.
Facet values are extracted from the indexed terms, so calculating faceting
and filtering by prefix should do the trick.

[1]
https://solr.pl/en/2013/03/25/autocomplete-on-multivalued-fields-using-faceting/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Regarding LTR feature

2018-05-04 Thread Alessandro Benedetti
Hi Preteek,
I would assume you have that feature at training time as well, can't you use
the training set to estabilish the parameters for the normalizer at query
time ?

In the end being a normalization, doesn't have to be that accurate to the
query time state, but it must reflect the relations the model learnt from
the training set.
Let me know !



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Regarding LTR feature

2018-05-03 Thread Alessandro Benedetti
Mmmm, first of all, you know that each Solr feature is calculated per
document right ?
So you want to calculate the payload score for the document you are
re-ranking, based on the query ( your External Feature Information) and
normalize across the different documents?

I would go with this feature and use the normalization LTR functionality :

{ 
  "store" : "my_feature_store", 
  "name" : "in_aggregated_terms", 
  "class" : "org.apache.solr.ltr.feature.SolrFeature", 
  "params" : { "q" : "{!payload_score 
f=aggregated_terms func=max v=${query}}" } 
} 

Then in the model you specify something like :

"name" : "myModelName",
   "features" : [
   {
 "name" : "isBook"
   },
...
   {
 "name" : "in_aggregated_terms",
 "norm": {
 "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer",
 "params" : { "min":"x", "max":"y" }
 }
   },
   }

Give it a try, let me know




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Autocomplete returning shingles

2018-05-03 Thread Alessandro Benedetti
So, your problem is you want to return shingle suggestions from a field in
input but apply multiple filter queries to the documents you want to fetch
suggestions from.

Are you building an auxiliary index for that ?
You need to design it accordingly.
If you want to to map each suggestion to a single document in the auxiliary
index, when you build this auxiliary index you need to calculate the
shingles client side and push the multiple documents ( suggestion) per
original field content.

To do that automatically in Solr I was thinking you could write an
UpdateRequestProcessor that given in input the document, split it in
multiple docs, but unfortunately the current architecture of
UpdateRequestProcessors takes in input 1 Doc and and returns in output just
1 doc.
So it is not a viable approach.

Unfortunately the shingle filter here doesn't help, as the user want shingle
in output ( analysers doesn't affect stored content)

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Regarding LTR feature

2018-04-30 Thread Alessandro Benedetti
Hi Prateek,
with query and FQ Solr is expected to score a document only if that document
is a match of all the FQ results intersected with the query results [1].
Then re-ranking happens, so effectively, only the top K intersected
documents will be re-ranked.

If you are curious about the code, this can be debugged running a variation
of org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing
filter queries ) and setting the breakpoint somewhere around :
org/apache/solr/ltr/LTRRescorer.java:181

Can you elaborate how you have verified that is currently not working like
that ?
I am familiar with LTR code and I would be surprised to see this different
behavior

[1] https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to create a solr collection providing as much searching flexibility as possible?

2018-04-30 Thread Alessandro Benedetti
Hi Raymond,
as Charlie correctly stated, the input format is not that important, what is
important is to focus on your requirements and properly design a
configuration and data model to solve them.

Extracting the information for such a data format is not going to be
particularly challenging ( as i assume you know the semantic of such
structure).
You need to properly build your Solr document accordingly to the set of
features you want to expose.
Designing fields and field types will be fundamental to reach the search
flexibility you are looking for.

e.g.
*Feature*: expose a fast range search on a numerical field (Int)
*Implementation* : 
[1] 
IntPointField
Integer field (32-bit signed integer). This class encodes int values using a
"Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.
[2]

Regards

[1]
https://lucene.apache.org/solr/guide/7_3/field-types-included-with-solr.html
[2]
https://lucene.apache.org/solr/guide/7_3/the-standard-query-parser.html#range-searches



-----
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to create a solr collection providing as much searching flexibility as possible?

2018-04-28 Thread Alessandro Benedetti
Hi Raymond,
your requirements are quite vague, Solr offers you those capabilities but
you need to model your configuration and data accordingly.

https://lucene.apache.org/solr/guide/7_3/solr-tutorial.html
is a good starting point.
After that you can study your requirements and design the search solution
accordingly.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Search Analytics Help

2018-04-27 Thread Alessandro Benedetti
Michal,
Doug was referring to an open source solution ready out of the box and just
pluggable ( a sort of plug and play).
Of course you can implement your own solution and using ELK or kafka is
absolutely a valid option.

Cheers


--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Fri, Apr 27, 2018 at 10:21 AM, Michal Hlavac <m...@hlavki.eu> wrote:

> Hi,
>
> you have plenty options. Without any special effort there is ELK. Parse
> solr logs with logstash, feed elasticsearch with data, then analyze in
> kibana.
>
> Another option is to send every relevant search request to kafka, then you
> can do more sophisticated data analytic using kafka-stream API. Then use
> ELK to feed elasticsearch with logstash kafka input plugin. For this
> scenario you need to do some programming. I`ve already created this
> component but I hadn't time to publish it.
>
> Another option is use only logstash to feed e.g. graphite database and
> show results with grafana or combine all these options.
>
> You can also monitor SOLR instances by JMX logstash input plugin.
>
> Really don't understand what do you mean by saying that there is nothing
> satisfactory.
>
> m.
>
> On štvrtok, 26. apríla 2018 22:23:30 CEST Doug Turnbull wrote:
> > Honestly I haven’t seen anything satisfactory (yet). It’s a huge need in
> > the open source community
> >
> > On Thu, Apr 26, 2018 at 3:38 PM Ennio Bozzetti <ebozze...@thorlabs.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'm setting up SOLR on an internal website for my company and I would
> like
> > > to know if anyone can recommend an analytics that I can see what the
> users
> > > are searching for? Does the log in SOLR give me that information?
> > >
> > > Thank you,
> > > Ennio Bozzetti
> > >
> > > --
> > CTO, OpenSource Connections
> > Author, Relevant Search
> > http://o19s.com/doug
>
>
>


Re: Learning to Rank (LTR) with grouping

2018-04-24 Thread Alessandro Benedetti
Are you using SolrCloud or any distributed search ?

If you are using just a single Solr instance, LTR should have no problem
with pagination.
The re-rank involves the top K and then you paginate.
So if a document from the original score page 1 ends up in page 3, you will
see it at page three.
have you verified that : "Say, if an item (Y) from second page is moved to
first page after 
re-ranking, while an item (X) from first page is moved away from the first 
page.  ?" 
Top K shouldn't start from the "start" parameter, if it does, it is a bug.

The situation change a little with distributed search where you can
experiment this behaviour : 

*Pagination*
Let’s explore the scenario on a single Solr node and on a sharded
architecture.

SINGLE SOLR NODE

reRankDocs=15
rows=10
This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and
affected by the reranking.
The latter 5 documents will preserve the original score and original
ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4
This means that score(15) could be < score(16), but document 15 and 16 are
still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the
rest is left unchanged.

*SHARDED ARCHITECTURE*

reRankDocs=15
rows=10
Shards number=2
When looking for the page 2, Solr will trigger queries to she shards to
collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored docs
(page2)
Shard2 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored docs
(page2)

The the results will be merged, and possibly, original scored search results
can top up reranked docs.
A possible solution could be to normalise the scores to prevent any
possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page >
reRankDocs. In situations when reRankDocs is quite high , the problem will
occur only in deep paging.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Run solr server using Java program

2018-04-20 Thread Alessandro Benedetti
To do what?
If you mean to start a Solr Server instance, you have the solr.sh ( or the
windows starter).
You can set up your automation stack to be able to startup Solr one click.
SolrJ is a client which means you need Solr up and running.

Cheers

On Fri, 20 Apr 2018, 16:51 rameshkjes,  wrote:

> Using solrJ, I am able to access the solr core. But still I need to go to
> command prompt to execute command for solr instance. Is there way to do
> that?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Run solr server using Java program

2018-04-20 Thread Alessandro Benedetti
There are various client API to use Apache Solr[1], in your case what you
need is SolrJ[2] .

Cheers

[1] https://lucene.apache.org/solr/guide/7_3/client-apis.html
[2] https://lucene.apache.org/solr/guide/7_3/using-solrj.html#using-solrj



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrCloud design question

2018-04-20 Thread Alessandro Benedetti
Unless you use recent Solt 7.x features where replicas can have different
properties[1], each replica is functionally the same at Solr level.
Zookeeper will elect a leader among them ( so temporary a replica will have
more responsibilities ) but (R1-R2-R3) does not really exist at Solr level.
It will just be Shard1 (ReplicaHost1, ReplicaHost2, ReplicaHost3).

So you can't really shuffle anything at this level.




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to protect middile initials during search

2018-04-20 Thread Alessandro Benedetti
Hi Wendy,
I recommend to properly configure your analysis chain.
You can start posting it here and we can help.

Generally speaking you should use the analysis tool in the Solr admin to
verify first the analysis chain is configured as you expect, then you can
pass modelling the query appropriately.

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Learning to Rank (LTR) with grouping

2018-04-17 Thread Alessandro Benedetti
Thanks for the response Shawn !

In relation to this : 
"I feel fairly sure that most of them are unwilling to document their
skills.  
If information like that is documented, it might saddle a committer with 
an obligation to work on issues affecting those areas when they may not 
have the free time available to cover that obligation. "

I understand your point.
I was referring to pure Lucene/Solr modules interest/expertise more than
skills but I get that "it might saddle a committer with 
an obligation to work on issues affecting those areas when they may not 
have the free time available to cover that obligation."

It shouldn't transmit an obligation ( as no contributor operates under any
SLA but purely passion driven ) but it might be a "suggestion" .
I was thinking to some way to avoid such long standing Jiras.
Let's pick this issue as an example.
>From the little of my opinion I believe it is quite useful.
The last activity is from 22/May/17 15:23 and no committer commented after
that.
I would assume that committers with interest or expertise on Learning To
Rank or Grouping initially didn't have free time to evaluate the patch and
then maybe they just forgot.
Having some sort of tagging based on expertise could at least avoid the
"forget" part ?
Or the contributor should viralize the issue and get as much "votes" from
the community as possible to validate an issue to be sexy ?
Just thinking loudly, it was just an idea ( and I am not completely sure it
could help) but I believe as a community we should manage a little bit
better contributions, of course I am open to any idea and perspective.

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Learning to Rank (LTR) with grouping

2018-04-17 Thread Alessandro Benedetti
Hi Erick,
I have a curiosity/suggestion regarding how to speed up pending( or
forgotten ) Jiras,
is there a way to find out the most suitable committer(s) for the task and
tag them ?

Apache Lucene/Solr is a big project, is there anywhere in the official
Apache Lucene/Solr website where each committer list the modules of
interest/expertise ?
In this way when a contrbutor create a Jira and attach a patch, the
committers could get a notification if the module involving the Jira is one
of their interest.
This could be done manually ( the contributor check the committers interests
and manually tag them in the Jira) or automatically ( integrating Jira
modules with this "Interests list" in some way) .
Happy to help in this direction.

I understand that all of us contributors ( and committers) are just
volunteers, so no SLA is expected at all, but did the fact of the fixed
version already assigned affect the address of that Jira issue ?


Cheers
 



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Sorting using "packed" fields?

2018-04-17 Thread Alessandro Benedetti
Hi Christopher,
if you model your documents with a nested document approach ( like the one
you mentioned) you should be able to achieve your requirement following
this interesting blog [1] :



*" ToParentBlockJoinQuery supports several score calculation modes. For
example, a score for a parent could be calculated as a min(max) score among
of all its children’s scores. So, with the piece of code below we can sort
parent documents by their children’s prices in descending
ordersort={!parent which=doc_type:parent score=max v=’+doc_type:child
+{!func}price’} desc… "*

Instead of using just the plain price function you could design your own
function, such as :

{!func}if(gt(query(prefix:),0),latest_submission,0)

it's just a quick attempt to give you the idea, the function query I posted
may need some refinement but it could work

Cheers

[1]
https://blog.griddynamics.com/how-to-sort-parent-documents-by-child-attributes-in-solr/

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Mon, Apr 16, 2018 at 9:48 PM, Christopher Schultz <
ch...@christopherschultz.net> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> All,
>
> I have documents that need to appear to have different attributes
> depending upon which user is trying to search them. One of the fields
> I currently have in the document is called "latest_submission" and
> it's a multi-valued text field that contains fields packed with a
> numeric identifier prefix and then the real data. Something like this:
>
> 101:2018-04-16T16:41:00Z
> 102:2017-01-25T22:08:17Z
> 103:2018-11-19T02:52:28Z
>
> When searching, I will know which prefixes are valid for a certain
> user, so I know I can search by *other* fields and then pull-out the
> values that are appropriate for a particular user.
>
> But if I want Solr/Lucene to searcg/sort by the "latest submission", I
> need to be able to tell Solr/Lucene which values are appropriate to
> use for that user.
>
> Is this kind of thing possible? I'd like to be able to issue a search
> that says e.g.:
>
>   find documents matching name:foo sort by latest_submission starting
> with ("102:" or "103:")
>
> I'm just starting out with this data set, so I can completely change
> the organization of the data within the index if necessary.
>
> Does anyone have any suggestions?
>
> I've seen some questions on the list about "child documents", and it
> seems like that might be relevant. Right now, my input data looks like
> this:
>
> {
>   { "name" : "document name",
> "latest_submission" : [ "prefix:date", "prefix:date", etc. ]
>   }
> }
>
> But that could easily be changed to be:
>
> {
>   { "name" : "document name",
> "latest_submission" : { "prefix" : "101",
> "date" : "[date]" },
>   { "prefix" : "103",
> "date" : "[date]" },
>   }
> }
>
>
> Thanks,
> - -chris
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlrVDAQACgkQHPApP6U8
> pFj7Mw//dnM0ZMRhbvAlMptYSH3LEj08I0l/oJWMQYilWOIltpZ148QOJp+5Iqu/
> Q9uYfkItdv0Fw77Ebgtmm7N5PUzH7utiyDfKNayvL9d9+MtfFzx4CKPyqdjNDXvC
> 2LLUks9ABTX93h7AUdeO5rM4NsPci6LMY8dcxU6fbVDbDT5nYTRULUrbGfDxmY6E
> SyMwk25DOzmrIoFCOJcyhuluvHhax753mOQCCljuFaCM3J8ap0+2ZqX8Nl5D2NLz
> CqU5ROTGxm+qMVQ8dbqhT6LRdbjj6KqazutOxZl+H+Ix6yVeWZG/9TiAtkKZklvJ
> 6wjMB2te4utj35YPhpMkghkIYwo7s6jt9DXyBaf2gv1fbiNKmvPN2eqhsI870f0t
> UmknH8Atx3ygeru3ddjIvb2Fn17E7EpKHWxkmmrexKE8uzCo9Ith6BWqL8ae19o/
> LtBQ7RNCNjIbyNk3GcUJmvboM+PAAvUWbnpwQ4V2oI8b5sO9zeopE4JlzbWmG89H
> WVmtPpIdw0H8AwLNbJuGaaksY5ZIcYg2iFH56BHvvu1ri3ArSgcQuyHfxEZD7gs3
> cjh+mX9QEgbCVrz2i0CwRkgAMMIffG2SjBsHhUs5ESYqeskkDcyFDi70Q+5wNJ71
> GhAESSbgpI31lpbhkGwh7gdXiJyKJG3EMFDEEZVN5sLhFYv96Q8=
> =V+EE
> -END PGP SIGNATURE-
>


Re: Match a phrase like "Apple iPhone 6 32GB white" with "iphone 6"

2018-04-09 Thread Alessandro Benedetti
Hi Sami,
I agree with Mikhail, if you have relatively complex data you could curate
your own knowledge base for products as use it for Named entity Recognition.
You can then search a field compatible_with the extracted entity.

If the scenario is simpler using the analysis chain you mentioned should
work (if the product names are always complete and well curated).

Cheers





--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Mon, Apr 9, 2018 at 10:40 AM, Adhyan Arizki <a.ari...@gmail.com> wrote:

> You can just use synonyms for that.. rather hackish but it works
>
> On Mon, 9 Apr 2018, 05:06 Sami al Subhi, <s...@alsubhi.me> wrote:
>
> > I think this filter will output the desired result:
> >
> > 
> >
> >
> >
> > 
> > 
> >
> >
> >
> > 
> >
> > indexing:
> > "iPhone 6" will be indexed as "iphone 6" (always a single token)
> >
> > querying:
> > so this will analyze "Apple iPhone 6 32GB white" to "apple", "apple
> > iphone",
> > "iphone", "iphone 6" and so on...
> > then here a match will be achieved using the 4th token.
> >
> >
> >  I dont see how this will result in false positive matching.
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>


Re: LTR - OriginalScore query issue

2018-03-19 Thread Alessandro Benedetti
>From Apache Solr tests :

loadFeature(
"SomeEdisMax",
SolrFeature.class.getCanonicalName(),
"{\"q\":\"{!edismax qf='title description' pf='description' mm=100%
boost='pow(popularity, 0.1)' v='w1' tie=0.1}\"}");


*qf='title description'*

Can you try again using the proper expected syntax ( with single quotes).
If it doesn't work we may need to raise it as a bug.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR - OriginalScore query issue

2018-03-16 Thread Alessandro Benedetti
I understood your requirement,
the SolrFeature feature type should be quite flexible,
have you tried :

{ 
name: "overallEdismaxScore", 
class: "org.apache.solr.ltr.feature.SolrFeature", 
params: { 
q: "{!dismax qf=item_typel^3.0 brand^2.0 title^5.0}${user_query}" 
}, 
store: "myFeatureStoreDemo", 
} 

Cheers



-----
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: SpellCheck Reload

2018-03-15 Thread Alessandro Benedetti
Hi Sadiki,
the kind of spellchecker you are using built an auxiliary Lucene index as a
support data structure.
That is going to be used to provide the spellcheck suggestions.

"My question is, does "reloading the dictionary" mean completely erasing the
current dictionary and starting from scratch (which is what I want)? "

What you want is re-build the spellchecker.
In the case of the the IndexBasedSpellChecker, the index is used to build
the dictionary.
When the spellchecker is initialized a reader is opened from the latest
index version available.

if in the meantime your index has changed and commits have happened, just
building the spellchecker *should* use the old reader :

@Override
  public void build(SolrCore core, SolrIndexSearcher searcher) throws
IOException {
IndexReader reader = null;
if (sourceLocation == null) {
  // Load from Solr's index
  reader = searcher.getIndexReader();
} else {
  // Load from Lucene index at given sourceLocation
  reader = this.reader;
}

This means your dictionary is not going to see any substantial changes.

So what you need to do is :

1) reload the spellchecker -> which will initialise again the source for the
dictionary to the latest index commit
2) re-build the dictionary



Cheers







-
-------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Some performance questions....

2018-03-15 Thread Alessandro Benedetti
*Single Solr Instance VS Multiple Solr instances on Single Server
*

I think there is no benefit in having multiple Solr instances on a single
server, unless the heap memory required by the JVM is too big.
And remember that this has relatively to do with the index size ( inverted
index is memory mapped OFF heap and docValues as well).
On the other hand of course Apache Solr uses plenty of JVM heap memory as
well ( caches, temporary data structures during indexing, ect ect)

> Deepak: 
> 
> Well its kinda a given that when running ANYTHING under a VM you have an 
> overhead..

***Deepak*** 
You mean you are assuming without any facts (performance benchmark with n 
without VM) 
 ***Deepak*** 
I think Shawn detailed this quite extensively, I am no sys admin or OS
expert, but there is no need of benchmarks and I don't even understand your
doubts.
In Information technology anytime you add additional layers of software you
need adapters which means additional instructions executed.
It is obvious  that having :
metal -> OS -> APP is cheaper instruction wise then 
metal -> OS -> VM -> APP
The APP will execute instruction in the VM that will be responsible to
translate those instructions for the underlining OS.
Going direct you skip one passage.
you can think about this when you emulate different OS, is it cheaper to run
windows on a machine directly to execute windows applications or run a
Windows VM on top of another OS to execute windows applications ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR - OriginalScore query issue

2018-03-15 Thread Alessandro Benedetti
>From the snippet you posted this is the query you run :
q=id:"13245336"

So the original score ( for each document in the result set) can only be the
score associated to that query.

You then pass an EFI with a different text.
You can now use that information to calculate another feature if you want.
You can define a SolrFeature :

{
"store" : "myFeatureStore",
"name" : "userTextCat",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{! <localParams}${user_query}" }
  }

e.g.
{
"store" : "myFeatureStore",
"name" : "titleTfIdf",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!field f=title}${user_query}" }
  }

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR not able to upload org.apache.solr.ltr.model.MultipleAdditiveTreesModel

2018-03-14 Thread Alessandro Benedetti
This is the piece of code involved :

"try {
  // create an instance of the model
  model = solrResourceLoader.newInstance(
  className,
  LTRScoringModel.class,
  new String[0], // no sub packages
  new Class[] { String.class, List.class, List.class, String.class,
List.class, Map.class },
  new Object[] { name, features, norms, featureStoreName,
allFeatures, params });
  if (params != null) {
SolrPluginUtils.invokeSetters(model, params.entrySet());
  }
} catch (final Exception e) {
  throw new ModelException("Model type does not exist " + className, e);
}"

I admit it is generic and contains even a catch "Exception" clause, but
wasn't it logging the stacktrace ?
Just out of curiosity, how was the entire stacktrace ?

This may help to improve it.

Regards



-----
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Warming Up Doubts

2018-03-14 Thread Alessandro Benedetti
I see quite a bit of confusion here :

*1. FirstSearcher* I have added some 2 frequent used query but all my 
autowarmCount are set to 0. I have also added facet for warming. So if my 
autowarmCount=0, does this mean by queries are not getting cached. 

/First Searcher as the name suggests is the First searcher opened on the
Solr instance on startup.
NewSearcher refers to the new searcher opened every commit instead.
If your autowarm count for your caches are set to 0 it means that 0 entries
for the old caches will be used to warm up the new caches ( old caches get
invalidated on both soft or hard commit)./


*2. useColdSearcher = false* Despite reading many document, i am not able 
to understand how it works after full import (assuming this is not my first 
full-import) 

Normally when a commit happens, the searcher is first warmed up and then is
registered to serve queries.
If you want to use a Cold Searcher you can, setting this property.

*3. not defined maxWarmingSearchers in solrconfig.* 
This refers to the number of warming searcher in background, if you have
frequent commits you may have different searchers concurrently warming up.
This parameter limit this number ( normally to 2 searchers)

So, in short, you are definitely doing something wrong and your auto warming
is not going to work as you like :)

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: LTR not able to upload org.apache.solr.ltr.model.MultipleAdditiveTreesModel

2018-03-14 Thread Alessandro Benedetti
Hi Roopa,
that model changed name few times, which Apache Solr version are you using ?
It is very likely you are using a class name not in sync with your Apache
Solr version.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Need help with match contains query in SOLR

2018-02-20 Thread Alessandro Benedetti
It was not clear at the beginning, but If I understood correctly you could :

*Index Time analysis*
Use whatever charFilter you need, the keyword tokenizer[1] and then token
filters you like ( such as lowercase filter, synonyms ect)

*Query Time Analysis*
You can use a tokenizer you like ( that tokenizes so not keywordTokenizer),
the Shingle Token filter[2] and 
whatever additional filter you need.
This should do the trick.

Cheers

[1]
https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer
[2]
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-19 Thread Alessandro Benedetti
Hi David,
good to know that sorting solved your problem.
I understand perfectly that given the urgency of your situation, having the
solution ready takes priority over continuing with the investigations.

I would recommend anyway to open a Jira issue in Apache Solr with all the
information gathered so far.
Your situation caught our attention and definitely changing the order of the
documents in input shouldn't affect the index size ( by such a greater
factor).
The fact that the optimize didn't change anything is even more suspicious.
It may be an indicator that in some edge cases ordering of input documents
is affecting one of the index data structure.
As a last thing when you have time I would suggest to :

1) index the ordering which gives you a small index - Optimize - Take note
of the size by index file extension

2) index the ordering which gives you a big index - Optimize - Take note of
the size by index file extension

And attach that to the Jira issue.
Whenever someone picks it up, that would definitely help.

Cheers




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr ltr jar is not able to recognize MultipleAdditiveTreesModel

2018-02-16 Thread Alessandro Benedetti
You can not just use the model output from Ranklib.
I opened this issue few months ago but I never had the right
time/motivation to implement it [1] .
You need to convert it in the Apache Solr LTR expected format.
I remember a script should be available[2]

[1]  https://sourceforge.net/p/lemur/feature-requests/144/

[2] https://github.com/ryac/lambdamart-xml-to-json

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Thu, Feb 15, 2018 at 3:55 PM, Brian Yee <b...@wayfair.com> wrote:

> I'm not sure if this will solve your problem, but you are using a very old
> version of Ranklib. The most recent version is 2.9.
> https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.9/
>
>
> -Original Message-
> From: kusha.pande [mailto:kusha.pa...@gmail.com]
> Sent: Thursday, February 15, 2018 8:12 AM
> To: solr-user@lucene.apache.org
> Subject: solr ltr jar is not able to recognize MultipleAdditiveTreesModel
>
> Hi I am trying to upload a training model generated from ranklib jar using
> lamdamart mart.
>
> The model is like
> {"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
> "name":"lambdamartmodel",
> "params" : {
> "trees" :[
>{
>   "id": "1",
>   "weight": "0.1",
>   "split": {
>  "feature": "8",
>  "threshold": "7.111333",
>  "split": [
> {
>"pos": "left",
>"feature": "8",
>"threshold": "5.223557",
>"split": [
>   {
>  "pos": "left",
>  "feature": "8",
>  "threshold": "3.2083516",
>  "split": [
> {
>"pos": "left",
>"feature": "1",
>"threshold": "100.0",
>"split": [
>   {
>  "pos": "left",
>  "feature": "8",
>  "threshold": "2.2626402",
>  "split": [
> {
>"pos": "left",
>"feature": "8",
>"threshold": "2.2594802",
>"split": [
>   {
>  "pos": "left",
>  "output": "-1.6371088"
>   },
>   {
>  "pos": "right",
>  "output": "-2.0"
>   }
>]
> },
> {
>"pos": "right",
>"feature": "8",
>"threshold": "2.4438097",
>"split": [
>   {
>  "pos": "left",
>  "feature": "2",
>  "threshold": "0.05",
>  "split": [
> {
>"pos": "left",
>"output": "2.0"
> }, ..
>
>
> getting an exception as :
> Exception: Status: 400 Bad Request
> Response: {
>   "responseHeader":{
> "status":400,
> "QTime":43},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.RuntimeException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not
> exist org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
> "code":400}}
> .
>
> I have used RankLib-2.1-patched.jar to generate the model and converted the
> generated xml to json.
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-16 Thread Alessandro Benedetti
It's a silly thing, but to confirm the direction that Erick is suggesting :
How many rows in the DB ?
If updates are happening on Solr ( causing the deletes), I would expect a
greater number of documents in the DB than in the Solr index.
Is the DB primary key ( if any) the same of the uniqueKey field in Solr ?

Regards

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Fri, Feb 16, 2018 at 10:18 AM, Howe, David <david.h...@auspost.com.au>
wrote:

>
> Hi Emir,
>
> We have no copy field definitions.  To keep things simple, we have a one
> to one mapping between the columns in our staging table and the fields in
> our Solr index.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


Re: Multiple context fields in suggester component

2018-02-15 Thread Alessandro Benedetti
You can start from here :

org/apache/solr/spelling/suggest/SolrSuggester.java:265

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Alessandro Benedetti
@Pratik: you should have investigated. I understand that solved your issue,
but in case you needed norms it doesn't make sense that cause your index to
grow up by a factor of 30. You must have faced a nasty bug if it was just
the norms.

@Howe : 

*Compound File* .cfs, .cfe  An optional "virtual" file consisting of all the
other index files for systems that frequently run out of file handles.

*Frequencies*   .docContains the list of docs which contain each term along
with frequency

*Field Data*.fdtThe stored fields for documents

*Positions* .posStores position information about where a term occurs in
the index

*Term Index*.tipThe index into the Term Dictionary

So, David, you confirm that those two index have :

1) same number of documents
2) identical documents ( + 1 new field each not indexed)
3) same number of deleted documents
4) they both were born from scratch ( an empty index)

The matter is still suspicious :
- Cfs seems to highlight some sort of malfunctioning during
indexing/committing in relation with the OS. What was the way of commiting
you were using ?

- .doc, .pos, .tip -> they shouldn't change, assuming both the indexes are
optimised, you are adding a not indexed field, those data structures
shouldn't be affected

- the stored content as well, too much of an increment 

Can you send us the full configuration for the new field ?
You don't want, norms, positions and frequencies for it.
But in case they are the issue, you may have found some very edge case,
because also enabling all of them you shouldn't incur in such a penalty for
just an additional tiny field



-
-------
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Alessandro Benedetti
Hi pratik,
how is it possible that just the norms for a single field were causing such
a massive index size increment in your case ?

In your case I think it was for a field type used by multiple fields, but
it's still suspicious in my opinions,
norms should be that big.
If I remember correctly in old versions of Solr before the drop of index
time boost, norms were containing both an approximation of the length of the
field + index time boost.
>From your mailing list problem you moved from 10 Gb to 300 Gb.
It can't be just the norms, are you sure you didn't face some bug ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Alessandro Benedetti
I see,
According to what I know it is not possible to run for the same field
different query time analysis.

Not sure if anyone was working on that.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Not getting appropriate spell suggestions

2018-02-14 Thread Alessandro Benedetti
 Given your schema the stemmer seems a very likely responsible.
You need to disable it and re-index.
Just commenting it is not going to work if you don't re-index.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Alessandro Benedetti
"I can go with the "title" field and have that include the synonyms in 
analysis. Only problem is that the number of fields and number of synonyms 
files are quite a lot (~ 8 synonyms files) due to different weightage and 
type of expansion (exact vs partial) based on these. Hence going with this 
approach would mean creating more fields for all these synonyms 
(synonyms.txt) 

So, I am looking to build a custom parser for which I could supply the file 
and the field and that would expand the synonyms and return a score. "

Having a binary or scalar feature is completely up to you and the way you
configure the Solr feature.
If you have 8 (copy?)fields with same content but different expansion, that
is still ok.
You can have 8 features, one per type of expansion.
LTR will take care of the weight to be assigned to those features.

"So, I am looking to build a custom parser for which I could supply the file 
and the field and that would expand the synonyms and return a score. ""
I don't get this , can you elaborate ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Judging the MoreLikeThis results for relevancy

2018-02-14 Thread Alessandro Benedetti
So let me answer point by point :

1) Similarity is misleading here if you interpret it as a probabilistic
measure. 
Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and
BM25 ( that solves the problem better) you are scoring the document. Higher
the score, higher the relevance of that document for the given query. BM25
does a better job in this , the relevance function will hit a saturation
point so it is closer to your expectation, this blog from Doug should
help[1]

2) "if document vector A is at a 
distance of 5 and 10 units from document vectors B and C respectively then 
can't we say that B is twice as relevant to A as C is to A? Or in terms of 
distance, C is twice as distant to  A and B is to A?"

Not in Lucene, at least not strictly.
Current MLT uses TF-IDF as a scoring formula.
When the score of B is double of the score of C, you can say that B is twice
as relevant to A than C for Lucene.
>From a User perspective this can be different (quoting Doug  : "If an
article mentions “dog” six times is it twice as relevant as an article
mentioning “dog” 3 times? Most users say no")

3) MLT under the hood build a Lucene query and retrieve documents from the
index.
When building the MLT query, to keep it simple it extract from the seed
document a subset of terms which are considered representative of the seed
document ( let's call them relevant terms).
This is managed through a parameter, but usually and by default you collect
a limited set of relevant terms ( not all the terms).
When retrieving similar documents you score them using TF-IDF ( and in the
future BM25).
So first of all, you can have documents with higher scores than the original
( it doesn't make sense in a probabilistic world, but this is how Lucene
works).
Reverting the documents, so applying the MLT to document B you could build a
slightly different query.
So :
given seed(a) the score(b) != the score(a) given seed(b)

I understand you think it doesn't make sense, but this how Lucene works.

I do also understand that a lot of times users want a percentage out of a
MLT query.
I will work toward that direction for sure, step by step, first I need to
have the MLT refactor approved and patched :)




[1]
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
Hi David, 
given the fact that you are actually building a new index from scratch, my
shot in the dark didn't hit any target.
When you say  : "Once the import finishes we save the docker image in the
AWS docker repository.  We then build our cluster using that image as the
base"

Do you mean just configuraiton wise ?
Will the new cluster have any starting index on disk?
If i understood correctly your latest statements I expect a NO in here.

So you are building a completely new index and comparing to the old index (
which is completely separate) you denote such a big difference in size.
This is extremely suspicious .
Optimizing in the end is just a huge merge to force 1 ( or N) final
segments.
Given the additional information you gave me, it's not going to make much
difference.

I would recommend to check how the index space is divided in different file
formats [1]
( i.e. list how much space is dedicated to a specific extension)

Stored content is in the .fdt files.


[1]
https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Multiple context fields in suggester component

2018-02-13 Thread Alessandro Benedetti
Simple answer is No.
Only one context field is supported out of the box.
The query you provide as context filtering query ( suggest.cfq= ) is
going to be parsed and a boolean query for the context field is created [1].

You will need some customizations if you are targeting that behavior.

[1] query = new
StandardQueryParser(contextFilterQueryAnalyzer).parse(contextFilter,
CONTEXTS_FIELD_NAME);




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: facet.method=uif not working in solr cloud?

2018-02-13 Thread Alessandro Benedetti
*Update* : This has been actually already solved by Hoss.

https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
Request : https://github.com/apache/lucene-solr/pull/279/files

This should go live with 7.3 

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: facet.method=uif not working in solr cloud?

2018-02-13 Thread Alessandro Benedetti
+1

I believe it is a bug related to that patch in some way.
facet.distrib.mco ( the naming is not very explicit) should activate the
feature in the patch, which forces the mincount in the distributed requests
to be set to 1.

The normal behavior expected is that you pass to the distributed requests
the same value for the parameter that you originally set.

Can you open a bug Wei ?
We can investigate the part where the requests are distributed.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr spell check index dictionary build failed issue

2018-02-13 Thread Alessandro Benedetti
Shooting in the dark it seems that 2 processes are trying to write the same
disk directory.
Is this directory shared by different Solr cores or Solr instances ?

If you contribute the configuration from the solrconfig we may be able to
help.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
I assume you re-index in full right ?
My shot in the dark is that this increment is temporary.
You re-index, so effectively delete and add all documents ( this means that
even if the new field is just stored, you re-build the entire index for all
the fields).
Create new segments and the old docs are marked as deleted.
Until the background merge happens, the index could reach those sizes.

the weird thing is why the merge didn't kick in...
Have you configured any special approach in segments merging ?

What happens if you explicitly optimize ?

Let us know ...




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Synonyms as a feature with LTR

2018-02-12 Thread Alessandro Benedetti
In the end a feature will just be a numerical value.
How do you plan to use synonyms in a field to generate a numerical feature ?

Are you planning to define a binary feature for a field, in case there is a
match on the synonyms ?
Or a feature which contains a score for a query ( with synonyms expansion) ?

I would start from the SolrFeature, let's assume the "title" field has a
field type that includes synonyms ( query time) :

{
"store" : "featureStore",
"name" : "hasTitleMatch",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : {
  "fq": [ "{!field f=title}${query}" ]
}

Query time analysis will be applied and synonyms expanded.
So the feature will have a value , which is the score returned for the query
and the document ( under scoring) .
You can play with that and design the feature that best fit your idea.

Regards








-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Context field unable to get autosuggestion for zip code having '-'.

2018-02-08 Thread Alessandro Benedetti
With that configuration you want to auto suggest Office names filtering them
by zip code.

Not sure why you perform an ngram analysis though.
How do you want to filter by zip code ? Exact Search ? Edge ngram ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Relevancy Tuning For Solr With Apache Nutch 2.3

2018-02-08 Thread Alessandro Benedetti
uhm, not really.
I am just saying that if you are running a version >=6.6.0 keep in mind that
the index time boost you think you are enabling is not actually working
anymore.

You are now mentioning a nutch boost field...
Can you elaborate that ?
It may be a completely different thing...
How is this boost stored Solr side ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Relevancy Tuning For Solr With Apache Nutch 2.3

2018-02-08 Thread Alessandro Benedetti
With : boost from nutch's  side.

If you refer to Index Time boost, this has been deprecated time ago[1]
At least from 6.6.0.

[1] http://lucene.apache.org/solr/6_6_0/solr-solrj/deprecated-list.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Spellcheck collations results

2018-02-08 Thread Alessandro Benedetti
Given this configurations you may state that if no collation is returned
there was no collation returning results after :
- getting back a maximum of 7 corrections for mispelled terms
- getting a max of 10.000 combinations of collations to extendedResults
- test 3 collations against the index to check if results are returned and
then give up

So there are scenarios where you don't get the collation, but it actually
would have returned results :

- the collation involve a correction that was not included in the closest 7
collations
- the collation was not tested ( not being included in the first 3 collation
combinations)

We can go more in deep if required, the Spellcheck is quite a complex module
:)

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Judging the MoreLikeThis results for relevancy

2018-02-08 Thread Alessandro Benedetti
Hi,
I have been personally working a lot with the MoreLikeThis and I am close to
contribute a refactor of that module ( to break up the monolithic giant
facade class mostly) .

First of all the MoreLikeThis handler will return the original document (
not scored) + the similar documents(scored).
The original document is not considered by the MoreLikeThis query, so it is
not returned as part of the results of the MLT lucene query, it is just
added to the response in the beginning.

if I remember well, but I am unable to check at the moment, you should be
able to get the original document in the response set ( with max score)
using the More Like This query parser.
Please double check that

Generally speaking at the moment TF-IDF is used under the hood, which means
that sometime the score is not probabilistic.
So a document which has a score 50% of the original doc score, it doesn't
mean it is 50% similar, but for your use case it may be a feasible
approximation.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


  1   2   3   4   5   6   7   >