from:"David Smiley"

Re: [ANNOUNCE] Apache Solr 8.8.1 released

2021-02-27 Thread David Smiley

The corresponding docker image has been released as well:
https://hub.docker.com/_/solr
(credit to Tobias Kässmann for helping)

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Feb 23, 2021 at 10:39 AM Timothy Potter 
wrote:

> The Lucene PMC is pleased to announce the release of Apache Solr 8.8.1.
>
>
> Solr is the popular, blazing fast, open source NoSQL search platform from
> the Apache Lucene project. Its major features include powerful full-text
> search, hit highlighting, faceted search, dynamic clustering, database
> integration, rich document handling, and geospatial search. Solr is highly
> scalable, providing fault tolerant distributed search and indexing, and
> powers the search and navigation features of many of the world's largest
> internet sites.
>
>
> Solr 8.8.1 is available for immediate download at:
>
>
>   <https://lucene.apache.org/solr/downloads.html>
>
>
> ### Solr 8.8.1 Release Highlights:
>
>
> Fix for a SolrJ backwards compatibility issue when upgrading the server to
> 8.8.0 without upgrading SolrJ to 8.8.0.
>
>
> Please refer to the Upgrade Notes in the Solr Ref Guide for information on
> upgrading from previous Solr versions:
>
>
>   <https://lucene.apache.org/solr/guide/8_8/solr-upgrade-notes.html>
>
>
> Please read CHANGES.txt for a full list of bugfixes:
>
>
>   <https://lucene.apache.org/solr/8_8_1/changes/Changes.html>
>
>
> Solr 8.8.1 also includes bugfixes in the corresponding Apache Lucene
> release:
>
>
>   <https://lucene.apache.org/core/8_8_1/changes/Changes.html>
>
>
>
> Note: The Apache Software Foundation uses an extensive mirroring network
> for
>
> distributing releases. It is possible that the mirror you are using may not
> have
>
> replicated the release yet. If that is the case, please try another mirror.
>
> This also applies to Maven access.
>
> 
>

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-19 Thread David Smiley

Even if you could do an "fl" with the ability to exclude certain fields, it
begs the question of what goes into the document cache.  The doc cache is
doc oriented, not field oriented.  So there needs to be some sort of
stand-in value if you don't want to cache a value there and that ends
up being LazyField if you have that feature enabled, or possible wasted
space if you don't have that enabled.  So I don't think the ability to
exclude fields in "fl" would obsolete enableLazyFieldLoading which I think
you are implying?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Feb 19, 2021 at 10:10 AM Gus Heck  wrote:

> Actually I suspect it's there because the ability to exclude fields
> rather than include them is still pending...
> https://issues.apache.org/jira/browse/SOLR-3191
> See also
> https://issues.apache.org/jira/browse/SOLR-10367
> https://issues.apache.org/jira/browse/SOLR-9467
>
> All of these and lazy field loading are motivated by the case where you
> have a very large stored field and you sometimes don't want it, but do want
> everything else, and an explicit list of fields is not convenient (i.e. the
> field list would have to be hard coded in an application, or alternately
> require some sort of schema parsing to build a list of possible fields or
> other severe ugliness..)
>
> -Gus
>
> On Thu, Feb 18, 2021 at 8:42 AM David Smiley  wrote:
>
> > IMO enableLazyFieldLoading is a small optimization for most apps.  It
> saves
> > memory in the document cache at the expense of increased latency if your
> > usage pattern wants a field later that wasn't requested earlier.  You'd
> > probably need detailed metrics/benchmarks to observe a difference, and
> you
> > might reach a conclusion that enableLazyFieldLoading is best at "false"
> for
> > you irrespective of the bug.  I suspect it may have been developed for
> > particularly large document use-cases where you don't normally need some
> > large text fields for retrieval/highlighting.  For example imagine if you
> > stored the entire input data as JSON in a _json_ field or some-such.
> > Nowadays, I'd set large="true" on such a field, which is a much newer
> > option.
> >
> > I was able to tweak my test to have only alphabetic IDs, and the test
> still
> > failed.  I don't see how the ID's contents/format could cause any effect.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, Feb 18, 2021 at 5:04 AM Nussbaum, Ronen <
> ronen.nussb...@verint.com
> > >
> > wrote:
> >
> > > You're right, I was able to reproduce it too without highlighting.
> > > Regarding the existing bug, I think there might be an additional issue
> > > here because it happens only when id field contains an underscore
> (didn't
> > > check for other special characters).
> > > Currently I have no other choice but to use
> enableLazyFieldLoading=false.
> > > I hope it wouldn't have a significant performance impact.
> > >
> > > -Original Message-
> > > From: David Smiley 
> > > Sent: יום ה 18 פברואר 2021 01:03
> > > To: solr-user 
> > > Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> > > Loading => Invalid Index
> > >
> > > I think the issue is this existing bug, but needs to refer to
> > > toSolrInputDocument instead of toSolrDoc:
> > > https://issues.apache.org/jira/browse/SOLR-13034
> > > Highlighting isn't involved; you just need to somehow get a document
> > > cached with lazy fields.  In a test I was able to do this simply by
> > doing a
> > > query that only returns the "id" field.  No highlighting.
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> > >
> > > On Wed, Feb 17, 2021 at 10:28 AM David Smiley 
> > wrote:
> > >
> > > > Thanks for more details.  I was able to reproduce this locally!  I
> > > > hacked a test to look similar to what you are doing.  BTW it's okay
> to
> > > > fill out a JIRA imperfectly; they can always be edited :-).  Once I
> > > > better understand the nature of the bug today, I'll file an issue and
> > > respond with it here.
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> >

Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-18 Thread David Smiley

Congratulations Jan!

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta  wrote:

> Hi everyone,
>
> I’d like to inform everyone that the newly formed Apache Solr PMC nominated
> and elected Jan Høydahl for the position of the Solr PMC Chair and Vice
> President. This decision was approved by the board in its February 2021
> meeting.
>
> Congratulations Jan!
>
> --
> Anshum Gupta
>

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-18 Thread David Smiley

IMO enableLazyFieldLoading is a small optimization for most apps.  It saves
memory in the document cache at the expense of increased latency if your
usage pattern wants a field later that wasn't requested earlier.  You'd
probably need detailed metrics/benchmarks to observe a difference, and you
might reach a conclusion that enableLazyFieldLoading is best at "false" for
you irrespective of the bug.  I suspect it may have been developed for
particularly large document use-cases where you don't normally need some
large text fields for retrieval/highlighting.  For example imagine if you
stored the entire input data as JSON in a _json_ field or some-such.
Nowadays, I'd set large="true" on such a field, which is a much newer
option.

I was able to tweak my test to have only alphabetic IDs, and the test still
failed.  I don't see how the ID's contents/format could cause any effect.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 18, 2021 at 5:04 AM Nussbaum, Ronen 
wrote:

> You're right, I was able to reproduce it too without highlighting.
> Regarding the existing bug, I think there might be an additional issue
> here because it happens only when id field contains an underscore (didn't
> check for other special characters).
> Currently I have no other choice but to use enableLazyFieldLoading=false.
> I hope it wouldn't have a significant performance impact.
>
> -----Original Message-
> From: David Smiley 
> Sent: יום ה 18 פברואר 2021 01:03
> To: solr-user 
> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> Loading => Invalid Index
>
> I think the issue is this existing bug, but needs to refer to
> toSolrInputDocument instead of toSolrDoc:
> https://issues.apache.org/jira/browse/SOLR-13034
> Highlighting isn't involved; you just need to somehow get a document
> cached with lazy fields.  In a test I was able to do this simply by doing a
> query that only returns the "id" field.  No highlighting.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Feb 17, 2021 at 10:28 AM David Smiley  wrote:
>
> > Thanks for more details.  I was able to reproduce this locally!  I
> > hacked a test to look similar to what you are doing.  BTW it's okay to
> > fill out a JIRA imperfectly; they can always be edited :-).  Once I
> > better understand the nature of the bug today, I'll file an issue and
> respond with it here.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen
> > 
> > wrote:
> >
> >> Hello David,
> >>
> >> Thank you for your reply.
> >> It was very hard but finally I discovered how to reproduce it. I
> >> thought of issuing an issue but wasn't sure about the components and
> priority.
> >> I used the "tech products" configset, with the following changes:
> >> 1. Added  >> name="_nest_path_" class="solr.NestPathField" /> 2. Added  >> name="text_en" type="text_en" indexed="true"
> >> stored="true" termVectors="true" termOffsets="true" termPositions="true"
> >> required="false" multiValued="true" /> Than I inserted one document
> >> with a nested child e.g.
> >> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
> >>
> >> To reproduce:
> >> Do a search with surround and unified highlighter:
> >>
> >> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("
> >> solr"%2C"great")
> >>
> >> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
> >>
> >> Important: it happens only when "id" contains underscore characters!
> >> If you'll use "abc-1" it would work.
> >>
> >> Thanks in advance,
> >> Ronen.
> >>
> >> -Original Message-
> >> From: David Smiley 
> >> Sent: יום א 14 פברואר 2021 19:17
> >> To: solr-user 
> >> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy
> >> Field Loading => Invalid Index
> >>
> >> Hello Ronen,
> >>
> >> Can you please file a JIRA issue?  Some quick searches did not turn
> >> anything up.  It would be super helpful to me if you could list a
> >> series of steps with Solr out-of-the-box in 8.8 including what data
> >> to i

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-17 Thread David Smiley

I think the issue is this existing bug, but needs to refer to
toSolrInputDocument instead of toSolrDoc:
https://issues.apache.org/jira/browse/SOLR-13034
Highlighting isn't involved; you just need to somehow get a document cached
with lazy fields.  In a test I was able to do this simply by doing a query
that only returns the "id" field.  No highlighting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 17, 2021 at 10:28 AM David Smiley  wrote:

> Thanks for more details.  I was able to reproduce this locally!  I hacked
> a test to look similar to what you are doing.  BTW it's okay to fill out a
> JIRA imperfectly; they can always be edited :-).  Once I better understand
> the nature of the bug today, I'll file an issue and respond with it here.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen 
> wrote:
>
>> Hello David,
>>
>> Thank you for your reply.
>> It was very hard but finally I discovered how to reproduce it. I thought
>> of issuing an issue but wasn't sure about the components and priority.
>> I used the "tech products" configset, with the following changes:
>> 1. Added > name="_nest_path_" class="solr.NestPathField" />
>> 2. Added > stored="true" termVectors="true" termOffsets="true" termPositions="true"
>> required="false" multiValued="true" />
>> Than I inserted one document with a nested child e.g.
>> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
>>
>> To reproduce:
>> Do a search with surround and unified highlighter:
>>
>> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("solr"%2C"great")
>>
>> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
>>
>> Important: it happens only when "id" contains underscore characters! If
>> you'll use "abc-1" it would work.
>>
>> Thanks in advance,
>> Ronen.
>>
>> -Original Message-
>> From: David Smiley 
>> Sent: יום א 14 פברואר 2021 19:17
>> To: solr-user 
>> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
>> Loading => Invalid Index
>>
>> Hello Ronen,
>>
>> Can you please file a JIRA issue?  Some quick searches did not turn
>> anything up.  It would be super helpful to me if you could list a series of
>> steps with Solr out-of-the-box in 8.8 including what data to index and
>> query.  Solr already includes the "tech products" sample data; maybe that
>> can illustrate the problem?  It's not clear if nested schema or nested docs
>> are actually required in your example.  If you share the JIRA issue with
>> me, I'll chase this one down.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum 
>> wrote:
>>
>> > Hi All,
>> >
>> > I discovered a strange behaviour with this combination.
>> > Not only the atomic update fails, the child documents are not properly
>> > indexed, and you can't use highlights on their text fields. Currently
>> > there is no workaround other than reindex.
>> >
>> > Checked on 8.3.0, 8.6.1 and 8.8.0.
>> > 1. Configure nested schema.
>> > 2. enableLazyFieldLoading is true (default).
>> > 3. Run a search with hl.method=unified and hl.fl=> > fields> 4. Trying to do an atomic update on some of the *parents* of
>> > the returned documents from #3.
>> >
>> > You get an error: "TransactionLog doesn't know how to serialize class
>> > org.apache.lucene.document.LazyDocument$LazyField".
>> >
>> > Now trying to run #3 again yields an error message that the text field
>> > is indexed without positions.
>> >
>> > If enableLazyFieldLoading is false or if using the default highlighter
>> > this doesn't happen.
>> >
>> > Ronen.
>> >
>>
>>
>> This electronic message may contain proprietary and confidential
>> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
>> information is intended to be for the use of the individual(s) or
>> entity(ies) named above. If you are not the intended recipient (or
>> authorized to receive this e-mail for the intended recipient), you may not
>> use, copy, disclose or distribute to anyone this message or any information
>> contained in this message. If you have received this electronic message in
>> error, please notify us by replying to this e-mail.
>>
>

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-17 Thread David Smiley

Thanks for more details.  I was able to reproduce this locally!  I hacked a
test to look similar to what you are doing.  BTW it's okay to fill out a
JIRA imperfectly; they can always be edited :-).  Once I better understand
the nature of the bug today, I'll file an issue and respond with it here.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen 
wrote:

> Hello David,
>
> Thank you for your reply.
> It was very hard but finally I discovered how to reproduce it. I thought
> of issuing an issue but wasn't sure about the components and priority.
> I used the "tech products" configset, with the following changes:
> 1. Added  name="_nest_path_" class="solr.NestPathField" />
> 2. Added  termVectors="true" termOffsets="true" termPositions="true" required="false"
> multiValued="true" />
> Than I inserted one document with a nested child e.g.
> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
>
> To reproduce:
> Do a search with surround and unified highlighter:
>
> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("solr"%2C"great")
>
> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
>
> Important: it happens only when "id" contains underscore characters! If
> you'll use "abc-1" it would work.
>
> Thanks in advance,
> Ronen.
>
> -Original Message-
> From: David Smiley 
> Sent: יום א 14 פברואר 2021 19:17
> To: solr-user 
> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> Loading => Invalid Index
>
> Hello Ronen,
>
> Can you please file a JIRA issue?  Some quick searches did not turn
> anything up.  It would be super helpful to me if you could list a series of
> steps with Solr out-of-the-box in 8.8 including what data to index and
> query.  Solr already includes the "tech products" sample data; maybe that
> can illustrate the problem?  It's not clear if nested schema or nested docs
> are actually required in your example.  If you share the JIRA issue with
> me, I'll chase this one down.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum  wrote:
>
> > Hi All,
> >
> > I discovered a strange behaviour with this combination.
> > Not only the atomic update fails, the child documents are not properly
> > indexed, and you can't use highlights on their text fields. Currently
> > there is no workaround other than reindex.
> >
> > Checked on 8.3.0, 8.6.1 and 8.8.0.
> > 1. Configure nested schema.
> > 2. enableLazyFieldLoading is true (default).
> > 3. Run a search with hl.method=unified and hl.fl= > fields> 4. Trying to do an atomic update on some of the *parents* of
> > the returned documents from #3.
> >
> > You get an error: "TransactionLog doesn't know how to serialize class
> > org.apache.lucene.document.LazyDocument$LazyField".
> >
> > Now trying to run #3 again yields an error message that the text field
> > is indexed without positions.
> >
> > If enableLazyFieldLoading is false or if using the default highlighter
> > this doesn't happen.
> >
> > Ronen.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-14 Thread David Smiley

Hello Ronen,

Can you please file a JIRA issue?  Some quick searches did not turn
anything up.  It would be super helpful to me if you could list a series of
steps with Solr out-of-the-box in 8.8 including what data to index and
query.  Solr already includes the "tech products" sample data; maybe that
can illustrate the problem?  It's not clear if nested schema or nested docs
are actually required in your example.  If you share the JIRA issue with
me, I'll chase this one down.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum  wrote:

> Hi All,
>
> I discovered a strange behaviour with this combination.
> Not only the atomic update fails, the child documents are not properly
> indexed, and you can't use highlights on their text fields. Currently there
> is no workaround other than reindex.
>
> Checked on 8.3.0, 8.6.1 and 8.8.0.
> 1. Configure nested schema.
> 2. enableLazyFieldLoading is true (default).
> 3. Run a search with hl.method=unified and hl.fl=
> 4. Trying to do an atomic update on some of the *parents* of the returned
> documents from #3.
>
> You get an error: "TransactionLog doesn't know how to serialize class
> org.apache.lucene.document.LazyDocument$LazyField".
>
> Now trying to run #3 again yields an error message that the text field is
> indexed without positions.
>
> If enableLazyFieldLoading is false or if using the default highlighter this
> doesn't happen.
>
> Ronen.
>

Re: Incorrect distance returned for indexed polygone shape

2021-01-31 Thread David Smiley

Closing the loop here (I think you reached out to me in multiple other
channels) -- Solr only supports calculating the distance between points.
If you manage to index your data with a centroid point and a separate
numeric field that is an approximate radius, you can get something that may
be good enough for what you want to do.  Basically, calculate the geodist
but subtract the radius field... maybe something like this (untested!):
sort=sub(geodist(),radius) desc.  Use LatLonPointSpatialField to store
point data if you can (is appropriate), which succeeded RPT for that.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Jan 20, 2021 at 6:00 AM Famas  wrote:

> I am using `geodist()` in solr query. Following this
> `select?==*,_dist_:geodist()={!geofilt
>
> d=30444}=on=50.53,-9.5722616=*:*=geo=true=json`
> However, it seems like distance calculations aren’t working.  Here’s an
> example query where the pt is several hundred kilometers away from the
> POLYGON. The problem that the calculated geodist is always `20015.115` .
>
> This is my query response:
>
> ```
> {
>   "responseHeader":{
> "status":0,
> "QTime":0,
> "params":{
>   "q":"*:*",
>   "pt":"50.53,-9.5722616",
>   "indent":"on",
>   "fl":"*,_dist_:geodist()",
>   "fq":"{!geofilt d=30444}",
>   "sfield":"geo",
>   "spatial":"true",
>   "wt":"json"}},
>   "response":{"numFound":3,"start":0,"docs":[
>   {
> "id":"1",
> "document_type_id":"1",
> "geo":["POLYGON ((3.837490081787109 43.61234105514181,
> 3.843669891357422 43.57877424689641, 3.893280029296875 43.57205863840097,
> 3.9458084106445312 43.58872191986938, 3.921947479248047 43.62762639320158,
> 3.8663291931152344 43.63321761913266, 3.837490081787109
> 43.61234105514181))"],
> "_version_":1689241382273679360,
> "timestamp":"2021-01-18T16:08:40.484Z",
> "_dist_":20015.115},
>   {
> "id":"4",
> "document_type_id":"4",
> "geo":["POLYGON ((-0.94482421875 45.10454630976873, -0.98876953125
> 44.6061127451739, 0.06591796875 44.134913443750726, 0.32958984375
> 45.1510532655634, -0.94482421875 45.10454630976873))"],
> "_version_":1689244486784253952,
> "timestamp":"2021-01-18T16:58:01.177Z",
> "_dist_":20015.115},
>   {
> "id":"8",
> "document_type_id":"8",
> "geo":["POLYGON ((-2.373046875 48.29781249243716, -2.28515625
> 48.004625021133904, -1.5380859375 47.76886840424207, -0.32958984375
> 47.79839667295524, -0.5712890625 48.531157010976706, -2.373046875
> 48.29781249243716))"],
> "_version_":1689252312264998912,
> "timestamp":"2021-01-18T19:02:24.137Z",
> "_dist_":20015.115}]
>   }}
> ```
> This is my solr field type definition:
> ```xml
>  class="solr.SpatialRecursivePrefixTreeFieldType" maxDistErr="0.001"
>
> spatialContextFactory="org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>
> validationRule="repairBuffer0"
> distErrPct="0.025"
> distanceUnits="kilometers"
> autoIndex="true"/>
>
>  stored="true"/>
>
> ```
> This is how I index my polygon:
>
> ```json
> {
>   "id": 12,
>   "document_type_id": 12,
>   "geo": "POLYGON ((3.77105712890625 43.61171961774284, 3.80401611328125
> 43.57939602461448, 3.8610076904296875 43.59580863402625, 3.8603210449218746
> 43.61519958447072, 3.826675415039062 43.628123412124616, 3.7827301025390625
> 43.63110543935801, 3.77105712890625 43.61171961774284))"
> }
> ```
>
> By the way I'm using solr 6.6 and I found 2 issues about this :
>
> https://issues.apache.org/jira/browse/SOLR-12899 
> https://issues.apache.org/jira/browse/SOLR-12899
>
> Does there an explanation !?
> Any help would be appreciated!
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-01-29 Thread David Smiley

https://issues.apache.org/jira/browse/SOLR-10321 -- near the end my opinion
is we should just omit the field if there is no highlight, which would
address your need to do this work-around.  Glob or no glob.  PR welcome!

It's satisfying seeing that the Unified Highlighter is so much faster than
the original.  I aim to make UH the default in 9.0.  SOLR-12901
<https://issues.apache.org/jira/browse/SOLR-12901>

It's kinda depressing that the weightMatcher mode is slow when there are
many fields because I was hoping this choice might eventually be permanent
in order to obsolete lots of code in the highlighter.  I can guess why it's
slow -- and I filed an issue --
https://issues.apache.org/jira/browse/LUCENE-9712 -- a tough one!  Don't
expect anything from me there for the foreseeable future.  It'd take either
some ugly hack that has some limited qualifications, or a substantial
rewrite of much of the UH.  At least there's the classic non-weightMatcher
mode, which works faithfully, albeit with some of its own gotchas around
obscure/custom query compatibility.

You said the original highlighter performs at ~1.5 seconds.  For the UH, I
suspect your offset source is postings from the index to get such fantastic
numbers that you get with it; right?  For curiosity's sake, can you please
set hl.offsetSource=ANALYSIS and tell me what speed you get?  Set
hl.weightMatches=false as well.  My hope is that it's still substantially
better than the original highlighter.

Just because hl.requireFieldMatch=false is the default, doesn't mean it's
the _right_ choice for everyone's app :-).  I tend to think Solr should
flip this in 9.0 for both accuracy & performance sake.  And unset
hl.maxAnalyzedChars -- mostly an obsolete safety with the UH being so much
faster.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Jan 29, 2021 at 2:46 AM Kerwin  wrote:

> On another note, since response time is in question, I have been using a
> customhighlighter to just override the method encodeSnippets() in the
> UnifiedSolrHighlighter class since solr 6 since Solr sends back blank array
> (ZERO_LEN_STR_ARRAY) in the response payload for fields that do not match.
> Here is the code before:
> if (snippet == null) {
>   //TODO reuse logic of DefaultSolrHighlighter.alternateField
>   summary.add(field, ZERO_LEN_STR_ARRAY);
> } 
>
> So I had removed this clause and made the following change:
>
> if (snippet != null) {
>// we used a special snippet separator char and we can now split on
> it.
>   summary.add(field, snippet.split(SNIPPET_SEPARATOR));
> }
>
> This has not changed in Solr 8 too, which for 76 fields gives a very large
> payload. So I will keep this custom code for now.
>
> On Fri, Jan 29, 2021 at 12:28 PM Kerwin  wrote:
>
>> Hi David,
>>
>> Thanks so much for your reply.
>> hl.weightMatches was indeed the culprit. After setting it to false, I am
>> now getting the same sub-second response as Solr 6. I am using Solr 8.6.1
>> (8.6.1)
>>
>> Here are the tests I carried out:
>> hl.requireFieldMatch=true=true  (2458 ms)
>> hl.requireFieldMatch=false=true (3964 ms)
>> hl.requireFieldMatch=true=false (158 ms)
>> hl.requireFieldMatch=false=false (169 ms) (CHOSEN since
>> this is consistent with our earlier setting).
>>
>> Thanks again, I will inform our other teams as well doing the Solr
>> upgrade to check the CHANGES.txt doc related to this.
>>
>

Re: Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-01-28 Thread David Smiley

Hello Kerwin,

Firstly, hopefully you've seen the upgrade notes:
https://lucene.apache.org/solr/guide/8_7/solr-upgrade-notes.html
8.6 fixes a performance regression found in 8.5; perhaps you are using 8.5?

Missing from the upgrade notes but found in the CHANGES.txt for 8.0
is hl.weightMatches=true is now the default.  Try setting it to false.
Does that help performance much?  It's documented on the highlighting page
of the ref guide:
https://lucene.apache.org/solr/guide/8_7/highlighting.html#the-unified-highlighter

You might want to try toggling hl.requireFieldMatch=true (defaults to
false).  For a query with dismax, it makes no semantic difference since all
clauses target all fields, unless users know how to query only specific
fields and do that.  It may impact performance significantly when there are
many fields.  Try a matrix of toggling this and hl.weightMatches (2x2=4
tests).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Jan 27, 2021 at 2:20 AM Kerwin  wrote:

> Hi,
>
> While upgrading to Solr 8 from 6 the Unified highlighter begins to have
> performance issues going from approximately 100ms to more than 4 seconds
> with 76 fields in the hl.q  and hl.fl parameters. So I played with
> different options and found that the hl.q parameter needs to have any one
> field for the performance issue to vanish. I do not know why this would be
> so. Could you check if this is a bug or something else? This is not the
> case if I use the original highlighter which has same performance on Solr 6
> and Solr 8 of ~ 1.5 seconds. The highlighting payload is also mostly same
> in all the cases.
>
> Prior Solr 8 configuration with bad performance of > 4sec
> {!edismax qf="field1 field2 ..field76" v=$qq}
> field1 field2 ..field76
>
> Solr 8 configuration with original Solr 6 performance of ~ 100 ms
> {!edismax qf="field1" v=$qq}
> field1 field2 ..field76
>
> Other highlighting parameters
> true
> unified
> 200
> WORD
> en
> 10
>
> If I remove the hl.q parameter altogether, the performance time shoots up
> to 6-7 seconds, since our user query is quite large with more fields and is
> more complicated, I suspect.
>

Re: Exact and non exact highlighting

2021-01-22 Thread David Smiley

I'm very familiar with using the Unifier Highligher on a project with this
requirement.  The main "trick" we used was using only one field but
analyzing both ways with a term differentiator (e.g. a leading symbol), and
then coupled with a custom query parser that knows a phrase query is to be
highlighted using the "exact" analysis as opposed to stemmed/approximate
analysis.  As one can imagine, there was a lot of custom code involved here
for many search requirements; this complexity wasn't just for the
highlighting matter.  Any way, using one stored field and multiple indexed
fields (ignoring their stored content if any) is a known feature request:
https://issues.apache.org/jira/browse/SOLR-1105  There's even a patch.  I
would love to help get this feature into Solr if you want to take-over
there!  The patch needs some work; I really disagree with touching the Solr
schema.  If you are up for it, comment on that issue to let the original
contributor know you want to help move this forward.  Maybe they do too.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Jan 22, 2021 at 12:46 PM df2832368_...@amberoad.de
df2832368_...@amberoad.de  wrote:

> Hello folks,
>
> I am currently working on an issue where we need to enable exact
> highlighting on a text field.
>
> Only problem is that it should also be possible to have also parts of the
> query which don't need to be exact.(e.g. "Hello World" Test, so "Hello
> World" needs to be an exact match, but tests would also match test.)
>
> We have a text field with our normal analyzer pipeline (stemming,...) and
> a copy field which has a decreased pipeline(lowercase filter).
>
> For searching this does its job fine and only returns the correct results
> by translating the query to its supposed fields(e.g. " data-rule="ARROWS"
> data-suggestions="[{"value":"→"},{"value":"⇾"},{"value":"≥"},{"value":"⇉"},{"value":"⇒"},{"value":"⇨"},{"value":"⇛"}]"
> data-type="grammar">-> text_exact:"Hello World" AND text:Test)
>
> Now the problem: The highlighting is now split into the two text fields
> (which makes sense). So we somehow want to combine those two highlights
> (they have the same stored text) to get appropriate "tags" and also scores.
>
> I haven't found a neat solution to this problem by now and would like to
> ask if someone has done something similar or has a clear idea on what to do.
>
> I have tried to tinker a bit around our custom extension of the unified
> highlighter and tried to somehow merge the passages returned by the
> highlighter. But this is quite tedious and error-prone. The next idea was
> to do a two-step process by first getting the positions of the exact match
> in the text_exact field and afterwards somehow filter only highlights that
> have these positions inside. (But I suppose this idea would still not solve
> the "tag"(/) problem .)
>
> I am glad for every help you could offer.
>
> Jan

Re: Highlighting large text fields

2021-01-12 Thread David Smiley

The last update to highlighting that I think is pertinent to
whether highlights match or not is v7.6 which added that hl.weightMatches
option.  So I recommend upgrading to at least that if you want to
experiment further.  But... uh.weightMatches highlights more accurately and
as such is more likely to not highlight as much as you are highlighting
now, and highlighting more is your goal right now it appears.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jan 12, 2021 at 2:45 PM Shaun Campbell 
wrote:

> That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll whack
> it right up and see what happens.
>
> I'm running 7.4 from a few years ago. Should I upgrade?
>
> For your info this is what I'm doing with Solr
> https://dev.fundingawards.nihr.ac.uk/search.
>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 19:33, David Smiley  wrote:
>
> > On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell  >
> > wrote:
> >
> > > Hi David
> > >
> > > Getting closer now.
> > >
> > > First of all, a bit of a mistake on my part. I have two cores set up
> and
> > I
> > > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > > highlighting wasn't being turned off.
> > >
> > > I think I've got the unified highlighter working.
> > > storeOffsetsWithPositions was already configured on my field type
> > > definition, not the field definition, so that was ok.
> > >
> > > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > > highlighting on some records and not others, making it confusing as to
> > > where the match is with my dismax parser.  I increased
> > > my hl.maxAnalyzedChars to 130 and now it's highlighting more
> records.
> > > Two questions:
> > >
> > > 1. Have you any guidelines as to what could be a
> > > maximum hl.maxAnalyzedChars without impacting performance or memory?
> > >
> >
> > With storeOffsetsWithPositions, highlighting is super-fast, and so this
> > hl.maxAnalyzedChars threshold is of marginal utility, like only to cap
> the
> > amount of memory used if you have some truly humongous docs and it's okay
> > only highlight the first X megabytes of them.  Maybe set to a 100MB worth
> > of text, or something like that.
> >
> >
> > > 2. Do you know a way to query the maximum length of text in a field so
> > that
> > > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can
> probably
> > > modify my java indexer to log the maximum content length.  Actually, I
> > > probably don't want the maximum but some value that highlights 90-95%
> > > records
> > >
> >
> > Eh... not really.  Maybe some approximation hacks involving function
> > queries on norms but I'd not bother in favor of just using a high
> threshold
> > such that this won't be an issue.
> >
> > All this said, this threshold is *not* the only reason why you might not
> be
> > getting highlights that you expect.  If you are using a recent Solr
> > version, you might try toggling the hl.weightMatches boolean, which could
> > make a difference for certain query arrangements.  There's a JIRA issue
> > pertaining to this one, and I haven't investigated it yet.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:
> > >
> > > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> > campbell.sh...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi David
> > > > >
> > > > > First of all I wanted to say I'm working off your book!!  Third
> > > edition,
> > > > > and I think it's a bit out of date now. I was just going to try
> > > following
> > > > > the section on the Postings highlighter, but I see that's been
> > absorbed
> > > > > into the Unified highlighter. I find your book easier to follow
> than
> > > the
> > > > > official documentation though.
> > > > >
> > > >
> > > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> > > code I
> > > > touch, including highlighting, so I hope what's there makes sense
> too.
> > > >
> > > >
> > > > > I am going to try to configure the unified highlighter, and I will
> > add
> > > > that
> > > > &g

Re: Highlighting large text fields

2021-01-12 Thread David Smiley

On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell 
wrote:

> Hi David
>
> Getting closer now.
>
> First of all, a bit of a mistake on my part. I have two cores set up and I
> was changing the solrconfig.xml on the wrong core doh!!  That's why
> highlighting wasn't being turned off.
>
> I think I've got the unified highlighter working.
> storeOffsetsWithPositions was already configured on my field type
> definition, not the field definition, so that was ok.
>
> What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> highlighting on some records and not others, making it confusing as to
> where the match is with my dismax parser.  I increased
> my hl.maxAnalyzedChars to 130 and now it's highlighting more records.
> Two questions:
>
> 1. Have you any guidelines as to what could be a
> maximum hl.maxAnalyzedChars without impacting performance or memory?
>

With storeOffsetsWithPositions, highlighting is super-fast, and so this
hl.maxAnalyzedChars threshold is of marginal utility, like only to cap the
amount of memory used if you have some truly humongous docs and it's okay
only highlight the first X megabytes of them.  Maybe set to a 100MB worth
of text, or something like that.


> 2. Do you know a way to query the maximum length of text in a field so that
> I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
> modify my java indexer to log the maximum content length.  Actually, I
> probably don't want the maximum but some value that highlights 90-95%
> records
>

Eh... not really.  Maybe some approximation hacks involving function
queries on norms but I'd not bother in favor of just using a high threshold
such that this won't be an issue.

All this said, this threshold is *not* the only reason why you might not be
getting highlights that you expect.  If you are using a recent Solr
version, you might try toggling the hl.weightMatches boolean, which could
make a difference for certain query arrangements.  There's a JIRA issue
pertaining to this one, and I haven't investigated it yet.

~ David


>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 16:30, David Smiley  wrote:
>
> > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell  >
> > wrote:
> >
> > > Hi David
> > >
> > > First of all I wanted to say I'm working off your book!!  Third
> edition,
> > > and I think it's a bit out of date now. I was just going to try
> following
> > > the section on the Postings highlighter, but I see that's been absorbed
> > > into the Unified highlighter. I find your book easier to follow than
> the
> > > official documentation though.
> > >
> >
> > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> code I
> > touch, including highlighting, so I hope what's there makes sense too.
> >
> >
> > > I am going to try to configure the unified highlighter, and I will add
> > that
> > > storeOffsetsWithPositions to the schema (which I saw in your book) and
> I
> > > will try indexing again from scratch.  Was getting some funny things
> > going
> > > on where I thought I'd turned highlighting off and it was still giving
> me
> > > highlights.
> > >
> >
> > hl=true/false
> >
> >
> > > Actually just re-reading your email again, are you saying that you
> can't
> > > configure highlighting in solrconfig.xml? That's where I always
> configure
> > > original highlighting in my dismax search handler. Am I supposed to add
> > > highlighting to each request?
> > >
> >
> > You can set highlighting and other *parameters* in solrconfig.xml for
> > request handlers.  But the dedicated  plugin info is only
> for
> > the original and Fast Vector Highlighters.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:
> > >
> > > > Hello!
> > > >
> > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > >
> > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > campbell.sh...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > I've been using highlighting for a while, using the original
> > > highlighter,
> > > > > and just come across a problem with fields that contain a large
> > amount
> > > of
> > > > > text, approx 250k characters. I only have about 2,000 records but
> > each
> > > > one
> > > > > contains a journal publication to

Re: Highlighting large text fields

2021-01-12 Thread David Smiley

On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell 
wrote:

> Hi David
>
> First of all I wanted to say I'm working off your book!!  Third edition,
> and I think it's a bit out of date now. I was just going to try following
> the section on the Postings highlighter, but I see that's been absorbed
> into the Unified highlighter. I find your book easier to follow than the
> official documentation though.
>

Thanks :-D.  I do maintain the Solr Reference Guide for the parts of code I
touch, including highlighting, so I hope what's there makes sense too.


> I am going to try to configure the unified highlighter, and I will add that
> storeOffsetsWithPositions to the schema (which I saw in your book) and I
> will try indexing again from scratch.  Was getting some funny things going
> on where I thought I'd turned highlighting off and it was still giving me
> highlights.
>

hl=true/false


> Actually just re-reading your email again, are you saying that you can't
> configure highlighting in solrconfig.xml? That's where I always configure
> original highlighting in my dismax search handler. Am I supposed to add
> highlighting to each request?
>

You can set highlighting and other *parameters* in solrconfig.xml for
request handlers.  But the dedicated  plugin info is only for
the original and Fast Vector Highlighters.

~ David


>
> Thanks
> Shaun
>
> On Mon, 11 Jan 2021 at 20:57, David Smiley  wrote:
>
> > Hello!
> >
> > I worked on the UnifiedHighlighter a lot and want to help you!
> >
> > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell  >
> > wrote:
> >
> > > I've been using highlighting for a while, using the original
> highlighter,
> > > and just come across a problem with fields that contain a large amount
> of
> > > text, approx 250k characters. I only have about 2,000 records but each
> > one
> > > contains a journal publication to search through.
> > >
> > > What I noticed is that some records didn't return a highlight even
> though
> > > they matched on the content. I noticed the hl.maxAnalyzedChars
> parameter
> > > and increased that, but  it allowed some records to be highlighted, but
> > not
> > > all, and then it caused memory problems on the server.  Performance is
> > also
> > > very poor.
> > >
> >
> > I've been thinking hl.maxAnalyzedChars should maybe default to no limit
> --
> > it's a performance threshold but perhaps better to opt-in to such a limit
> > then scratch your head for a long time wondering why a search result
> isn't
> > showing highlights.
> >
> >
> > > To try to fix this I've tried  to configure the unified highlighter in
> my
> > > solrconfig.xml instead.   It seems to be working but again I'm missing
> > some
> > > highlighted records.
> > >
> >
> > There is no configuration of that highlighter in solrconfig.xml; it's
> > entirely parameter driven (runtime).
> >
> >
> > > The other thing is I've tried to adjust my unified highlighting
> settings
> > in
> > > solrconfig.xml and they don't  seem to be having any effect even after
> > > restarting Solr.  I was just wondering whether there is any
> highlighting
> > > information stored at index time. It's taking over 4hours to index my
> > > records so it's not easy to keep reindexing my content.
> > >
> > > Any ideas on how to handle highlighting of large content  would be
> > > appreciated.
> > >
> > > Shaun
> > >
> >
> > Please read the documentation here thoroughly:
> >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > (or earlier version as applicable)
> > Since you have large bodies of text to highlight, you would strongly
> > benefit from putting offsets into the search index (and re-index) --
> > storeOffsetsWithPositions.  That's an option on the field/fieldType in
> your
> > schema; it may not be obvious reading the docs.  You have to opt-in to
> > that; Solr doesn't normally store any info in the index for highlighting.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
>

Re: Highlighting large text fields

2021-01-11 Thread David Smiley

Hello!

I worked on the UnifiedHighlighter a lot and want to help you!

On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell 
wrote:

> I've been using highlighting for a while, using the original highlighter,
> and just come across a problem with fields that contain a large amount of
> text, approx 250k characters. I only have about 2,000 records but each one
> contains a journal publication to search through.
>
> What I noticed is that some records didn't return a highlight even though
> they matched on the content. I noticed the hl.maxAnalyzedChars parameter
> and increased that, but  it allowed some records to be highlighted, but not
> all, and then it caused memory problems on the server.  Performance is also
> very poor.
>

I've been thinking hl.maxAnalyzedChars should maybe default to no limit --
it's a performance threshold but perhaps better to opt-in to such a limit
then scratch your head for a long time wondering why a search result isn't
showing highlights.

> To try to fix this I've tried  to configure the unified highlighter in my
> solrconfig.xml instead.   It seems to be working but again I'm missing some
> highlighted records.
>

There is no configuration of that highlighter in solrconfig.xml; it's
entirely parameter driven (runtime).

> The other thing is I've tried to adjust my unified highlighting settings in
> solrconfig.xml and they don't  seem to be having any effect even after
> restarting Solr.  I was just wondering whether there is any highlighting
> information stored at index time. It's taking over 4hours to index my
> records so it's not easy to keep reindexing my content.
>
> Any ideas on how to handle highlighting of large content  would be
> appreciated.
>
> Shaun
>

Please read the documentation here thoroughly:
https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
(or earlier version as applicable)
Since you have large bodies of text to highlight, you would strongly
benefit from putting offsets into the search index (and re-index) --
storeOffsetsWithPositions.  That's an option on the field/fieldType in your
schema; it may not be obvious reading the docs.  You have to opt-in to
that; Solr doesn't normally store any info in the index for highlighting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Re: SPLITSHARD - data loss of child documents

2020-12-19 Thread David Smiley

https://issues.apache.org/jira/browse/SOLR-11191 and I assigned it to
myself just now.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Dec 17, 2020 at 9:50 AM Mike Drob  wrote:

> I was under the impression that split shard doesn’t work with child
> documents, if that is missing from the ref guide we should update it
>
> On Thu, Dec 17, 2020 at 4:30 AM Nussbaum, Ronen  >
> wrote:
>
> > Hi Everyone,
> >
> > We're using version 8.6.1 with nested documents.
> > I used the SPLITSHARD API and after it finished successfully, I've
> noticed
> > the following:
> >
> >   1.  Most of child documents are missing - before the split: ~600M,
> > after: 68M
> >   2.  Retrieving a document with its children, shows child documents that
> > do not belong to this parent (their parentID value is different than
> > parent's ID).
> >
> > I didn't see any limitation in the API documentation.
> > Do you have any suggestions?
> >
> > Thanks in advance,
> > Ronen.
> >
> >
> > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or subsidiaries.
> The
> > information is intended to be for the use of the individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> not
> > use, copy, disclose or distribute to anyone this message or any
> information
> > contained in this message. If you have received this electronic message
> in
> > error, please notify us by replying to this e-mail.
> >
>

Re: data import handler deprecated?

2020-11-30 Thread David Smiley

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 30, 2020 at 8:04 AM Eric Pugh 
wrote:

> You don’t need to abandon DIH right now….   You can just use the Github
> hosted version….   The more people who use it, the better a community it
> will form around it!It’s a bit chicken and egg, since no one is
> actively discussing it, submitting PR’s etc, it may languish.   If you use
> it, and test it, and support other community folks using it, then it will
> continue on!
>
>
>
> > On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk 
> wrote:
> >
> > On 11/29/2020 10:32 AM, Erick Erickson wrote:
> >
> >> And I absolutely agree with Walter that the DB is often where
> >> the bottleneck lies. You might be able to
> >> use multiple threads and/or processes to query the
> >> DB if that’s the case and you can find some kind of partition
> >> key.
> >
> > IME the difficult part has always been dealing with incremental updates,
> if we were to roll our own, my vote would be for a database trigger that
> does a POST in whichever language the DBMS likes.
> >
> > But this has not been a part of our "solr 6.5 update" project until now.
> >
> > Thanks everyone,
> > Dima
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: Faceting: !terms vs mincount precedence

2020-11-17 Thread David Smiley

This is confusing because when you write {!terms}, it suggests a reference
to the TermsQParser, but when you write {!terms=a,b,c} it suggests
local-params, with key "terms" and value "a,b,c" -- entirely different
things.  I think that "terms" local-param to faceting was a purely internal
thing that wasn't documented; it existed as an internal implementation
detail.  Then someone (I think Christine, if not then Mikhail) observed it
wasn't documented, and added some basic docs.  Now you come along and try
to use it with other things that unsurprisingly it just wasn't designed
for.  That's my estimation of the matter... and *if* true, illustrates that
maybe some internal params should stay internal and don't need to be
publicly documented.  I confess I've used that faceting local-param in an
app once before too; it's useful.  I know my response isn't a direct answer
to your question RE mincount... perhaps it can be made to work?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Nov 17, 2020 at 8:21 AM Jason Gerlowski 
wrote:

> Hey all,
>
> I was using the {!terms} local parameter on some traditional field
> facets to make sure particular values were returned.
>
> e.g.
> facet=true={!terms='fantasy,scifi,mystery'}genre_s_s.facet.mincount=2
>
> On single-shard collections in 8.6.3 this worked as I expected -
> "fantasy", "scifi", and "mystery" were the only 3 field values
> returned, and "mystery" was returned despite its count value being
> less than the specified "mincount".  But on a multi-shard collection
> "mystery" isn't returned (presumably because a "mincount" check
> filters out the values on the facet aggregator node).
>
> What are the expected semantics when "{!terms}" and "mincount" are
> used together?  Should mincount filter out values in {!terms}, or
> should those values be excluded from any mincount filtering?  The
> behavior is clearly inconsistent between single and multi-shard, so it
> deserves a JIRA either way.  Just trying to figure out what the
> expected behavior is.
>
> Best,
>
> Jason
>

Re: [ANNOUNCE] Apache Solr 8.7.0 released

2020-11-09 Thread David Smiley

FYI an updated Docker image was just published a few hours ago:
https://hub.docker.com/_/solr

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Nov 4, 2020 at 9:06 AM Atri Sharma  wrote:

> 3/11/2020, Apache Solr™ 8.7 available
>
> The Lucene PMC is pleased to announce the release of Apache Solr 8.7
>
> Solr is the popular, blazing fast, open source NoSQL search platform
> from the Apache Lucene project. Its major features include powerful
> full-text search, hit highlighting, faceted search and analytics, rich
> document parsing, geospatial search, extensive REST APIs as well as
> parallel SQL. Solr is enterprise grade, secure and highly scalable,
> providing fault tolerant distributed search and indexing, and powers
> the search and navigation features of many of the world's largest
> internet sites.
>
>
> The release is available for immediate download at:
>
>
> https://lucene.apache.org/solr/downloads.html
>
>
> Please read CHANGES.txt for a detailed list of changes:
>
>
> https://lucene.apache.org/solr/8_7_0/changes/Changes.html
>
>
> Solr 8.7.0 Release Highlights
>
>
> SOLR-14588 -- Circuit Breakers Infrastructure and Real JVM Based Circuit
> Breaker
>
>
> SOLR-14615 –- CPU Based Circuit Breaker
>
>
> SOLR-14537 -- Improve performance of ExportWriter
>
>
> SOLR-14651 -- The MetricsHistoryHandler Can Be Disabled
>
>
> A summary of important changes is published in the Solr Reference
> Guide at https://lucene.apache.org/solr/guide/8_7/solr-upgrade-notes.html.
> For the most exhaustive list, see the full release notes at
> https://lucene.apache.org/solr/8_7_0/changes/Changes.html or by
> viewing the CHANGES.txt file accompanying the distribution.  Solr's
> release notes usually don't include Lucene layer changes.  Lucene's
> release notes are at
> https://lucene.apache.org/core/8_7_0/changes/Changes.html
>
>
> Note: The Apache Software Foundation uses an extensive mirroring network
> for
>
> distributing releases. It is possible that the mirror you are using may
> not have
>
> replicated the release yet. If that is the case, please try another mirror.
>
> This also applies to Maven access.
>
> 
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

Re: Solr 8.6.3

2020-10-22 Thread David Smiley

Kris,

>From a user's standpoint, the DIH is not deprecated.  I think we as a
project screwed up the messaging around components in Solr that are
*moving* in terms of code maintenance.  That is not deprecation yet we
referred to it as such, hence your understandable confusion.  I corrected
the warning about this in 8.7, so you won't see that again.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Oct 15, 2020 at 4:13 PM Kris Gurusamy 
wrote:

> I've just downloaded solr 8.6.3 and trying to create DIH for loading
> structured XML. I found out that DIH will be deprecated soon with version
> 9.0. What is the equivalent of DIH in new solr version? How do I import
> structured XML data which is very custom and index in Solr new version? Any
> help is appreciated.
>
> Regards
>
> Kris Gurusamy
> Director, Engineering
> kgurus...@xpanse.com
> www.xpanse.com
>
> On 10/15/20, 1:08 PM, "Anshum Gupta (Jira)"  wrote:
>
>
>  [
> https://issues.apache.org/jira/browse/SOLR-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Anshum Gupta resolved SOLR-14938.
> -
> Resolution: Invalid
>
> [~krisgurusamy] - Please ask questions regarding usage on the Solr
> user mailing list.
>
> JIRA is meant for issue tracking purposes.
>
> > Solr 8.6.3
> > --
> >
> > Key: SOLR-14938
> > URL:
> https://issues.apache.org/jira/browse/SOLR-14938
> > Project: Solr
> >  Issue Type: Bug
> >  Security Level: Public(Default Security Level. Issues are
> Public)
> >  Components: contrib - DataImportHandler
> >Reporter: Krishnan
> >Priority: Major
> >
> > I've just downloaded solr 8.6.3 and trying to create DIH for loading
> structured XML. I found out that DIH will be deprecated soon with version
> 9.0. What is the equivalent of DIH in new solr version? How do I import
> structured XML data which is very custom and index in Solr new version? Any
> help is appreciated.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
>

HEY, are you using the Analytics contrib?

2020-09-03 Thread David Smiley

I wonder who is using the Analytics contrib?  Why do you use it instead of
other Solr features like the JSON Faceting module that seem to have
competing functionality.  My motivation is to ascertain if it ought to be
maintained as a 3rd party plugin/package or remain as a 1st party contrib
where Solr maintainers continue to maintain it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Re: What is the Best way to block certain types of queries/ query patterns in Solr?

2020-09-03 Thread David Smiley

The general assumption in deploying a search platform is that you are going
to front it with a service you write that has the search features you care
about, and only those.  Only this service or other administrative functions
should reach Solr.  Be wary of making your service so flexible to support
arbitrary parameters you pass to Solr as-is that you don't know about in
advance (i.e. use an allow-list).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Aug 31, 2020 at 10:57 AM Mark Robinson 
wrote:

> Hi,
> I had come across a mail (Oct, 2019 one) which suggested the best way is to
> handle it before it reaches Solr. I was curious whether:-
>1. Jetty query filter can be used (came across something like
> that,, need to check)
> 2. Any new features in Solr itself (like in a request handler...or
> solrconfig, schema etc..)
>
> Thanks!
> Mark
>

Re: Error on searches containing specific character pattern

2020-09-03 Thread David Smiley

Hi,

I looked at the code at those line numbers and it seems simply impossible
that an ArrayIndexOutOfBoundsException could be thrown there because it's
guarded by a condition ensuring the array is of length 1.
https://github.com/apache/lucene-solr/blob/2752d50dd1dcf758a32dc573d02967612a2cf1ff/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L653

If you can reproduce this with the "techproducts" schema, please share the
complete query.  If there's a problem here, I suspect the synonyms you have
may be pertinent.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Sep 1, 2020 at 11:50 PM Andy @ BlueFusion 
wrote:

> Hi All,
>
> I have an 8.6.0 instance that is working well with one exception.
>
> It returns an error when the search term follows a pattern of numbers &
> alpha characters such as:
>
>   * 1a1 aa
>   * 1a1 1aa
>   * 1a1 11
>
> Similar patterns that don't error
>
>   * 1a1 a
>   * 1a1 1
>   * 1a11 aa
>   * 11a1 aa
>   * 1a1aa
>   * 11a11 aa
>
> The error is:
>
> |"trace":"java.lang.ArrayIndexOutOfBoundsException: 0\n\t at
> org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653)\n\t
>
> at
> org.apache.solr.parser.SolrQueryParserBase.newSynonymQuery(SolrQueryParserBase.java:617)\n\t
>
> at
> org.apache.lucene.util.QueryBuilder.analyzeGraphBoolean(QueryBuilder.java:533)\n\t
>
> at
> org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:320)\n\t
>
> at
> org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:240)\n\t
>
> at
> org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:524)\n\t
>
> at
> org.apache.solr.parser.QueryParser.newFieldQuery(QueryParser.java:62)\n\t
> at
> org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:1122)\n\t
>
> at
> org.apache.solr.parser.QueryParser.MultiTerm(QueryParser.java:593)\n\t
> at org.apache.solr.parser.QueryParser.Query(QueryParser.java:142)\n\t at
> org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\n\t at
> org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\n\t at
> org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\n\t at
> org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\n\t at
> org.apache.solr.parser.QueryParser.Clause(QueryParser.java:282)\n\t at
> org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)\n\t at
> org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:131)\n\t
> at
> org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:260)\n\t
>
> at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:49)\n\t
> at org.apache.solr.search.QParser.getQuery(QParser.java:174)\n\t at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:160)\n\t
>
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:302)\n\t
>
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)\n\t
>
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)\n\t at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799)\n\t
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578)\n\t
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)\n\t
>
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)\n\t
>
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)\n\t
>
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)\n\t
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\t
>
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\t
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\t
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\t
>
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)\n\t
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\t
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)\n\t
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\t
>
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)\n\t
>
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)\n\t
>
> at
> org.eclip

[CVE-2020-13941] Apache Solr information disclosure vulnerability

2020-08-14 Thread David Smiley

Reported in SOLR-14515 (private) and fixed in SOLR-14561 (public), released
in Solr version 8.6.0.
The Replication handler (
https://lucene.apache.org/solr/guide/8_6/index-replication.html#http-api-commands-for-the-replicationhandler)
allows commands backup, restore and deleteBackup. Each of these take a
location parameter, which was not validated, i.e you could read/write to
any location the solr user can access.

On a windows system SMB paths such as \\10.0.0.99\share\folder may also be
used, leading to:
* The possibility of restoring another SolrCore from a server on the
network (or mounted remote file system) may lead to:
** Exposing search index data that the attacker should otherwise not have
access to
** Replacing the index data entirely by loading it from a remote file
system that the attacker controls

* Launching SMB attacks which may result in:
** The exfiltration of sensitive data such as OS user hashes (NTLM/LM
hashes),
** In case of misconfigured systems, SMB Relay Attacks which can lead to
user impersonation on SMB Shares or, in a worse-case scenario, Remote Code
Execution

The solution implemented to address these issues was to:
* Restrict the location parameter to trusted paths
* Prevent remote connection when using Windows UNC Paths

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Re: org.apache.lucene.util.fst.FST taking up lot of Java Heap Memory

2020-08-07 Thread David Smiley

Since you have a typical use-case (point data, queries that are
rectangles), I strongly encourage you to migrate to LatLonPointSpatialField:

https://builds.apache.org/job/Solr-reference-guide-master/javadoc/spatial-search.html#latlonpointspatialfield
It's based off an internal "BKD" tree index (doesn't use FSTs) which is
different than the terms based index used by the RPT field that you are
using which employes FSTs.  To be clear, FSTs are awesome but the BKD index
is tailored for numeric data whereas terms/FSTs are not.

If your FSTs are/were taking up so much memory, you are probably not using
Solr 8.4.0 or beyond, which moved to having the FSTs off-heap -- at least
the ones associated with the field indexes.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Aug 6, 2020 at 8:19 PM sanjay dutt
 wrote:

> FieldType defined with class solr.SpatialRecursivePrefixTreeFieldType
>
> In this we are adding points only although collection has few fields with
> points data and then other fieldTypes as well.
> And one of the queries looks like
> (my_field: [45,-94 TO 46,-93]+OR+my_field: [42,-94 TO 43,-93])
>
> Thanks and Regards,Sanjay Dutt
>
> On Thursday, August 6, 2020, 12:10:04 AM GMT+5:30, David Smiley <
> dsmi...@apache.org> wrote:
>
>  What is the Solr field type definition for this field?  And what sort of
> spatial data do you add here -- just points or what?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 3, 2020 at 10:09 PM sanjay dutt
>  wrote:
>
> > Hello Solr community,
> > On our Production SolrCloud Server, OutOfMemory has been occurring on lot
> > of instances. When I download the HEAP DUMP and analyzed it. I got to
> know
> > that in multiple HEAP DUMPS there are lots of instances
> > of org.apache.lucene.codecs.blocktree.BlockTreeTermsReader  which has the
> > highest retained heap memory and further I have checked the
> > outgoing-reference for those objects,
> > the  org.apache.lucene.util.fst.FST is the one which occupy 90% of the
> heap
> > memory.
> > it's like
> > Production HEAP memory :- 12GBout of
> > which  org.apache.lucene.codecs.blocktree.BlockTreeTermsReader total
> retained
> > heap :- 7-8 GB(vary from instance to
> > instance)and org.apache.lucene.util.fst.FST total retained heap :- 6-7 GB
> > Upon further looking I have calculated the total retained heap for
> > FieldReader.fieldInfo.name="my_field" is around 7GB. Now this is the
> same
> > reader which also contains reference to org.apache.lucene.util.fst.FST.
> > Now "my_field" is the field on which we are performing spatial searches.
> > Is spatial searches use FST internally and hence we are seeing lot of
> heap
> > memory used by FST.l only.
> > IS there any way we can optimize the spatial searches so that it take
> less
> > memory.
> > Can someone please give me any pointer that from where Should I start
> > looking to debug the above issue.
> > Thanks and Regards,Sanjay Dutt
> > Sent from Yahoo Mail on Android
>

Re: org.apache.lucene.util.fst.FST taking up lot of Java Heap Memory

2020-08-05 Thread David Smiley

What is the Solr field type definition for this field?  And what sort of
spatial data do you add here -- just points or what?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Aug 3, 2020 at 10:09 PM sanjay dutt
 wrote:

> Hello Solr community,
> On our Production SolrCloud Server, OutOfMemory has been occurring on lot
> of instances. When I download the HEAP DUMP and analyzed it. I got to know
> that in multiple HEAP DUMPS there are lots of instances
> of org.apache.lucene.codecs.blocktree.BlockTreeTermsReader  which has the
> highest retained heap memory and further I have checked the
> outgoing-reference for those objects,
> the  org.apache.lucene.util.fst.FST is the one which occupy 90% of the heap
> memory.
> it's like
> Production HEAP memory :- 12GBout of
> which  org.apache.lucene.codecs.blocktree.BlockTreeTermsReader total retained
> heap :- 7-8 GB(vary from instance to
> instance)and org.apache.lucene.util.fst.FST total retained heap :- 6-7 GB
> Upon further looking I have calculated the total retained heap for
> FieldReader.fieldInfo.name="my_field" is around 7GB. Now this is the same
> reader which also contains reference to org.apache.lucene.util.fst.FST.
> Now "my_field" is the field on which we are performing spatial searches.
> Is spatial searches use FST internally and hence we are seeing lot of heap
> memory used by FST.l only.
> IS there any way we can optimize the spatial searches so that it take less
> memory.
> Can someone please give me any pointer that from where Should I start
> looking to debug the above issue.
> Thanks and Regards,Sanjay Dutt
> Sent from Yahoo Mail on Android

Re: Out of memory errors with Spatial indexing

2020-07-06 Thread David Smiley

I believe you are experiencing this bug: LUCENE-5056
<https://issues.apache.org/jira/browse/LUCENE-5056>
The fix would probably be adjusting code in here
org.apache.lucene.spatial.query.SpatialArgs#calcDistanceFromErrPct

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jul 6, 2020 at 5:18 AM Sunil Varma  wrote:

> Hi David
> Thanks for your response. Yes, I noticed that all the data causing issue
> were at the poles. I tried the "RptWithGeometrySpatialField" field type
> definition but get a "Spatial context does not support S2 spatial
> index"error. Setting "spatialContextFactory="Geo3D" I still see the
> original OOM error .
>
> On Sat, 4 Jul 2020 at 05:49, David Smiley  wrote:
>
> > Hi Sunil,
> >
> > Your shape is at a pole, and I'm aware of a bug causing an exponential
> > explosion of needed grid squares when you have polygons super-close to
> the
> > pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
> > or not by itself.  For indexing non-point data, I recommend
> > class="solr.RptWithGeometrySpatialField" which internally is based off a
> > combination of a course grid and storing the original vector geometry for
> > accurate verification:
> >  > class="solr.RptWithGeometrySpatialField"
> >   prefixTree="s2" />
> > The internally coarser grid will lessen the impact of that pole bug.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma 
> > wrote:
> >
> > > We are seeing OOM errors  when trying to index some spatial data. I
> > believe
> > > the data itself might not be valid but it shouldn't cause the Server to
> > > crash. We see this on both Solr 7.6 and Solr 8. Below is the input that
> > is
> > > causing the error.
> > >
> > > {
> > > "id": "bad_data_1",
> > > "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> > > 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> > > 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> > > 1.000150474662E30)"
> > > }
> > >
> > > Above dynamic field is mapped to field type "location_rpt" (
> > > solr.SpatialRecursivePrefixTreeFieldType).
> > >
> > >   Any pointers to get around this issue would be highly appreciated.
> > >
> > > Thanks!
> > >
> >
>

Re: unified highlighter performance in solr 8.5.1

2020-07-05 Thread David Smiley

Here's my PR, which includes some edits to the ref guide docs where I tried
to clarify these settings a little too.
https://github.com/apache/lucene-solr/pull/1651
~ David


On Sat, Jul 4, 2020 at 8:44 AM Nándor Mátravölgyi 
wrote:

> I guess that's fair. Let's have hl.fragsizeIsMinimum=true as default.
>
> On 7/4/20, David Smiley  wrote:
> > I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms
> of
> > quality of the highlight since there are vastly more breaks to pick from.
> > I think that setting is more useful in SENTENCE mode if you can stand the
> > perf hit.  If you agree, then why not just let this one default to
> "true"?
> >
> > We agree on better documenting the perf trade-off.
> >
> > Thanks again for working on these settings, BTW.
> >
> > ~ David
> >
> >
> > On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi <
> nandor.ma...@gmail.com>
> > wrote:
> >
> >> Since the issue seems to be affecting the highlighter differently
> >> based on which mode it is using, having different defaults for the
> >> modes could be explored.
> >>
> >> WORD may have the new defaults as it has little effect on performance
> >> and it creates nicer highlights.
> >> SENTENCE should have the defaults that produce reasonable performance.
> >> The docs could document this while also mentioning that the UH's
> >> performance is highly dependent on the underlying Java String/Text?
> >> Iterator.
> >>
> >> One can argue that having different defaults based on mode is
> >> confusing. In this case I think the defaults should be changed to have
> >> the SENTENCE mode perform better. Maybe the options for nice
> >> highlights with WORD mode could be put into the docs in this case as
> >> some form of an example.
> >>
> >> As long as I can use the UH with nicely aligned snippets in WORD mode
> >> I'm fine with any defaults. I explicitly set them in the config and in
> >> the queries most of the time anyways.
> >>
> >
>

Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley

I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms of
quality of the highlight since there are vastly more breaks to pick from.
I think that setting is more useful in SENTENCE mode if you can stand the
perf hit.  If you agree, then why not just let this one default to "true"?

We agree on better documenting the perf trade-off.

Thanks again for working on these settings, BTW.

~ David

On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi 
wrote:

> Since the issue seems to be affecting the highlighter differently
> based on which mode it is using, having different defaults for the
> modes could be explored.
>
> WORD may have the new defaults as it has little effect on performance
> and it creates nicer highlights.
> SENTENCE should have the defaults that produce reasonable performance.
> The docs could document this while also mentioning that the UH's
> performance is highly dependent on the underlying Java String/Text?
> Iterator.
>
> One can argue that having different defaults based on mode is
> confusing. In this case I think the defaults should be changed to have
> the SENTENCE mode perform better. Maybe the options for nice
> highlights with WORD mode could be put into the docs in this case as
> some form of an example.
>
> As long as I can use the UH with nicely aligned snippets in WORD mode
> I'm fine with any defaults. I explicitly set them in the config and in
> the queries most of the time anyways.
>

Re: Out of memory errors with Spatial indexing

2020-07-03 Thread David Smiley

Hi Sunil,

Your shape is at a pole, and I'm aware of a bug causing an exponential
explosion of needed grid squares when you have polygons super-close to the
pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
or not by itself.  For indexing non-point data, I recommend
class="solr.RptWithGeometrySpatialField" which internally is based off a
combination of a course grid and storing the original vector geometry for
accurate verification:

The internally coarser grid will lessen the impact of that pole bug.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma  wrote:

> We are seeing OOM errors  when trying to index some spatial data. I believe
> the data itself might not be valid but it shouldn't cause the Server to
> crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is
> causing the error.
>
> {
> "id": "bad_data_1",
> "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> 1.000150474662E30)"
> }
>
> Above dynamic field is mapped to field type "location_rpt" (
> solr.SpatialRecursivePrefixTreeFieldType).
>
>   Any pointers to get around this issue would be highly appreciated.
>
> Thanks!
>

Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley

I think we should flip the default of hl.fragsizeIsMinimum to be 'true',
thus have the behavior close to what preceded 8.5.
(a) it was very recently (<= 8.4) the previous behavior and so may require
less tuning for users in 8.6 henceforth
(b) it's significantly faster for long text -- seems to be 2x to 5x for
long documents (assuming no change in hl.fragAlignRatio).  If the user
additionally configures hl.fragAlignRatio to 0 (also the previous behavior;
0.5 is the new default), I saw another 6x on top of that for "doc3" in the
test data Michal prepared.

Although I like that the sizing looks nicer, I think that is more from the
introduction and new default of hl.fragAlignRatio=0.5 than it is
hl.fragsizeIsMinimum=false.  We might even consider lowering
hl.fragAlignRatio to say 0.3 and retain pretty reasonable highlights
(avoids the extreme cases occurring with '0') and additional performance
benefit from that.

What do you think Nandor, Michal?

I'm hoping a change in settings (+ some better notes/docs on this) could
slip into an 8.6, all done by myself ASAP.

~ David


On Fri, Jun 19, 2020 at 2:32 PM Nándor Mátravölgyi 
wrote:

> Hi!
>
> With the provided test I've profiled the preceding() and following()
> calls on the base Java iterators in the different options.
>
> === default highlighter arguments ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1130 calls of
> baseIter.preceding() took 1.039629 seconds in total
> - from LengthGoalBreakIterator.following(): 1140 calls of
> baseIter.following() took 0.340679 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1150 calls of
> baseIter.preceding() took 0.099344 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1100 calls of
> baseIter.following() took 0.015156 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1200 calls of
> baseIter.preceding() took 0.001006 seconds in total
> - from LengthGoalBreakIterator.following(): 1700 calls of
> baseIter.following() took 0.006278 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1710 calls of
> baseIter.preceding() took 0.016320 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1090 calls of
> baseIter.following() took 0.000527 seconds in total
>
> === hl.fragsizeIsMinimum=true=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 860 calls of
> baseIter.following() took 0.012593 seconds in total
> - from LengthGoalBreakIterator.preceding(): 870 calls of
> baseIter.preceding() took 0.022208 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1360 calls of
> baseIter.following() took 0.004789 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1370 calls of
> baseIter.preceding() took 0.015983 seconds in total
>
> === hl.fragsizeIsMinimum=true ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 980 calls of
> baseIter.following() took 0.010253 seconds in total
> - from LengthGoalBreakIterator.preceding(): 980 calls of
> baseIter.preceding() took 0.341997 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1670 calls of
> baseIter.following() took 0.005150 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1680 calls of
> baseIter.preceding() took 0.013657 seconds in total
>
> === hl.fragAlignRatio=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1070 calls of
> baseIter.preceding() took 1.312056 seconds in total
> - from LengthGoalBreakIterator.following(): 1080 calls of
> baseIter.following() took 0.678575 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.preceding() took 0.020507 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.following() took 0.006977 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 880 calls of
> baseIter.preceding() took 0.000706 seconds in total
> - from LengthGoalBreakIterator.following(): 1370 calls of
> baseIter.following() took 0.004110 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.preceding() took 0.014752 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.following() took 0.000106 seconds in total
>
> There is definitely a big difference between SENTENCE and WORD. I'm
> not sure how we can improve the logic on our side while keeping the
> features as is. Since the number of calls is roughly the same for when
> the performance is good and bad, it seems to depend on what the text
> is that the iterator is traversing.
>

Re: Master Slave Terminology

2020-06-17 Thread David Smiley

priv...@lucene.apache.org but it should have been public and expect it to
spill out to the dev list today.

~ David


On Wed, Jun 17, 2020 at 11:14 AM Mike Drob  wrote:

> Hi Jan,
>
> Can you link to the discussion? I searched the dev list and didn’t see
> anything, is it on slack or a jira or somewhere else?
>
> Mike
>
> On Wed, Jun 17, 2020 at 1:51 AM Jan Høydahl  wrote:
>
> > Hi Kaya,
> >
> > Thanks for bringing it up. The topic is already being discussed by
> > developers, so expect to see some change in this area; Not over-night,
> but
> > incremental.
> > Also, if you want to lend a helping hand, patches are more than welcome
> as
> > always.
> >
> > Jan
> >
> > > 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> > >
> > > Hello, Community:
> > >
> > > As the Github and Python will replace terminologies that relative to
> > > slavery,
> > > why don't we replace master-slave for Solr as well?
> > >
> > > https://developers.srad.jp/story/18/09/14/0935201/
> > >
> >
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> > >
> > > --
> > >
> > > Sincerely,
> > > Kaya
> > > github: https://github.com/28kayak
> >
> >
>

Re: Facet Performance

2020-06-17 Thread David Smiley

I strongly recommend setting indexed=true on a field you facet on for the
purposes of efficient refinement (fq=field:value).  But it strictly isn't
required, as you have discovered.

~ David


On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
wrote:

> facet.method=enum works by executing a query (against indexed values)
> for each indexed value in a given field (which, for indexed=false, is
> "no values"). So that explains why facet.method=enum no longer works.
> I was going to suggest that you might not want to set indexed=false on
> the docValues facet fields anyway, since the indexed values are still
> used for facet refinement (assuming your index is distributed).
>
> What's the number of unique values in the relevant fields? If it's low
> enough, setting docValues=false and indexed=true and using
> facet.method=enum (with a sufficiently large filterCache) is
> definitely a viable option, and will almost certainly be faster than
> docValues-based faceting. (As an aside, noting for future reference:
> high-cardinality facets over high-cardinality DocSet domains might be
> able to benefit from a term facet count cache:
> https://issues.apache.org/jira/browse/SOLR-13807)
>
> I think you didn't specifically mention whether you acted on Erick's
> suggestion of setting "uninvertible=false" (I think Erick accidentally
> said "uninvertible=true") to fail fast. I'd also recommend doing that,
> perhaps even above all else -- it shouldn't actually *do* anything,
> but will help ensure that things are behaving as you expect them to!
>
> Michael
>
> On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
>  wrote:
> >
> > Thanks, I've implemented some queries that improve the first-hit
> execution for faceting.
> >
> > Since turning off indexed on those fields, we've noticed that
> facet.method=enum no longer returns the facets when used.
> > Using facet.method=fc/fcs is significantly slower compared to
> facet.method=enum for us. Why do these two differences exist?
> >
> > On 16/06/2020, 17:52, "Erick Erickson"  wrote:
> >
> > Ok, I see the disconnect... Necessary parts if the index are read
> from disk
> > lazily. So your newSearcher or firstSearcher query needs to do
> whatever
> > operation causes the relevant parts of the index to be read. In this
> case,
> > probably just facet on all the fields you care about. I'd add
> sorting too
> > if you sort on different fields.
> >
> > The *:* query without facets or sorting does virtually nothing due
> to some
> > special handling...
> >
> > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> james.bod...@loveholidays.com>
> > wrote:
> >
> > > I've been trying to build a query that I can use in newSearcher
> based off
> > > the information in your previous e-mail. I thought you meant to
> build a *:*
> > > query as per Query 1 in my previous e-mail but I'm still seeing the
> > > first-hit execution.
> > > Now I'm wondering if you meant to create a *:* query with each of
> the
> > > fields as part of the fl query parameters or a *:* query with each
> of the
> > > fields and values as part of the fq query parameters.
> > >
> > > At the moment I've been running these manually as I expected that
> I would
> > > see the first-execution penalty disappear by the time I got to
> query 4, as
> > > I thought this would replicate the actions of the newSeacher.
> > > Unfortunately we can't use the autowarm count that is available as
> part of
> > > the filterCache/filterCache due to the custom deployment mechanism
> we use
> > > to update our index.
> > >
> > > Kind Regards,
> > >
> > > James Bodkin
> > >
> > > On 16/06/2020, 15:30, "Erick Erickson" 
> wrote:
> > >
> > > Did you try the autowarming like I mentioned in my previous
> e-mail?
> > >
> > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > james.bod...@loveholidays.com> wrote:
> > > >
> > > > We've changed the schema to enable docValues for these
> fields and
> > > this led to an improvement in the response time. We found a further
> > > improvement by also switching off indexed as these fields are used
> for
> > > faceting and filtering only.
> > > > Since those changes, we've found that the first-execution for
> > > queries is really noticeable. I thought this would be the
> filterCache based
> > > on what I saw in NewRelic however it is probably trying to read the
> > > docValues from disk. How can we use the autowarming to improve
> this?
> > > >
> > > > For example, I've run the following queries in sequence and
> each
> > > query has a first-execution penalty.
> > > >
> > > > Query 1:
> > > >
> > > > q=*:*
> > > > facet=true
> > > > facet.field=D_DepartureAirport
> > > > facet.field=D_Destination
> > > > facet.limit=-1
> > > > rows=0
> > >

Re: Why Did It Match?

2020-05-29 Thread David Smiley

I've used the highlighter in the past for this but it has to do a lot more
work than "explain".  Typically that extra work is analysis of the fields'
text again.  Still; the highlighter can make sense when the individual
fields aren't otherwise searchable because you are searching on an
aggregate catch-all field.

~ David


On Thu, May 28, 2020 at 6:40 PM Walter Underwood 
wrote:

> Are you sure they will wonder? I’d try it without that and see if the
> simpler UI is easier to use. Simple almost always wins the A/B test.
>
> You can use the highlighter to see if a field matched a term. Only use
> explain if you need all the scores.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 28, 2020, at 3:37 PM, Webster Homer <
> webster.ho...@milliporesigma.com> wrote:
> >
> > Thank you.
> >
> > The problem is that Endeca just provided this information. The website
> users see how each search result matched the query.
> > For example this is displayed for a hit:
> > 1 Product Result
> >
> > |  Match Criteria: Material, Product Number
> >
> > The business users will wonder why we cannot provide this information
> with the new system.
> >
> > -Original Message-
> > From: Erick Erickson 
> > Sent: Thursday, May 28, 2020 4:38 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Why Did It Match?
> >
> > Yes, debug=explain is expensive. Expensive in the sense that I’d never
> add it to every query. But if your business users are trying to understand
> why query X came back the way it did by examining individual queries, then
> I wouldn’t worry.
> >
> > You can easily see how expensive it is in your situation by looking at
> the timings returned. Debug is just a component just like facet etc and the
> time it takes is listed separately in the timings section of debug output…
> >
> > Best,
> > Erick
> >
> >> On May 28, 2020, at 4:52 PM, Webster Homer <
> webster.ho...@milliporesigma.com> wrote:
> >>
> >> My concern was that I thought that explain is resource heavy, and was
> only used for debugging queries.
> >>
> >> -Original Message-
> >> From: Doug Turnbull 
> >> Sent: Thursday, May 21, 2020 4:06 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Why Did It Match?
> >>
> >> Is your concern that the Solr explain functionality is slower than
> Endecas?
> >> Or harder to understand/interpret?
> >>
> >> If the latter, I might recommend http://splainer.io as one solution
> >>
> >> On Thu, May 21, 2020 at 4:52 PM Webster Homer <
> webster.ho...@milliporesigma.com> wrote:
> >>
> >>> My company is working on a new website. The old/current site is
> >>> powered by Endeca. The site under development is powered by Solr
> >>> (currently 7.7.2)
> >>>
> >>> Out of the box, Endeca provides the capability to show how a query
> >>> was matched in the search. The business users like this
> >>> functionality, in solr this functionality is an expensive debug
> >>> option. Is there another way to get this information from a query?
> >>>
> >>> Webster Homer
> >>>
> >>>
> >>>
> >>> This message and any attachment are confidential and may be
> >>> privileged or otherwise protected from disclosure. If you are not the
> >>> intended recipient, you must not copy this message or attachment or
> >>> disclose the contents to any other person. If you have received this
> >>> transmission in error, please notify the sender immediately and
> >>> delete the message and any attachment from your system. Merck KGaA,
> >>> Darmstadt, Germany and any of its subsidiaries do not accept
> >>> liability for any omissions or errors in this message which may arise
> >>> as a result of E-Mail-transmission or for damages resulting from any
> >>> unauthorized changes of the content of this message and any
> >>> attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >>> subsidiaries do not guarantee that this message is free of viruses
> >>> and does not accept liability for any damages caused by any virus
> transmitted therewith.
> >>>
> >>>
> >>>
> >>> Click http://www.merckgroup.com/disclaimer to access the German,
> >>> French, Spanish and Portuguese versions of this disclaimer.
> >>>
> >>
> >>
> >> --
> >> *Doug Turnbull **| CTO* | OpenSource Connections
> >> , LLC | 240.476.9983
> >> Author: Relevant Search ; Contributor:
> *AI Powered Search * This e-mail and all
> contents, including attachments, is considered to be Company Confidential
> unless explicitly stated otherwise, regardless of whether attachments are
> marked as such.
> >>
> >>
> >> This message and any attachment are confidential and may be privileged
> or otherwise protected from disclosure. If you are not the intended
> recipient, you must not copy this message or attachment or disclose the
> contents to any other person. If you have received this transmission in
> error, please notify the sender

Re: unified highlighter performance in solr 8.5.1

2020-05-27 Thread David Smiley

try setting hl.fragsizeIsMinimum=true
I did some benchmarking and found that this helps quite a bit


BTW I used the highlights.alg benchmark file, with some changes to make it
more reflective of your scenario -- offsets in postings, and used "enwiki"
(english wikipedia) docs which are larger than the Reuters ones (so it
appears, any way).  I had to do a bit of hacking to use the
"LengthGoalBreakIterator, which wasn't previously used by this framework.

~ David


On Tue, May 26, 2020 at 4:42 PM Michal Hlavac  wrote:

> fine, I'l try to write simple test, thanks
>
>
>
> On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
>
> > Please create an issue.  I haven't reproduced it yet but it seems
> unlikely
>
> > to be user-error.
>
> >
>
> > ~ David
>
> >
>
> >
>
> > On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:
>
> >
>
> > > Hi,
>
> > >
>
> > > I have field:
>
> > > 
> > > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> > >
>
> > > and configuration:
>
> > > true
>
> > > unified
>
> > > true
>
> > > content_txt_sk_highlight
>
> > > 2
>
> > > true
>
> > >
>
> > > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms
> which
>
> > > is really slow.
>
> > > Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> > >
>
> > > is this normal behaviour or should I create issue?
>
> > >
>
> > > thanks, m.
>
> > >
>
> >
>
>

Re: unified highlighter performance in solr 8.5.1

2020-05-26 Thread David Smiley

Please create an issue.  I haven't reproduced it yet but it seems unlikely
to be user-error.

~ David


On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:

> Hi,
>
> I have field:
>  stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> and configuration:
> true
> unified
> true
> content_txt_sk_highlight
> 2
> true
>
> Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> is really slow.
> Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> is this normal behaviour or should I create issue?
>
> thanks, m.
>

Re: unified highlighter performance in solr 8.5.1

2020-05-25 Thread David Smiley

Wow that's terrible!
So this problem is for SENTENCE in particular, and it's a regression in
8.5?  I'll see if I can reproduce this with the Lucene benchmark module.

I figure you have some meaty text, like "page" size or longer?

~ David

On Mon, May 25, 2020 at 10:38 AM Michal Hlavac  wrote:

> I did same test on solr 8.4.1 and response times are same for both
> hl.bs.type=SENTENCE and hl.bs.type=WORD
>
> m.
>
> On pondelok 25. mája 2020 15:28:24 CEST Michal Hlavac wrote:
>
>
> Hi,
>
> I have field:
>  stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> and configuration:
> true
> unified
> true
> content_txt_sk_highlight
> 2
> true
>
> Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> is really slow.
> Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> is this normal behaviour or should I create issue?
>
> thanks, m.
>
>
>

Re: highlighting a whole html document using Unified highlighter

2020-05-24 Thread David Smiley

These strategies are not mutually exclusive.  Yes I do suggest having the
HTML in whole go into one searchable field to satisfy your highlighting
use-case.  But I can imagine you will also want some document metadata in
separate fields.  It's up to you to parse that out somehow and add it.  You
mentioned you are using bin/post but, IMO, that capability is more for
quick experimentation / tutorials, some POCs, or very simple use-cases.  I
doubt you can do what I suggest while still using bin/post.  You might be
able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what
bin/post does with HTML.

Good luck!

~ David


On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI 
wrote:

> Hi David,
>
> I have many meta-tags in html documents like   content="2019-10-15T23:59:59Z"> which matches the field descriptions in
> schema file.
>
> As I understand, you propose to index the whole html document as one text
> file and map it to a search field (do you?) . That would take care of the
> html highlight issue, however I would lose the field information coming
> from meta-tags .
>
> So is it possible to index the html document as html document ?
> (preserving the field data coming from meta-tags and not strip the html
> tags)
>
> Then I could use solr.HTMLStripCharFilterFactory for analysis.
>
> Thank You,
>
> Serkan,
>
>
>
>
> -Original Message-
> From: David Smiley [mailto:dsmi...@apache.org]
> Sent: Sunday, May 24, 2020 5:26 PM
> To: solr-user
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> Instead of stripping the HTML for the stored value, leave it be and remove
> it during the analysis stage with solr.HTMLStripCharFilterFactory
> <
> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory
> >
> This means the searchable text will only be the visible text, basically.
> And the highlighter will only highlight what's searchable.
>
> I suggest doing some experimentation for searching for words that you know
> are directly adjacent (no spaces) to opening and closing tags to make sure
> that the inserted HTML markup for the highlight balance correctly.  Use a
> "phrase query" (quoted) as well, and see if you can highlight around markup
> like "phrasequery" to see what happens.  You might need to set
> hl.weightMatches=false to ensure the words separately are highlighted.  I
> suspect you will find there is a problem, and the root cause is here:
> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
> my long TODO list but hasn't bitten me lately so I've neglected it.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI 
> wrote:
>
> > Thanks Jörn for the answer,
> >
> > I use post tool to index html documents, so the html tags are stripped
> > when indexed and stored. The remaining text is mapped to the field
> content
> > by default.
> >
> > hl.fragsize=0 works perfect for the indexed document, but I can only
> > display highlighted text-only version of html document because the html
> > tags are stripped.
> >
> > So is it possible to index and store the html document without stripping
> > the html tags, so that when the document is displayed with hl.fragsize=0
> > parameter, it is displayed as original html document?
> >
> > Or
> >
> > Is it possible to give a whole html document as a parameter to the
> Unified
> > highlighter so that output is also a highlighted html document?
> >
> > Or
> >
> > Do you have a better idea to highlight the keywords of the whole html
> > document?
> >
> >  Thanks,
> >
> >  Serkan
> >
> > -Original Message-
> > From: Jörn Franke [mailto:jornfra...@gmail.com]
> > Sent: Sunday, May 24, 2020 1:22 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: highlighting a whole html document using Unified highlighter
> >
> > hl.fragsize=0
> >
> > https://lucene.apache.org/solr/guide/8_5/highlighting.html
> >
> >
> >
> > > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI :
> > >
> > > Hi,
> > >
> > >
> > >
> > > I use solr to search over a million html documents, when a document is
> > > searched and displayed, I want to highlight the keywords that are used
> to
> > > find and access the document.
> > >
> > >
> > >
> > > Unified highlighter is fast, accurate and supports different languages
> > but
> > > only highlights passages with given parameters.
> > >
> > >
> > >
> > > How can I highlight a whole html document using Unified highlighter? I
> > have
> > > written a php code but it cannot do the complex word stemming
> functions.
> > >
> > >
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Serkan
> > >
> >
> >
>
>

Re: highlighting a whole html document using Unified highlighter

2020-05-24 Thread David Smiley

Instead of stripping the HTML for the stored value, leave it be and remove
it during the analysis stage with solr.HTMLStripCharFilterFactory
<https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory>
This means the searchable text will only be the visible text, basically.
And the highlighter will only highlight what's searchable.

I suggest doing some experimentation for searching for words that you know
are directly adjacent (no spaces) to opening and closing tags to make sure
that the inserted HTML markup for the highlight balance correctly.  Use a
"phrase query" (quoted) as well, and see if you can highlight around markup
like "phrasequery" to see what happens.  You might need to set
hl.weightMatches=false to ensure the words separately are highlighted.  I
suspect you will find there is a problem, and the root cause is here:
LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
my long TODO list but hasn't bitten me lately so I've neglected it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI 
wrote:

> Thanks Jörn for the answer,
>
> I use post tool to index html documents, so the html tags are stripped
> when indexed and stored. The remaining text is mapped to the field content
> by default.
>
> hl.fragsize=0 works perfect for the indexed document, but I can only
> display highlighted text-only version of html document because the html
> tags are stripped.
>
> So is it possible to index and store the html document without stripping
> the html tags, so that when the document is displayed with hl.fragsize=0
> parameter, it is displayed as original html document?
>
> Or
>
> Is it possible to give a whole html document as a parameter to the Unified
> highlighter so that output is also a highlighted html document?
>
> Or
>
> Do you have a better idea to highlight the keywords of the whole html
> document?
>
>  Thanks,
>
>  Serkan
>
> -Original Message-
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Sunday, May 24, 2020 1:22 PM
> To: solr-user@lucene.apache.org
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> hl.fragsize=0
>
> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>
>
>
> > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI :
> >
> > Hi,
> >
> >
> >
> > I use solr to search over a million html documents, when a document is
> > searched and displayed, I want to highlight the keywords that are used to
> > find and access the document.
> >
> >
> >
> > Unified highlighter is fast, accurate and supports different languages
> but
> > only highlights passages with given parameters.
> >
> >
> >
> > How can I highlight a whole html document using Unified highlighter? I
> have
> > written a php code but it cannot do the complex word stemming functions.
> >
> >
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Serkan
> >
>
>

Re: hl.preserveMulti in Unified highlighter?

2020-05-23 Thread David Smiley

Better late than never?  I added some new mail filters to bring topics of
interest to my attention.

Any way; this seems like an important use-case.

Anthony:  You'd probably benefit from also setting hl.bs.type=WHOLE since
clearly you want whole values (no snippets/fragments of values).  If I get
around to implementing hl.preserveMulti for the UH, i'll have it make this
assumption likewise.

~ David


On Sat, May 23, 2020 at 1:48 PM Walter Underwood 
wrote:

> I’m a little amused that this thread has become active after almost two
> months of silence.
>
> I think we just used the old highlighter. I don’t even remember now.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 23, 2020, at 9:14 AM, Anthony Groves  wrote:
> >
> > Hi Walter,
> >
> > I did something very similar to what David is suggesting when switching
> > from the PostingsHighlighter to the UnifiedHighlighter in Solr 7.
> >
> > In order to include non-highlighted items (exact ordering) when using
> > preserveMulti, we used a custom PassageFormatter that ignored the start
> and
> > end offsets:
> >
> https://github.com/oreillymedia/ifpress-solr-plugin/blob/bf3b07c5be32fbcfa7b6fdfd439d511ef60dab68/src/main/java/com/ifactory/press/db/solr/highlight/HighlightFormatter.java#L35
> >
> > I was actually surprised to see not much of a performance hit from
> > essentially removing the offset usage, but our highlighted fields aren't
> > extremely large :-)
> >
> > Hope that helps!
> > Anthony
> >
> > *Anthony Groves*  | Technical Lead, Search
> >
> > O'Reilly Media, Inc.  | https://www.linkedin.com/in/anthonygroves/
> >
> >
> > On Fri, May 22, 2020 at 4:59 PM David Smiley 
> > wrote:
> >
> >> Hi Walter,
> >>
> >> No, the UnifiedHighlighter does not behave as if this setting were true.
> >>
> >> The docs say:
> >>
> >> `hl.preserveMulti`::
> >> If `true`, multi-valued fields will return all values in the order they
> >> were saved in the index. If `false`, the default, only values that match
> >> the highlight request will be returned.
> >>
> >>
> >> The first sentence there is the essence of it.  Notice it's not
> conditional
> >> on wether there are highlights or not.  The UH won't return values
> lacking
> >> a highlight. Even hl.defaultSummary isn't triggered because *some* of
> the
> >> values have a highlight.
> >>
> >> As I look at the pertinent code right now, I imagine a solution would
> be to
> >> provide a custom PassageFormatter.  If we can assume for this use-case
> that
> >> you can use hl.bs.type=WHOLE as well, then a a simpler PassageFormatter
> >> could basically ignore the passage starts & ends and merely mark up the
> >> original content in entirety, which is a null concatenated sequence of
> all
> >> the values for this field for a document.
> >>
> >> ~ David
> >>
> >>
> >> On Fri, Mar 29, 2019 at 2:02 PM Walter Underwood  >
> >> wrote:
> >>
> >>> We are testing 6.6.1.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On Mar 29, 2019, at 11:02 AM, Walter Underwood  >
> >>> wrote:
> >>>>
> >>>> In testing, hl.preserveMulti=true works with the unified highlighter.
> >>> But the documentation says that the parameter is only implemented in
> the
> >>> original highlighter.
> >>>>
> >>>> Is the documentation wrong? Can we trust this to keep working with
> >>> unified?
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wun...@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>>> On Mar 26, 2019, at 12:08 PM, Walter Underwood <
> wun...@wunderwood.org
> >>>
> >>> wrote:
> >>>>>
> >>>>> It looks like hl.preserveMulti is only implemented in the Original
> >>> highlighter. Has anyone looked at doing this for the Unified
> highlighter?
> >>>>>
> >>>>> We need to preserve order in the highlights for a multi-valued field.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Creating custom PassageFormatter

2020-05-22 Thread David Smiley

You've probably gotten you answer now but "no".  Basically, you'd need to
specify your own subclass of UnifiedSolrHighlighter in solrconfig.xml like
this:


   Error loading class 'solr.highlight.CustomPassageFormatter'".
>
> Example from solrconfig.xml:
>  class="solr.highlight.CustomPassageFormatter">
> 
>
> I'm asking if this is still the right way? Is the "formatter" tag in XML
> valid option for Unified Highlighter?
>
> Thank you.
>
> Kind regards,
>   Damjan
>

Re: hl.preserveMulti in Unified highlighter?

2020-05-22 Thread David Smiley

Hi Walter,

No, the UnifiedHighlighter does not behave as if this setting were true.

The docs say:

`hl.preserveMulti`::
If `true`, multi-valued fields will return all values in the order they
were saved in the index. If `false`, the default, only values that match
the highlight request will be returned.

The first sentence there is the essence of it.  Notice it's not conditional
on wether there are highlights or not.  The UH won't return values lacking
a highlight. Even hl.defaultSummary isn't triggered because *some* of the
values have a highlight.

As I look at the pertinent code right now, I imagine a solution would be to
provide a custom PassageFormatter.  If we can assume for this use-case that
you can use hl.bs.type=WHOLE as well, then a a simpler PassageFormatter
could basically ignore the passage starts & ends and merely mark up the
original content in entirety, which is a null concatenated sequence of all
the values for this field for a document.

~ David

On Fri, Mar 29, 2019 at 2:02 PM Walter Underwood 
wrote:

> We are testing 6.6.1.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Mar 29, 2019, at 11:02 AM, Walter Underwood 
> wrote:
> >
> > In testing, hl.preserveMulti=true works with the unified highlighter.
> But the documentation says that the parameter is only implemented in the
> original highlighter.
> >
> > Is the documentation wrong? Can we trust this to keep working with
> unified?
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Mar 26, 2019, at 12:08 PM, Walter Underwood 
> wrote:
> >>
> >> It looks like hl.preserveMulti is only implemented in the Original
> highlighter. Has anyone looked at doing this for the Unified highlighter?
> >>
> >> We need to preserve order in the highlights for a multi-valued field.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org 
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >
>
>

Re: Alternate Fields for Unified Highlighter

2020-05-22 Thread David Smiley

Feel free to file an issue; I know it's not supported.  I also don't think
it's a big deal because you can just ask Solr to return the
"alternateField", thus letting the client side choose when to use that.  I
suppose it might be large, so I can imagine a concern there.  It'd be nice
if Solr had a DocTransformer to accomplish that.

I know it's been awhile; I'm curious how the UH has been working for you,
assuming you are using it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, Jun 2, 2019 at 6:47 AM Furkan KAMACI  wrote:

> Hi All,
>
> I want to switch to Unified Highlighter due to performance reasons for my
> Solr 7.6 I was using these fields
>
> solrQuery.addHighlightField("content_*")
> .set("f.content_en.hl.alternateField", "content")
> .set("f.content_es.hl.alternateField", "content")
> .set("hl.useFastVectorHighlighter", "true");
> .set("hl.maxAlternateFieldLength", 300);
>
> As far as I see, there is no definitions for alternate fields for unified
> highlighter. How can I configure such a configuration?
>
> Kind Regards,
> Furkan KAMACI
>

Re: unified highlighter methods works unexpected

2020-05-22 Thread David Smiley

Hi Roland,

I was not able to reproduce this.  I modified the tech_products same config
to change the name field to use a new field type that had a trivial
edgengram config.  Then I composed this query based. alittle on some of
your parameters, and it did find highlights:
http://localhost:8983/solr/techproducts/select?defType=edismax=id%2Cname=name=unified=on=3%3C74%25=%22hard%20dri%22=name%20text=true=0.1

If you could help me in telling me reproducibility instructions with
tech_products, then I can help diagnose the underlying problem and possibly
fix.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Apr 2, 2020 at 9:02 AM Szűcs Roland 
wrote:

> Hi All,
>
> I use Solr 8.4.1 and implement suggester functionality. As part of the
> suggestions I would like to show product info so I had to implement this
> functionality with normal query parsers instead of suggester component. I
> applied an edgengramm filter without stemming to fasten the analysis of the
> query which is crucial for the suggester functionality.
> I could use the Highlight component with edismax query parser without any
> problem. This is a typical output if hl.method=original (this is the
> default):
> { "responseHeader":{ "status":0, "QTime":4, "params":{ "mm":"3<74%",
> "q":"Arany
> Já", "tie":"0.1", "defType":"edismax", "hl":"true", "echoParams":"all", "qf
> ":"author_ngram^5 title_ngram^10", "fl":"id,imageUrl,title,price",
> "pf":"author_ngram^15
> title_ngram^30", "hl.fl":"title", "hl.method":"original", "_":
> "1585830768672"}}, "response":{"numFound":2,"start":0,"docs":[ {
> "id":"369",
> "title":"Arany János összes költeményei", "price":185.0, "imageUrl":"
> https://cdn.bknw.net/prd/covers_big/369.jpg"}, { "id":"26321",
> "title":"Arany
> János összes költeményei", "price":1400.0, "imageUrl":"
> https://cdn.bknw.net/prd/covers_big/26321.jpg"}] }, "highlighting":{
> "369":{
> "title":["\n \n Arany\n \n János összes költeményei"]}, "
> 26321":{ "title":["\n \n Arany\n \n János összes
> költeményei"]}}}
>
> If I change the method to unified, I get unexpected result:
> { "responseHeader":{ "status":0, "QTime":5, "params":{ "mm":"3<74%",
> "q":"Arany
> Já", "tie":"0.1", "defType":"edismax", "hl":"true", "echoParams":"all", "qf
> ":"author_ngram^5 title_ngram^10", "fl":"id,imageUrl,title,price",
> "pf":"author_ngram^15
> title_ngram^30", "hl.fl":"title", "hl.method":"unified",
> "_":"1585830768672"
> }}, "response":{"numFound":2,"start":0,"docs":[ { "id":"369",
> "title":"Arany
> János összes költeményei", "price":185.0, "imageUrl":"
> https://cdn.bknw.net/prd/covers_big/369.jpg"}, { "id":"26321",
> "title":"Arany
> János összes költeményei", "price":1400.0, "imageUrl":"
> https://cdn.bknw.net/prd/covers_big/26321.jpg"}] }, "highlighting":{
> "369":{
> "title":[]}, "26321":{ "title":[]}}}
>
> Any idea why the newest method fails to deliver the same results?
>
> Thanks,
> Roland
>

Re: Unified highlighter with storeOffsetsWithPositions and termVectors giving an exception

2020-05-22 Thread David Smiley

FWIW I tried this on the techproducts schema with a modification to the
name field, but did not see the issue.

I suspect you did not re-index after making these schema changes.  If you
did, then also check that the collection (or core) truly started fresh
(never had any previous schema) because if you tried it one way then merely
deleted/replaced the documents after changing the schema, then some
internal metadata in the underlying index data tends to persist.  I suspect
some of the options flipped here might stay sticky.

If that really isn't it, then you might suggest to me exactly how to
reproduce this from what Solr ships with, like the techproducts example
schema and dataset.

~ David

On Sun, Jul 21, 2019 at 10:07 PM Richard Walker 
wrote:

> On 22 Jul 2019, at 11:32 am, Richard Walker 
> wrote:
> > I'm trying out the advice in the user guide
> > (
> https://lucene.apache.org/solr/guide/8_1/highlighting.html#schema-options-and-performance-considerations
> )
> > for using the unified highlighter.
> >
> > ...
> > * "set storeOffsetsWithPositions to true"
> > * "set termVectors to true but no other term vector
> >  related options on the field being highlighted"
> ...
>
> I completely forgot to mention that I also tried _just_:
>
> > * "set storeOffsetsWithPositions to true"
>
> i.e., without _also_ setting termVectors, and this _doesn't_
> give the exception.
>
> So it seems to be the _combination_ of:
> * unified highlighter
> * storeOffsetsWithPositions
> * termVectors
>
> that seems to be giving the exception.
>
>

Re: Highlighting Solr 8

2020-05-22 Thread David Smiley

What did you end up doing, Eric?  Did you migrate to the Unified
Highlighter?
~ David


On Wed, Oct 16, 2019 at 4:36 PM Eric Allen 
wrote:

> Thanks for the reply.
>
> Currently we are migrating from solr4 to solr8 under solr 4 we wrote our
> own highlighter because the provided one was too slow for our documents.
>
> We deal with many large documents, but we have full term vectors already.
> So as I understand it from my reading of the code the unified highlighter
> should be fast even on these large documents.
>
> The concern about alternate fields was if the highlighter was slow we
> could just return highlights from one field if they existed and if not then
> highlight the other fields.
>
> From my research I'm leaning towards returning highlights from all the
> fields we are interested in because I feel it will be fast.
>
> Eric Allen - Software Devloper, NetDocuments
> eric.al...@netdocuments.com | O: 801.989.9691 | C: 801.989.9691
>
> -Original Message-
> From: sasarun 
> Sent: Wednesday, October 16, 2019 2:45 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Highlighting Solr 8
>
> Hi Eric,
>
> Unified highlighter does not have an option to provide alternate field
> when highlighting. That option is available with Orginal and fast vector
> highlighter. As indicated in the Solr documentation, Unified is the
> recommended method for highlighting to meet most of the use cases. Please
> do share more details in case you are facing any specific issue with
> highlighting.
>
> Thanks,
>
> Arun
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>

Re: Unified highlighter- unable to get results - can get results with original and termvector highlighters

2020-05-22 Thread David Smiley

Hello,

Did you get it to work eventually?

Try setting hl.weightMatches=false and see if that helps.  Wether this
helps or not, I'd like to have a deeper understanding of the internal
structure of the Query (not the original query string).  What query parser
are you using?.  If you pass debug=query to Solr then you'll get a a parsed
version of the query that would be helpful to me.

~ David


On Mon, May 11, 2020 at 10:46 AM Warren, David [USA] 
wrote:

> I am running Solr 8.4 and am attempting to use its highlighting feature.
> It appears to work well when I use the original highlighter or the term
> vector highlighter, but when I try to use the unified highlighter, I get no
> results returned.  My Google searches so far have not revealed anybody
> having this same problem (perhaps user error on my part), hence why I’m
> asking a question to the Solr mailing list.
>
> I am running a query which searches the “title_text” field for a term and
> highlights it.
> The configuration for title_text is this:
>  multiValued="true" termVectors="true"/>
>
> The query looks like this:
>
> https://solr-server/index/c1/select?hl.fl=title_text=unified=true=
> title_text%3Azelda
>
> If hl.method=original or hl.method=termvector, I get back results in the
> highlighting section with “Zelda” surrounded by  tags.
> If hl.method=unified, all results in the highlighting section are blank.
>
> I’ve attached a remote debugger to my Solr server and verified that the
> unified highlighter class
> (org/apache/solr/highlight/UnifiedSolrHighlighter.java) is being invoked
> when I set hl.method=unified.  And I do not see any errors in the Solr logs.
>
> Any idea what I’m doing wrong? In looking at the Solr highlighting
> documentation, I didn’t see any additional configuration which needs to be
> done to get the unified highlighter to work.
>
> I realize I have not provided a bunch of information here, but obviously
> can provide more if needed.
>
> Thank you,
> David Warren
> Booz | Allen | Hamilton
> 703-625-0311 mobile
>
>

Re: Syntax error while parsing Spatial Query as string

2020-02-14 Thread David Smiley

You are asking on solr-user but your scenario seems pure Lucene.

For Lucene and indexing point-data, I strongly recommend LatLonPoint.  For
Solr, same scenario, the Solr adaptation of the same functionality is
LatLonPointSpatialField.  I know this doesn't directly address your
question.  Just looking at your email and reported error, it seems you are
supplying some custom syntax.  If you wish to proceed with the
SpatialStrategy/Spatial4j based framework, then see SpatialExample.java in
the tests which serve as documentation by example.  FYI PointVectorStrategy
is slated for removal in 9.0 as it's obsolete.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Feb 14, 2020 at 6:47 AM vas aj  wrote:

> Hi team,
>
> I am using Lucene 6.6.2, Spatial4j 0.7, lucene-spatial-extras 6.6.2. I am
> trying to create a Spatial Query string for a given longitude, latitude &
> radius in miles.
>
> The query string generated using SpatialHelper (code as attached ) for
> long: -122.8515139 & lat: 45.5099231 in .25 miles radius  is as follow :
>
> #(+location__x:[-122.85667708964212 TO -122.84635071035788]
> +location__y:[45.50630481040378 TO 45.51354138959622])
> #frange(DistanceValueSource(PointVectorStrategy field:location
> ctx=SpatialContext.GEO, Pt(x=-122.8515139,y=45.5099231))):[0.0 TO
> 0.00361828959621958]
>
> My lucene index is as follows:
> create lucene index --name=myLuceneIndex --region=stations --field=title
> --analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer
>
> I get error
> Syntax Error, cannot parse
> ConstantScore(#(+location__x:[-122.85667708964212 TO -122.84635071035788]
> +location__y:[45.50630481040378 TO 45.51354138959622])
> #frange(DistanceValueSource(PointVectorStrategy field:location
> ctx=SpatialContext.GEO, Pt(x=-122.8515139,y=45.5099231))):[0.0 TO
> 0.00361828959621958]):
>
> What am I doing wrong ?
>
> Regards,
> Aj
>

Re: Dependency log4j-slf4j-impl for solr-core:7.5.0 causing a number of build problems

2020-01-16 Thread David Smiley

Ultimately if you deduce the problem, file a JIRA issue and share it with
me; I will look into it.  I care about this matter too; I hate having to
exclude logging dependencies on the consuming end.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Jan 15, 2020 at 9:03 PM Wolf, Chris (ELS-CON) 
wrote:

> I am having several issues due to the slf4j implementation dependency
> “log4j-slf4j-impl” being declared as a dependency of solr-core:7.5.0.   The
> first issue observed when starting the app is this:
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/Users/ma-wolf2/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.7/log4j-slf4j-impl-2.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/ma-wolf2/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
>
> I first got wind that this might not be just myself from this thread:
>
> https://lucene.472066.n3.nabble.com/log4j-slf4j-impl-dependency-in-solr-core-td4449635.html#a4449891
>
>
>   *   If there are any users that integrate solr-core into their own code,
> it's currently a bit of a land-mine situation to change logging
> implementations.  If there's a way we can include log4j jars at build
> time, but remove the log4j dependency on the published solr-core
> artifact, that might work well.  We should do our best to make it so
> people can use EmbeddedSolrServer without log4j jars.
>
> There are two dimensions to this dependency problem:
>
>   *   Building a war file (this runs with a warning)
>   *   Building a spring-boot executable JAR with embedded servlet
> container (doesn’t run)
>
> When building a WAR and deploying, I get the “multiple SLF4J bindings”
> warning, but the app works. However, I want the convenience of a
> spring-boot executable JAR with embedded servlet container, but in that
> case, I get that warning followed by a fatal NoClassDefFoundError/
> ClassNotFoundException – which is a show-stopper.  If I hack the built
> spring-boot FAT jar and remove “log4j-slf4j-impl.jar” then the app works.
>
> For the WAR build, the proper version of log4j-slf4j-impl.jar was included
> – 2.11.0, but,for some reason when building the spring-boot fat (uber) jar,
> it was building with log4j-slf4j-impl:2.7 so of course it will croak.
>
> There are several issues:
>
>   1.  I don’t want log4j-slf4j-impl at all
>   2.  Somehow the version of “log4j-slf4j-impl” being used for the build
> is 2.7 rather then the expected 2.11.0
>   3.  Due to the version issue, the app croaks with
> ClassNotFoundException: org.apache.logging.log4j.util.ReflectionUtil
>
> For issue #1, I tried:
>   
>   org.apache.solr
>   solr-core
>   7.5.0
>   
> 
>   org.apache.logging.log4j
>   log4j-slf4j-impl
> 
>   
> 
>
> All to no avail, as that dependency ends up in the packaged build - for
> WAR, it’s version 2.11.0, so even though it’s a bad build, the app runs,
> but for building a spring-boot executable JAR with embedded webserver, for
> some reason, it switches log4j-slf4j-impl from version 2.11.0  to 2,7
> (2.11.0  works, but should not even be there)
>
> I also tried this:
>
> https://docs.spring.io/spring-boot/docs/current/maven-plugin/examples/exclude-dependency.html
>
> …that didn’t work either.
>
> I’m thinking that solr-core should have added a classifier of “provided”
> for “log4j-slf4j-impl”, but that’s conjecture of a possible solution going
> forward, but does anyone know how I can exclude  “log4j-slf4j-impl”  from a
> spring-boot build?
>
>
>
>
>
>

Re: Solr spatial search - overlapRatio of polygons

2020-01-08 Thread David Smiley

Your English is perfect.

I forwarded my response without your contact info.

I *do* follow solr-users but only certain key words like "spatial" (and
some other topics) and some words related to that domain (e.g. polygon,
etc.).  I so your post would have gotten my attention.

On Wed, Jan 8, 2020 at 1:16 PM David Smiley  wrote:

> My response to a direct email (copying here with permission):
>
> It's possible; you'll certainly have to write some code here to make this
> work, including some new Solr plugin; perhaps ValueSourceParser that can
> compute a more accurate overlap.  Such a thing would have to get the
> Spatial4J Shape from the RptWithGeometrySpatialField (see getValues).  Then
> some casting to unwrap it to get to a JTS Geometry.  All this is going to
> be slow, so I propose you use Solr query re-ranking to only do this on the
> top results that are based on the bounding-box overlap ratio as an
> approximation.
>
> https://lucene.apache.org/solr/guide/8_3/query-re-ranking.html
>
> -- Forwarded message -
> From: Marc
> Date: Tue, Jan 7, 2020 at 6:14 AM
> Subject: Solr spatial search - overlapRatio of polygons
> To: David Smiley 
>
>
>
> Dear Mr Smiley,
>
> I have a tricky question concerning the spatial search features of
> Solr and therefore I am directly contacting you, as a specialist.
>
> Currently I am developing a new catalogue for our map collection with
> Solr. I would like to sort the search results by the overlap ratio of
> the search rectangle and the polygon of the map corners. Solr provides
> such a feature for comparing and sorting bounding boxes only.
> But it should be possible to compare polygons with the help of JTS
> functions
> (
> locationtech.github.io/jts/javadoc/org/locationtech/jts/geom/Geometry.html).
>
> With intersection() you can compute the geometry of the overlapping
> part. Afterwards you may calculate the size of it with getArea() and
> compare it with the size of the search rectangle.
> Is there a way to use such JTS functions in a Solr query? Or do you
> know another option to sort by the overlap ratio of polygons?
>
>
>

Fwd: Solr spatial search - overlapRatio of polygons

2020-01-08 Thread David Smiley

My response to a direct email (copying here with permission):

It's possible; you'll certainly have to write some code here to make this
work, including some new Solr plugin; perhaps ValueSourceParser that can
compute a more accurate overlap.  Such a thing would have to get the
Spatial4J Shape from the RptWithGeometrySpatialField (see getValues).  Then
some casting to unwrap it to get to a JTS Geometry.  All this is going to
be slow, so I propose you use Solr query re-ranking to only do this on the
top results that are based on the bounding-box overlap ratio as an
approximation.

https://lucene.apache.org/solr/guide/8_3/query-re-ranking.html

-- Forwarded message -
From: Marc
Date: Tue, Jan 7, 2020 at 6:14 AM
Subject: Solr spatial search - overlapRatio of polygons
To: David Smiley 



Dear Mr Smiley,

I have a tricky question concerning the spatial search features of
Solr and therefore I am directly contacting you, as a specialist.

Currently I am developing a new catalogue for our map collection with
Solr. I would like to sort the search results by the overlap ratio of
the search rectangle and the polygon of the map corners. Solr provides
such a feature for comparing and sorting bounding boxes only.
But it should be possible to compare polygons with the help of JTS
functions
(locationtech.github.io/jts/javadoc/org/locationtech/jts/geom/Geometry.html).

With intersection() you can compute the geometry of the overlapping
part. Afterwards you may calculate the size of it with getArea() and
compare it with the size of the search rectangle.
Is there a way to use such JTS functions in a Solr query? Or do you
know another option to sort by the overlap ratio of polygons?

Re: [ANNOUNCE] Apache Solr 8.3.1 released

2019-12-09 Thread David Smiley

Thanks.  I observe we too often write in that way and leave it up to the
reader to assume we don’t intentionally add bugs :-)

On Mon, Dec 9, 2019 at 5:45 AM Colvin Cowie 
wrote:

> Oh, just looking at the way the announcement reads on
> http://lucene.apache.org/solr/news.html :
> Solr 8.3.1 Release Highlights:
>
>- JavaBinCodec has concurrent modification of CharArr resulting in
>corrupt internode updates
>
> That kind of sounds like the corrupt internode updates is something that
> has been *introduced* by the release rather than being fixed. Maybe that
> could just be changed to:
>
>- Fixed: JavaBinCodec has concurrent modification of CharArr resulting
>in corrupt internode updates
>
>
> Thanks
>
> On Fri, 6 Dec 2019 at 01:22, Paras Lehana 
> wrote:
>
> > Yup, now reflected. :)
> >
> > On Thu, 5 Dec, 2019, 19:43 Erick Erickson, 
> > wrote:
> >
> > > It’s there for me when I click on your link.
> > >
> > > > On Dec 5, 2019, at 1:08 AM, Paras Lehana  >
> > > wrote:
> > > >
> > > > Hey Ishan,
> > > >
> > > > Cannot find 8.3.1 here:
> https://lucene.apache.org/solr/downloads.html
> > > (8.3.0
> > > > is listed here).
> > > >
> > > > Anyways, I'm downloading it from here:
> > > > https://archive.apache.org/dist/lucene/solr/8.3.1/
> > > >
> > > >
> > > >
> > > > On Wed, 4 Dec 2019 at 20:27, Rahul Goswami 
> > > wrote:
> > > >
> > > >> Thanks Ishan. I was just going through the list of fixes in 8.3.1
> > > >> (published in changes.txt) and couldn't see the below JIRA.
> > > >>
> > > >> SOLR-13971 :
> > Velocity
> > > >> response writer's resource loading now possible only through startup
> > > >> parameters.
> > > >>
> > > >> Is it linked appropriately? Or is it some access rights issue for
> > > non-PMC
> > > >> members like me ?
> > > >>
> > > >> Thanks,
> > > >> Rahul
> > > >>
> > > >>
> > > >> On Wed, Dec 4, 2019 at 7:12 AM Noble Paul 
> > wrote:
> > > >>
> > > >>> Thanks ishan
> > > >>>
> > > >>> On Wed, Dec 4, 2019, 3:32 PM Ishan Chattopadhyaya <
> > > >>> ichattopadhy...@gmail.com>
> > > >>> wrote:
> > > >>>
> > >  ## 3 December 2019, Apache Solr™ 8.3.1 available
> > > 
> > >  The Lucene PMC is pleased to announce the release of Apache Solr
> > > 8.3.1.
> > > 
> > >  Solr is the popular, blazing fast, open source NoSQL search
> platform
> > >  from the Apache Lucene project. Its major features include
> powerful
> > >  full-text search, hit highlighting, faceted search, dynamic
> > >  clustering, database integration, rich document handling, and
> > >  geospatial search. Solr is highly scalable, providing fault
> tolerant
> > >  distributed search and indexing, and powers the search and
> > navigation
> > >  features of many of the world's largest internet sites.
> > > 
> > >  Solr 8.3.1 is available for immediate download at:
> > > 
> > >   
> > > 
> > >  ### Solr 8.3.1 Release Highlights:
> > > 
> > >   * JavaBinCodec has concurrent modification of CharArr resulting
> in
> > >  corrupt internode updates
> > >   * findRequestType in AuditEvent is more robust
> > >   * CoreContainer.auditloggerPlugin is volatile now
> > >   * Velocity response writer's resource loading now possible only
> > >  through startup parameters
> > > 
> > > 
> > >  Please read CHANGES.txt for a full list of changes:
> > > 
> > >   
> > > 
> > >  Solr 8.3.1 also includes  and bugfixes in the corresponding Apache
> > >  Lucene release:
> > > 
> > >   
> > > 
> > >  Note: The Apache Software Foundation uses an extensive mirroring
> > > >> network
> > >  for
> > >  distributing releases. It is possible that the mirror you are
> using
> > > may
> > >  not have
> > >  replicated the release yet. If that is the case, please try
> another
> > > >>> mirror.
> > >  This also applies to Maven access.
> > > 
> > > 
> > -
> > >  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > >  For additional commands, e-mail: dev-h...@lucene.apache.org
> > > 
> > > 
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > --
> > > > Regards,
> > > >
> > > > *Paras Lehana* [65871]
> > > > Development Engineer, Auto-Suggest,
> > > > IndiaMART Intermesh Ltd.
> > > >
> > > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > > > Noida, UP, IN - 201303
> > > >
> > > > Mob.: +91-9560911996
> > > > Work: 01203916600 | Extn:  *8173*
> > > >
> > > > --
> > > > *
> > > > *
> > > >
> > > > 
> > >
> > >
> >
> > --
> > *
> > *
> >
> >  
> >
>
--

Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-10-01 Thread David Smiley

Do you know how URLs are structured?  They include name=value pairs
separated by ampersands.  This takes precedence over the contents of any
particular name or value.  Consequently looking at your parenthesis doesn't
make sense since the open-close span ampersands and thus go to different
filter queries.  I think you can completely remove those parenthesis in
fact.  Also try a tool like Postman to compose your queries rather than
direct URL manipulation.

=adminLatLon
=80
= {!geofilt pt=33.0198431,-96.6988856} OR {!geofilt
pt=50.2171726,8.265894}

Notice the leading space after 'fq'.  This is a syntax parsing gotcha that
has to do with how embedded queries are parsed, which is what you need to
do as you need to compose two with an operator.  It'd be kinda awkard to
fix that gotcha in Solr.  There are other techniques too, but this is the
most succinct.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Oct 1, 2019 at 7:34 AM anushka gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Thanks,
>
> Could you please help me in combining two geofilt fqs as the following
> gives
> error, it treats ")" as part of the d parameter and gives error that
> 'd=80)'
> is not a valid param:
>
>
> ({!geofilt}=adminLatLon=33.0198431,-96.6988856=80)+OR+({!geofilt}=adminLatLon=50.2171726,8.265894=80)
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread David Smiley

"sort" is a regular request parameter.  In your non-working query, you
specified it as a local-param inside geofilt which isn't where it belongs.
If you want to sort from two points then you need to make up your mind on
how to combine the distances into some greater aggregate function (e.g.
min/max/sum).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Sep 30, 2019 at 10:22 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Hi,
>
>
>
> I want to be able to filter on different cities and also sort the results
> based on geoproximity. But sorting doesn’t work:
>
>
>
>
> admin_directory_search_geolocation?q=david=({!geofilt+sfield=adminLatLon+pt=33.0198431,-96.6988856+d=80+sort=min(geodist(33.0198431,-96.6988856))})+OR+({!geofilt+sfield=adminLatLon+pt=50.2171726,8.265894+d=80+sort=min(geodist(50.2171726,8.265894))})
>
>
>
> Sorting works fine if I add ‘&’ in geofilt condition like :
> q=david={!geofilt=adminLatLon=33.0198431,-96.6988856=80=geodist(33.0198431,-96.6988856)}
>
>
>
> But when I combine the two FQs then sorting doesn’t work.
>
>
>
> Please help.
>
>
>
>
>
> Best regards,
>
> Anushka gupta
>
>
>
>
>
>
>
> *From:* David Smiley 
> *Sent:* Friday, September 13, 2019 10:29 PM
> *To:* Anushka Gupta 
> *Subject:* [EXT]Re: Need urgent help with Solr spatial search using
> SpatialRecursivePrefixTreeFieldType
>
>
>
> Hello,
>
>
>
> Please don't email me directly for public help.  CC is okay if you send it
> to solr-user@lucene.apache.org so that the Solr community can benefit
> from my answer or might even answer it.
>
>
> ~ David Smiley
>
> Apache Lucene/Solr Search Developer
>
> http://www.linkedin.com/in/davidwsmiley
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_davidwsmiley=DwMFaQ=yIH1_-b1hO27QV_BdDph9suDL0Jq0WcgndLmIuQXoms=0egJOuVVdmY5VQTw_S3m4bVez1r-U8nqqi6RYBxO6tTbzryrDHrFoJROJ8r-TqNc=ulu2-5V3TDOnVNfRRQusod6-FoJcdeAWu5gGB3owryU=Hv2uYeXnut3oi1ijHp14BJ09QIZzhEI-onwzhnQYB8I=>
>
>
>
>
>
> On Wed, Sep 11, 2019 at 11:27 AM Anushka Gupta <
> anushka_gu...@external.mckinsey.com> wrote:
>
> Hello David,
>
>
>
> I read a lot of articles of yours regarding Solr spatial search using
> SpatialRecursivePrefixTreeFieldType. But unfortunately it doesn’t work for
> me when I combine filter query with my keyword search.
>
>
>
> Solr Version used : Solr 7.1.0
>
>
>
> I have declared fields as :
>
>
>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001"
>
> distErrPct="0.025"
> distanceUnits="kilometers"/>
>
>
>
>  stored="true"  multiValued="true" />
>
>
>
>
>
> Field values are populated like :
>
> adminLatLon: [50.2171726,8.265894]
>
>
>
> Query is :
>
>
> /solr/ac3_persons/admin_directory_search_location?q=Idstein=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
>
>
> My request handler is :
>
> admin_directory_search_location
>
>
>
> I get results if I do :
>
> /solr/ac3_persons/admin_directory_search_location?q=*:*
> =Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
>
>
> But I do not get results when I add any keyword in q.
>
>
>
> I am stuck in this issue since last many days. Could you please help with
> the same.
>
>
>
>
>
> Thanks,
>
> Anushka Gupta
>
>
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>

Re: Solr Backup restore

2019-09-13 Thread David Smiley

It would help if you could devise a simple set of command line steps to
reproduce/demonstrate the problem using the "bin/solr -e solrcloud" setup.
The problem you see ought to be reproducible here if there is a problem.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Sep 12, 2019 at 10:13 AM Mohammed Farhan Ejaz 
wrote:

> Hello,
>
> I have a Solr Cloud with 2 node cluster. It has 2 replicas one on each node
> with a single shard.
>
> The cores created are <>_shard1_replica1 and
> <>_shard1_replica2.
>
> When I create a collection back up and restore the documents are indexed
> properly on both the nodes, but the cores created are
> <>_shard1_replica0 and
> <>_shard1_replica1
>
> Additionally, when I delete or add documents it gets only deleted from one
> node which means the replication does not work. I noticed on one node I do
> not have the index folder on one of the nodes from where document is not
> getting deleted or added.
>
> What could I be possibly doing wrong?
>

Re: Migrating Bounding box from Lucene to Solr

2019-09-09 Thread David Smiley

Hi Amjad,

As you've seen from the ref guide, an arbitrary rectangle query *is*
supported.  Your query looks fine, though I can't be sure if the particular
shape/coordinates are what you intend.  You have a horizontal line in the
vicinity of the US state of Oklahoma.  Your data, on the other hand, is in
the UK.  It's also unclear what field type you are using.  If you have a
polygon then use RptWithGeometrySpatialField and provide it as-such using
either WKT or GeoJSON.  Supplying a list of points runs the risk that the
query won't actually intersect those points.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Sep 9, 2019 at 10:10 AM Amjad Khan  wrote:

> Hi,
> I am migrating my code from Lucene to Solr and stuck on bounding box query.
>
> As per lucene we had this query below
>
> IndexService regionIndex = 
> IndexServiceFactory.getIndexService("HotelRegionIndexService");
>
> AbstractQuerySearcher querySearcher = regionIndex.getSearcher();
>
> SpatialArgs spatialArgs = new SpatialArgs(SpatialOperation.Intersects, shape);
> Filter filter = querySearcher.getSpatialStrategy().makeFilter(spatialArgs);
>
> List regionResults = querySearcher.executeQuery(new 
> MatchAllDocsQuery(), filter,
>   HotelSearchConfig.getMaxRegionsInBoundingBox(), null, null);
>
>
> However in solr bounding box takes only center lat Lon and radius, that
> does not work since our client pass us 4 coordinates and I want to make it
> backward compatible.
>
> So found in solr document to use range query
> https://lucene.apache.org/solr/guide/8_1/spatial-search.html#filtering-by-an-arbitrary-rectangle
>
> But even after using this we are not getting data return by solr query
>
> As example select?q=*:*=POLYGON_VERTICES:[35,-96+TO+36,-95] did not
> return any record.
>
> We have data in solrcloud in this format below
>
>
>
>
> Any help will be appreciated.
>
> Thanks
>

Re: Query field alias - issue with circular reference

2019-09-08 Thread David Smiley

No but this seems like a decent enhancement request.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Aug 9, 2019 at 3:07 AM Jaroslaw Rozanski 
wrote:

> Hi Folks,
>
>
>
> Question about query field aliases.
>
>
>
> Assuming one has fields:
>
>  * foo1
>  * foo2
> Sending "defType=edismax=foo:hello=foo1 foo2" will work.
>
>
>
> But what in case of, when one has fields:
>
>  * foo
>  * foo1
> Say we want to add behaviour to queries that are already in use. We want
> to search in existing "foo" and "foo1" without making query changes.
>
>
>
> Sending "defType=edismax=foo:hello=foo foo1" will *not* work.
> The error is "org.apache.solr.search.SyntaxError: Field aliases lead to a
> cycle".
>
>
>
> So, is there anyway, to extend search query for the existing field without
> modifying index?
>
>
> --
> Jaroslaw Rozanski | m...@jarekrozanski.eu
>

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-07 Thread David Smiley

Also consider substituting grouping with expand/collapse (see the ref
guide).  The latter performs much better in general, although grouping does
have certain options that are uniquely valuable like ensuring that facet
counts look at the aggregate (if you want that).  I wish we could outright
remove grouping; it's a complexity weight on our codebase.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sat, Sep 7, 2019 at 5:15 PM David Smiley 
wrote:

> 10s of seconds to respond to a simple match-all query, especially to just
> a single shard via using distrib=false, is very bizarre.  What's the
> "QTime" on one of these? -- also super long or sub-second?
>
> I took a brief look at your schema with a hunch.  I see you have
> docValues=true on your ID field -- makes sense to me.  You also have
> version=1.5 on the schema instead of 1.6.  Why did you not do 1.6?  With
> 1.5 useDocValuesAsStored is false by default.  try toggling the version
> number to 1.6.  And try your query with "fl=id" and see how that changes
> the times.
>
> I also took a look at your solrconfig.xml with a hunch, and now think I
> found the smoking gun.  I see you've modified the /select request handler
> to add a bunch of defaults, including, of all things, grouping.  Thus when
> you report to us your queries are simple *:* queries, the reality is far
> different.  I wish people would treat /select as immutable and instead
> create request handlers for their apps' needs.
>
> Nonetheless my investigation here only reveals that your test queries are
> actually very complex and thus explains their overall slowness.  We don't
> know why Solr 8 performs slower than Solr 4 here.  For that I think we've
> given you some tips.  Get back to a simple query and compare that.  Try
> certain features in isolation (e.g. *just* the grouping).  Maybe it's
> that.  You might experiment with switching "fingerprint" (the string field
> you group on) from docValues=true to false to see if it's a docValues perf
> issue compared to uninverting.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sat, Sep 7, 2019 at 3:06 PM Russell Bahr  wrote:
>
>> Hi David,
>> I ran the *:* query 10 times against all 30 servers and the results
>> (below)
>> were similar across all of them. I agree working against a single server
>> is
>> easier troubleshooting, but I do not know where to start.
>>
>> Server shard replica, Matches, Time, Pass
>> 16 1 n2 2989421 78800 1
>> 20 1 n1 2989559 63246 1
>> 23 1 n8 2989619 55141 1
>> 28 1 n6 2989619 65536 1
>> 17 1 n4 2989818 56694 1
>> 26 2 n10 2990088 63485 1
>> 21 2 n18 2990145 68077 1
>> 11 2 n16 2990145 62271 1
>> 13 2 n12 2990242 68564 1
>> 27 2 n14 2990242 63739 1
>> 10 3 n26 2988056 69117 1
>> 25 3 n24 2988056 73750 1
>> 12 3 n28 2988096 61948 1
>> 6 3 n20 2988123 62174 1
>> 19 3 n22 2988123 65826 1
>> 1 4 n30 2985457 60404 1
>> 29 4 n34 2985457 68498 1
>> 30 4 n38 2985604 72034 1
>> 9 4 n36 2902757 65943 1
>> 15 4 n32 2985948 67208 1
>> 7 5 n48 2992278 63098 1
>> 5 5 n42 2992363 69503 1
>> 8 5 n44 2992363 66818 1
>> 4 5 n40 2992397 66784 1
>> 14 5 n46 2883495 58759 1
>> 3 6 n56 2878221 52265 1
>> 22 6 n58 2878221 53768 1
>> 24 6 n52 2878326 62174 1
>> 2 6 n50 2878326 53143 1
>> 18 6 n54 2878326 59044 1
>>
>> Results from 10 passes
>> p-solr-8-16.obscured.com:8983/solr/content_shard1_replica_n2/ 69697.8
>> 4599.8171896
>> Query time milliseconds [78800, 65549, 68045, 72151, 62774, 69168, 66459,
>> 74336, 69028, 70668]
>> p-solr-8-20.obscured.com:8983/solr/content_shard1_replica_n1/ 58310.5
>> 4531.23621224
>> Query time milliseconds [63246, 59626, 61001, 59366, 53028, 58693, 58832,
>> 64633, 54659, 50021]
>> p-solr-8-23.obscured.com:8983/solr/content_shard1_replica_n8/ 57778.5
>> 4659.23933348
>> Query time milliseconds [55141, 55194, 59100, 62614, 65425, 59261, 58961,
>> 59259, 53799, 49031]
>> p-solr-8-28.obscured.com:8983/solr/content_shard1_replica_n6/ 64944.1
>> 3382.61379705
>> Query time milliseconds [65536, 67825, 69829, 60059, 63616, 67588, 68443,
>> 60853, 62666, 63026]
>> p-solr-8-17.obscured.com:8983/solr/content_shard1_replica_n4/ 58018.9
>> 4821.9028851
>> Query time milliseconds [56694, 58900, 55404, 51590, 66034, 51256, 57109,
>> 57515, 63530, 62157]
>> p-solr-8-26.obscured.com:8983/solr/content_shard2_replica_n10/ 59366.6
>> 5036.84751936
>> Query time milliseconds [63485, 53315, 64845, 62077, 54313, 52

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-07 Thread David Smiley

10s of seconds to respond to a simple match-all query, especially to just a
single shard via using distrib=false, is very bizarre.  What's the "QTime"
on one of these? -- also super long or sub-second?

I took a brief look at your schema with a hunch.  I see you have
docValues=true on your ID field -- makes sense to me.  You also have
version=1.5 on the schema instead of 1.6.  Why did you not do 1.6?  With
1.5 useDocValuesAsStored is false by default.  try toggling the version
number to 1.6.  And try your query with "fl=id" and see how that changes
the times.

I also took a look at your solrconfig.xml with a hunch, and now think I
found the smoking gun.  I see you've modified the /select request handler
to add a bunch of defaults, including, of all things, grouping.  Thus when
you report to us your queries are simple *:* queries, the reality is far
different.  I wish people would treat /select as immutable and instead
create request handlers for their apps' needs.

Nonetheless my investigation here only reveals that your test queries are
actually very complex and thus explains their overall slowness.  We don't
know why Solr 8 performs slower than Solr 4 here.  For that I think we've
given you some tips.  Get back to a simple query and compare that.  Try
certain features in isolation (e.g. *just* the grouping).  Maybe it's
that.  You might experiment with switching "fingerprint" (the string field
you group on) from docValues=true to false to see if it's a docValues perf
issue compared to uninverting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sat, Sep 7, 2019 at 3:06 PM Russell Bahr  wrote:

> Hi David,
> I ran the *:* query 10 times against all 30 servers and the results (below)
> were similar across all of them. I agree working against a single server is
> easier troubleshooting, but I do not know where to start.
>
> Server shard replica, Matches, Time, Pass
> 16 1 n2 2989421 78800 1
> 20 1 n1 2989559 63246 1
> 23 1 n8 2989619 55141 1
> 28 1 n6 2989619 65536 1
> 17 1 n4 2989818 56694 1
> 26 2 n10 2990088 63485 1
> 21 2 n18 2990145 68077 1
> 11 2 n16 2990145 62271 1
> 13 2 n12 2990242 68564 1
> 27 2 n14 2990242 63739 1
> 10 3 n26 2988056 69117 1
> 25 3 n24 2988056 73750 1
> 12 3 n28 2988096 61948 1
> 6 3 n20 2988123 62174 1
> 19 3 n22 2988123 65826 1
> 1 4 n30 2985457 60404 1
> 29 4 n34 2985457 68498 1
> 30 4 n38 2985604 72034 1
> 9 4 n36 2902757 65943 1
> 15 4 n32 2985948 67208 1
> 7 5 n48 2992278 63098 1
> 5 5 n42 2992363 69503 1
> 8 5 n44 2992363 66818 1
> 4 5 n40 2992397 66784 1
> 14 5 n46 2883495 58759 1
> 3 6 n56 2878221 52265 1
> 22 6 n58 2878221 53768 1
> 24 6 n52 2878326 62174 1
> 2 6 n50 2878326 53143 1
> 18 6 n54 2878326 59044 1
>
> Results from 10 passes
> p-solr-8-16.obscured.com:8983/solr/content_shard1_replica_n2/ 69697.8
> 4599.8171896
> Query time milliseconds [78800, 65549, 68045, 72151, 62774, 69168, 66459,
> 74336, 69028, 70668]
> p-solr-8-20.obscured.com:8983/solr/content_shard1_replica_n1/ 58310.5
> 4531.23621224
> Query time milliseconds [63246, 59626, 61001, 59366, 53028, 58693, 58832,
> 64633, 54659, 50021]
> p-solr-8-23.obscured.com:8983/solr/content_shard1_replica_n8/ 57778.5
> 4659.23933348
> Query time milliseconds [55141, 55194, 59100, 62614, 65425, 59261, 58961,
> 59259, 53799, 49031]
> p-solr-8-28.obscured.com:8983/solr/content_shard1_replica_n6/ 64944.1
> 3382.61379705
> Query time milliseconds [65536, 67825, 69829, 60059, 63616, 67588, 68443,
> 60853, 62666, 63026]
> p-solr-8-17.obscured.com:8983/solr/content_shard1_replica_n4/ 58018.9
> 4821.9028851
> Query time milliseconds [56694, 58900, 55404, 51590, 66034, 51256, 57109,
> 57515, 63530, 62157]
> p-solr-8-26.obscured.com:8983/solr/content_shard2_replica_n10/ 59366.6
> 5036.84751936
> Query time milliseconds [63485, 53315, 64845, 62077, 54313, 52607, 65389,
> 55977, 63486, 58172]
> p-solr-8-21.obscured.com:8983/solr/content_shard2_replica_n18/ 61844.1
> 4623.13444537
> Query time milliseconds [68077, 61117, 64284, 65393, 60580, 57495, 58068,
> 67454, 62370, 53603]
> p-solr-8-11.obscured.com:8983/solr/content_shard2_replica_n16/ 61179.1
> 4224.86040401
> Query time milliseconds [62271, 66059, 67076, 55706, 60905, 58617, 56561,
> 66308, 57100, 61188]
> p-solr-8-13.obscured.com:8983/solr/content_shard2_replica_n12/ 69578.3
> 3986.83530998
> Query time milliseconds [68564, 67411, 71644, 75938, 73772, 69780, 67438,
> 72479, 66368, 62389]
> p-solr-8-27.obscured.com:8983/solr/content_shard2_replica_n14/ 59808.2
> 4896.04649579
> Query time milliseconds [63739, 59873, 65775, 50280, 63009, 60955, 55516,
> 64130, 60016, 54789]
> p-solr-8-10.obscured.com:8983/solr/content_shard3_replica_n26/ 66038.1
> 3363.2

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-05 Thread David Smiley

I suggest first working with a single machine to see if it responds
substantially slower with the new version.  Just find one of yours and
issue it a query that will resolve locally (distrib=false param).  Your
current collection level queries are internally issuing such queries, and
so with a little bit of sleuthing, looking at logs, you can find a shard
level query like this.  If it's quick then there's some distributed aspect
to investigate.  But you'll probably see the slowness here, and the problem
is better scoped and easier to diagnose.  At this point look at timings
with debug=timing to see information on each of the components.  That may
give you a strong clue.  If it's in the QueryComponent which actually
executes the underlying search then you have some further digging to do.
Use a profiler like JVisualVM.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Re: ExecutorService support in SolrIndexSearcher

2019-08-30 Thread David Smiley

It'd take some work to do that.  Years ago I recall Etsy did a POC and
shared their experience at Lucene/Solr Revolution in Washington DC; I
attended the presentation with great interest.  One of the major obstacles,
if I recall, was the Collector needs to support this mode of operation, and
in particular Solr's means of flipping bits in a big bitset to accumulate
the DocSet had to be careful so that multiple threads don't try to
overwrite the same underlying "long" in the long[].

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Aug 26, 2019 at 7:02 AM Aghasi Ghazaryan
 wrote:

> Hi,
>
> Lucene's IndexSearcher
> <
> http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/IndexSearcher.html#IndexSearcher-org.apache.lucene.index.IndexReaderContext-java.util.concurrent.ExecutorService-
> >
> supports
> running searches for each segment separately, using the provided
> ExecutorService.
> I wonder why SolrIndexSearcher does not support the same as it may improve
> queries performance a lot?
>
> Thanks, looking forward to hearing from you.
>
> Regards
> Aghasi Ghazaryan
>

Re: Solutio for long time highlighting

2019-08-30 Thread David Smiley

Ah, multi-threaded highlighting.  I implemented that once as a precursor to
ultimately other better things -- the UnifiedHighlighter.

Your ExecutorService ought to be a field on the handler.  In inform() you
can call SolrCore.addCloseHook to ensure this executor is shut down.

I suggest looking at this presentation from a few years ago I did with
Bloomberg at Lucene/Solr Revolution:
https://www.youtube.com/watch?v=tv5qKDKW8kk=14s
The UnifiedHighlighter is not enabled by default.  See the documentation:
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/highlighting.html

Still... there is perhaps some value in multi-threading the highlighting
for huge docs, but I think we ultimately found no need after re-engineering
the highlighter.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Aug 28, 2019 at 10:36 AM SOLR4189  wrote:

> Hi all.
>
> In our team we thought about some tricky solution for queries with long
> time
> highlighting. For example, highlighting that takes more than 25 seconds.
> So,
> we created our component that wraps highlighting component of SOLR in this
> way:
>
> public void inform(SolrCore core) {
> . . . .
> subSearchComponent = core.getSearchComponent("highlight");
> . . . .
> }
>
> public void process(ResponseBuilder rb) throws Exception {
> long timeout = 25000;
> ExecutorService exec = null:
> try {
> exec = Executors.newSingleThreadExecutor();
> Future future = exec.submit(() -> {
> try {
> subSearchComponent.process(rb);
> } catch (IOException e) {
> return e;
> }
> return null;
> });
> Exception ex = future.get(timeout, TimeUnit.MILLISECONDS);
> if (ex != null) {
> throw ex;
> }
> } catch ( TimeoutException toe) {
> . . . .
> } catch (Exception e) {
>throw new IOException(e);
> } finally {
> if (exec != null) {
> exec.shutdownNow();
> }
> }
> }
>
> This solution works, but sometime we see that searchers stay open and as a
> result our RAM usage is pretty high (like a memory leak of
> SolrIndexSearcher
> objects). And only after a SOLR service restart they disappear.
>
> What do you think about this solution?
> Maybe exists some built-in function for it?
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

[CVE-2019-0193] Apache Solr, Remote Code Execution via DataImportHandler

2019-07-31 Thread David Smiley

The DataImportHandler, an optional but popular module to pull in data from
databases and other sources, has a feature in which the whole DIH
configuration can come from a request's "dataConfig" parameter. The debug
mode of the DIH admin screen uses this to allow convenient debugging /
development of a DIH config. Since a DIH config can contain scripts, this
parameter is a security risk. Starting with version 8.2.0 of Solr, use of
this parameter requires setting the Java System property
"enable.dih.dataConfigParam" to true.

Mitigations:
* Upgrade to 8.2.0 or later, which is secure by default.
* or, edit solrconfig.xml to configure all DataImportHandler usages with an
"invariants" section listing the "dataConfig" parameter set to am empty
string.
* Ensure your network settings are configured so that only trusted traffic
communicates with Solr, especially to the DIH request handler.  This is a
best practice to all of Solr.

Credits:
* Michael Stepankin (JPMorgan Chase)

References:
* https://issues.apache.org/jira/browse/SOLR-13669
* https://cwiki.apache.org/confluence/display/solr/SolrSecurity

Please direct any replies as either comments in the JIRA issue above or to
solr-user@lucene.apache.org

Re: Solr Geospatial Polygon Indexing/Querying Issue

2019-07-30 Thread David Smiley

On Tue, Jul 30, 2019 at 4:41 PM Sanders, Marshall (CAI - Atlanta) <
marshall.sande...@coxautoinc.com> wrote:

> I’ll explain the context around the use case we’re trying to solve and
> then attempt to respond as best I can to each of your points.  What we have
> is a list of documents that in our case the location is sometimes a point
> and sometimes a circle.  These basically represent (in our case) inventory
> at a physical location (point) or inventory that can be delivered to you
> within X km (configurable per document) which represents the circle use
> case.  We want to be able to allow a user to say I want all documents
> within X distance of my location, but also all documents that are able to
> be delivered to your point where the delivery distance is defined on the
> inventory (creating the circle).
>

That background info helps me understand things!

> This is why we were actually trying to combine both point based data and
> poly/circle data into a single geospatial field, since I don’t believe you
> could do something like fq=geofilt(latlng, x, y, d) OR
> geofilt(latlngCircle, x, y, 1) but perhaps we’re just not getting quite the
> right syntax, etc.
>

Oh quite possible :-).   It would look something like this:   fq= {!geofilt
sfield=latLng d=queryDistance} OR {!geofilt sfield=latLngCircle
d=0}=myLocation
Notice the space after the fq= which is critical so that the first
local-params (i.e. first geofilt) does not "own" the entire filter query
string end to end.  Due to the space, the whole thing is parsed by the
default lucene/standard query parser, and then we have the two clauses
clearly there.  The second geofilt has distance 0; it'd be nice if it
internally optimized to a point but nonetheless it's fine.  Alternatively
there's another syntax to embed WKT where you can specify a point
explicitly... something like this: ...  {!field f=latLngCircle
v="Intersects(POINT(x y))"}

That said, it's also just fine to do as you were planning -- have one RPT
based field for the shape representation (mixture of points and circles),
and one LLPSF field purely for the center point that is used for sorting.
That LLPSF field would be indexed=false docValues=true since you wouldn't
be filtering on it.

>
> * Generally RptWithGeometrySpatialField should be used over
> SpatialRecursivePrefixTreeFieldType unless you want heatmaps or are willing
> to make trade-offs in higher index size and lossy precision in order to get
> faster search.  It's up to you; if you benchmark both I'd love to hear how
> it went.
>
> -We may explore both but typically we’re more interested in speed
> than accuracy, benchmarking it may be a very interesting exercise however.
> For sorting for instance we’re actually using sqedist instead of geodist
> because we’re not overly concerned about sorting accuracy.
>

Okay... though geodist on a LLPSF field is remarkably optimized.

> * I see you are using Geo3D, which is not the default.  Geo3D is strict
> about the coordinate order -- counter-clickwise.  Your triangle is
> clockwise and thus it has an inverted interpretation -- thus it's a shape
> that covers nearly the whole globe.  I recently documented this
> https://issues.apache.org/jira/browse/SOLR-13467 but it's not published
> yet since it's so new.
>
> - Thanks for this clarification as well.  I had read this in the
> WKT docs too, again something we tried but really weren’t sure about what
> the right answer was and had been going back and forth on.  The
> documentation seems to specify that you need to specify either JTS or
> Geo3d, but doesn’t provide much info/guidance about which to use when and
> since JTS required adding another jar manually and therefore complicates
> our build process significantly (at least vs using Geo3D) we tried Geo3D.
> I’d love to hear more about the tradeoffs and other considerations between
> the two, but sounds like we should switch to JTS (the default, correct?)
>

The default spatialContextFactory is something internal; not JTS or Geo3D.
Based on your requirements, you needn't specify either JTS or Geo3D, mostly
because you don't actually need polygons.  I wouldn't bother specifying it
unless you want to experiment with some benchmarking.  JTS would give you
nothing here but Geo3D + prefixTree=S2 (in Solr 8.2) might be faster.

> * You can absolutely index a circle in Solr -- this is something cool and
> somewhat unique. And you don't need format=legacy.  The documentation needs
> to call t out better, though it at least refers to circles as a "buffered
> point" which is the currently supported way of representing it, and it does
> have one example.  Search for "BUFFER" and you'll see a WKT-like syntax to
> do it.  BUFFER is not standard WKT; it was added on to do this.  The first
> arg is a X Y center, and 2nd arg is a distance in decimal degrees (not
> km).  BTW Geo3D is a good choice here but not essential either.
>
> -   This sounds very promising and we’ll

Re: Solr Geospatial Polygon Indexing/Querying Issue

2019-07-25 Thread David Smiley

Hello Marshall,

I worked on a lot of this functionality.  I have lots to say:

* Personally, I find it highly confusing to have a field named "latlng" and
have it be anything other than a simple point -- it's all you have if given
a single latitude longitude pair.  If you intend for the data to be a
circle (either exactly or approximated) then perhaps call it latLngCircle
* geodist() and for that matter any other attempt to get the distance to a
non-point shape is not going to work -- either error or confusing results;
I forget.  This is hard to do and the logic isn't there for it, and
probably wouldn't perform to user's expectations if it did.  This ought to
be documented but seems not to be.
* Generally RptWithGeometrySpatialField should be used
over SpatialRecursivePrefixTreeFieldType unless you want heatmaps or are
willing to make trade-offs in higher index size and lossy precision in
order to get faster search.  It's up to you; if you benchmark both I'd love
to hear how it went.
* In WKT format, the ordinate order is "X Y" (thus longitude then
latitude).  Looking at your triangle, it is extremely close to Antarctica,
and I'm skeptical you intended that. This is not directly documented AFAICT
but it's such a common mistake that it ought to be called out in the docs.
* I see you are using Geo3D, which is not the default.  Geo3D is strict
about the coordinate order -- counter-clickwise.  Your triangle is
clockwise and thus it has an inverted interpretation -- thus it's a shape
that covers nearly the whole globe.  I recently documented this
https://issues.apache.org/jira/browse/SOLR-13467 but it's not published yet
since it's so new.
* You can absolutely index a circle in Solr -- this is something cool and
somewhat unique. And you don't need format=legacy.  The documentation needs
to call this out better, though it at least refers to circles as a
"buffered point" which is the currently supported way of representing it,
and it does have one example.  Search for "BUFFER" and you'll see a
WKT-like syntax to do it.  BUFFER is not standard WKT; it was added on to
do this.  The first arg is a X Y center, and 2nd arg is a distance in
decimal degrees (not km).  BTW Geo3D is a good choice here but not
essential either.

Back to your core requirement -- you want to index circles and sort results
by distance.  Can you please elaborate better on this... distance to the
outer ring of the circle or the center point?  Center point is easy to do
simply by putting the center point additionally in a field using
LatLonPointSpatialField and use geodist referring to that.  Also,

FYI geodist() is a function that can take arguments directly which makes
more sense when multiple spatial fields are in play.  Sadly this aspect is
not documented.  Suffice it to say, if you do geodist(latLng) (maybe
quoted?) then it'll use that field, and parse "pt" param from the request.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Jul 23, 2019 at 2:32 PM Sanders, Marshall (CAI - Atlanta) <
marshall.sande...@coxautoinc.com> wrote:

> We’re trying to index a polygon into solr and then filter/calculate
> geodist on the polygon (ideally we actually want a circle, but it looks
> like that’s not really supported officially by wkt/geojson and instead you
> have to switch format=”legacy” which seems like something that might be
> removed in the future so don’t want to rely on it).
>
> Here’s the info from schema:
>  multiValued="true"/>
>
>  class="solr.SpatialRecursivePrefixTreeFieldType"
>geo="true" distErrPct="0.025" maxDistErr="0.09"
> distanceUnits="kilometers"
> spatialContextFactory="Geo3D"/>
>
>
> We’ve tried indexing some different data, but to keep it as simple as
> possible we started with a triangle (will eventually add more points to
> approximate a circle).  Here’s an example document that we’ve added just
> for testing:
>
> {
> "latlng": ["POLYGON((33.7942704 -84.4412613, 33.7100611 -84.4028091,
> 33.7802888 -84.3279648, 33.7942704 -84.4412613))"],
> "ID": "284598223"
> }
>
>
> However, it seems like filtering/distance calculations aren’t working (at
> least not the way we are used to doing it for points).  Here’s an example
> query where the pt is several hundred kilometers away from the polygon, yet
> the document still returns.  Also, it seems that regardless of origin point
> or polygon location the calculated geodist is always 20015.115
>
> Example query:
>
> select?d=1=ID,latlng,geodist()=%7B!geofilt%7D=on=33.9798087,-94.3286133=*:*=latlng=json
>
> Example documents coming back anyway:
> "docs": [
> {
> "latlng": ["PO

Re: highlighting not working as expected

2019-06-10 Thread David Smiley

Please try hl.method=unified and tell us if that helps.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jun 3, 2019 at 4:06 AM Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
> I am having some difficulties making highlighting work. For some reason
> the highlighting feature only works on some fields but not on other fields
> even though these fields are stored.
>
> An example of a request looks like this:
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagstitel=Sagstitel=%3C/b%3E=%3Cb%3E=on=rotte
>
> It simply returns an empty set, for all documents even though I can see
> several documents which have “Sagstitel” containing the word “rotte”
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
> 
>   
>   
>  default="true"
>   class="solr.highlight.GapFragmenter">
> 
>   100
> 
>   
>
>   
>  class="solr.highlight.RegexFragmenter">
> 
>   
>   70
>   
>   0.5
>   
>   [-\w
> ,/\n\]{20,200}
> 
>   
>
>   
> default="true"
>  class="solr.highlight.HtmlFormatter">
> 
>   b
>   /b
> 
>   
>
>   
>   class="solr.highlight.HtmlEncoder" />
>
>   
>   class="solr.highlight.SimpleFragListBuilder"/>
>
>   
>   class="solr.highlight.SingleFragListBuilder"/>
>
>   
>  default="true"
>class="solr.highlight.WeightedFragListBuilder"/>
>
>   
>default="true"
> class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>
>   
>class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>   
> 
>   
>
>   default="true"
>class="solr.highlight.SimpleBoundaryScanner">
> 
>   10
>   .,!? 
> 
>   
>
>   class="solr.highlight.BreakIteratorBoundaryScanner">
> 
>   
>   WORD
>   
>   
>   da
> 
>   
> 
>   
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der
> fortæller, hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>

Re: Range query syntax on a polygon field is returning all documents

2019-05-12 Thread David Smiley

I answered in StackOverflow but will paste it here:

Geo3D requires that polygons adhere to the "right hand rule", and thus the
exterior ring must be in counter-clockwise order and holes must be
clockwise.  If you make this mistake then the meaning of the shape is
inverted, and thus that little rectangle in Alberta Canada represents the
inverse of that place.  Consequently most shapes will cover nearly the
entire globe!  There is certainly a documentation issue needed in Solr to
this effect.  Even I didn't know until I debugged this today!  It appears
some of the GIS industry is migrating to this rule as well:
http://mapster.me/right-hand-rule-geojson-fixer/

Separately: I would be very curious to see how Geo3D compares to JTS after
you get it working.  Additionally, you likely ought to use
solr.RptWithGeometrySpatialField instead of
solr.SpatialRecursivePrefixTreeFieldType to get the full accuracy of the
vector geometry instead of settling on a grid representation of shapes,
otherwise your queries might get false-positives for just being close to an
indexed shape.  Another thing to try is using prefixTree="s2" which is a
not-yet-documented prefixTree that supposedly is much more efficient for
Geo3D specifically.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Mar 20, 2019 at 2:00 PM David Smiley 
wrote:

> Hi Mitchell,
>
> Seems like there's a bug based on what you've shown.
> * Can you please try RptWithGeometrySpatialField instead
> of SpatialRecursivePrefixTreeFieldType to see if the problem goes away?
> This could point to a precision issue; though still what you've seen is
> suspicious.
> * Can you try one other query syntax e.g. bbox query parser to see if the
> problem goes away?  I doubt this is it but you seem to point to the syntax
> being related.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Mar 18, 2019 at 12:24 AM Mitchell Bösecke <
> mitchell.bose...@forcorp.com> wrote:
>
>> Hi everyone,
>>
>> I'm trying to index geodetic polygons and then query them out using an
>> arbitrary rectangle. When using the Geo3D spatial context factory, the
>> data
>> indexes just fine but using a range query (as per the solr documentation)
>> does not seem to filter the results appropriately (I get all documents
>> back).
>>
>> When I switch it to JTS, everything works as expected. However, it
>> significantly slowed down the initial indexing time. A sample size of 3000
>> documents took 3 seconds with Geo3D and 50 seconds with JTS.
>>
>> I've documented my journey in detail on stack overflow:
>> https://stackoverflow.com/q/55212622/1017571
>>
>>1. Can I not use the range query syntax with Geo3D? I.e. am I
>>misreading the documentation?
>>2. Is it expected that using JTS will *significantly* slow down the
>>indexing time?
>>
>> Thanks for any insight.
>>
>> --
>> Mitchell Bosecke, B.Sc.
>> Senior Application Developer
>>
>> FORCORP
>> Suite 200, 15015 - 123 Ave NW,
>> Edmonton, AB, T5V 1J7
>> www.forcorp.com
>> (d) 780.733.0494
>> (o) 780.452.5878 ext. 263
>> (f) 780.453.3986
>>
>

Re: Date format issue in solr select query.

2019-05-09 Thread David Smiley

(the correct list here is solr-user, not dev)

Solr has minimal support for _formatting_ the response; that's generally up
the the application that builds the UI.  If you want Solr to retain the
original input precision which appears to be lost here, then use a typical
copyField approach to a string stored field.  This is necessary because
primitive field types (date, float, int, etc.) normalize the input when the
value is internally stored.  Perhaps it shouldn't do that -- as you show
here the surface form (original) may indicate the precision.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 8, 2019 at 10:42 PM Karthik Gunasekaran <
karthik.gunaseka...@stats.govt.nz> wrote:

> Hi,
>
> I am new to solr. I am using solr7.6 version.
>
>
>
> The problem which I am facing is to format the date for a specific field.
>
>
>
> Explanation of my issue:
>
>
>
> I have a collection named “DateFieldTest”
>
> It has few fields out of “initial_release_date” is a field of type pdate.
>
> We are loading the data into the collection as below
>
>
>
>
>
> [
>
>   {
>
> "id": 0,
>
> "Number": 0,
>
> "String": "This is a string 0",
>
> "initial_release_date": "2019-02-28"
>
>   },
>
>   {
>
> "ID": 1,
>
> "Number": 1,
>
> "String": "This is a string 1",
>
> " initial_release_date ": "2019-02-28"
>
>   }]
>
>
>
> When we do a select query as
> http://localhost:8983/solr/DateFieldTest/select?q=*:*
>
> We are getting the output as,
>
> {
>
>   "responseHeader":{
>
> "zkConnected":true,
>
> "status":0,
>
> "QTime":0,
>
> "params":{
>
>   "q":"*:*"}},
>
>   "response":{"numFound":1000,"start":0,"docs":[
>
>   {
>
> "id":"0",
>
> "Number":[0],
>
> "String":["This is a Māori macron 0"],
>
> "initial_release_date":["2019-02-28T00:00:00Z"],
>
> "_version_":1633015101576445952},
>
>   {
>
> "ID":[1],
>
> "Number":[1],
>
> "String":["This is a Māori macron 1"],
>
> "initial_release_date":["2019-02-28T00:00:00Z"],
>
> "_version_":1633015101949739008},
>
>
>
> But our use case is to get the output for the above query is to get the
> initial_release_date field to be formatted as -MM-DD.
>
> The query returns by adding time to the data field automatically, which we
> don’t want to happen.
>
> Can someone please help me to resolve this issue to get only date value
> without time in my select query.
>
>
>
> Thanks,
>
> Karthik Gunasekaran
>
> Senior Applications Developer | kaiwhakawhanake Pūmanawa Tautono
>
> Digital Business  - Channels | Ngā Ratonga Mamati - Ngā Hongere
>
> Digital Business Services | Ngā Ratonga Pakihi Mamati
>
> Stats NZ Tatauranga Aotearoa
> * DDI* +64 4 931 4347 | stats.govt.nz <http://www.stats.govt.nz/>
>
> [image: cid:image007.jpg@01D29D69.DD3FD280]
>
> [image: cid:image008.png@01D29D69.DD3FD280]
> <https://www.facebook.com/StatisticsNZ>  [image:
> cid:image009.png@01D29D69.DD3FD280] <https://twitter.com/Stats_NZ>  [image:
> cid:image010.png@01D29D69.DD3FD280]
> <https://www.linkedin.com/company/statistics-new-zealand>
>
>
>

Re: Reverse-engineering existing installation

2019-05-02 Thread David Smiley

Consider trying to diff configs from a default at the version it was copied
from, if possible. Even better, the configs should be in source control and
then you can browse history with commentary and sometimes links to issue
trackers and code reviews.

Also a big part that you can’t see by staring at configs is what the
queries look like. You should examine the system interacting with Solr to
observe embedded comments/docs for insights.

On Thu, May 2, 2019 at 11:21 PM Doug Reeder  wrote:

> The documentation for SOLR is good.  However it is oriented toward setting
> up a new installation, with the data model known.
>
> I have inherited an existing installation.  Aspects of the data model I
> know, but there's a lot of ways things could have been configured in SOLR,
> and for some cases, I don't know what SOLR was supposed to do.
>
> Can you reccomend any documentation on working out the configuration of an
> existing installation?
>
-- 
Sent from Gmail Mobile

Re: Unable to tag queries (q) in SOLR >= 7.2

2019-04-30 Thread David Smiley

Hi Frederik,

In your example, I think you may have typed it manually since there are
mistakes like df=edismax which I think you meant defType=edismax.  Any way,
assuming you need local-param syntax in 'q' (for tagging or whatever other
reason), then this means you must specify the query parser there and *not*
defType (don't set defType or set it to "lucene" which is the default).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Apr 30, 2019 at 8:17 AM Fredrik Rodland  wrote:

> Hi.
>
> I seems SOLR-11501 may have changed more than just the ability to control
> the query parser set through {!queryparser}.  We tag our queries to provide
> facets both with and without the query in the same request, just as tagging
> in fq described here:
> https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-TaggingandExcludingFilters
>
> After upgrading to 7.2 this does not work anymore (for the q-parameter)
> using edismax.  We’ve tried to add the uf-paramter:
>
> select?q={!tag%3Dmytag}house=query=0=query=edismax
>
> But this only result in q being allowed through, but not parsed - i.e.:
> "+(+DisjunctionMaxQuery(((synrank80:tagmytagingeniør)^8.0 |
> (stemrank40:tagmytagingeniør)^4.0…
>
> Does anybody have any experience or tips for enabling tagging of queries
> for SOLR >= 7.2?
>
> Regards
>
> Fredrik

Re: Spatial Search using two separate fields for lat and long

2019-04-13 Thread David Smiley

Hi,

I think your requirement of exporting back to CSV is fine but it's quite
normal for there to be some transformation steps on input and/or output...
and that such steps you mostly do yourself (not Solr).  That said, one
straight-forward solution is to have your spatial field be redundant with
the lat & lon separately.  Your spatial field could be stored=false, and
the separate fields would be stored but otherwise not be indexed or have
other characteristics that add weight.  The result is efficient; no
redundancies.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Apr 3, 2019 at 1:54 AM Tim Hedlund  wrote:

> Hi all,
>
> I'm importing documents (rows in excel file) that includes latitude and
> longitude fields. I want to use those two separate fields for searching
> with a bounding box. Is this possible (not using deprecated LatLonType) or
> do I need to combine them into one single field when indexing? The reason I
> want to keep the fields as two separate ones is that I want to be able to
> export from solr back to exact same excel file structure, i.e. solr fields
> maps exactly to excel columns.
>
> I'm using solr 7. Any thoughts or suggestions would be appreciated.
>
> Regards
> Tim
>
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread David Smiley

Hi Edwin,

I'd like to rule something out.  Does your schema define a field "_root_"?
If you don't have nested documents then remove it.  It's presence adds
indexing weight in 8.0 that was not there previously.  I'm not sure how
much though; I've hoped small but who knows.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Apr 2, 2019 at 10:17 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> scratch in Solr 8.0.0
>
> However, I found that the indexing speed is slower in Solr 8.0.0, as
> compared to the earlier version like Solr 7.7.1. I have not changed the
> schema.xml and solrconfig.xml yet, just did a change of the
> luceneMatchVersion in solrconfig.xml to 8.0.0
> uceneMatchVersion>8.0.0
>
> On average, the speed is about 40% to 50% slower. For example, the indexing
> speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
> index the same set of data.
>
> What could be the reason that causes the indexing to be slower in Solr
> 8.0.0?
>
> Regards,
> Edwin
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread David Smiley

What/where is this benchmark?  I recall once Ishan was working with a
volunteer to set up something like Lucene has but sadly it was not
successful

On Wed, Apr 3, 2019 at 6:04 AM Đạt Cao Mạnh  wrote:

> Hi guys,
>
> I'm seeing the same problems with Shalin nightly indexing benchmark. This
> happen around this period
> git log --before=2018-12-07 --after=2018-11-21
>
> On Wed, Apr 3, 2019 at 8:45 AM Toke Eskildsen  wrote:
>
>> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
>> > Yes, I am using DocValues for most of my fields.
>>
>> So that's a culprit. Thank you.
>>
>> > Currently we can't share the test data yet as some of the records are
>> > sensitive. Do you have any data from CSV file that you can test?
>>
>> Not really. I asked because it was a relatively easy way to do testing
>> (replicate your indexing flow with both Solr 7 & 8 as end-points,
>> attach JVisualVM to the Solrs and compare the profiles).
>>
>>
>> I'll put on my to-do to create a test or two with the scenario
>> "indexing from CSV with many DocValues fields". I'll try and generate
>> some test data and see if I can reproduce with them. If this is to be a
>> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>>
>> If this does turn out to be the cause of your performance regression,
>> the fix (if possible) will be for a later Solr version. Currently it is
>> not possible to tweak the docValues indexing parameters outside of code
>> changes.
>>
>>
>> Do note that we're still operating on guesses here. The cause for your
>> regression might easily be elsewhere.
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>
>>
>
> --
> *Best regards,*
> *Cao Mạnh Đạt*
>
>
> *D.O.B : 31-07-1991Cell: (+84) 946.328.329E-mail: caomanhdat...@gmail.com
> *
>
-- 
Sent from Gmail Mobile

Re: Range query syntax on a polygon field is returning all documents

2019-03-20 Thread David Smiley

Hi Mitchell,

Seems like there's a bug based on what you've shown.
* Can you please try RptWithGeometrySpatialField instead
of SpatialRecursivePrefixTreeFieldType to see if the problem goes away?
This could point to a precision issue; though still what you've seen is
suspicious.
* Can you try one other query syntax e.g. bbox query parser to see if the
problem goes away?  I doubt this is it but you seem to point to the syntax
being related.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Mar 18, 2019 at 12:24 AM Mitchell Bösecke <
mitchell.bose...@forcorp.com> wrote:

> Hi everyone,
>
> I'm trying to index geodetic polygons and then query them out using an
> arbitrary rectangle. When using the Geo3D spatial context factory, the data
> indexes just fine but using a range query (as per the solr documentation)
> does not seem to filter the results appropriately (I get all documents
> back).
>
> When I switch it to JTS, everything works as expected. However, it
> significantly slowed down the initial indexing time. A sample size of 3000
> documents took 3 seconds with Geo3D and 50 seconds with JTS.
>
> I've documented my journey in detail on stack overflow:
> https://stackoverflow.com/q/55212622/1017571
>
>1. Can I not use the range query syntax with Geo3D? I.e. am I
>misreading the documentation?
>2. Is it expected that using JTS will *significantly* slow down the
>indexing time?
>
> Thanks for any insight.
>
> --
> Mitchell Bosecke, B.Sc.
> Senior Application Developer
>
> FORCORP
> Suite 200, 15015 - 123 Ave NW,
> Edmonton, AB, T5V 1J7
> www.forcorp.com
> (d) 780.733.0494
> (o) 780.452.5878 ext. 263
> (f) 780.453.3986
>

Re: Nested geofilt query for LTR feature

2019-03-20 Thread David Smiley

Hi,

I've never used the LTR module, but I suspect I might know what the error
is.  I think that the "query" Function Query has parsing limitations on
what you pass to it.  At least it used to.  Try to put the embedded query
onto another parameter and then refer to it with a dollar-sign.  See the
examples here:
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/function-queries.html#query-function

Also, I think it's a bit inefficient to wrap a query function query around
a geofilt query that exposes a distance as a score.  If you want the
distance then call the "geodist" function query.

Additionally if you dump the full stack trace here, it might be helpful.
Getting a RuntimeException suggests we need to do a better of job
wrapping/cleaning errors internally.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Mar 14, 2019 at 11:43 PM Kamuela Lau  wrote:

> Hello,
>
> I'm currently using Solr 7.2.2 and trying to use the LTR contrib module to
> rerank queries.
> For my LTR model, I would like to use a feature that is essentially a
> "normalized distance," a value between 0 and 1 which is based on distance.
>
> When using geodist() to define a feature in the feature store, I received a
> "failed to parse feature query" error, and thus I am using the below
> geofilt query for distance.
>
> {
>   "name":"dist",
>   "class":"org.apache.solr.ltr.feature.SolrFeature",
>   "params":{"q":"{!geofilt sfield=latlon score=kilometers filter=false
> pt=${ltrpt} d=5000}"},
>   "store":"ltrFeatureStore"
> }
>
> This feature correctly returns the distance between ltrpt and the sfield
> latlon (LatLonPointSpatialField).
> As I mentioned previously, I would like a feature which uses this distance
> in another function. To test this functionality, I tried to define a
> feature which multiplies the distance by two:
>
> {
>   "name":"twoDist",
>   "class":"org.apache.solr.ltr.feature.SolrFeature",
>   "params":{"q":"{!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${ltrpt} d=5000},0.0))"},
>   "store":"ltrFeatureStore"
> }
>
> When trying to extract this feature, I receive the following error:
>
> java.lang.RuntimeException: Exception from createWeight for SolrFeature
> [name=multDist, params={q={!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${ltrpt} d=5000},0.0))}]  missing sfield
> for spatial request
>
> However, when I define the following in fl for a regular, non-reranked
> query, I find that it is correctly parsed and I receive the correct value,
> which is twice the value of geodist() (pt2 is defined in a different part
> of the query):
> fl=score,geodist(),{!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${pt2} d=5},0.0))
>
> For reference, below is what I have defined in my schema:
>
>
>  docValues="true"/>
>
> Is this the correct, intended behavior? If so, is my query for this
> correct, or should I go about extracting this sort of feature a different
> way?
>

Re: regarding debugging solr in eclipse

2019-01-18 Thread David Smiley

On Fri, Jan 18, 2019 at 9:20 AM Scott Stults <
sstu...@opensourceconnections.com> wrote:

> This blog article might help:
>
> https://opensourceconnections.com/blog/2013/04/13/how-to-debug-solr-with-eclipse/
>
>
I don't use Eclipse but I believe things are better now than the
instructions given.  The setup for both Eclipse and IntelliJ have a "run
configuration" (or whatever it's called in Eclipse) and thus you needn't
manually at the CLI run things nor do you need to setup a new run config
with the ports set.

~ David


>
>
> On Fri, Jan 18, 2019 at 6:53 AM SAGAR INGALE 
> wrote:
>
> > Can anybody tell me how to debug solr in eclipse, if possible how can I
> > build a maven project and launch the jetty server in debug mode?
> > Thanks. Regards
> >
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780 <(434)%20409-2780>
> http://www.opensourceconnections.com
>
-- 
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Solr 7.2.1 Stream API throws null pointer execption when used with collapse filter query

2019-01-03 Thread David Smiley

File a JIRA issue please

On Thu, Jan 3, 2019 at 5:20 PM gopikannan  wrote:

> Hi,
>I am getting null pointer exception when streaming search is done with
> collapse filter query. When debugged the last element in FixedBitSet array
> is null. Please let me know if I can raise an issue.
>
>
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/export/ExportWriter.java#L232
>
>
> http://localhost:8983/stream/?expr=search(coll_a ,sort="field_a
>
> asc",fl="field_a,field_b,field_c,field_d",qt="/export",q="*:*",fq="(filed_b:x)",fq="{!collapse
> field=field_c sort='field_d desc'}")
>
> org.apache.solr.servlet.HttpSolrCall null:java.lang.NullPointerException
> at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
> at org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
> at
> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
> at
>
> org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
> at
>
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
> at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> at
> org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:222)
> at
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> at
>
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
> at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> at
> org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:220)
> at
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:218)
> at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2627)
> at
>
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
>
-- 
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Geofilt and distance measurement problems using SpatialRecursivePrefixTreeFieldType field type

2018-12-23 Thread David Smiley

For latitude and longitude data, I recommend "lat,lon" and never use "x
y".  Perhaps the latter should be an error when geo=true (and inverse when
false) but it isn't.  Yes the documentation could be better!

On Fri, Dec 21, 2018 at 4:31 AM Peter Lancaster <
peter.lancas...@findmypast.com> wrote:

> Hi David,
>
> Ignore my previous reply.
>
> I think you've supplied the answer. Yes we do need to use a space to index
> points in an rpt field, but when we do that the order is flipped from
> Lat,Lon to Lon Lat, so we need to re-index our data. In my defence that is
> far from obvious in the documentation.
>
> Thanks again for your help.
>
> Cheers,
> Peter.
>
> -Original Message-
> From: David Smiley [mailto:david.w.smi...@gmail.com]
> Sent: 21 December 2018 04:44
> To: solr-user@lucene.apache.org
> Subject: Re: Geofilt and distance measurement problems using
> SpatialRecursivePrefixTreeFieldType field type
>
> Hi Peter,
>
> Use of an RPT field for distance sorting/boosting is to be avoided where
> possible because it's very inefficient at this specific use-case.  Simply
> use LatLonType for this task, and continue to use RPT for the filter/search
> use-case.
>
> Also I see you putting a space between the coordinates instead of a
> comma...   yet you have geo (latitude & longitude data) so this is a bit
> confusing.  Do "lat,lon".  I think a space will be interpreted as "x y"
> (thus reversed).  Perhaps you've mixed up the coordinates and this
> explains the error?  A quick lookup of your sample coordinates suggests to
> me this is likely the problem.  It's a common mistake.
>
> BTW this:
> maxDistErr="0.2" distanceUnits="kilometers"
> means 200m accuracy (or better).  Is this what you want?  Just checking.
>
> ~ David
>
> On Thu, Dec 13, 2018 at 6:38 AM Peter Lancaster <
> peter.lancas...@findmypast.com> wrote:
>
> > I am currently using Solr 5.5.2 and implementing a GeoSpatial search
> > that returns results within a radius in Km of a specified LatLon.
> > Using a field of type solr.LatLonType and a geofilt query this gives
> > good results but is much slower than our regular queries. Using a bbox
> > query is faster but of course less accurate.
> >
> > I then attempted to use a field of type
> > solr.SpatialRecursivePrefixTreeFieldType to check performance and
> > because I want to be able to do searches within a polygon eventually.
> > The field is defined as follows
> >
> >  >  class="solr.SpatialRecursivePrefixTreeFieldType"
> >
> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> > geo="true" distErrPct="0.05" maxDistErr="0.2"
> > distanceUnits="kilometers" autoIndex="true"/>
> >
> >  > stored="true" multiValued="false" omitNorms="true" />
> >
> > I'm just using it to index single points right now. The problem is
> > that the distance calculation is not working correctly. It seems to
> > overstate the distances for differences in longitude.
> >
> > For example a query for
> > =Id,LatLonRPT__location_rpt,_dist_:geodist()=LatLonRPT__loca
> > tion_rpt=53.409490 -2.979677={!geofilt
> > sfield=LatLonRPT__location_rpt pt="53.409490 -2.979677" d=25} returns
> >
> > {
> > "Id": "HAR/CH1/80763270",
> > "LatLonRPT__location_rpt": "53.2 -2.91",
> > "_dist_": 24.295607
> > },
> > {
> > "Id": "HAR/CH42/1918283949",
> > "LatLonRPT__location_rpt": "53.393239 -3.028859",
> > "_dist_": 5.7587695
> > }
> >
> > The true distances for these results are 23.67 and 3.73 km and other
> > results at a true distance of 17 km aren't returned within the 25 km
> radius.
> >
> > The explain has the following
> >
> > +IntersectsPrefixTreeQuery(IntersectsPrefixTreeQuery(fieldName=LatLonR
> > +PT__location_rpt,queryShape=Circle(Pt(x=53.40949,y=-2.979677),
> > d=0.2° 25.00km),detailLevel=6,prefixGridScanLevel=7))
> >
> > Is my set up incorrect in some way or is the
> > SpatialRecursivePrefixTreeFieldType not suitable for doing radius
> > searches on points in this way?
> >
> > Thanks in anticipation for any suggestions.
> >
> > Peter Lancaster.
> >
> > _

Re: Geofilt and distance measurement problems using SpatialRecursivePrefixTreeFieldType field type

2018-12-20 Thread David Smiley

Hi Peter,

Use of an RPT field for distance sorting/boosting is to be avoided where
possible because it's very inefficient at this specific use-case.  Simply
use LatLonType for this task, and continue to use RPT for the filter/search
use-case.

Also I see you putting a space between the coordinates instead of a
comma...   yet you have geo (latitude & longitude data) so this is a bit
confusing.  Do "lat,lon".  I think a space will be interpreted as "x y"
(thus reversed).  Perhaps you've mixed up the coordinates and this explains
the error?  A quick lookup of your sample coordinates suggests to me this
is likely the problem.  It's a common mistake.

BTW this:
maxDistErr="0.2" distanceUnits="kilometers"
means 200m accuracy (or better).  Is this what you want?  Just checking.

~ David

On Thu, Dec 13, 2018 at 6:38 AM Peter Lancaster <
peter.lancas...@findmypast.com> wrote:

> I am currently using Solr 5.5.2 and implementing a GeoSpatial search that
> returns results within a radius in Km of a specified LatLon. Using a field
> of type solr.LatLonType and a geofilt query this gives good results but is
> much slower than our regular queries. Using a bbox query is faster but of
> course less accurate.
>
> I then attempted to use a field of type
> solr.SpatialRecursivePrefixTreeFieldType to check performance and because I
> want to be able to do searches within a polygon eventually. The field is
> defined as follows
>
>   class="solr.SpatialRecursivePrefixTreeFieldType"
> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> geo="true" distErrPct="0.05" maxDistErr="0.2"
> distanceUnits="kilometers" autoIndex="true"/>
>
>  stored="true" multiValued="false" omitNorms="true" />
>
> I'm just using it to index single points right now. The problem is that
> the distance calculation is not working correctly. It seems to overstate
> the distances for differences in longitude.
>
> For example a query for
> =Id,LatLonRPT__location_rpt,_dist_:geodist()=LatLonRPT__location_rpt=53.409490
> -2.979677={!geofilt sfield=LatLonRPT__location_rpt pt="53.409490
> -2.979677" d=25} returns
>
> {
> "Id": "HAR/CH1/80763270",
> "LatLonRPT__location_rpt": "53.2 -2.91",
> "_dist_": 24.295607
> },
> {
> "Id": "HAR/CH42/1918283949",
> "LatLonRPT__location_rpt": "53.393239 -3.028859",
> "_dist_": 5.7587695
> }
>
> The true distances for these results are 23.67 and 3.73 km and other
> results at a true distance of 17 km aren't returned within the 25 km radius.
>
> The explain has the following
>
> +IntersectsPrefixTreeQuery(IntersectsPrefixTreeQuery(fieldName=LatLonRPT__location_rpt,queryShape=Circle(Pt(x=53.40949,y=-2.979677),
> d=0.2° 25.00km),detailLevel=6,prefixGridScanLevel=7))
>
> Is my set up incorrect in some way or is the
> SpatialRecursivePrefixTreeFieldType not suitable for doing radius searches
> on points in this way?
>
> Thanks in anticipation for any suggestions.
>
> Peter Lancaster.
>
> 
> This message is confidential and may contain privileged information. You
> should not disclose its contents to any other person. If you are not the
> intended recipient, please notify the sender named above immediately. It is
> expressly declared that this e-mail does not constitute nor form part of a
> contract or unilateral obligation. Opinions, conclusions and other
> information in this message that do not relate to the official business of
> findmypast shall be understood as neither given nor endorsed by it.
> 
>
> __
>
> This email has been checked for virus and other malicious content prior to
> leaving our network.
> __

-- 
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Rectangle with rotation in Solr

2018-09-13 Thread David Smiley

Polygon is the only way.

On Wed, Aug 29, 2018 at 7:46 AM Zahra Aminolroaya 
wrote:

> I have locations with 4-tuple (longitude,latitude) which are like
> rectangles
> and I want to index them. Solr BBoxField with minX, maxX, maxY and minY,
> only considers rectangles which does not have rotations. suppose my
> rectangle is rotated  45 degree  clockwise based on axis, how can I define
> rotation in bbox? Is using RPT (polygon) the only way?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Impact/Performance of maxDistErr

2018-05-30 Thread David Smiley

I suggest using the "Intersects" spatial predicate when either the data is
all points or if the query is a point.  It's semantically equivalent and
the algorithm is much faster.

On Wed, May 30, 2018 at 3:25 AM Jens Viebig  wrote:

> Thanks for the detailed answer David, that helps a lot to understand!
> Best Regards
>
> Jens
>
> P.S. Currently the only search we are doing on the polygon is
> Contains(POINT(x,y))
>
>
> Am 29.05.2018 um 13:30 schrieb David Smiley:
>
> Hello Jens,
> With solr.RptWithGeometrySpatialField, you always get an accurate result
> thanks to the "WithGeometry" part.  The "Rpt" part is a grid index, and
> most of the parameters pertain to that.  maxDistErr controls the highest
> resolution grid.  No shape will be indexed to higher resolutions than this,
> though may be courser resolutions dependent on distErrPct.  The
> configuration you chose initially (that turned out to be slow for you) was
> a meter, and then you changed it to a kilometer and got fast indexing
> results.  I figure the size of your indexed shapes are on average a
> kilometer in size (give or take an order of magnitude).  It's hard to guess
> how your query shapes compare to your indexed shapes as there are multiple
> possibilities that could yield similar query performance when changing
> maxDistErr so much.
>
> The bottom line is that you should dial up maxDistErr as much as you can
> get away with it -- which is as long as query performance is good.  So you
> did the right thing :-).  That number will probably be a distance somewhat
> less than the average indexed shape diameter, or average query shape
> diameter, whichever is greater.  Perhaps 1/10th smaller; if I had to pick.
> The default setting, I think a meter, is probably not a good default for
> this field type.
>
> Note you could also try increasing distErrPct some, maybe to as much as
> .25, though I wouldn't go much higher., as it may yield gridded shapes that
> are so course as to not have interior cells.  Depending on what your query
> shapes typically look like and indexed shapes relative to each other, that
> may be significant or may not be.  If the indexed shapes are often much
> larger than your query shape then it's significant.
>
> ~ David
>
> On Fri, May 25, 2018 at 6:59 AM Jens Viebig  wrote:
>
>> Hello,
>>
>> we are indexing a polygon with 4 points (non-rectangular, field-of-view
>> of a camera) in a RptWithGeometrySpatialField alongside some more fields,
>> to perform searches that check if a point is within this polygon
>>
>> We started using the default configuration found in several examples
>> online:
>>
>> >
>> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>>geo="true" distErrPct="0.15" maxDistErr="0.001"
>> distanceUnits="kilometers" />
>>
>> We discovered that with this setting the indexing (soft commit) speed is
>> very slow
>> For 1 documents it takes several minutes to finish the commit
>>
>> If we disable this field, indexing+soft commit is only 3 seconds for
>> 1 docs,
>> if we set maxDistErr to 1, indexing speed is at around 5 seconds, so a
>> huge performance gain against the several minutes we had before
>>
>> I tried to find out via the documentation whats the impact of
>> "maxDistErr" on search results but didn't quite find an in-depth explanation
>> From our tests we did, the search results still seem to be very accurate
>> even if the covered space of the polygon is less then 1km and search speed
>> did not suffer.
>>
>> So i would love to learn more about the differences on having
>> maxDistErr="0.001" vs maxDistErr="1" on a RptWithGeometrySpatialField and
>> what problems could we run into with the bigger value
>>
>> Thanks
>> Jens
>>
>>
>>
>>
>> *Jens Viebig*
>>
>> Software Development
>>
>> MAM Products
>>
>>
>> T. +49-(0)4307-8358-0 <+49%204307%2083580>
>>
>> E. jens.vie...@vitec.com
>>
>> *http://www.vitec.com <http://www.vitec.com>*
>>
>>
>>
>> [image: VITEC_logo_for_email_signature]
>>
>>
>>
>> --
>>
>> VITEC GmbH, 24223 Schwentinental
>>
>> Geschäftsführer/Managing Director: Philippe Wetzel
>> HRB Plön 1584 / Steuernummer: 1929705211 / VATnumber: DE134878603
>>
>>
>>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
>
> --
>
>
> *Jens Viebig*
>
> Software Development
>
> MAM Products
>
>
> T. +49-(0)4307-8358-0 <+49%204307%2083580>
>
> E. jens.vie...@vitec.com
>
> *http://www.vitec.com <http://www.vitec.com>*
>
>
>
> [image: VITEC_logo_for_email_signature]
>
>
>
> --
>
> VITEC GmbH, 24223 Schwentinental
>
> Geschäftsführer/Managing Director: Philippe Wetzel
> HRB Plön 1584 / Steuernummer: 1929705211 / VATnumber: DE134878603
>
>
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Impact/Performance of maxDistErr

2018-05-29 Thread David Smiley

Hello Jens,
With solr.RptWithGeometrySpatialField, you always get an accurate result
thanks to the "WithGeometry" part.  The "Rpt" part is a grid index, and
most of the parameters pertain to that.  maxDistErr controls the highest
resolution grid.  No shape will be indexed to higher resolutions than this,
though may be courser resolutions dependent on distErrPct.  The
configuration you chose initially (that turned out to be slow for you) was
a meter, and then you changed it to a kilometer and got fast indexing
results.  I figure the size of your indexed shapes are on average a
kilometer in size (give or take an order of magnitude).  It's hard to guess
how your query shapes compare to your indexed shapes as there are multiple
possibilities that could yield similar query performance when changing
maxDistErr so much.

The bottom line is that you should dial up maxDistErr as much as you can
get away with it -- which is as long as query performance is good.  So you
did the right thing :-).  That number will probably be a distance somewhat
less than the average indexed shape diameter, or average query shape
diameter, whichever is greater.  Perhaps 1/10th smaller; if I had to pick.
The default setting, I think a meter, is probably not a good default for
this field type.

Note you could also try increasing distErrPct some, maybe to as much as
.25, though I wouldn't go much higher., as it may yield gridded shapes that
are so course as to not have interior cells.  Depending on what your query
shapes typically look like and indexed shapes relative to each other, that
may be significant or may not be.  If the indexed shapes are often much
larger than your query shape then it's significant.

~ David

On Fri, May 25, 2018 at 6:59 AM Jens Viebig  wrote:

> Hello,
>
> we are indexing a polygon with 4 points (non-rectangular, field-of-view of
> a camera) in a RptWithGeometrySpatialField alongside some more fields, to
> perform searches that check if a point is within this polygon
>
> We started using the default configuration found in several examples
> online:
>
> 
> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>geo="true" distErrPct="0.15" maxDistErr="0.001"
> distanceUnits="kilometers" />
>
> We discovered that with this setting the indexing (soft commit) speed is
> very slow
> For 1 documents it takes several minutes to finish the commit
>
> If we disable this field, indexing+soft commit is only 3 seconds for 1
> docs,
> if we set maxDistErr to 1, indexing speed is at around 5 seconds, so a
> huge performance gain against the several minutes we had before
>
> I tried to find out via the documentation whats the impact of "maxDistErr"
> on search results but didn't quite find an in-depth explanation
> From our tests we did, the search results still seem to be very accurate
> even if the covered space of the polygon is less then 1km and search speed
> did not suffer.
>
> So i would love to learn more about the differences on having
> maxDistErr="0.001" vs maxDistErr="1" on a RptWithGeometrySpatialField and
> what problems could we run into with the bigger value
>
> Thanks
> Jens
>
>
>
>
> *Jens Viebig*
>
> Software Development
>
> MAM Products
>
>
> T. +49-(0)4307-8358-0 <+49%204307%2083580>
>
> E. jens.vie...@vitec.com
>
> *http://www.vitec.com *
>
>
>
> [image: VITEC_logo_for_email_signature]
>
>
>
> --
>
> VITEC GmbH, 24223 Schwentinental
>
> Geschäftsführer/Managing Director: Philippe Wetzel
> HRB Plön 1584 / Steuernummer: 1929705211 / VATnumber: DE134878603
>
>
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: ClassCastException: o.a.l.d.Field cannot be cast to o.a.l.d.StoredField

2018-04-26 Thread David Smiley

> but how would a DocumentTransformer affect UpdateLog replay?

Oh right; nevermind that silly theory ;-)

On Thu, Apr 26, 2018 at 10:42 AM Markus Jelsma 
wrote:

> Hello David,
>
> Yes it was sporadic indeed, but how would a DocumentTransformer affect
> UpdateLog replay?
>
> We removed the cast, no idea how it got there.
>
> Thanks,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 26th April 2018 16:31
> > To: solr-user@lucene.apache.org
> > Subject: Re: ClassCastException: o.a.l.d.Field cannot be cast to
> o.a.l.d.StoredField
> >
> > I'm not sure but I wonder why you would want to cast it in the first
> > place.  Field is the base class; all it's subclasses are in one way or
> > another utilities/conveniences.  In other words, if you ever see code
> > casting Field to some subclass, there's a good chance it's fundamentally
> > wrong or making assumptions that aren't necessarily true.
> >
> > If the problem you saw appears sporadic, there's a good chance it is in
> > some way related to updateLog replay.
> >
> > On Tue, Apr 24, 2018 at 7:13 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello,
> > >
> > > We have a DocumentTransformer that gets a Field from the SolrDocument
> and
> > > casts it to StoredField (although aparently we don't need to cast).
> This
> > > works well in tests and fine in production, except for some curious,
> > > unknown and unreproducible, cases, throwing the ClassCastException.
> > >
> > > I can, and will, just remove the cast to fix the rare exception, but in
> > > what cases could the exception get thrown?
> > >
> > > Many thanks,
> > > Markus
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Highlighter throwing InvalidTokenOffsetsException for field with large number of synonyms

2018-04-26 Thread David Smiley

Yay!  I'm glad the UnifiedHighlighter is serving you well.  I was about to
suggest it.  If you think the fragmentation/snippeting could be improved in
a general way then post a JIRA for consideration.  Note: identical results
with the original Highlighter is a non-goal.

On Mon, Apr 23, 2018 at 10:14 PM howed  wrote:

> Finally got back to looking at this, and found that the solution was to
> switch to the  unified
> <
> https://lucene.apache.org/solr/guide/7_2/highlighting.html#choosing-a-highlighter>
>
> highlighter which doesn't seem to have the same problem with my complex
> synonyms.  This required some tweaking of the highlighting parameters and
> my
> code as it doesn't highlight exactly the same as the default highlighter,
> but all is working now.
>
> Thanks again for the assistance.
>
> David
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: ClassCastException: o.a.l.d.Field cannot be cast to o.a.l.d.StoredField

2018-04-26 Thread David Smiley

I'm not sure but I wonder why you would want to cast it in the first
place.  Field is the base class; all it's subclasses are in one way or
another utilities/conveniences.  In other words, if you ever see code
casting Field to some subclass, there's a good chance it's fundamentally
wrong or making assumptions that aren't necessarily true.

If the problem you saw appears sporadic, there's a good chance it is in
some way related to updateLog replay.

On Tue, Apr 24, 2018 at 7:13 AM Markus Jelsma 
wrote:

> Hello,
>
> We have a DocumentTransformer that gets a Field from the SolrDocument and
> casts it to StoredField (although aparently we don't need to cast). This
> works well in tests and fine in production, except for some curious,
> unknown and unreproducible, cases, throwing the ClassCastException.
>
> I can, and will, just remove the cast to fix the rare exception, but in
> what cases could the exception get thrown?
>
> Many thanks,
> Markus
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: PreAnalyzed URP and SchemaRequest API

2018-04-13 Thread David Smiley

Yes I could imagine big gains from this strategy if OpenNLP is in the
analysis chain ;-)

On Fri, Apr 13, 2018 at 5:01 PM Markus Jelsma 
wrote:

> Hello David,
>
> If JSON serialization is too bulky, we could also opt for
> SimplePreAnalyzed right? At least as a FieldType it is possible, if not
> with URP, it just needs some work.
>
> Regarding results; we haven't done it yet, and won't for some time, but we
> will when we reintroduce OpenNLP in the analysis chain. We tried to
> introduce POS-tagging on our own two years ago, but i wasn't suited for
> production because it was too heavy on the CPU. Indexing data suddenly took
> eight to ten times longer in a SolrCloud environment with three replica's.
>
> If we offload our current chains without OpenNLP, it will only benefit
> when large fields pass through a regex, and for decompounding the Germanic
> languages we ingest. Offloading just this cost is a micro optimization,
> offloading the various OpenNLP char and token filters are really beneficial.
>
> Regarding a dependency on Lucene core and analysis-common, it would be
> helpful, but we'll manage.
>
> Thanks again,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 12th April 2018 19:16
> > To: solr-user@lucene.apache.org
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Ah ok.
> > I've wondered how much value there is in pre-analysis.  The serialization
> > of the analyzed form in JSON is bulky.  If you can share any results, I'd
> > be interested to hear how it went.  It's an optimization so you should be
> > able to know how much better it is.  Of course it isn't for everybody --
> > only when the analysis chain is sufficiently complex.
> >
> > On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello David,
> > >
> > > The remote client has everything on the class path but just calling
> > > setTokenStream is not going to work. Remotely, all i get from
> SchemaRequest
> > > API is a AnalyzerDefinition. I haven't found any Solr code that allows
> me
> > > to transform that directly into an analyzer. If i had that, it would
> make
> > > things easy.
> > >
> > > As far as i see it, i need to reconstruct a real Analyzer using
> > > AnalyzerDefinition's information. It won't be a problem, but it is
> > > cumbersome.
> > >
> > > Thanks anyway,
> > > Markus
> > >
> > > -Original message-
> > > > From:David Smiley 
> > > > Sent: Thursday 5th April 2018 19:38
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: PreAnalyzed URP and SchemaRequest API
> > > >
> > > > Is this really a problem when you could easily enough create a
> TextField
> > > > and call setTokenStream?
> > > >
> > > > Does your remote client have Solr-core and all its dependencies on
> the
> > > > classpath?   That's one way to do it... and presumably the direction
> you
> > > > are going because you're asking how to work with PreAnalyzedParser
> which
> > > is
> > > > in solr-core.  *Alternatively*, only bring in Lucene core and
> construct
> > > > things yourself in the right format.  You could copy
> PreAnalyzedParser
> > > into
> > > > your codebase so that you don't have to reinvent any wheels, even
> though
> > > > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't
> want
> > > > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > > > dependency.
> > > >
> > > > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <
> markus.jel...@openindex.io
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > We intend to move to PreAnalyzed URP for analysis offloading.
> Browsing
> > > the
> > > > > Javadocs i came across the SchemaRequest API looking for a way to
> get a
> > > > > Field object remotely, which i seem to need for
> > > > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> > > from
> > > > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > > > >
> > > > > So, to analyze remotely i do need an index-time analyzer. I can
> get it,
> > > > > but not turn it into a Field object, which the PreAnalyzedParser
> for
> > > some
> > > > > reason wants.
> > > > >
> > > > > Any hints here? I must be looking the wrong way.
> > > > >
> > > > > Many thanks!
> > > > > Markus
> > > > >
> > > > --
> > > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > > > http://www.solrenterprisesearchserver.com
> > > >
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:

Re: PreAnalyzed URP and SchemaRequest API

2018-04-12 Thread David Smiley

Ah ok.
I've wondered how much value there is in pre-analysis.  The serialization
of the analyzed form in JSON is bulky.  If you can share any results, I'd
be interested to hear how it went.  It's an optimization so you should be
able to know how much better it is.  Of course it isn't for everybody --
only when the analysis chain is sufficiently complex.

On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma 
wrote:

> Hello David,
>
> The remote client has everything on the class path but just calling
> setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> to transform that directly into an analyzer. If i had that, it would make
> things easy.
>
> As far as i see it, i need to reconstruct a real Analyzer using
> AnalyzerDefinition's information. It won't be a problem, but it is
> cumbersome.
>
> Thanks anyway,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 5th April 2018 19:38
> > To: solr-user@lucene.apache.org
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Is this really a problem when you could easily enough create a TextField
> > and call setTokenStream?
> >
> > Does your remote client have Solr-core and all its dependencies on the
> > classpath?   That's one way to do it... and presumably the direction you
> > are going because you're asking how to work with PreAnalyzedParser which
> is
> > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > things yourself in the right format.  You could copy PreAnalyzedParser
> into
> > your codebase so that you don't have to reinvent any wheels, even though
> > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > dependency.
> >
> > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello,
> > >
> > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> the
> > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > Field object remotely, which i seem to need for
> > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> from
> > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > >
> > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > but not turn it into a Field object, which the PreAnalyzedParser for
> some
> > > reason wants.
> > >
> > > Any hints here? I must be looking the wrong way.
> > >
> > > Many thanks!
> > > Markus
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: PreAnalyzed URP and SchemaRequest API

2018-04-05 Thread David Smiley

Is this really a problem when you could easily enough create a TextField
and call setTokenStream?

Does your remote client have Solr-core and all its dependencies on the
classpath?   That's one way to do it... and presumably the direction you
are going because you're asking how to work with PreAnalyzedParser which is
in solr-core.  *Alternatively*, only bring in Lucene core and construct
things yourself in the right format.  You could copy PreAnalyzedParser into
your codebase so that you don't have to reinvent any wheels, even though
that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
SolrJ depending on Lucene-core, though it'd make a fine "optional"
dependency.

On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma 
wrote:

> Hello,
>
> We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> Javadocs i came across the SchemaRequest API looking for a way to get a
> Field object remotely, which i seem to need for
> JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> SchemaRequest API is FieldTypeRepresentation, which offers me
> getIndexAnalyzer() but won't allow me to construct a Field object.
>
> So, to analyze remotely i do need an index-time analyzer. I can get it,
> but not turn it into a Field object, which the PreAnalyzedParser for some
> reason wants.
>
> Any hints here? I must be looking the wrong way.
>
> Many thanks!
> Markus
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: querying vs. highlighting: complete freedom?

2018-04-03 Thread David Smiley

Thanks for your review!

On Tue, Apr 3, 2018 at 6:56 AM Arturas Mazeika  wrote:
...

> What I missed at the beginning of the documentation is the minimal set of
> requirements that is reacquired to have highlighting sensible: somehow I
> have a feeling that one needs some of the information stored in schema in
> some form. This of course is mentioned later on in the corresponding
> section, but I'd write this explicitly.
>

Explicitly say what up front?  "Requirements" are somewhat loose/minimal.
We ought to say clearly say that hl.fl fields need to be "stored".

...

> Is there a way to "load-balance" analyze-query-chain for the purpose of
> highlighting matches? In the url below, I need to specify a specific core.

...

I doubt it.  You'll have to do this yourself.  Why do you want to use this
for highlighting?  Is it to get the offsets returned to you?  There's a
JIRA or two for that already; someone ought to make that happen.
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: PreAnalyzed FieldType, and simultaneously importing JSON

2018-04-02 Thread David Smiley

Hello Markus,

It appears you are not familiar with PreAnalyzedUpdateProcessor?  Using
that is much more flexible -- you could have different URP chains for your
use-cases. IMO PreAnalyzedField ought to go away.  I argued for the URP
version and thus it's superiority to the FieldType here:
https://issues.apache.org/jira/browse/SOLR-4619?focusedCommentId=13611191=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13611191
Sadly, the FieldType is the one that is documented in the ref guide, but
not the URP :-(

~ David

On Thu, Mar 29, 2018 at 5:06 PM Markus Jelsma 
wrote:

> Hello,
>
> We want to move to PreAnalyzed FieldType to offload our very heavy
> analysis chain away from the search cluster, so we have to configure our
> fields to accept pre-analyzed tokens in production.
>
> But we use the same schema in development environments too, and that is
> where we use JSON files, or stream (export/import) data directly from
> production servers into a development environment, again via JSON. And in
> case of disaster recovery, we can import the daily exported JSON bzipped
> files back into our production servers.
>
> But this JSON loading does not work with PreAnalyzed FieldType. So to load
> JSON we must reset all fields back to their respective language specific
> FieldTypes on-the-fly, we could automate, but it is a hassle we like to
> avoid.
>
> Have i overlooked any configuration parameters that can help? Must we
> automate the on-the-fly schema reconfiguration and reset to PreAnalyzed
> after JSON loading is finished?
>
> Many thanks!
> Markus
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: querying vs. highlighting: complete freedom?

2018-04-02 Thread David Smiley

Hi Arturas,

Both Erick and I had a go at improving the documentation here.  I hope it's
clearer.
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/highlighting.html
The docs for hl.fl, hl.q, hl.qparser were all updated.  The meat of the
change was a new note in hl.fl including an example.  It's kinda hard to
document the problem you found but I hope the note will be somewhat
illustrative.

~ David

On Mon, Mar 26, 2018 at 3:12 AM Arturas Mazeika  wrote:

> Hi Erick,
>
> Adding a field-qualify to the hl.q parameter solved the issue. My
> excitement is steaming over the roof! What a thorough answer: the
> explanation about the behavior of solr, how it tries to interpret what I
> mean when I supply a keyword without the field-qualifier. Very impressive.
> Would you care (re)posting this answer to stackoverflow? If that is too
> much of a hassle, I'll do this in a couple of days myself on your behalf.
>
> I am impressed how well, thorough, fast and fully the question was
> answered.
>
> Steven hint pushed me into this direction further: he suggested to use the
> query part of solr to filter and sort out the relevant answers in the 1st
> step and in the 2nd step he'd highlight all the keywords using CTR+F (in
> the browser or some alternative viewer). This brought be to the next
> question:
>
> How can one match query terms with the analyze-chained documents in an
> efficient and distributed manner? My current understanding how to achieve
> this is the following:
>
> 1. Get the list of ids (contents) of the documents that match the query
> 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the
> document and the query
> 3. Use the matching of the substrings from the original text to last
> filter/tokenizer/analyzer in the analyze-chain to map the terms of the
> query
> 4. Emulate CTRL+F highlighting
>
> Web Interface of Solr offers quite a bit to advance towards this goal. If
> one fires this request:
>
> * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was a
> German-born theoretical physicist[5] who developed the theory of
> relativity, one of the two pillars of modern physics (alongside quantum
> mechanics).&
> * analysis.query=reletivity theory
>
> to one of the cores of solr, one gets the steps 1-3 done:
>
>
> http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml=true=Albert%20Einstein%20(14%20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%20of%20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics%20(alongside%20quantum%20mechanics).=reletivity%20theory=text_en
>
> Questions:
>
> 1. Is there a way to "load-balance" this? In the above url, I need to
> specify a specific core. Is it possible to generalize it, so the core that
> receives the request is not necessarily the one that processes it? Or this
> already is distributed in a sense that receiving core and processing cores
> are never the same?
>
> 2. The document was already analyze-chained. Is is possible to store this
> information so one does not need to re-analyze-chain it once more?
>
> Cheers
> Arturas
>
> On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson 
> wrote:
>
> > Arturas:
> >
> > Try to field-qualify your hl.q parameter. That looks like:
> >
> > hl.q=trans:Kundigung
> > or
> > hl.q=trans:Kündigung
> >
> > I saw the exact behavior you describe when I did _not_ specify the
> > field in the hl.q parameter, i.e.
> >
> > hl.q=Kundigung
> > or
> > hl.q=Kündigung
> >
> > didn't show all highlights.
> >
> > But when I did specify the field, it worked.
> >
> > Here's what I think is happening: Solr uses the default search
> > field when parsing an un-field-qualified query. I.e.
> >
> > q=something
> >
> > is parsed as
> >
> > q=default_search_field:something.
> >
> > The default field is controlled in solrconfig.xml with the "df"
> > parameter, you'll see entries like:
> > my_field
> >
> > Also when I changed the "df" parameter to the field I was highlighting
> > on, I didn't need to specify the field on the hl.q parameter.
> >
> > hl.q=Kundigung
> > or
> > hl.q=Kündigung
> >
> > The default  field is usually "text", which knows nothing about
> > the German-specific filters you've applied unless you changed it.
> >
> > So in the absence of a field-qualification for the hl.q parameter Solr
> > was parsing the query according to the analysis chain specifed
> > in your default field, and probably passed ü through without
> > transforming it. Since your indexing analysis chain for that field
> > folded ü to just plain u, it wasn't found or highlighted.
> >
> > On the surface, this does seem like something that should be
> > changed, I'll go ahead and ping the dev list.
> >
> > NOTE: I was trying this on Solr 7.1
> >
> > Best,
> > Erick
> >
> > On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika 
> > wrote:
> > > Hi

Re: Copying a SolrCloud collection to other hosts

2018-03-28 Thread David Smiley

Right, there is a shared filesystem requirement.  It would be nice if this
Solr feature could be enhanced to have more options like backing up
directly to another SolrCloud using replication/fetchIndex like your cool
solrcloud_manager thing.

On Wed, Mar 28, 2018 at 12:34 PM Jeff Wartes <jwar...@whitepages.com> wrote:

> The backup/restore still requires setting up a shared filesystem on all
> your nodes though right?
>
> I've been using the fetchindex trick in my solrcloud_manager tool for ages
> now: https://github.com/whitepages/solrcloud_manager#cluster-commands
> Some of the original features in that tool have been incorporated into
> Solr itself these days, but I still use clonecollection/copycollection
> regularly. (most recently with Solr 7.2)
>
>
> On 3/27/18, 9:55 PM, "David Smiley" <david.w.smi...@gmail.com> wrote:
>
> The backup/restore API is intended to address this.
>
> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/making-and-restoring-backups.html
>
> Erick's advice is good (and I once drafted docs for the same scheme
> years
> ago as well), but I consider it dated -- it's what people had to do
> before
> the backup/restore API existed.  Internally, backup/restore is doing
> similar stuff.  It's easy to give backup/restore a try; surely you
> have by
> now?
>
> ~ David
>
> On Tue, Mar 6, 2018 at 9:47 AM Patrick Schemitz <p...@solute.de> wrote:
>
> > Hi List,
> >
> > so I'm running a bunch of SolrCloud clusters (each cluster is: 8
> shards
> > on 2 servers, with 4 instances per server, no replicas, i.e. 1 shard
> per
> > instance).
> >
> > Building the index afresh takes 15+ hours, so when I have to deploy
> a new
> > index, I build it once, on one cluster, and then copy (scp) over the
> > data//index directories (shutting down the Solr instances
> > first).
> >
> > I could get Solr 6.5.1 to number the shard/replica directories
> nicely via
> > the createNodeSet and createNodeSet.shuffle options:
> >
> > Solr 6.5.1 /var/lib/solr:
> >
> > Server node 1:
> > instance00/data/main_index_shard1_replica1
> > instance01/data/main_index_shard2_replica1
> > instance02/data/main_index_shard3_replica1
> > instance03/data/main_index_shard4_replica1
> >
> > Server node 2:
> > instance00/data/main_index_shard5_replica1
> > instance01/data/main_index_shard6_replica1
> > instance02/data/main_index_shard7_replica1
> > instance03/data/main_index_shard8_replica1
> >
> > However, while attempting to upgrade to 7.2.1, this numbering has
> changed:
> >
> > Solr 7.2.1 /var/lib/solr:
> >
> > Server node 1:
> > instance00/data/main_index_shard1_replica_n1
> > instance01/data/main_index_shard2_replica_n2
> > instance02/data/main_index_shard3_replica_n4
> > instance03/data/main_index_shard4_replica_n6
> >
> > Server node 2:
> > instance00/data/main_index_shard5_replica_n8
> > instance01/data/main_index_shard6_replica_n10
> > instance02/data/main_index_shard7_replica_n12
> > instance03/data/main_index_shard8_replica_n14
> >
> > This new numbering breaks my copy script, and furthermode, I'm
> worried
> > as to what happens when the numbering is different among target
> clusters.
> >
> > How can I switch this back to the old numbering scheme?
> >
> > Side note: is there a recommended way of doing this? Is the
> > backup/restore mechanism suitable for this? The ref guide is kind of
> terse
> > here.
> >
> > Thanks in advance,
> >
> > Ciao, Patrick
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Copying a SolrCloud collection to other hosts

2018-03-27 Thread David Smiley

The backup/restore API is intended to address this.
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/making-and-restoring-backups.html

Erick's advice is good (and I once drafted docs for the same scheme years
ago as well), but I consider it dated -- it's what people had to do before
the backup/restore API existed.  Internally, backup/restore is doing
similar stuff.  It's easy to give backup/restore a try; surely you have by
now?

~ David

On Tue, Mar 6, 2018 at 9:47 AM Patrick Schemitz  wrote:

> Hi List,
>
> so I'm running a bunch of SolrCloud clusters (each cluster is: 8 shards
> on 2 servers, with 4 instances per server, no replicas, i.e. 1 shard per
> instance).
>
> Building the index afresh takes 15+ hours, so when I have to deploy a new
> index, I build it once, on one cluster, and then copy (scp) over the
> data//index directories (shutting down the Solr instances
> first).
>
> I could get Solr 6.5.1 to number the shard/replica directories nicely via
> the createNodeSet and createNodeSet.shuffle options:
>
> Solr 6.5.1 /var/lib/solr:
>
> Server node 1:
> instance00/data/main_index_shard1_replica1
> instance01/data/main_index_shard2_replica1
> instance02/data/main_index_shard3_replica1
> instance03/data/main_index_shard4_replica1
>
> Server node 2:
> instance00/data/main_index_shard5_replica1
> instance01/data/main_index_shard6_replica1
> instance02/data/main_index_shard7_replica1
> instance03/data/main_index_shard8_replica1
>
> However, while attempting to upgrade to 7.2.1, this numbering has changed:
>
> Solr 7.2.1 /var/lib/solr:
>
> Server node 1:
> instance00/data/main_index_shard1_replica_n1
> instance01/data/main_index_shard2_replica_n2
> instance02/data/main_index_shard3_replica_n4
> instance03/data/main_index_shard4_replica_n6
>
> Server node 2:
> instance00/data/main_index_shard5_replica_n8
> instance01/data/main_index_shard6_replica_n10
> instance02/data/main_index_shard7_replica_n12
> instance03/data/main_index_shard8_replica_n14
>
> This new numbering breaks my copy script, and furthermode, I'm worried
> as to what happens when the numbering is different among target clusters.
>
> How can I switch this back to the old numbering scheme?
>
> Side note: is there a recommended way of doing this? Is the
> backup/restore mechanism suitable for this? The ref guide is kind of terse
> here.
>
> Thanks in advance,
>
> Ciao, Patrick
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: InetAddressPoint support in Solr or other IP type?

2018-03-27 Thread David Smiley

(I overlooked your reply; sorry to leave you hanging)

>From a simplicity standpoint, Just use InetAddressPoint.  Solr has no
rules/restrictions as to which Lucene module it's in.

That said, I *suspect* a Terms PrefixTree aligned to each byte would offer
better query performance, presuming that typical range queries are
byte-to-byte (as they would be for IPs?).  The Points API internally makes
the splitting decision, and it's not customizable.  It's blind to how
people will realistically query the data; it just wants a balanced tree.
For the same reason, I *suspect* (but have not benchmarked to see) that
DateRangeField has better query performance than DatePointField.  That
said, a Points index is probably going to be leaner & faster to index.

~ David

On Fri, Mar 23, 2018 at 7:51 PM Mike Cooper <mcoo...@carbonblack.com> wrote:

> Thanks David. Is there a reason we wouldn't want to base the Solr
> implementation on the InetAddressPoint class?
>
>
> https://lucene.apache.org/core/7_2_1/misc/org/apache/lucene/document/InetAddressPoint.html
>
> I realize that is in the "misc" package for now, so it's not part of core
> Lucene. But it is nice in that it has one class for both ipv4 and ipv6 and
> it's based on point numerics rather than trie numerics which seem to be
> deprecated. I'm pretty familiar with the code base, I could take a stab at
> implementing this. I just wanted to make sure there wasn't something I was
> missing since I couldn't find any discussion on this.
>
> Michael Cooper
>
> -Original Message-
> From: David Smiley [mailto:david.w.smi...@gmail.com]
> Sent: Friday, March 23, 2018 5:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: InetAddressPoint support in Solr or other IP type?
>
> Hi,
>
> For IPv4, use TrieIntField with precisionStep=8
>
> For IPv6 https://issues.apache.org/jira/browse/SOLR-6741   There's nothing
> there yet; you could help out if you are familiar with the codebase.  Or
> you
> might try something relatively simple involving edge ngrams.
>
> ~ David
>
> On Thu, Mar 22, 2018 at 1:09 PM Mike Cooper <mcoo...@carbonblack.com>
> wrote:
>
> > I have scoured the web and cannot find any discussion of having the
> > Lucene InetAddressPoint type exposed in Solr. Is there a reason this
> > is omitted from the Solr supported types? Is it on the roadmap? Is
> > there an alternative recommended way to index and store Ipv4 and Ipv6
> > addresses for optimal range searches and subnet searches? Thanks for your
> > help.
> >
> >
> >
> > *Michael Cooper*
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: InetAddressPoint support in Solr or other IP type?

2018-03-23 Thread David Smiley

Hi,

For IPv4, use TrieIntField with precisionStep=8

For IPv6 https://issues.apache.org/jira/browse/SOLR-6741   There's nothing
there yet; you could help out if you are familiar with the codebase.  Or
you might try something relatively simple involving edge ngrams.

~ David

On Thu, Mar 22, 2018 at 1:09 PM Mike Cooper  wrote:

> I have scoured the web and cannot find any discussion of having the Lucene
> InetAddressPoint type exposed in Solr. Is there a reason this is omitted
> from the Solr supported types? Is it on the roadmap? Is there an
> alternative recommended way to index and store Ipv4 and Ipv6 addresses for
> optimal range searches and subnet searches? Thanks for your help.
>
>
>
> *Michael Cooper*
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Sorting results for spatial search

2018-02-01 Thread David Smiley

quote: "The problem is that this includes children that DON’T touch the
search area in the sum. How can I only include the shapes from the first
query above in my sort?"

Unless I'm misunderstanding your intent, I think this is a simple matter of
adding the spatial filter to the parent join query you are sorting on.  So
something like this (not tested):

=query($sortQ) desc
={!parent which=is_parent:true score=total}
  +is_parent:false
  +{!func}density
  +gridcell_rpt:"Intersects(POLYGON((-20 70, -50 80, -20 20, 30 60, -10 40,
-20 70)))"

Separately from your question, you state that these are grid cells and thus
rectangles.  For rectangles, I recommend using BBoxField, which will
probably overall perform better (smaller index, faster queries).  If you
need an RPT field nonetheless (heatmaps?) then you could use the more
concise ENVELOPE syntax but it shouldn't matter since a polygon that is a
rectangle will internally be optimized to be one.

On Wed, Jan 31, 2018 at 3:33 PM Leila Deljkovic <
leila.deljko...@koordinates.com> wrote:

> Hiya,
>
> So I have some nested documents in my index with this kind of structure:
> {
> "id": “parent",
> "gridcell_rpt": "POLYGON((30 10, 40 40, 20 40, 10 20, 30 10))",
> "density": “30"
>
> "_childDocuments_" : [
> {
> "id":"child1",
> "gridcell_rpt":"MULTIPOLYGON(((30 20, 45 40, 10 40, 30 20)))",
> "density":"25"
> },
> {
> "id":"child2",
> "gridcell_rpt":"MULTIPOLYGON(((15 5, 40 10, 10 20, 5 10, 15
> 5)))",
> "density":"5"
> }
> ]
> }
>
> The parent document is a WKT shape, and its children are “grid cells”,
> which are just divisions of the main shape (ie; cutting up the parent shape
> to get children shapes). The “density" is the feature count in each shape.
> When I query (through the Solr UI) I use “Intersects” to return parents
> which touch the search area (note that if a child is touching, the parent
> must also be touching).
>
> eg; fq={!field f=gridcell_rpt}Intersects(POLYGON((-20 70, -50 80,
> -20 20, 30 60, -10 40, -20 70)))
>
> and I want to sort the results by the sum of the densities of all the
> children touching the search area (so which parent has children that touch
> the search area, and how big the sum of these children’s densities is)
> something like {!parent which=is_parent:true score=total
> v='+is_parent:false +{!func}density'} desc
>
> The problem is that this includes children that DON’T touch the search
> area in the sum. How can I only include the shapes from the first query
> above in my sort?
>
> Cheers :)

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Sum area polygon solr

2017-11-01 Thread David Smiley

Hi,

Ah, no -- sorry.  If you want to roll up your sleeves and write a Solr
plugin (a ValueSource in this case, perhaps) then you could lookup the
index polygon and then call out to JTS to compute the intersection and then
ask it for the area.  But that's going to be a very heavyweight computation
to score/sort on!  Instead, perhaps you can use BBoxField's overlapRatio to
compare bounding boxes which is relatively fast.

~ David

On Tue, Oct 31, 2017 at 8:45 AM Samur Araujo  wrote:

> Hi all, is it possible to sum the area of a polygon in solr?
>
> Suppose I do an polygon intersect and I want to retrieve the total area of
> the resulting polygon.
>
> Is it possible?
>
> Best,
>
> --
> Head of Data
> Geophy
> www.geophy.com
>
> Nieuwe Plantage 54
> -55
> 2611XK  Delft
> +31 (0)70 7640725 <+31%2070%20764%200725>
>
> 1 Fore Street
> EC2Y 9DT  London
> +44 (0)20 37690760 <+44%2020%203769%200760>
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Retrieve DocIdSet from Query in lucene 5.x

2017-10-24 Thread David Smiley

See SolrIndexSearcher.getDocSet.  It may not be identical to what you want
but following what it does on through to DocSetUtil.createDocSet may be
enlightening.

On Fri, Oct 20, 2017 at 5:10 PM Jamie Johnson  wrote:

> I am trying to migrate some old code that used to retrieve DocIdSets from
> filters, but with Filters being deprecated in Lucene 5.x I am trying to
> move away from those classes but I'm not sure the right way to do this
> now.  Are there any examples of doing this?
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Solr Spatial Query Problem Hk.

2017-10-04 Thread David Smiley

Hi,

Firstly, if Solr returns an error referencing an exception then you can
look in Solr's logs for the stack trace, which helps debugging problems a
ton (at least for Solr devs).

I suspect that the problem here is that your schema might have a dynamic
field where *coordinates is defined to be a number.  The error suggests
this, at least.

On Wed, Sep 27, 2017 at 6:42 AM Can Ezgi Aydemir 
wrote:

> 1-
> http://localhost:8983/solr/nh/select?fq=geometry.coordinates:%22IsWithin(POLYGON((-80%2029,%20-90%2050,%20-60%2070,%200%200,%20-80%2029)))%20distErrPct=0%22


missing q=*:*


>
> 2-
> http://localhost:8983/solr/nh/select?q={!field%20f=geometry.coordinates}Intersects(POLYGON((-80%2029,%20-90%2050,%20-60%2070,%200%200,%20-80%2029)))
> 
> 3-
> http://localhost:8983/solr/nh/select?q=*:*={!field%20f=geometry.coordinates}Intersects(POLYGON((-80%2029,%20-90%2050,%20-60%2070,%200%200,%20-80%2029)))
> 
>
> 
>  
>   400
>   1
>   
>geometry.coordinates:"IsWithin(POLYGON((-80 29, -90 50,
> -60 70, 0 0, -80 29))) distErrPct=0"
>
>   
>  
>  
>   
>org.apache.solr.common.SolrException
>org.apache.solr.common.SolrException
>   
>   Invalid Number: IsWithin(POLYGON((-80 29, -90 50, -60
> 70, 0 0, -80 29))) distErrPct=0
>   400
>  
> 
>
>
>
> [cid:74426A0B-010D-4871-A556-A3590DE88C60@islem.com.tr.]
>
> Can Ezgi AYDEMİR
> Oracle Veri Tabanı Yöneticisi
>
> İşlem Coğrafi Bilgi Sistemleri Müh. & Eğitim AŞ.
> 2024.Cadde No:14, Beysukent 06800, Ankara, Türkiye
> T : 0 312 233 50 00 .:. F : 0312 235 56 82
> E :  cayde...@islem.com.tr<
> https://mail.islem.com.tr/owa/redir.aspx?REF=nPSsfnBmV5Ce9vWorvlOrrYthN1Wt5jhrDrHz4IuPgJuXODmM8nUCAFtYWlsdG86Z2R1cmFuQGlzbGVtLmNvbS50cg..>
> .:. W : https://mail.islem.com.tr/owa/redir.aspx?REF=q0Pp2HH-W10G07gbyIRn7NyrFWyaL2QLhqXKE1SMNj1uXODmM8nUCAFodHRwOi8vd3d3LmlzbGVtLmNvbS50ci8
> .>
>
> Bu e-posta ve ekindekiler gizli bilgiler içeriyor olabilir ve sadece
> adreslenen kişileri ilgilendirir. Eğer adreslenen kişi siz değilseniz, bu
> e-postayı yaymayınız, dağıtmayınız veya kopyalamayınız. Eğer bu e-posta
> yanlışlıkla size gönderildiyse, lütfen bu e-posta ve ekindeki dosyaları
> sisteminizden siliniz ve göndereni hemen bilgilendiriniz. Ayrıca, bu
> e-posta ve ekindeki dosyaları virüs bulaşması ihtimaline karşı taratınız.
> İŞLEM GIS® bu e-posta ile taşınabilecek herhangi bir virüsün neden
> olabileceği hasarın sorumluluğunu kabul etmez. Bilgi iç
> in:b...@islem.com.tr This message may contain confidential information
> and is intended only for recipient name. If you are not the named addressee
> you should not disseminate, distribute or copy this e-mail. Please notify
> the sender immediately if you have received this e-mail by mistake and
> delete this e-mail from your system. Finally, the recipient should check
> this email and any attachments for the presence of viruses. İŞLEM GIS®
> accepts no liability for any damage may be caused by any virus transmitted
> by this email.” For information: b...@islem.com.tr
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Sorting by distance resources with WKT polygon data

2017-09-19 Thread David Smiley

Hello,

Sorry for the belated response.

Solr only supports sorting from point or rectangles in the index.  For
rectangles use BBoxField.  For points, ideally use the new
LatLonPointSpatialField; failing that use LatLonType.  You can use RPT for
point data but I don't recommend sorting with it; use one of the others
just mentioned.

~ David

On Tue, Sep 12, 2017 at 5:09 PM Grondin Luc 
wrote:

> Hello,
>
> I am having difficulties with sorting by distance resources indexed with
> WKT geolocation data. I have tried different field configurations and query
> parameters and I did not get working results.
>
> I am using SOLR 6.6 and JTS-core 1.14. My test sample includes resources
> with point coordinates plus one associated with a polygon. I tried using
> both fieldtypes "solr.SpatialRecursivePrefixTreeFieldType" and
> "solr.RptWithGeometrySpatialField". In both cases, I get good results if I
> do not care about sorting. The problem arises when I include sorting.
>
> With SpatialRecursivePrefixTreeFieldType:
>
> The best request I used, based on the documentation I could find, was:
>
> select?fl=*,score={!geofilt%20sfield=PositionGeo%20pt=45.52,-73.53%20d=10%20score=distance}=score%20asc
>
> The distance appears to be correctly evaluated for resources indexed with
> point coordinates. However, it is wrong for the resource with a polygon
>
> 
>   2.3913236
>   4.3242383
>   4.671504
>   4.806902
>   20015.115
> 
>
> (Please note that I have verified the polygon externally and it is correct)
>
> With solr.RptWithGeometrySpatialField:
>
> I get an exception triggered by the presence of « score=distance » in the
> request «
> q={!geofilt%20sfield=PositionGeo%20pt=45.52,-73.53%20d=10%20score=distance}
> »
>
> java.lang.UnsupportedOperationException
> at
> org.apache.lucene.spatial.composite.CompositeSpatialStrategy.makeDistanceValueSource(CompositeSpatialStrategy.java:92)
> at
> org.apache.solr.schema.AbstractSpatialFieldType.getValueSourceFromSpatialArgs(AbstractSpatialFieldType.java:412)
> at
> org.apache.solr.schema.AbstractSpatialFieldType.getQueryFromSpatialArgs(AbstractSpatialFieldType.java:359)
> at
> org.apache.solr.schema.AbstractSpatialFieldType.createSpatialQuery(AbstractSpatialFieldType.java:308)
> at
> org.apache.solr.search.SpatialFilterQParser.parse(SpatialFilterQParser.java:80)
>
> From there, I am rather stuck with no ideas on how to resolve these
> problems. So advises in that regards would be much appreciated. I can
> provide more details if necessary.
>
> Thank you in advance,
>
>
>  ---
>   Luc Grondin
>   Analyste en gestion de l'information numérique
>   Centre d'expertise numérique pour la recherche - Université de Montréal
>   téléphone: 514-343-6111 <(514)%20343-6111> p. 3988  --
> luc.gron...@umontreal.ca
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

1 2 3 4 >

1 - 100 of 322 matches

Mail list logo