RE: Schema API specifying different analysers for query and index

2021-03-02 Thread ufuk yılmaz
It worked! Thanks Mr. Rafalovitch. I just removed “type”: “query”.. keys from 
the json, and used indexAnalyzer and queryAnalyzer in place of analyzer json 
node.

Sent from Mail for Windows 10

From: Alexandre Rafalovitch
Sent: 03 March 2021 01:19
To: solr-user
Subject: Re: Schema API specifying different analysers for query and index

RefGuide gives this for Adding, I would hope the Replace would be similar:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type":{
 "name":"myNewTextField",
 "class":"solr.TextField",
 "indexAnalyzer":{
"tokenizer":{
   "class":"solr.PathHierarchyTokenizerFactory",
   "delimiter":"/" }},
 "queryAnalyzer":{
"tokenizer":{
   "class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema

So, indexAnalyzer/queryAnalyzer, rather than array:
https://lucene.apache.org/solr/guide/8_8/schema-api.html#add-a-new-field-type

Hope this works,
Alex.
P.s. Also check whether you are using matching API and V1/V2 end point.

On Tue, 2 Mar 2021 at 15:25, ufuk yılmaz  wrote:
>
> Hello,
>
> I’m trying to change a field’s query analysers. The following works but it 
> replaces both index and query type analysers:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzer": {
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }
> }
> }
>
> I tried to change analyzer field to analyzers, to specify different analysers 
> for query and index, but it gave error:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzers": [{
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
>     "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> },{
> "type": "index",
> "tokenizer": {
> "class": "solr.KeywordTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }]
> }
> }
>
> "errorMessages":["Plugin init failure for [schema.xml]
> "msg":"error processing commands",...
>
> How can I specify different analyzers for query and index type when using 
> schema api?
>
> Sent from Mail for Windows 10
>



Re: Schema API specifying different analysers for query and index

2021-03-02 Thread Alexandre Rafalovitch
RefGuide gives this for Adding, I would hope the Replace would be similar:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type":{
 "name":"myNewTextField",
 "class":"solr.TextField",
 "indexAnalyzer":{
"tokenizer":{
   "class":"solr.PathHierarchyTokenizerFactory",
   "delimiter":"/" }},
 "queryAnalyzer":{
"tokenizer":{
   "class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema

So, indexAnalyzer/queryAnalyzer, rather than array:
https://lucene.apache.org/solr/guide/8_8/schema-api.html#add-a-new-field-type

Hope this works,
Alex.
P.s. Also check whether you are using matching API and V1/V2 end point.

On Tue, 2 Mar 2021 at 15:25, ufuk yılmaz  wrote:
>
> Hello,
>
> I’m trying to change a field’s query analysers. The following works but it 
> replaces both index and query type analysers:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzer": {
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }
> }
> }
>
> I tried to change analyzer field to analyzers, to specify different analysers 
> for query and index, but it gave error:
>
> {
> "replace-field-type": {
> "name": "string_ci",
> "class": "solr.TextField",
> "sortMissingLast": true,
> "omitNorms": true,
> "stored": true,
> "docValues": false,
> "analyzers": [{
> "type": "query",
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
>     "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> },{
> "type": "index",
> "tokenizer": {
> "class": "solr.KeywordTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> }
> ]
> }]
> }
> }
>
> "errorMessages":["Plugin init failure for [schema.xml]
> "msg":"error processing commands",...
>
> How can I specify different analyzers for query and index type when using 
> schema api?
>
> Sent from Mail for Windows 10
>


Schema API specifying different analysers for query and index

2021-03-02 Thread ufuk yılmaz
Hello,

I’m trying to change a field’s query analysers. The following works but it 
replaces both index and query type analysers:

{
"replace-field-type": {
"name": "string_ci",
"class": "solr.TextField",
"sortMissingLast": true,
"omitNorms": true,
"stored": true,
"docValues": false,
"analyzer": {
"type": "query",
"tokenizer": {
"class": "solr.StandardTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
}
]
}
}
}

I tried to change analyzer field to analyzers, to specify different analysers 
for query and index, but it gave error:

{
"replace-field-type": {
"name": "string_ci",
"class": "solr.TextField",
"sortMissingLast": true,
"omitNorms": true,
"stored": true,
"docValues": false,
"analyzers": [{
"type": "query",
"tokenizer": {
"class": "solr.StandardTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
}
]
},{
"type": "index",
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
}
]
}]
}
}

"errorMessages":["Plugin init failure for [schema.xml]
"msg":"error processing commands",...

How can I specify different analyzers for query and index type when using 
schema api?

Sent from Mail for Windows 10



Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-19 Thread David Smiley
Even if you could do an "fl" with the ability to exclude certain fields, it
begs the question of what goes into the document cache.  The doc cache is
doc oriented, not field oriented.  So there needs to be some sort of
stand-in value if you don't want to cache a value there and that ends
up being LazyField if you have that feature enabled, or possible wasted
space if you don't have that enabled.  So I don't think the ability to
exclude fields in "fl" would obsolete enableLazyFieldLoading which I think
you are implying?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Feb 19, 2021 at 10:10 AM Gus Heck  wrote:

> Actually I suspect it's there because the ability to exclude fields
> rather than include them is still pending...
> https://issues.apache.org/jira/browse/SOLR-3191
> See also
> https://issues.apache.org/jira/browse/SOLR-10367
> https://issues.apache.org/jira/browse/SOLR-9467
>
> All of these and lazy field loading are motivated by the case where you
> have a very large stored field and you sometimes don't want it, but do want
> everything else, and an explicit list of fields is not convenient (i.e. the
> field list would have to be hard coded in an application, or alternately
> require some sort of schema parsing to build a list of possible fields or
> other severe ugliness..)
>
> -Gus
>
> On Thu, Feb 18, 2021 at 8:42 AM David Smiley  wrote:
>
> > IMO enableLazyFieldLoading is a small optimization for most apps.  It
> saves
> > memory in the document cache at the expense of increased latency if your
> > usage pattern wants a field later that wasn't requested earlier.  You'd
> > probably need detailed metrics/benchmarks to observe a difference, and
> you
> > might reach a conclusion that enableLazyFieldLoading is best at "false"
> for
> > you irrespective of the bug.  I suspect it may have been developed for
> > particularly large document use-cases where you don't normally need some
> > large text fields for retrieval/highlighting.  For example imagine if you
> > stored the entire input data as JSON in a _json_ field or some-such.
> > Nowadays, I'd set large="true" on such a field, which is a much newer
> > option.
> >
> > I was able to tweak my test to have only alphabetic IDs, and the test
> still
> > failed.  I don't see how the ID's contents/format could cause any effect.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, Feb 18, 2021 at 5:04 AM Nussbaum, Ronen <
> ronen.nussb...@verint.com
> > >
> > wrote:
> >
> > > You're right, I was able to reproduce it too without highlighting.
> > > Regarding the existing bug, I think there might be an additional issue
> > > here because it happens only when id field contains an underscore
> (didn't
> > > check for other special characters).
> > > Currently I have no other choice but to use
> enableLazyFieldLoading=false.
> > > I hope it wouldn't have a significant performance impact.
> > >
> > > -Original Message-
> > > From: David Smiley 
> > > Sent: יום ה 18 פברואר 2021 01:03
> > > To: solr-user 
> > > Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> > > Loading => Invalid Index
> > >
> > > I think the issue is this existing bug, but needs to refer to
> > > toSolrInputDocument instead of toSolrDoc:
> > > https://issues.apache.org/jira/browse/SOLR-13034
> > > Highlighting isn't involved; you just need to somehow get a document
> > > cached with lazy fields.  In a test I was able to do this simply by
> > doing a
> > > query that only returns the "id" field.  No highlighting.
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> > >
> > > On Wed, Feb 17, 2021 at 10:28 AM David Smiley 
> > wrote:
> > >
> > > > Thanks for more details.  I was able to reproduce this locally!  I
> > > > hacked a test to look similar to what you are doing.  BTW it's okay
> to
> > > > fill out a JIRA imperfectly; they can always be edited :-).  Once I
> > > > better understand the nature of the bug today, I'll file an issue and
> > > respond with it here.
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> >

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-19 Thread Gus Heck
Actually I suspect it's there because the ability to exclude fields
rather than include them is still pending...
https://issues.apache.org/jira/browse/SOLR-3191
See also
https://issues.apache.org/jira/browse/SOLR-10367
https://issues.apache.org/jira/browse/SOLR-9467

All of these and lazy field loading are motivated by the case where you
have a very large stored field and you sometimes don't want it, but do want
everything else, and an explicit list of fields is not convenient (i.e. the
field list would have to be hard coded in an application, or alternately
require some sort of schema parsing to build a list of possible fields or
other severe ugliness..)

-Gus

On Thu, Feb 18, 2021 at 8:42 AM David Smiley  wrote:

> IMO enableLazyFieldLoading is a small optimization for most apps.  It saves
> memory in the document cache at the expense of increased latency if your
> usage pattern wants a field later that wasn't requested earlier.  You'd
> probably need detailed metrics/benchmarks to observe a difference, and you
> might reach a conclusion that enableLazyFieldLoading is best at "false" for
> you irrespective of the bug.  I suspect it may have been developed for
> particularly large document use-cases where you don't normally need some
> large text fields for retrieval/highlighting.  For example imagine if you
> stored the entire input data as JSON in a _json_ field or some-such.
> Nowadays, I'd set large="true" on such a field, which is a much newer
> option.
>
> I was able to tweak my test to have only alphabetic IDs, and the test still
> failed.  I don't see how the ID's contents/format could cause any effect.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Feb 18, 2021 at 5:04 AM Nussbaum, Ronen  >
> wrote:
>
> > You're right, I was able to reproduce it too without highlighting.
> > Regarding the existing bug, I think there might be an additional issue
> > here because it happens only when id field contains an underscore (didn't
> > check for other special characters).
> > Currently I have no other choice but to use enableLazyFieldLoading=false.
> > I hope it wouldn't have a significant performance impact.
> >
> > -Original Message-
> > From: David Smiley 
> > Sent: יום ה 18 פברואר 2021 01:03
> > To: solr-user 
> > Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> > Loading => Invalid Index
> >
> > I think the issue is this existing bug, but needs to refer to
> > toSolrInputDocument instead of toSolrDoc:
> > https://issues.apache.org/jira/browse/SOLR-13034
> > Highlighting isn't involved; you just need to somehow get a document
> > cached with lazy fields.  In a test I was able to do this simply by
> doing a
> > query that only returns the "id" field.  No highlighting.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Wed, Feb 17, 2021 at 10:28 AM David Smiley 
> wrote:
> >
> > > Thanks for more details.  I was able to reproduce this locally!  I
> > > hacked a test to look similar to what you are doing.  BTW it's okay to
> > > fill out a JIRA imperfectly; they can always be edited :-).  Once I
> > > better understand the nature of the bug today, I'll file an issue and
> > respond with it here.
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> > >
> > > On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen
> > > 
> > > wrote:
> > >
> > >> Hello David,
> > >>
> > >> Thank you for your reply.
> > >> It was very hard but finally I discovered how to reproduce it. I
> > >> thought of issuing an issue but wasn't sure about the components and
> > priority.
> > >> I used the "tech products" configset, with the following changes:
> > >> 1. Added  > >> name="_nest_path_" class="solr.NestPathField" /> 2. Added  > >> name="text_en" type="text_en" indexed="true"
> > >> stored="true" termVectors="true" termOffsets="true"
> termPositions="true"
> > >> required="false" multiValued="true" /> Than I inserted one document
> > >> with a nested child e.g.
> > >> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
> > >>
> > >> To reproduce:

Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-18 Thread David Smiley
IMO enableLazyFieldLoading is a small optimization for most apps.  It saves
memory in the document cache at the expense of increased latency if your
usage pattern wants a field later that wasn't requested earlier.  You'd
probably need detailed metrics/benchmarks to observe a difference, and you
might reach a conclusion that enableLazyFieldLoading is best at "false" for
you irrespective of the bug.  I suspect it may have been developed for
particularly large document use-cases where you don't normally need some
large text fields for retrieval/highlighting.  For example imagine if you
stored the entire input data as JSON in a _json_ field or some-such.
Nowadays, I'd set large="true" on such a field, which is a much newer
option.

I was able to tweak my test to have only alphabetic IDs, and the test still
failed.  I don't see how the ID's contents/format could cause any effect.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 18, 2021 at 5:04 AM Nussbaum, Ronen 
wrote:

> You're right, I was able to reproduce it too without highlighting.
> Regarding the existing bug, I think there might be an additional issue
> here because it happens only when id field contains an underscore (didn't
> check for other special characters).
> Currently I have no other choice but to use enableLazyFieldLoading=false.
> I hope it wouldn't have a significant performance impact.
>
> -Original Message-
> From: David Smiley 
> Sent: יום ה 18 פברואר 2021 01:03
> To: solr-user 
> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> Loading => Invalid Index
>
> I think the issue is this existing bug, but needs to refer to
> toSolrInputDocument instead of toSolrDoc:
> https://issues.apache.org/jira/browse/SOLR-13034
> Highlighting isn't involved; you just need to somehow get a document
> cached with lazy fields.  In a test I was able to do this simply by doing a
> query that only returns the "id" field.  No highlighting.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Feb 17, 2021 at 10:28 AM David Smiley  wrote:
>
> > Thanks for more details.  I was able to reproduce this locally!  I
> > hacked a test to look similar to what you are doing.  BTW it's okay to
> > fill out a JIRA imperfectly; they can always be edited :-).  Once I
> > better understand the nature of the bug today, I'll file an issue and
> respond with it here.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen
> > 
> > wrote:
> >
> >> Hello David,
> >>
> >> Thank you for your reply.
> >> It was very hard but finally I discovered how to reproduce it. I
> >> thought of issuing an issue but wasn't sure about the components and
> priority.
> >> I used the "tech products" configset, with the following changes:
> >> 1. Added  >> name="_nest_path_" class="solr.NestPathField" /> 2. Added  >> name="text_en" type="text_en" indexed="true"
> >> stored="true" termVectors="true" termOffsets="true" termPositions="true"
> >> required="false" multiValued="true" /> Than I inserted one document
> >> with a nested child e.g.
> >> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
> >>
> >> To reproduce:
> >> Do a search with surround and unified highlighter:
> >>
> >> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("
> >> solr"%2C"great")
> >>
> >> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
> >>
> >> Important: it happens only when "id" contains underscore characters!
> >> If you'll use "abc-1" it would work.
> >>
> >> Thanks in advance,
> >> Ronen.
> >>
> >> -Original Message-
> >> From: David Smiley 
> >> Sent: יום א 14 פברואר 2021 19:17
> >> To: solr-user 
> >> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy
> >> Field Loading => Invalid Index
> >>
> >> Hello Ronen,
> >>
> >> Can you please file a JIRA issue?  Some quick searches did not turn
> >> anything up.  It would be super helpful to me if you could list a
> >> series of steps with Solr out-of-the-box in 8.8 including what data
> >> to i

RE: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-18 Thread Nussbaum, Ronen
You're right, I was able to reproduce it too without highlighting.
Regarding the existing bug, I think there might be an additional issue here 
because it happens only when id field contains an underscore (didn't check for 
other special characters).
Currently I have no other choice but to use enableLazyFieldLoading=false. I 
hope it wouldn't have a significant performance impact.

-Original Message-
From: David Smiley 
Sent: יום ה 18 פברואר 2021 01:03
To: solr-user 
Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading 
=> Invalid Index

I think the issue is this existing bug, but needs to refer to 
toSolrInputDocument instead of toSolrDoc:
https://issues.apache.org/jira/browse/SOLR-13034
Highlighting isn't involved; you just need to somehow get a document cached 
with lazy fields.  In a test I was able to do this simply by doing a query that 
only returns the "id" field.  No highlighting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 17, 2021 at 10:28 AM David Smiley  wrote:

> Thanks for more details.  I was able to reproduce this locally!  I
> hacked a test to look similar to what you are doing.  BTW it's okay to
> fill out a JIRA imperfectly; they can always be edited :-).  Once I
> better understand the nature of the bug today, I'll file an issue and respond 
> with it here.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen
> 
> wrote:
>
>> Hello David,
>>
>> Thank you for your reply.
>> It was very hard but finally I discovered how to reproduce it. I
>> thought of issuing an issue but wasn't sure about the components and 
>> priority.
>> I used the "tech products" configset, with the following changes:
>> 1. Added > name="_nest_path_" class="solr.NestPathField" /> 2. Added > name="text_en" type="text_en" indexed="true"
>> stored="true" termVectors="true" termOffsets="true" termPositions="true"
>> required="false" multiValued="true" /> Than I inserted one document
>> with a nested child e.g.
>> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
>>
>> To reproduce:
>> Do a search with surround and unified highlighter:
>>
>> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("
>> solr"%2C"great")
>>
>> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
>>
>> Important: it happens only when "id" contains underscore characters!
>> If you'll use "abc-1" it would work.
>>
>> Thanks in advance,
>> Ronen.
>>
>> -Original Message-
>> From: David Smiley 
>> Sent: יום א 14 פברואר 2021 19:17
>> To: solr-user 
>> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy
>> Field Loading => Invalid Index
>>
>> Hello Ronen,
>>
>> Can you please file a JIRA issue?  Some quick searches did not turn
>> anything up.  It would be super helpful to me if you could list a
>> series of steps with Solr out-of-the-box in 8.8 including what data
>> to index and query.  Solr already includes the "tech products" sample
>> data; maybe that can illustrate the problem?  It's not clear if
>> nested schema or nested docs are actually required in your example.
>> If you share the JIRA issue with me, I'll chase this one down.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum 
>> wrote:
>>
>> > Hi All,
>> >
>> > I discovered a strange behaviour with this combination.
>> > Not only the atomic update fails, the child documents are not
>> > properly indexed, and you can't use highlights on their text
>> > fields. Currently there is no workaround other than reindex.
>> >
>> > Checked on 8.3.0, 8.6.1 and 8.8.0.
>> > 1. Configure nested schema.
>> > 2. enableLazyFieldLoading is true (default).
>> > 3. Run a search with hl.method=unified and hl.fl=> > fields> 4. Trying to do an atomic update on some of the *parents*
>> > fields> of
>> > the returned documents from #3.
>> >
>> > You get an error: "TransactionLog doesn't know how to serialize
>> > class org.apache.lucene.document.LazyDocument$LazyField".
>> >
>> > Now trying t

Re: Meaning of "Index" flag under properties and schema

2021-02-17 Thread Alexandre Rafalovitch
I wonder if looking more directly at the indexes would allow you to
get closer to the problem source.

Have you tried comparing/exploring the indexes with Luke? It is in the
Lucene distribution (not Solr), and there is a small explanation here:
https://mocobeta.medium.com/luke-become-an-apache-lucene-module-as-of-lucene-8-1-7d139c998b2

Regards,
   Alex.

On Wed, 17 Feb 2021 at 16:58, Vivaldi  wrote:
>
> I was getting “illegal argument exception length must be >= 1” when I used 
> significantTerms streaming expression, from this collection and field. I 
> asked about that as a separate question on this list. I will get the whole 
> exception stack trace the next time I am at the customer site.
>
> Why any other field in other collections doesn’t have that flag? We have 
> numerous indexed, non-indexed, docvalues fields in other collections but not 
> that row
>
> Sent from my iPhone
>
> > On 16 Feb 2021, at 20:42, Shawn Heisey  wrote:
> >
> >> On 2/16/2021 9:16 AM, ufuk yılmaz wrote:
> >> I didn’t realise that, sorry. The table is like:
> >> Flags   Indexed Tokenized   Stored  UnInvertible
> >> Properties  YesYes    Yes Yes
> >> Schema  YesYesYes Yes
> >> Index   YesYesYes NO
> >> Problematic collection has a Index row under Schema row. No other 
> >> collection has it. I was asking about what the “Index” meant
> >
> > I am not completely sure, but I think that row means the field was found in 
> > the actual Lucene index.
> >
> > In the original message you mentioned "weird exceptions" but didn't include 
> > any information about them.  Can you give us those exceptions, and the 
> > requests that caused them?
> >
> > Thanks,
> > Shawn
>


Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-17 Thread David Smiley
I think the issue is this existing bug, but needs to refer to
toSolrInputDocument instead of toSolrDoc:
https://issues.apache.org/jira/browse/SOLR-13034
Highlighting isn't involved; you just need to somehow get a document cached
with lazy fields.  In a test I was able to do this simply by doing a query
that only returns the "id" field.  No highlighting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 17, 2021 at 10:28 AM David Smiley  wrote:

> Thanks for more details.  I was able to reproduce this locally!  I hacked
> a test to look similar to what you are doing.  BTW it's okay to fill out a
> JIRA imperfectly; they can always be edited :-).  Once I better understand
> the nature of the bug today, I'll file an issue and respond with it here.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen 
> wrote:
>
>> Hello David,
>>
>> Thank you for your reply.
>> It was very hard but finally I discovered how to reproduce it. I thought
>> of issuing an issue but wasn't sure about the components and priority.
>> I used the "tech products" configset, with the following changes:
>> 1. Added > name="_nest_path_" class="solr.NestPathField" />
>> 2. Added > stored="true" termVectors="true" termOffsets="true" termPositions="true"
>> required="false" multiValued="true" />
>> Than I inserted one document with a nested child e.g.
>> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
>>
>> To reproduce:
>> Do a search with surround and unified highlighter:
>>
>> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("solr"%2C"great")
>>
>> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
>>
>> Important: it happens only when "id" contains underscore characters! If
>> you'll use "abc-1" it would work.
>>
>> Thanks in advance,
>> Ronen.
>>
>> -Original Message-
>> From: David Smiley 
>> Sent: יום א 14 פברואר 2021 19:17
>> To: solr-user 
>> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
>> Loading => Invalid Index
>>
>> Hello Ronen,
>>
>> Can you please file a JIRA issue?  Some quick searches did not turn
>> anything up.  It would be super helpful to me if you could list a series of
>> steps with Solr out-of-the-box in 8.8 including what data to index and
>> query.  Solr already includes the "tech products" sample data; maybe that
>> can illustrate the problem?  It's not clear if nested schema or nested docs
>> are actually required in your example.  If you share the JIRA issue with
>> me, I'll chase this one down.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum 
>> wrote:
>>
>> > Hi All,
>> >
>> > I discovered a strange behaviour with this combination.
>> > Not only the atomic update fails, the child documents are not properly
>> > indexed, and you can't use highlights on their text fields. Currently
>> > there is no workaround other than reindex.
>> >
>> > Checked on 8.3.0, 8.6.1 and 8.8.0.
>> > 1. Configure nested schema.
>> > 2. enableLazyFieldLoading is true (default).
>> > 3. Run a search with hl.method=unified and hl.fl=> > fields> 4. Trying to do an atomic update on some of the *parents* of
>> > the returned documents from #3.
>> >
>> > You get an error: "TransactionLog doesn't know how to serialize class
>> > org.apache.lucene.document.LazyDocument$LazyField".
>> >
>> > Now trying to run #3 again yields an error message that the text field
>> > is indexed without positions.
>> >
>> > If enableLazyFieldLoading is false or if using the default highlighter
>> > this doesn't happen.
>> >
>> > Ronen.
>> >
>>
>>
>> This electronic message may contain proprietary and confidential
>> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
>> information is intended to be for the use of the individual(s) or
>> entity(ies) named above. If you are not the intended recipient (or
>> authorized to receive this e-mail for the intended recipient), you may not
>> use, copy, disclose or distribute to anyone this message or any information
>> contained in this message. If you have received this electronic message in
>> error, please notify us by replying to this e-mail.
>>
>


Re: Meaning of "Index" flag under properties and schema

2021-02-17 Thread Vivaldi
I was getting “illegal argument exception length must be >= 1” when I used 
significantTerms streaming expression, from this collection and field. I asked 
about that as a separate question on this list. I will get the whole exception 
stack trace the next time I am at the customer site.

Why any other field in other collections doesn’t have that flag? We have 
numerous indexed, non-indexed, docvalues fields in other collections but not 
that row

Sent from my iPhone

> On 16 Feb 2021, at 20:42, Shawn Heisey  wrote:
> 
>> On 2/16/2021 9:16 AM, ufuk yılmaz wrote:
>> I didn’t realise that, sorry. The table is like:
>> Flags   Indexed Tokenized   Stored  UnInvertible
>> Properties  YesYesYes Yes
>> Schema  Yes    YesYes Yes
>> Index   YesYesYes NO
>> Problematic collection has a Index row under Schema row. No other collection 
>> has it. I was asking about what the “Index” meant
> 
> I am not completely sure, but I think that row means the field was found in 
> the actual Lucene index.
> 
> In the original message you mentioned "weird exceptions" but didn't include 
> any information about them.  Can you give us those exceptions, and the 
> requests that caused them?
> 
> Thanks,
> Shawn



Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-17 Thread David Smiley
Thanks for more details.  I was able to reproduce this locally!  I hacked a
test to look similar to what you are doing.  BTW it's okay to fill out a
JIRA imperfectly; they can always be edited :-).  Once I better understand
the nature of the bug today, I'll file an issue and respond with it here.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 17, 2021 at 6:36 AM Nussbaum, Ronen 
wrote:

> Hello David,
>
> Thank you for your reply.
> It was very hard but finally I discovered how to reproduce it. I thought
> of issuing an issue but wasn't sure about the components and priority.
> I used the "tech products" configset, with the following changes:
> 1. Added  name="_nest_path_" class="solr.NestPathField" />
> 2. Added  termVectors="true" termOffsets="true" termPositions="true" required="false"
> multiValued="true" />
> Than I inserted one document with a nested child e.g.
> {id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}
>
> To reproduce:
> Do a search with surround and unified highlighter:
>
> hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("solr"%2C"great")
>
> Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}
>
> Important: it happens only when "id" contains underscore characters! If
> you'll use "abc-1" it would work.
>
> Thanks in advance,
> Ronen.
>
> -Original Message-
> From: David Smiley 
> Sent: יום א 14 פברואר 2021 19:17
> To: solr-user 
> Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field
> Loading => Invalid Index
>
> Hello Ronen,
>
> Can you please file a JIRA issue?  Some quick searches did not turn
> anything up.  It would be super helpful to me if you could list a series of
> steps with Solr out-of-the-box in 8.8 including what data to index and
> query.  Solr already includes the "tech products" sample data; maybe that
> can illustrate the problem?  It's not clear if nested schema or nested docs
> are actually required in your example.  If you share the JIRA issue with
> me, I'll chase this one down.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum  wrote:
>
> > Hi All,
> >
> > I discovered a strange behaviour with this combination.
> > Not only the atomic update fails, the child documents are not properly
> > indexed, and you can't use highlights on their text fields. Currently
> > there is no workaround other than reindex.
> >
> > Checked on 8.3.0, 8.6.1 and 8.8.0.
> > 1. Configure nested schema.
> > 2. enableLazyFieldLoading is true (default).
> > 3. Run a search with hl.method=unified and hl.fl= > fields> 4. Trying to do an atomic update on some of the *parents* of
> > the returned documents from #3.
> >
> > You get an error: "TransactionLog doesn't know how to serialize class
> > org.apache.lucene.document.LazyDocument$LazyField".
> >
> > Now trying to run #3 again yields an error message that the text field
> > is indexed without positions.
> >
> > If enableLazyFieldLoading is false or if using the default highlighter
> > this doesn't happen.
> >
> > Ronen.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


RE: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-17 Thread Nussbaum, Ronen
Hello David,

Thank you for your reply.
It was very hard but finally I discovered how to reproduce it. I thought of 
issuing an issue but wasn't sure about the components and priority.
I used the "tech products" configset, with the following changes:
1. Added 
2. Added 
Than I inserted one document with a nested child e.g.
{id:"abc_1", utterances:{id:"abc_1-1", text_en:"Solr is great"}}

To reproduce:
Do a search with surround and unified highlighter:
hl.fl=text_en=unified=on=%7B!surround%7Dtext_en%3A4W("solr"%2C"great")

Now, try to update the parent e.g. {id:"abc_1", categories_i:{add:1}}

Important: it happens only when "id" contains underscore characters! If you'll 
use "abc-1" it would work.

Thanks in advance,
Ronen.

-Original Message-
From: David Smiley 
Sent: יום א 14 פברואר 2021 19:17
To: solr-user 
Subject: Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading 
=> Invalid Index

Hello Ronen,

Can you please file a JIRA issue?  Some quick searches did not turn anything 
up.  It would be super helpful to me if you could list a series of steps with 
Solr out-of-the-box in 8.8 including what data to index and query.  Solr 
already includes the "tech products" sample data; maybe that can illustrate the 
problem?  It's not clear if nested schema or nested docs are actually required 
in your example.  If you share the JIRA issue with me, I'll chase this one down.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum  wrote:

> Hi All,
>
> I discovered a strange behaviour with this combination.
> Not only the atomic update fails, the child documents are not properly
> indexed, and you can't use highlights on their text fields. Currently
> there is no workaround other than reindex.
>
> Checked on 8.3.0, 8.6.1 and 8.8.0.
> 1. Configure nested schema.
> 2. enableLazyFieldLoading is true (default).
> 3. Run a search with hl.method=unified and hl.fl= fields> 4. Trying to do an atomic update on some of the *parents* of
> the returned documents from #3.
>
> You get an error: "TransactionLog doesn't know how to serialize class
> org.apache.lucene.document.LazyDocument$LazyField".
>
> Now trying to run #3 again yields an error message that the text field
> is indexed without positions.
>
> If enableLazyFieldLoading is false or if using the default highlighter
> this doesn't happen.
>
> Ronen.
>


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Re: Meaning of "Index" flag under properties and schema

2021-02-16 Thread Shawn Heisey

On 2/16/2021 9:16 AM, ufuk yılmaz wrote:

I didn’t realise that, sorry. The table is like:

Flags   Indexed Tokenized   Stored  UnInvertible

Properties  YesYesYes Yes
Schema  YesYesYes Yes
Index   YesYesYes NO

Problematic collection has a Index row under Schema row. No other collection 
has it. I was asking about what the “Index” meant


I am not completely sure, but I think that row means the field was found 
in the actual Lucene index.


In the original message you mentioned "weird exceptions" but didn't 
include any information about them.  Can you give us those exceptions, 
and the requests that caused them?


Thanks,
Shawn


RE: Meaning of "Index" flag under properties and schema

2021-02-16 Thread ufuk yılmaz
I didn’t realise that, sorry. The table is like:

Flags   Indexed Tokenized   Stored  UnInvertible

Properties  YesYesYes Yes
Schema  YesYesYes Yes
Index   YesYesYes NO


Problematic collection has a Index row under Schema row. No other collection 
has it. I was asking about what the “Index” meant

-ufuk

Sent from Mail for Windows 10

From: Charlie Hull
Sent: 16 February 2021 18:48
To: solr-user@lucene.apache.org
Subject: Re: Meaning of "Index" flag under properties and schema

This list strips attachments so you'll have to figure out another way to 
show the difference,

Cheers

Charlie

On 16/02/2021 15:16, ufuk yılmaz wrote:
>
> There’s a collection at our customer’s site giving weird exceptions 
> when a particular field is involved (asked another question detailing 
> that).
>
> When I inspected it, there’s only one difference between it and other 
> dozens of fine working collections, which is,
>
> A text_general field in all other collections has the above 
> configuration without my artsy paint edits, but only that problematic 
> collection has an “index” flag with indexed tokenized and stored 
> checked. I never saw this “Index” flag before. What does it mean?
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
> Windows 10
>

-- 
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828



Re: Meaning of "Index" flag under properties and schema

2021-02-16 Thread Charlie Hull
This list strips attachments so you'll have to figure out another way to 
show the difference,


Cheers

Charlie

On 16/02/2021 15:16, ufuk yılmaz wrote:


There’s a collection at our customer’s site giving weird exceptions 
when a particular field is involved (asked another question detailing 
that).


When I inspected it, there’s only one difference between it and other 
dozens of fine working collections, which is,


A text_general field in all other collections has the above 
configuration without my artsy paint edits, but only that problematic 
collection has an “index” flag with indexed tokenized and stored 
checked. I never saw this “Index” flag before. What does it mean?


Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
Windows 10




--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network <https://thesearchnetwork.com/> 
and co-author of Searching the Enterprise 
<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Meaning of "Index" flag under properties and schema

2021-02-16 Thread ufuk yılmaz

There’s a collection at our customer’s site giving weird exceptions when a 
particular field is involved (asked another question detailing that).

When I inspected it, there’s only one difference between it and other dozens of 
fine working collections, which is,


A text_general field in all other collections has the above configuration 
without my artsy paint edits, but only that problematic collection has an 
“index” flag with indexed tokenized and stored checked. I never saw this 
“Index” flag before. What does it mean?




Sent from Mail for Windows 10



Re: Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-14 Thread David Smiley
Hello Ronen,

Can you please file a JIRA issue?  Some quick searches did not turn
anything up.  It would be super helpful to me if you could list a series of
steps with Solr out-of-the-box in 8.8 including what data to index and
query.  Solr already includes the "tech products" sample data; maybe that
can illustrate the problem?  It's not clear if nested schema or nested docs
are actually required in your example.  If you share the JIRA issue with
me, I'll chase this one down.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Feb 14, 2021 at 11:16 AM Ronen Nussbaum  wrote:

> Hi All,
>
> I discovered a strange behaviour with this combination.
> Not only the atomic update fails, the child documents are not properly
> indexed, and you can't use highlights on their text fields. Currently there
> is no workaround other than reindex.
>
> Checked on 8.3.0, 8.6.1 and 8.8.0.
> 1. Configure nested schema.
> 2. enableLazyFieldLoading is true (default).
> 3. Run a search with hl.method=unified and hl.fl=
> 4. Trying to do an atomic update on some of the *parents* of the returned
> documents from #3.
>
> You get an error: "TransactionLog doesn't know how to serialize class
> org.apache.lucene.document.LazyDocument$LazyField".
>
> Now trying to run #3 again yields an error message that the text field is
> indexed without positions.
>
> If enableLazyFieldLoading is false or if using the default highlighter this
> doesn't happen.
>
> Ronen.
>


Atomic Update (nested), Unified Highlighter and Lazy Field Loading => Invalid Index

2021-02-14 Thread Ronen Nussbaum
Hi All,

I discovered a strange behaviour with this combination.
Not only the atomic update fails, the child documents are not properly
indexed, and you can't use highlights on their text fields. Currently there
is no workaround other than reindex.

Checked on 8.3.0, 8.6.1 and 8.8.0.
1. Configure nested schema.
2. enableLazyFieldLoading is true (default).
3. Run a search with hl.method=unified and hl.fl=
4. Trying to do an atomic update on some of the *parents* of the returned
documents from #3.

You get an error: "TransactionLog doesn't know how to serialize class
org.apache.lucene.document.LazyDocument$LazyField".

Now trying to run #3 again yields an error message that the text field is
indexed without positions.

If enableLazyFieldLoading is false or if using the default highlighter this
doesn't happen.

Ronen.


Index rich document and view

2021-02-10 Thread Luke Oak
Hi,

I have all kind of rich documents, such as excel, ppt, PDF, word, jpg ..., I 
knew Tika or ocr can convert them to text and index it. But when I open the 
document, the format is changed,  how can I keep original document format, is 
it possible in solr?

If not, can I use external field type to save original file and load it when I 
want to view the document?

Thanks 

Sent from my iPhone

Index analyzer concatenate tokens

2021-01-29 Thread Florin Babes
Hello,
I'm trying to index the following token with payload "winter tires|1.4" as
an exact match but also I want to apply hunspell lemmer to this token and
keep both the original and the lemma. So after all that I want to have the
following tokens:
"winter tires" with payload 1.4
"winter tire" with payload 1.4

I thought of doing it this way:

 







 



But what happens here is that the indexed tokens are "winter tires|1.4" and
"winter tire|1.4" because any filter
after solr.ConcatenateGraphFilterFactory does not apply.

Do you have any idea how I can concatenate the tokens from a stream without
using solr.ConcatenateGraphFilterFactory? Or how I can achieve the above?

Thanks.


how to use a compass lucene generated index with solr

2021-01-26 Thread Guglielmo Fanini
With (the latest) lucene 8.7 is it possible to open very old .cfs compound 
index file of lucene 2.2 with "Luke" ? or alternatively could it be possibile 
to generate the .idx file for Luke from the .cfs ?
the .cfs was generated by compass on top of lucene 2.2, not by lucene directly
Is it possible to use a compass generated index containing 
_b.cfs
segments.gen
segments_d
with solr ?



Re: Possible bug on LTR when using solr 8.6.3 - index out of bounds DisiPriorityQueue.add(DisiPriorityQueue.java:102)

2021-01-06 Thread Florin Babes
Hello, Christine and thank you for your help!

So, we've investigated further based on your suggestions and have the
following things to note:

Reproducibility: We can reproduce the same queries on multiple runs, with
the same error.
Data as a factor: Our setup is single-sharded, so we can't investigate
further on this.
Feature vs. Model: We've also tried a dummy LinearModel with only two
features and the problem still occurs.
Identification of the troublesome feature(s): We've narrowed our model to
only two features and the problem always occurs (for some queries, not all)
when we have a feature with a mm=1 and a feature with a mm>=3. The problem
also occurs when we only do feature extraction and the problem seems to
always occur on the feature with the bigger mm. The errors seem to be
related to the size of the head DisiPriorityQueue created here:
https://github.com/apache/lucene-solr/blob/branch_8_6/lucene/core/src/java/org/apache/lucene/search/MinShouldMatchSumScorer.java#L107
as the error changes as we change the mm for the second feature:

1 feature with mm=1 and one with mm=3 -> Index 4 out of bounds for length 4
1 feature with mm=1 and one with mm=5 -> Index 2 out of bounds for length 2

You can find below the dummy feature-store.

[
{
"store": "dummystore",
"name": "similarity_name_mm_1",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!dismax qf=name mm=1}${term}"
}
},
{
"store": "dummystore",
"name": "similarity_names_mm_3",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!dismax qf=name mm=3}${term}"
}
}
]

The problem starts occuring in Solr 8.6.0, as we tried multiple versions <
8.6 and >= 8.6 and the problem started on 8.6.0 and we tend to believe it's
because of the following changes:
https://issues.apache.org/jira/browse/SOLR-14364 as they're the only major
changes related to LTR which were introduced in Solr 8.6.0.

I've created a Solr JIRA bug/issue ticket here:
https://issues.apache.org/jira/browse/SOLR-15071

Thank you for your help!

În mar., 5 ian. 2021 la 19:40, Christine Poerschke (BLOOMBERG/ LONDON) <
cpoersc...@bloomberg.net> a scris:

> Hello Florin Babes,
>
> Thanks for this detailed report! I agree you experiencing
> ArrayIndexOutOfBoundsException during SolrFeature computation sounds like a
> bug, would you like to open a SOLR JIRA issue for it?
>
> Here's some investigative ideas I would have, in no particular order:
>
> Reproducibility: if a failed query is run again, does it also fail second
> time around (when some caches may be used)?
>
> Data as a factor: is your setup single-sharded or multi-sharded? in a
> multi-sharded setup if the same query fails on some shards but succeeds on
> others (and all shards have some documents that match the query) then this
> could support a theory that a certain combination of data and features
> leads to the exception.
>
> Feature vs. Model: you mention use of a MultipleAdditiveTrees model, if
> the same features are used in a LinearModel instead, do the same errors
> happen? or if no model is used but only feature extraction is done, does
> that give errors?
>
> Identification of the troublesome feature(s): narrowing down to a single
> feature or a small combination of features could make it easier to figure
> out the problem. assuming the existing logging doesn't identify the
> features, replacing the org.apache.solr.ltr.feature.SolrFeature with a
> com.mycompany.solr.ltr.feature.MySolrFeature containing instrumentation
> could provide insights e.g. the existing code [2] logs feature names for
> UnsupportedOperationException and if it also caught
> ArrayIndexOutOfBoundsException then it could log the feature name before
> rethrowing the exception.
>
> Based on your detail below and this [3] conditional in the code probably
> at least two features will be necessary to hit the issue, but for
> investigative purposes two features could still be simplified potentially
> to effectively one feature e.g. if one feature is a SolrFeature and the
> other is a ValueFeature or if featureA and featureB are both SolrFeature
> features with _identical_ parameters but different names.
>
> Hope that helps.
>
> Regards,
>
> Christine
>
> [1]
> https://lucene.apache.org/solr/guide/8_6/learning-to-rank.html#extracting-features
> [2]
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/SolrFeature.java#L243
> [3]
> https://github.com/apache/lucene-solr/blob/releases/lu

Re:Possible bug on LTR when using solr 8.6.3 - index out of bounds DisiPriorityQueue.add(DisiPriorityQueue.java:102)

2021-01-05 Thread Christine Poerschke (BLOOMBERG/ LONDON)
Hello Florin Babes,

Thanks for this detailed report! I agree you experiencing 
ArrayIndexOutOfBoundsException during SolrFeature computation sounds like a 
bug, would you like to open a SOLR JIRA issue for it?

Here's some investigative ideas I would have, in no particular order:

Reproducibility: if a failed query is run again, does it also fail second time 
around (when some caches may be used)?

Data as a factor: is your setup single-sharded or multi-sharded? in a 
multi-sharded setup if the same query fails on some shards but succeeds on 
others (and all shards have some documents that match the query) then this 
could support a theory that a certain combination of data and features leads to 
the exception.

Feature vs. Model: you mention use of a MultipleAdditiveTrees model, if the 
same features are used in a LinearModel instead, do the same errors happen? or 
if no model is used but only feature extraction is done, does that give errors?

Identification of the troublesome feature(s): narrowing down to a single 
feature or a small combination of features could make it easier to figure out 
the problem. assuming the existing logging doesn't identify the features, 
replacing the org.apache.solr.ltr.feature.SolrFeature with a 
com.mycompany.solr.ltr.feature.MySolrFeature containing instrumentation could 
provide insights e.g. the existing code [2] logs feature names for 
UnsupportedOperationException and if it also caught 
ArrayIndexOutOfBoundsException then it could log the feature name before 
rethrowing the exception.

Based on your detail below and this [3] conditional in the code probably at 
least two features will be necessary to hit the issue, but for investigative 
purposes two features could still be simplified potentially to effectively one 
feature e.g. if one feature is a SolrFeature and the other is a ValueFeature or 
if featureA and featureB are both SolrFeature features with _identical_ 
parameters but different names.

Hope that helps.

Regards,

Christine

[1] 
https://lucene.apache.org/solr/guide/8_6/learning-to-rank.html#extracting-features
[2] 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/SolrFeature.java#L243
[3] 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRScoringQuery.java#L520-L525

From: solr-user@lucene.apache.org At: 01/04/21 17:31:44To:  
solr-user@lucene.apache.org
Subject: Possible bug on LTR when using solr 8.6.3 - index out of bounds 
DisiPriorityQueue.add(DisiPriorityQueue.java:102)

Hello,
We are trying to update Solr from 8.3.1 to 8.6.3. On Solr 8.3.1 we are
using LTR in production using a MultipleAdditiveTrees model. On Solr 8.6.3
we receive an error when we try to compute some SolrFeatures. We didn't
find any pattern of the queries that fail.
Example:
We have the following query raw parameters:
q=lg cx 4k oled 120 hz -> just of many examples
term_dq=lg cx 4k oled 120 hz
rq={!ltr model=model reRankDocs=1000 store=feature_store
efi.term=${term_dq}}
defType=edismax,
mm=2<75%
The features are something like this:
{
  "name":"similarity_query_fileld_1",
  "class":"org.apache.solr.ltr.feature.SolrFeature",
  "params":{"q":"{!dismax qf=query_field_1 mm=1}${term}"},
  "store":"feature_store"
},
{
  "name":"similarity_query_field_2",
  "class":"org.apache.solr.ltr.feature.SolrFeature",
  "params":{"q":"{!dismax qf=query_field_2 mm=5}${term}"},
  "store":"feature_store"
}

We are testing ~6300 production queries and for about 1% of them we receive
that following error message:
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.ArrayIndexOutOfBoundsException"],
"msg":"java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds
for length 2",

The stacktrace is :
org.apache.solr.common.SolrException:
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at org.apache.solr.search.ReRankCollector.topDocs(ReRankCollector.java:154)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:159
9)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1413
)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:596)
at
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryC
omponent.java:1513)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:403
)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.
java:360)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java
:214

Possible bug on LTR when using solr 8.6.3 - index out of bounds DisiPriorityQueue.add(DisiPriorityQueue.java:102)

2021-01-04 Thread Florin Babes
Hello,
We are trying to update Solr from 8.3.1 to 8.6.3. On Solr 8.3.1 we are
using LTR in production using a MultipleAdditiveTrees model. On Solr 8.6.3
we receive an error when we try to compute some SolrFeatures. We didn't
find any pattern of the queries that fail.
Example:
We have the following query raw parameters:
q=lg cx 4k oled 120 hz -> just of many examples
term_dq=lg cx 4k oled 120 hz
rq={!ltr model=model reRankDocs=1000 store=feature_store
efi.term=${term_dq}}
defType=edismax,
mm=2<75%
The features are something like this:
{
  "name":"similarity_query_fileld_1",
  "class":"org.apache.solr.ltr.feature.SolrFeature",
  "params":{"q":"{!dismax qf=query_field_1 mm=1}${term}"},
  "store":"feature_store"
},
{
  "name":"similarity_query_field_2",
  "class":"org.apache.solr.ltr.feature.SolrFeature",
  "params":{"q":"{!dismax qf=query_field_2 mm=5}${term}"},
  "store":"feature_store"
}

We are testing ~6300 production queries and for about 1% of them we receive
that following error message:
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.ArrayIndexOutOfBoundsException"],
"msg":"java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds
for length 2",

The stacktrace is :
org.apache.solr.common.SolrException:
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at org.apache.solr.search.ReRankCollector.topDocs(ReRankCollector.java:154)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1599)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1413)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:596)
at
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1513)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:403)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:360)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:214)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2627)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:795)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:568)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1610)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1580)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
at
org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:500)
at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(Ab

How can i poll Solrcloud via API to get the sum of index size of all shards and replicas?

2020-12-09 Thread Roman Ivanov
Hello! We have a Solrcloud(7.4) consisting of 90+ hosts(each of them
running multiple nodes of solr, e.g. ports 8983, 8984, 8985), numerous
shards(each having several replicas) and numerous collections.

I was given a task to summarize the total index size(on disks) of a certain
collection. First I calculated it from web interface(via copy-paste)
manually and there were thousands of lines (The http interface(8983) Cloud
- Nodes tab). It took about several hours. Now i consider this task needs
some automatization. I read the API documentation and googled but still no
luck... And any possible solution could help somebody else in the future.

What i tried:
   1) If I poll one of the solr cores via

"
http://solrhost1.somecorporatesite.org:8983/solr/admin/metrics?wt=JSON=INDEX
"

I get output like (**cores.json**):

"responseHeader":{
   "status":0,
"Qtime":2004},
 "metrics":{
   "solr.core.collectionname1-2020-12-05.shard12.replica_n240:{
   "INDEX.size":"456 bytes",
   "INDEX.sizeInBytes":456},
   "solr.core.collectionname2-2020-12-04.shard74.replica_n650:{
   "INDEX.size":"2.88 GB",
   "INDEX.sizeInBytes":3088933801},

... and so on which is what i need BUT only according to one core(local).
But there are more than 200 of them.

   2) I can get a list of all collections, shards and replicas via:


http://localhost:8983/solr/admin/collections?action=clusterstatus=json

and it looks like (**collections.json**)

"responseHeader":{
  "status":0,
  "QTime":184},
"cluster":{
  "collections":{
  "collectionname1":{
  "pullReplicas":"0",
  "replicationFactor":"1",
  "shards":{
 "shard1":{
  "range":"8-80e0",
  "state":active",
  "replicas":{
 "core_node67":{
   "core":"collectionname123-2020-11-30_shard1_replica_n54",
   "node_name":"solrhost99.somecorporatesite.org:8985/solr",
   "state":"active",
   "type":"NRT",
   "force_ste_state":"false",
   "leader":"true"},
  "core_node548":{
 "core":"collectionname223-2020-11-29_shard1_replica_n448",
  "node_name":"solrhost77.somecorporatesite.org:8984/solr",
  "state":"active",
  "type":"NRT",
  "force_ste_state":"false"}}},
   "shard2":{
 "range":

... and so on, 117 156 lines

The question is, how can i insert the fields of INDEX.size into the second
output(clusterstatus) for calculation of sum disk space used by indices?

In other words, i need the correspondings fields of INDEX.size in replicas
sections of **collections.json**

Currently the whole solr system consumes 100TB+ and is still growing, we
need to know the tempo of it's growth. Many thanks in advance!


Re: Solr8.7 - How to optmize my index ?

2020-12-03 Thread Erick Erickson
Dave:

Yeah, every time there’s generic advice, there’s some situations where it’s not 
the best choice ;).

In your situation, you’re trading of some space savings for moving up to 450G 
all at once. Which sounds like it is worthwhile to you, although I’d check perf 
numbers sometime

You may want to check out expungeDeletes. That will deal only with segments 
with more than 10% deleted docs, and may get you most all of the benefits of 
optimize without the problems. Specifically, let’s say you have a segment right 
at the limit (5G by default) that has exactly one deleted doc. Optimize will 
rewrite that, expungeDeletes will not. It’s an open question whether there’s 
any practical difference, ‘cause if all the segments in your index have > 10% 
deleted documents, they all get rewritten in either case….

And the mechanism for optimize changed pretty significantly in Solr 7.5, the 
short form is that before that the result was a single massive segment, whereas 
after that the default max segment size of 5G is respected by default (although 
you can force to one segment if you take explicit actions).

Here are two articles that explain it all:
Pre Solr 7.4: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
Post Solr 7.4: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

Best,
Erick

> On Dec 2, 2020, at 11:05 PM, Dave  wrote:
> 
> I’m going to go against the advice SLIGHTLY, it really depends on how you 
> have things set up as far as your solr server hosting is done. If you’re 
> searching off the same solr server you’re indexing to, yeah don’t ever 
> optimize it will take care of itself, people much smarter than us, like 
> Erick/Walter/Yonik, have spent time on this and if they say don’t do it don't 
> do it. 
> 
> In my particular use case I do see a measured improvement from optimizing 
> every three or four months.  In my case a large portion, over 75% of the 
> documents, which each measure around 500k to 3mg get reindexed every month, 
> as the fields in the documents change every month, while documents are added 
> to it daily as well.  So when I can go from a 650gb index to a 450gb once in 
> a while it makes a difference if I only have 500gb of memory to work with on 
> the searchers and can fit all the segments straight to memory. Also I use the 
> old set up of master slave, so my indexing server, when it’s optimizing has 
> no impact on the searching servers.  Once the optimized index gets warmed 
> back up in the searcher I do notice improvement in my qtimes (I like to 
> think) however I’ve been using my same integration process of occasional hard 
> optimizations since 1.4, and it might just be i like to watch the index 
> inflate three times the size then shrivel up. Old habits die hard. 
> 
>> On Dec 2, 2020, at 10:28 PM, Matheo Software  
>> wrote:
>> 
>> Hi Erick,
>> Hi Walter,
>> 
>> Thanks for these information,
>> 
>> I will learn seriously about the solr article you gave me. 
>> I thought it was important to always delete and optimize collection.
>> 
>> More information concerning my collection,
>> Index size is about 390Go for 130M docs (3-5ko / doc), around 25 fields 
>> (indexed, stored)
>> All Tuesday I do an update of around 1M docs and all Thusday I do an add new 
>> docs (around 50 000). 
>> 
>> Many thanks !
>> 
>> Regards,
>> Bruno
>> 
>> -Message d'origine-
>> De : Erick Erickson [mailto:erickerick...@gmail.com] 
>> Envoyé : mercredi 2 décembre 2020 14:07
>> À : solr-user@lucene.apache.org
>> Objet : Re: Solr8.7 - How to optmize my index ?
>> 
>> expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
>> The key difference is commit=true. I suspect if you’d waited until your 
>> indexing process added another doc and committed, you’d have seen the index 
>> size drop.
>> 
>> Just to check, you send the command to my_core but talk about collections.
>> Specifying the collection is sufficient, but I’ll assume that’s a typo and 
>> you’re really saying my_collection.
>> 
>> I agree with Walter like I always do, you shouldn’t be running optimize 
>> without some proof that it’s helping. About the only time I think it’s 
>> reasonable is when you have a static index, unless you can demonstrate 
>> improved performance. The optimize button was removed precisely because it 
>> was so tempting. In much earlier versions of Lucene, it made a demonstrable 
>> difference so was put front and center. In more recent versions of Solr 
>> optimize doesn’t help nearly as much so it was removed.
>> 
>> You say you have 38M deleted documents. How many documents total

Re: Solr8.7 - How to optmize my index ?

2020-12-02 Thread Dave
I’m going to go against the advice SLIGHTLY, it really depends on how you have 
things set up as far as your solr server hosting is done. If you’re searching 
off the same solr server you’re indexing to, yeah don’t ever optimize it will 
take care of itself, people much smarter than us, like Erick/Walter/Yonik, have 
spent time on this and if they say don’t do it don't do it. 

 In my particular use case I do see a measured improvement from optimizing 
every three or four months.  In my case a large portion, over 75% of the 
documents, which each measure around 500k to 3mg get reindexed every month, as 
the fields in the documents change every month, while documents are added to it 
daily as well.  So when I can go from a 650gb index to a 450gb once in a while 
it makes a difference if I only have 500gb of memory to work with on the 
searchers and can fit all the segments straight to memory. Also I use the old 
set up of master slave, so my indexing server, when it’s optimizing has no 
impact on the searching servers.  Once the optimized index gets warmed back up 
in the searcher I do notice improvement in my qtimes (I like to think) however 
I’ve been using my same integration process of occasional hard optimizations 
since 1.4, and it might just be i like to watch the index inflate three times 
the size then shrivel up. Old habits die hard. 

> On Dec 2, 2020, at 10:28 PM, Matheo Software  wrote:
> 
> Hi Erick,
> Hi Walter,
> 
> Thanks for these information,
> 
> I will learn seriously about the solr article you gave me. 
> I thought it was important to always delete and optimize collection.
> 
> More information concerning my collection,
> Index size is about 390Go for 130M docs (3-5ko / doc), around 25 fields 
> (indexed, stored)
> All Tuesday I do an update of around 1M docs and all Thusday I do an add new 
> docs (around 50 000). 
> 
> Many thanks !
> 
> Regards,
> Bruno
> 
> -Message d'origine-
> De : Erick Erickson [mailto:erickerick...@gmail.com] 
> Envoyé : mercredi 2 décembre 2020 14:07
> À : solr-user@lucene.apache.org
> Objet : Re: Solr8.7 - How to optmize my index ?
> 
> expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
> The key difference is commit=true. I suspect if you’d waited until your 
> indexing process added another doc and committed, you’d have seen the index 
> size drop.
> 
> Just to check, you send the command to my_core but talk about collections.
> Specifying the collection is sufficient, but I’ll assume that’s a typo and 
> you’re really saying my_collection.
> 
> I agree with Walter like I always do, you shouldn’t be running optimize 
> without some proof that it’s helping. About the only time I think it’s 
> reasonable is when you have a static index, unless you can demonstrate 
> improved performance. The optimize button was removed precisely because it 
> was so tempting. In much earlier versions of Lucene, it made a demonstrable 
> difference so was put front and center. In more recent versions of Solr 
> optimize doesn’t help nearly as much so it was removed.
> 
> You say you have 38M deleted documents. How many documents total? If this is 
> 50% of your index, that’s one thing. If it’s 5%, it’s certainly not worth the 
> effort. You’re rewriting 466G of index, if you’re not seeing demonstrable 
> performance improvements, that’s a lot of wasted effort…
> 
> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> and the linked article for what happens in pre 7.5 solr versions.
> 
> Best,
> Erick
> 
>> On Dec 1, 2020, at 2:31 PM, Info MatheoSoftware  
>> wrote:
>> 
>> Hi All,
>> 
>> 
>> 
>> I found the solution, I must do :
>> 
>> curl ‘http://xxx:8983/solr/my_core/update?
>> <http://xxx:8983/solr/my_core/update?optimize=true>
>> commit=true=true’
>> 
>> 
>> 
>> It works fine
>> 
>> 
>> 
>> Thanks,
>> 
>> Bruno
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> De : Matheo Software [mailto:i...@matheo-software.com] Envoyé : mardi 
>> 1 décembre 2020 13:28 À : solr-user@lucene.apache.org Objet : Solr8.7 
>> - How to optmize my index ?
>> 
>> 
>> 
>> Hi All,
>> 
>> 
>> 
>> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
>> 
>> 
>> 
>> So I decide to use the command line:
>> 
>> curl http://xxx:8983/solr/my_core/update?optimize=true
>> 
>> 
>> 
>> My collection my_core exists of course.
>> 
>> 
>> 
>> The answer of the command line is:
>> 
>> {
>> 
>> "responseHeader":{
&g

RE: Solr8.7 - How to optmize my index ?

2020-12-02 Thread Matheo Software
Hi Erick,
Hi Walter,

Thanks for these information,

I will learn seriously about the solr article you gave me.
I thought it was important to always delete and optimize collection.

More information concerning my collection,
Index size is about 390Go for 130M docs (3-5ko / doc), around 25 fields 
(indexed, stored)
All Tuesday I do an update of around 1M docs and all Thusday I do an add new 
docs (around 50 000).

Many thanks !

Regards,
Bruno

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com]
Envoyé : mercredi 2 décembre 2020 14:07
À : solr-user@lucene.apache.org
Objet : Re: Solr8.7 - How to optmize my index ?

expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
The key difference is commit=true. I suspect if you’d waited until your 
indexing process added another doc and committed, you’d have seen the index 
size drop.

Just to check, you send the command to my_core but talk about collections.
Specifying the collection is sufficient, but I’ll assume that’s a typo and 
you’re really saying my_collection.

I agree with Walter like I always do, you shouldn’t be running optimize without 
some proof that it’s helping. About the only time I think it’s reasonable is 
when you have a static index, unless you can demonstrate improved performance. 
The optimize button was removed precisely because it was so tempting. In much 
earlier versions of Lucene, it made a demonstrable difference so was put front 
and center. In more recent versions of Solr optimize doesn’t help nearly as 
much so it was removed.

You say you have 38M deleted documents. How many documents total? If this is 
50% of your index, that’s one thing. If it’s 5%, it’s certainly not worth the 
effort. You’re rewriting 466G of index, if you’re not seeing demonstrable 
performance improvements, that’s a lot of wasted effort…

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
and the linked article for what happens in pre 7.5 solr versions.

Best,
Erick

> On Dec 1, 2020, at 2:31 PM, Info MatheoSoftware  
> wrote:
>
> Hi All,
>
>
>
> I found the solution, I must do :
>
> curl ‘http://xxx:8983/solr/my_core/update?
> <http://xxx:8983/solr/my_core/update?optimize=true>
> commit=true=true’
>
>
>
> It works fine
>
>
>
> Thanks,
>
> Bruno
>
>
>
>
>
>
>
> De : Matheo Software [mailto:i...@matheo-software.com] Envoyé : mardi
> 1 décembre 2020 13:28 À : solr-user@lucene.apache.org Objet : Solr8.7
> - How to optmize my index ?
>
>
>
> Hi All,
>
>
>
> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
>
>
>
> So I decide to use the command line:
>
> curl http://xxx:8983/solr/my_core/update?optimize=true
>
>
>
> My collection my_core exists of course.
>
>
>
> The answer of the command line is:
>
> {
>
>  "responseHeader":{
>
>"status":0,
>
>"QTime":18}
>
> }
>
>
>
> But nothing change.
>
> I always have 38M deleted docs in my collection and directory size no
> change like with solr5.4.
>
> The size of the collection stay always at : 466.33Go
>
>
>
> Could you tell me how can I purge deleted docs ?
>
>
>
> Cordialement, Best Regards
>
> Bruno Mannina
>
> <http://www.matheo-software.com> www.matheo-software.com
>
> <http://www.patent-pulse.com> www.patent-pulse.com
>
> Tél. +33 0 970 738 743
>
> Mob. +33 0 634 421 817
>
> <https://www.facebook.com/PatentPulse> facebook (1)
> <https://twitter.com/matheosoftware> 1425551717
> <https://www.linkedin.com/company/matheo-software> 1425551737
> <https://www.youtube.com/user/MatheoSoftware> 1425551760
>
>
>
>
>
>  _
>
>
> <https://www.avast.com/antivirus> Avast logo
>
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com <https://www.avast.com/antivirus>
>
>
>
>
>
>
>
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus


--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re: Solr8.7 - How to optmize my index ?

2020-12-02 Thread Erick Erickson
expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
The key difference is commit=true. I suspect if you’d waited until your
indexing process added another doc and committed, you’d have seen
the index size drop.

Just to check, you send the command to my_core but talk about collections.
Specifying the collection is sufficient, but I’ll assume that’s a typo and
you’re really saying my_collection.

I agree with Walter like I always do, you shouldn’t be running 
optimize without some proof that it’s helping. About the only time
I think it’s reasonable is when you have a static index, unless you can
demonstrate improved performance. The optimize button was
removed precisely because it was so tempting. In much earlier
versions of Lucene, it made a demonstrable difference so was put
front and center. In more recent versions of Solr optimize doesn’t
help nearly as much so it was removed.

You say you have 38M deleted documents. How many documents total? If this is
50% of your index, that’s one thing. If it’s 5%, it’s certainly not worth
the effort. You’re rewriting 466G of index, if you’re not seeing demonstrable
performance improvements, that’s a lot of wasted effort…

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
and the linked article for what happens in pre 7.5 solr versions.

Best,
Erick

> On Dec 1, 2020, at 2:31 PM, Info MatheoSoftware  
> wrote:
> 
> Hi All,
> 
> 
> 
> I found the solution, I must do :
> 
> curl ‘http://xxx:8983/solr/my_core/update?
> <http://xxx:8983/solr/my_core/update?optimize=true>
> commit=true=true’
> 
> 
> 
> It works fine
> 
> 
> 
> Thanks,
> 
> Bruno
> 
> 
> 
> 
> 
> 
> 
> De : Matheo Software [mailto:i...@matheo-software.com]
> Envoyé : mardi 1 décembre 2020 13:28
> À : solr-user@lucene.apache.org
> Objet : Solr8.7 - How to optmize my index ?
> 
> 
> 
> Hi All,
> 
> 
> 
> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
> 
> 
> 
> So I decide to use the command line:
> 
> curl http://xxx:8983/solr/my_core/update?optimize=true
> 
> 
> 
> My collection my_core exists of course.
> 
> 
> 
> The answer of the command line is:
> 
> {
> 
>  "responseHeader":{
> 
>"status":0,
> 
>"QTime":18}
> 
> }
> 
> 
> 
> But nothing change.
> 
> I always have 38M deleted docs in my collection and directory size no change
> like with solr5.4.
> 
> The size of the collection stay always at : 466.33Go
> 
> 
> 
> Could you tell me how can I purge deleted docs ?
> 
> 
> 
> Cordialement, Best Regards
> 
> Bruno Mannina
> 
> <http://www.matheo-software.com> www.matheo-software.com
> 
> <http://www.patent-pulse.com> www.patent-pulse.com
> 
> Tél. +33 0 970 738 743
> 
> Mob. +33 0 634 421 817
> 
> <https://www.facebook.com/PatentPulse> facebook (1)
> <https://twitter.com/matheosoftware> 1425551717
> <https://www.linkedin.com/company/matheo-software> 1425551737
> <https://www.youtube.com/user/MatheoSoftware> 1425551760
> 
> 
> 
> 
> 
>  _
> 
> 
> <https://www.avast.com/antivirus> Avast logo
> 
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com <https://www.avast.com/antivirus>
> 
> 
> 
> 
> 
> 
> 
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus



Re: Solr8.7 - How to optmize my index ?

2020-12-01 Thread Walter Underwood
Even better DO NOT OPTIMIZE.

Just let Solr manage the indexes automatically.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 1, 2020, at 11:31 AM, Info MatheoSoftware  
> wrote:
> 
> Hi All,
> 
> 
> 
> I found the solution, I must do :
> 
> curl ‘http://xxx:8983/solr/my_core/update?
> <http://xxx:8983/solr/my_core/update?optimize=true>
> commit=true=true’
> 
> 
> 
> It works fine
> 
> 
> 
> Thanks,
> 
> Bruno
> 
> 
> 
> 
> 
> 
> 
> De : Matheo Software [mailto:i...@matheo-software.com]
> Envoyé : mardi 1 décembre 2020 13:28
> À : solr-user@lucene.apache.org
> Objet : Solr8.7 - How to optmize my index ?
> 
> 
> 
> Hi All,
> 
> 
> 
> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
> 
> 
> 
> So I decide to use the command line:
> 
> curl http://xxx:8983/solr/my_core/update?optimize=true
> 
> 
> 
> My collection my_core exists of course.
> 
> 
> 
> The answer of the command line is:
> 
> {
> 
>  "responseHeader":{
> 
>"status":0,
> 
>"QTime":18}
> 
> }
> 
> 
> 
> But nothing change.
> 
> I always have 38M deleted docs in my collection and directory size no change
> like with solr5.4.
> 
> The size of the collection stay always at : 466.33Go
> 
> 
> 
> Could you tell me how can I purge deleted docs ?
> 
> 
> 
> Cordialement, Best Regards
> 
> Bruno Mannina
> 
> <http://www.matheo-software.com> www.matheo-software.com
> 
> <http://www.patent-pulse.com> www.patent-pulse.com
> 
> Tél. +33 0 970 738 743
> 
> Mob. +33 0 634 421 817
> 
> <https://www.facebook.com/PatentPulse> facebook (1)
> <https://twitter.com/matheosoftware> 1425551717
> <https://www.linkedin.com/company/matheo-software> 1425551737
> <https://www.youtube.com/user/MatheoSoftware> 1425551760
> 
> 
> 
> 
> 
>  _
> 
> 
> <https://www.avast.com/antivirus> Avast logo
> 
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com <https://www.avast.com/antivirus>
> 
> 
> 
> 
> 
> 
> 
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus



RE: Solr8.7 - How to optmize my index ?

2020-12-01 Thread Info MatheoSoftware
Hi All,



I found the solution, I must do :

curl ‘http://xxx:8983/solr/my_core/update?
<http://xxx:8983/solr/my_core/update?optimize=true>
commit=true=true’



It works fine



Thanks,

Bruno







De : Matheo Software [mailto:i...@matheo-software.com]
Envoyé : mardi 1 décembre 2020 13:28
À : solr-user@lucene.apache.org
Objet : Solr8.7 - How to optmize my index ?



Hi All,



With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.



So I decide to use the command line:

curl http://xxx:8983/solr/my_core/update?optimize=true



My collection my_core exists of course.



The answer of the command line is:

{

  "responseHeader":{

"status":0,

"QTime":18}

}



But nothing change.

I always have 38M deleted docs in my collection and directory size no change
like with solr5.4.

The size of the collection stay always at : 466.33Go



Could you tell me how can I purge deleted docs ?



Cordialement, Best Regards

Bruno Mannina

 <http://www.matheo-software.com> www.matheo-software.com

 <http://www.patent-pulse.com> www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

 <https://www.facebook.com/PatentPulse> facebook (1)
<https://twitter.com/matheosoftware> 1425551717
<https://www.linkedin.com/company/matheo-software> 1425551737
<https://www.youtube.com/user/MatheoSoftware> 1425551760





  _


 <https://www.avast.com/antivirus> Avast logo

L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
www.avast.com <https://www.avast.com/antivirus>







--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


Solr8.7 - How to optmize my index ?

2020-12-01 Thread Matheo Software
Hi All,



With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.



So I decide to use the command line:

curl http://xxx:8983/solr/my_core/update?optimize=true



My collection my_core exists of course.



The answer of the command line is:

{

  "responseHeader":{

"status":0,

"QTime":18}

}



But nothing change.

I always have 38M deleted docs in my collection and directory size no change
like with solr5.4.

The size of the collection stay always at : 466.33Go



Could you tell me how can I purge deleted docs ?



Cordialement, Best Regards

Bruno Mannina

  www.matheo-software.com

  www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

  facebook (1)
 1425551717
 1425551737
 1425551760





--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


Re: Can solr index replacement character

2020-12-01 Thread Erick Erickson
Solr handles UTF-8, so it should be able to. The problem you’ll have is
getting the UTF-8 characters to get through all the various transport
encodings, i.e. if you try to search from a browser, you need to encode
it so the browser passes it through. If you search through SolrJ, it needs
to be encoded at that level. If you use cURL, it needs another….

> On Dec 1, 2020, at 12:30 AM, Eran Buchnick  wrote:
> 
> Hi community,
> During integration tests with new data source I have noticed weird scenario
> where replacement character can't be searched, though, seems to be stored.
> I mean, honestly, I don't want that irrelevant data stored in my index but
> I wondered if solr can index replacement character (U+FFFD �) as string, if
> so, how to search it?
> And in general, is there any built-in char filtration?!
> 
> Thanks



Can solr index replacement character

2020-11-30 Thread Eran Buchnick
Hi community,
During integration tests with new data source I have noticed weird scenario
where replacement character can't be searched, though, seems to be stored.
I mean, honestly, I don't want that irrelevant data stored in my index but
I wondered if solr can index replacement character (U+FFFD �) as string, if
so, how to search it?
And in general, is there any built-in char filtration?!

Thanks


Index size issue. Migration from Solr-6.5.1 To Solr-8.6.3

2020-11-17 Thread Modassar Ather
Hi,

I am in a process of migrating from Solr-6.5.1 To Solr-8.6.3. The current
index size after optimisation is 2.4 TB. We use a 7TB disk for indexing as
the optimization needs extra space.
Now with the newer Solr the un-optimised index itself got created of size
5+TB which after optimisation reduced to 2.4TB. Analysing the index I found
.cfs files of almost 1.9 TB getting created.

 is by default false and I have not enabled it but still
the .cfs files are getting created.
During installation of Solr-8.6.3 I was getting ULIMIT related errors
during solr startup which I fixed by increasing it. The Solr is run as a
different user and not as a solr user.

Kindly provide your suggestions on how I can achieve the same size of
un-optimised index as that of  Solr-6.5.1 to save on hard disk cost.

Best,
Modassar


Re: Frequent Index Replication Failure in solr.

2020-11-13 Thread David Hastings
looks like youre repeater is grabbing a file that the master merged into a
different file, why not lower how often you go from master->repeater,
and/or dont commit so often so you can make the index faster

On Fri, Nov 13, 2020 at 12:13 PM Parshant Kumar
 wrote:

> All,please help on this
>
> On Tue, Nov 3, 2020, 6:01 PM Parshant Kumar 
> wrote:
>
> > Hi team,
> >
> > We are having solr architecture as *master->repeater-> 3 slave servers.*
> >
> > We are doing incremental indexing on the master server(every 20 min) .
> > Replication of index is done from master to repeater server(every 10
> mins)
> > and from repeater to 3 slave servers (every 3 hours).
> > *We are facing the frequent replication failure between master to
> repeater
> > server  as well as between repeater  to slave servers.*
> > On checking logs found that every time one of the below  exceptions
> > occurred whenever the replication has failed .
> >
> > 1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
> > bytes)
> > java.io.EOFException: Unexpected end of ZLIB input stream
> > at
> > java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
> > at
> > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> > at
> >
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
> > at
> >
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
> >
> >
> > 2)
> > WARN : Error getting file length for [segments_568]
> > java.nio.file.NoSuchFileException:
> >
> /data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
> > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> > at
> >
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> > at
> >
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> > at
> >
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> > at java.nio.file.Files.readAttributes(Files.java:1737)
> > at java.nio.file.Files.size(Files.java:2332)
> > at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
> > at
> >
> org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
> > at
> >
> org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
> > at
> >
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)
> >
> > 3)
> > WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of
> > 555377795 bytes)
> > org.apache.http.MalformedChunkCodingException: CRLF expected at end of
> > chunk
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> > at
> > org.apache.http.impl.io
> .ChunkedInputStream.read(ChunkedInputStream.java:186)
> > at
> >
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> > at
> > java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
> > at
> > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> > at
> >
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
> > at
> >
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> > at
> >
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexF

Re: Frequent Index Replication Failure in solr.

2020-11-13 Thread Parshant Kumar
All,please help on this

On Tue, Nov 3, 2020, 6:01 PM Parshant Kumar 
wrote:

> Hi team,
>
> We are having solr architecture as *master->repeater-> 3 slave servers.*
>
> We are doing incremental indexing on the master server(every 20 min) .
> Replication of index is done from master to repeater server(every 10 mins)
> and from repeater to 3 slave servers (every 3 hours).
> *We are facing the frequent replication failure between master to repeater
> server  as well as between repeater  to slave servers.*
> On checking logs found that every time one of the below  exceptions
> occurred whenever the replication has failed .
>
> 1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
> bytes)
> java.io.EOFException: Unexpected end of ZLIB input stream
> at
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
> at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> at
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
> at
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
>
>
> 2)
> WARN : Error getting file length for [segments_568]
> java.nio.file.NoSuchFileException:
> /data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
> at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at java.nio.file.Files.size(Files.java:2332)
> at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
> at
> org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
> at
> org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
> at
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)
>
> 3)
> WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of
> 555377795 bytes)
> org.apache.http.MalformedChunkCodingException: CRLF expected at end of
> chunk
> at
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
> at
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
> at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> at
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1458)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1390)
> at
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:872)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:438)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
>
> *Replication configuration of master,repeater,slave's is given below:*
>
> 
> 
> ${enable.master:false}
> commit
> startup
> 00:00:10
> 
>
>
> *Commit Configuration master,repeater,slave's is given below :*
>
>  
> 10false
>

Partial updates on collection with router.field lead to duplicated index

2020-11-06 Thread Zhivko Donev
Hi All,

I believe that this is a bug on solr side but want to be sure before filing
a JIRA ticket.
Setup:
Solr Cloud 8.3
Collection with 2 shards, 2 replicas, router = compositeId,
router.field=routerField_s

I am adding a document and then updating it as follows:

{
"id":"1",
"routerField_s":"1"
}
-
/update?*_route_=1*
[{
"id":"1",
"routerField_s":"1",
"test_s":{"set":"1"}
}]
--
/update?*_route_=1*
[{
"id":"1",
"routerField_s":"1",
"test_s":{"set":"2"}
}]
--
/update?*_route_=1*
[{
"id":"1",
"routerField_s":"1",
"test_s":{"set":"3"}
}]

When I query the collection for document with id:1 and limit = 10 all seems
to be fine. However if I query with limit 1 the response is saying
numFound=4 (indicating duplicated index).
Moreover if I query the added field test_s for particular value I will get
matches for all of the updated values - 1,2 and 3

If I execute the update without the _route_ param everything seems to work
properly - can someone confirm this?
The same behaviour can be observed if I have the following for the
routerField_s:
"routerField_s":{"set":"1"}

If I try to update with just _route_ param and "id" inside the update body
the request is rejected stating that the "routerField_s" is missing and no
shard can be identified. This seems like expected behaviour.
At a bare minimum I believe that the documentation for updating parts of
the document should be updated with examples how to handle cases like this.
Ideally I would expect solr to reject any requests containing both _route_
param and "routerField_s" values as well as using the {"set":"value"} for
the "routerField_s".

And final question - Do I have any other options for fixing the duplicated
index beside:
1. Delete documents by query "id:{corrupted_id}", then add the document
again
2. Do a full reload to a new collection and switch to using it.

Any thoughts will be much appreciated.


Frequent Index Replication Failure in solr.

2020-11-03 Thread Parshant Kumar
Hi team,

We are having solr architecture as *master->repeater-> 3 slave servers.*

We are doing incremental indexing on the master server(every 20 min) .
Replication of index is done from master to repeater server(every 10 mins)
and from repeater to 3 slave servers (every 3 hours).
*We are facing the frequent replication failure between master to repeater
server  as well as between repeater  to slave servers.*
On checking logs found that every time one of the below  exceptions
occurred whenever the replication has failed .

1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
bytes)
java.io.EOFException: Unexpected end of ZLIB input stream
at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
at
org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
at
org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)


2)
WARN : Error getting file length for [segments_568]
java.nio.file.NoSuchFileException:
/data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.Files.size(Files.java:2332)
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
at
org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)

3)
WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of 555377795
bytes)
org.apache.http.MalformedChunkCodingException: CRLF expected at end of chunk
at
org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
at
org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
at
org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1458)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1390)
at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:872)
at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:438)
at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)

*Replication configuration of master,repeater,slave's is given below:*



${enable.master:false}
commit
startup
00:00:10



*Commit Configuration master,repeater,slave's is given below :*

 
10false


Please help in finding the root cause of replication failure.Let me
know for any queries.

Thanks

Parshant kumar









<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
Virus-free.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

-- 



Re: Index Replication Failure

2020-10-20 Thread Parshant Kumar
Hi all, please check the details

On Sat, Oct 17, 2020 at 5:52 PM Parshant Kumar 
wrote:

>
>
> *Architecture is master->repeater->slave servers in hierarchy.*
>
> *One of the Below exceptions are occuring whenever replication fails.*
>
> 1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
> bytes)
> java.io.EOFException: Unexpected end of ZLIB input stream
> at
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
> at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> at
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
> at
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
>
> 2)
> WARN : Error getting file length for [segments_568]
> java.nio.file.NoSuchFileException:
> /data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
> at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at java.nio.file.Files.size(Files.java:2332)
> at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
> at
> org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
> at
> org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
> at
> org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)
>
>
> 3)
> WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of
> 555377795 bytes)
> org.apache.http.MalformedChunkCodingException: CRLF expected at end of
> chunk
> at
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
> at
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
> at
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
> at
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
> at
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
> at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> at
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
> at
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1458)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
> at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1390)
> at
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:872)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:438)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
>
>
> *Replication configuration of master,repeater,slave's is given below:*
>
>  
> 
> ${enable.master:false}
> commit
> startup
> 00:00:10
> 
>
>
> *Commit Configuration master,repeater,slave's is given below :*
>
>  
> 10false
>
>
>
>
>
>
> On Sat, Oct 17, 2020 at 5:12 PM Erick Erickson 
> wrote:
>
>> None of your images made it through the mail server. You’ll
>> have to put them somewhere and provide a link.
>>
>> > On Oct 17, 2020, at 5:17 AM, Parshant Kumar <
>> parshant.ku...@indiamart.com.INVALID> wrote:
>> >
>> > Architecture image: If 

Re: Index Replication Failure

2020-10-17 Thread Parshant Kumar
*Architecture is master->repeater->slave servers in hierarchy.*

*One of the Below exceptions are occuring whenever replication fails.*

1)WARN : Error in fetching file: _4rnu_t.liv (downloaded 0 of 11505507
bytes)
java.io.EOFException: Unexpected end of ZLIB input stream
at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
at
org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
at
org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:139)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1443)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)

2)
WARN : Error getting file length for [segments_568]
java.nio.file.NoSuchFileException:
/data/solr/search/application/core-conf/im-search/data/index.20200711012319226/segments_568
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.Files.size(Files.java:2332)
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
at
org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:335)


3)
WARN : Error in fetching file: _4nji.nvd (downloaded 507510784 of 555377795
bytes)
org.apache.http.MalformedChunkCodingException: CRLF expected at end of chunk
at
org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255)
at
org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
at
org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1458)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1409)
at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1390)
at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:872)
at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:438)
at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)


*Replication configuration of master,repeater,slave's is given below:*

 

${enable.master:false}
commit
startup
00:00:10



*Commit Configuration master,repeater,slave's is given below :*

 
10false






On Sat, Oct 17, 2020 at 5:12 PM Erick Erickson 
wrote:

> None of your images made it through the mail server. You’ll
> have to put them somewhere and provide a link.
>
> > On Oct 17, 2020, at 5:17 AM, Parshant Kumar <
> parshant.ku...@indiamart.com.INVALID> wrote:
> >
> > Architecture image: If not visible in previous mail
> >
> >
> >
> >
> > On Sat, Oct 17, 2020 at 2:38 PM Parshant Kumar <
> parshant.ku...@indiamart.com> wrote:
> > Hi all,
> >
> > We are having solr architecture as below.
> >
> >
> >
> > We are facing the frequent replication failure between master to
> repeater server  as well as between repeater  to slave servers.
> > On checking logs found every time one of the below  exceptions occurred
> whenever the replication have failed.
> >
> > 1)
> >
> > 2)
> >
> >
> > 3)
> >
> >
> > The replica

Re: Index Replication Failure

2020-10-17 Thread Erick Erickson
None of your images made it through the mail server. You’ll
have to put them somewhere and provide a link.

> On Oct 17, 2020, at 5:17 AM, Parshant Kumar 
>  wrote:
> 
> Architecture image: If not visible in previous mail
> 
> 
> 
> 
> On Sat, Oct 17, 2020 at 2:38 PM Parshant Kumar  
> wrote:
> Hi all,
> 
> We are having solr architecture as below.
> 
> 
> 
> We are facing the frequent replication failure between master to repeater 
> server  as well as between repeater  to slave servers.
> On checking logs found every time one of the below  exceptions occurred 
> whenever the replication have failed. 
> 
> 1)
> 
> 2)
> 
> 
> 3)
> 
> 
> The replication configuration of master,repeater,slave's is given below:
> 
> 
> 
> Commit Configuration master,repeater,slave's is given below :
> 
> 
> 
> Replication between master and repeater occurs every 10 mins.
> Replication between repeater and slave servers occurs every 15 mins between 
> 4-7 am and after that in every 3 hours.
> 
> Thanks,
> Parshant Kumar
> 
> 
> 
> 
> 
> 
> 



Re: Index Replication Failure

2020-10-17 Thread Parshant Kumar
Architecture image: If not visible in previous mail

[image: image.png]


On Sat, Oct 17, 2020 at 2:38 PM Parshant Kumar 
wrote:

> Hi all,
>
> We are having solr architecture as below.
>
>
>
> *We are facing the frequent replication failure between master to repeater
> server  as well as between repeater  to slave servers.*
> On checking logs found every time one of the below  exceptions occurred
> whenever the replication have failed.
>
> 1)
> [image: image.png]
> 2)
> [image: image.png]
>
> 3)
> [image: image.png]
>
> The replication configuration of master,repeater,slave's is given below:
>
> [image: image.png]
>
> Commit Configuration master,repeater,slave's is given below :
>
> [image: image.png]
>
> Replication between master and repeater occurs every 10 mins.
> Replication between repeater and slave servers occurs every 15 mins
> between 4-7 am and after that in every 3 hours.
>
> Thanks,
> Parshant Kumar
>
>
>
>
>
>

-- 



Index Replication Failure

2020-10-17 Thread Parshant Kumar
Hi all,

We are having solr architecture as below.



*We are facing the frequent replication failure between master to repeater
server  as well as between repeater  to slave servers.*
On checking logs found every time one of the below  exceptions occurred
whenever the replication have failed.

1)
[image: image.png]
2)
[image: image.png]

3)
[image: image.png]

The replication configuration of master,repeater,slave's is given below:

[image: image.png]

Commit Configuration master,repeater,slave's is given below :

[image: image.png]

Replication between master and repeater occurs every 10 mins.
Replication between repeater and slave servers occurs every 15 mins between
4-7 am and after that in every 3 hours.

Thanks,
Parshant Kumar

-- 



Re: Index Deeply Nested documents and retrieve a full nested document in solr

2020-09-24 Thread Alexandre Rafalovitch
It is yes to both questions, but I am not sure if they play well
together for historical reasons.

For storing/parsing original JSON in any (custom) format:
https://lucene.apache.org/solr/guide/8_6/transforming-and-indexing-custom-json.html
(srcField parameter)
For indexing nested children (with named collections of subdocuments)
but in Solr's own JSON format:
https://lucene.apache.org/solr/guide/8_6/indexing-nested-documents.html

I am not sure if defining additional fields as per the second document
but indexing the first way will work together. A feedback on that
would be useful.

Please also note that Solr is not intended to be the primary storage
(like a database). If you do atomic operations, the stored JSON will
get out of sync as it is not regenerated. Also, for the advanced
searches, you may want to normalize your data in different ways than
those your original data structure has. So, you may want to consider
an architecture where that JSON is stored separately or is retrieved
from original database and the Solr is focused on good search and
returning you just the record ID. That would actually allow you to
store a lot less in Solr (like just IDs) and focus on indexing in the
best way. Not saying it is the right way for your needs, just that is
a non-obvious architecture choice you may want to keep in mind as you
add Solr to your existing stack.

Regards,
   Alex.

On Thu, 24 Sep 2020 at 10:23, Abhay Kumar  wrote:
>
> Hello Team,
>
> Can someone please help to index the below sample json document into Solr.
>
> I have following queries on indexing multi level child document.
>
>
>   1.  Can we specify names to documents hierarchy such as "therapeuticareas" 
> or "sites" while indexing.
>   2.  How can we index document at multi-level hierarchy.
>
> I have following queries on retrieving the result.
>
>
>   1.  How can I retrieve result with full nested structure.
>
> [{
>"id": "NCT0102",
>"title": "Congenital Adrenal Hyperplasia: Calcium Channels as 
> Therapeutic Targets",
>"phase": "Phase 1/Phase 2",
>"status": "Completed",
>"studytype": "Interventional",
>"enrollmenttype": "",
>"sponsorname": ["National Center for Research Resources 
> (NCRR)"],
>"sponsorrole": ["lead"],
>"score": [0],
>"source": "National Center for Research Resources (NCRR)",
>"therapeuticareas": [{
>  "taid": "ta1",
>  "ta": "Lung Cancer",
>  "diseaseAreas": ["Oncology, 
> Respiratory tract diseases"],
>  "pubmeds": [{
> "pmbid": "pm1",
> "articleTitle": 
> "Consensus minimum data set for lung cancer multidisciplinary teams Results 
> of a Delphi process",
> "revisedDate": 
> "2018-12-11T18:30:00Z"
>  }],
>  "conferences": [{
> "confid": "conf1",
> "conferencename": 
> "American Academy of Neurology Annual Meeting",
> 
> "conferencetopic": "Avances en el manejo de los trastornos del movimiento 
> hipercineticos",
> "conferencedate": 
> "2019-05-08T18:30:00Z"
>  }]
>   },
>   {
>  "taid": "ta2",
>  "ta": "Breast Cancer",
>  "diseaseAreas": ["Oncology"],
>  "pubmeds": [],
>  "conferences": []
>   }
> 

Index Deeply Nested documents and retrieve a full nested document in solr

2020-09-24 Thread Abhay Kumar
Hello Team,

Can someone please help to index the below sample json document into Solr.

I have following queries on indexing multi level child document.


  1.  Can we specify names to documents hierarchy such as "therapeuticareas" or 
"sites" while indexing.
  2.  How can we index document at multi-level hierarchy.

I have following queries on retrieving the result.


  1.  How can I retrieve result with full nested structure.

[{
   "id": "NCT0102",
   "title": "Congenital Adrenal Hyperplasia: Calcium Channels as 
Therapeutic Targets",
   "phase": "Phase 1/Phase 2",
   "status": "Completed",
   "studytype": "Interventional",
   "enrollmenttype": "",
   "sponsorname": ["National Center for Research Resources (NCRR)"],
   "sponsorrole": ["lead"],
   "score": [0],
   "source": "National Center for Research Resources (NCRR)",
   "therapeuticareas": [{
 "taid": "ta1",
 "ta": "Lung Cancer",
 "diseaseAreas": ["Oncology, 
Respiratory tract diseases"],
 "pubmeds": [{
"pmbid": "pm1",
"articleTitle": 
"Consensus minimum data set for lung cancer multidisciplinary teams Results of 
a Delphi process",
"revisedDate": 
"2018-12-11T18:30:00Z"
 }],
 "conferences": [{
"confid": "conf1",
"conferencename": 
"American Academy of Neurology Annual Meeting",
"conferencetopic": 
"Avances en el manejo de los trastornos del movimiento hipercineticos",
"conferencedate": 
"2019-05-08T18:30:00Z"
 }]
  },
  {
 "taid": "ta2",
 "ta": "Breast Cancer",
 "diseaseAreas": ["Oncology"],
 "pubmeds": [],
 "conferences": []
  }
   ],

   "sites": [{
  "siteid": "site1",
  "type": "Hospital",
  "institutionname": "Methodist Health System",
  "country": "United States",
  "state": "Texas",
  "city": "Dallas",
  "zip": ""
   }],

   "investigators": [{
  "invid": "inv1",
  "investigatorname": "Bryan A Faller",
  "role": "Principal Investigator",
  "location": "",
  "score": ""
   }],

   "Drugs": [{
  "id": "11",
  "drugname": "Methotrexate",
  "activeIngredient": "Methotrexate Sodium"
   }]
}]

Thanks.
Abhay

Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.


Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Tim Casey
People usually want to do some analysis during index time.  This analysis
should be considered 'expensive', compared to any single query run.  You
can think of it as indexing every day, over a 86400 second day, vs a 200 ms
query time.

Normally, you want to index as honestly as possible.  That is, you want to
take what you are given and put it in the index they way it comes.  You do
this with a particular analyzer.  This produces a token stream, which is
then indexed.  (Solr does things way more complicated now, like two tokens
with the same index position and so on.  But a simple model to give a
foundational explanation.)

On the query side you can try all kinds of crazy things to find what you
want.  You can build synonyms at this point and query for them all.  You
can stem words, and query and so on.  You can build distance queries, two
words nearish to each other.

If you produce more tokens at index time, you are increasing the over all
documents returned, and assuming a single set of documents is the desired
search result, this will result in lower precision.  You will not always be
able to find the thing you want in the fixed set of early query results.
The only way to fix this is at index time.  It is much easier to make this
adjustment at query time.  Instead of stemming, make the query more exact
hopefully increasing precision.

This difference in cost leads to a tendency, over the time of a search
universe, to tend towards more complex queries and less complex indexing.

I would recommend avoiding indexing tricks for this reason.  If they are
required, and I am sure they are, then you may want to segment the queries
in such a way as to be able to answer over generation over the required
recall.  So, segment the differences by field.  Put time tokens in a time
field, so you dont get names of people 'june' while searching for 'jun',
for instance.

tim



On Thu, Sep 10, 2020 at 10:08 AM Walter Underwood 
wrote:

> It is very common for us to do more processing in the index analysis
> chain. In general, we do that when we want additional terms in the index to
> be searchable. Some examples:
>
> * synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
> * ngrams: For prefix matching, generate all edge ngrams, for example for
> “french” add “f”, “fr” “fre”, “fren”, and “frenc”.
> * shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
> * split on delimiters: break up compounds, so “baby sitter” can match
> “baby-sitter”. Do this before shingles and you get matches for
> “babysitter”, “baby-sitter”, and “baby sitter”.
> * remove HTML: we rarely see HTML in queries, but we never know when
> someone will get clever with the source text, sigh.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 10, 2020, at 9:48 AM, Erick Erickson 
> wrote:
> >
> > When you want to do something different and index and query time. There,
> an answer that’s almost, but not quite, completely useless while being
> accurate ;)
> >
> > A concrete example is synonyms as have been mentioned. Say you have an
> index-time synonym definition of
> > A,B,C
> >
> > These three tokens will be “stacked” in the index wherever any of them
> are found.
> > A query "q=field:B” would find a document with any of the three tokens
> in the original. It would be wasteful for the query to be transformed into
> “q=field:(A B C)”…
> >
> > And take a very close look at WordDelimiterGraphFilterFactory. I’m
> pretty sure you’ll find the parameters are different. Say the parameters
> for the input 123-456-7890 cause WDGFF to add
> > 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t
> need to repeat and have all of those tokens in the search itself.
> >
> > Best,
> > Erick
> >
> >> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch 
> wrote:
> >>
> >> There are a lot of different use cases and the separate analyzers for
> >> indexing and query is part of the Solr power. For example, you could
> >> apply ngram during indexing time to generate multiple substrings. But
> >> you don't want to do that during the query, because otherwise you are
> >> matching on 'shared prefix' instead of on what user entered. Thinking
> >> phone number directory where people may enter any suffix and you want
> >> to match it.
> >> See for example
> >>
> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
> >> , starting slide 16 onwards.
> >>
> >> Or, for non-production but fun use case:
> >>
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
> >> (s

Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Walter Underwood
It is very common for us to do more processing in the index analysis chain. In 
general, we do that when we want additional terms in the index to be 
searchable. Some examples:

* synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
* ngrams: For prefix matching, generate all edge ngrams, for example for 
“french” add “f”, “fr” “fre”, “fren”, and “frenc”.
* shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
* split on delimiters: break up compounds, so “baby sitter” can match 
“baby-sitter”. Do this before shingles and you get matches for “babysitter”, 
“baby-sitter”, and “baby sitter”.
* remove HTML: we rarely see HTML in queries, but we never know when someone 
will get clever with the source text, sigh.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 10, 2020, at 9:48 AM, Erick Erickson  wrote:
> 
> When you want to do something different and index and query time. There, an 
> answer that’s almost, but not quite, completely useless while being accurate 
> ;)
> 
> A concrete example is synonyms as have been mentioned. Say you have an 
> index-time synonym definition of
> A,B,C
> 
> These three tokens will be “stacked” in the index wherever any of them are 
> found. 
> A query "q=field:B” would find a document with any of the three tokens in the 
> original. It would be wasteful for the query to be transformed into 
> “q=field:(A B C)”…
> 
> And take a very close look at WordDelimiterGraphFilterFactory. I’m pretty 
> sure you’ll find the parameters are different. Say the parameters for the 
> input 123-456-7890 cause WDGFF to add
> 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t need 
> to repeat and have all of those tokens in the search itself.
> 
> Best,
> Erick
> 
>> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch  
>> wrote:
>> 
>> There are a lot of different use cases and the separate analyzers for
>> indexing and query is part of the Solr power. For example, you could
>> apply ngram during indexing time to generate multiple substrings. But
>> you don't want to do that during the query, because otherwise you are
>> matching on 'shared prefix' instead of on what user entered. Thinking
>> phone number directory where people may enter any suffix and you want
>> to match it.
>> See for example
>> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
>> , starting slide 16 onwards.
>> 
>> Or, for non-production but fun use case:
>> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
>> (search phonetically mapped Thai text in English).
>> 
>> Similarly, you may want to apply synonyms at query time only if you
>> want to avoid diluting some relevancy. Or at index type to normalize
>> spelling and help relevancy.
>> 
>> Or you may want to be doing some accent folding for sorting or
>> faceting (which uses indexed tokens).
>> 
>> Regards,
>>  Alex.
>> 
>> On Thu, 10 Sep 2020 at 11:19, Steven White  wrote:
>>> 
>>> Hi everyone,
>>> 
>>> In Solr's schema, I have come across field types that use a different logic
>>> for "index" than for "query".  To be clear, I"m talking about this block:
>>> 
>>>   >> positionIncrementGap="100">
>>> 
>>>  
>>> 
>>> 
>>>  
>>> 
>>>   
>>> 
>>> Why would one want to not use the same logic for both and simply use:
>>> 
>>>   >> positionIncrementGap="100">
>>> 
>>>  
>>> 
>>>   
>>> 
>>> What are real word use cases to use a different analyzer for index and
>>> query?
>>> 
>>> Thanks,
>>> 
>>> Steve
> 



Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Erick Erickson
When you want to do something different and index and query time. There, an 
answer that’s almost, but not quite, completely useless while being accurate ;)

A concrete example is synonyms as have been mentioned. Say you have an 
index-time synonym definition of
A,B,C

These three tokens will be “stacked” in the index wherever any of them are 
found. 
A query "q=field:B” would find a document with any of the three tokens in the 
original. It would be wasteful for the query to be transformed into “q=field:(A 
B C)”…

And take a very close look at WordDelimiterGraphFilterFactory. I’m pretty sure 
you’ll find the parameters are different. Say the parameters for the input 
123-456-7890 cause WDGFF to add
123, 456, 7890, 1234567890 to the index. Again, at query time you don’t need to 
repeat and have all of those tokens in the search itself.

Best,
Erick

> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch  
> wrote:
> 
> There are a lot of different use cases and the separate analyzers for
> indexing and query is part of the Solr power. For example, you could
> apply ngram during indexing time to generate multiple substrings. But
> you don't want to do that during the query, because otherwise you are
> matching on 'shared prefix' instead of on what user entered. Thinking
> phone number directory where people may enter any suffix and you want
> to match it.
> See for example
> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
> , starting slide 16 onwards.
> 
> Or, for non-production but fun use case:
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
> (search phonetically mapped Thai text in English).
> 
> Similarly, you may want to apply synonyms at query time only if you
> want to avoid diluting some relevancy. Or at index type to normalize
> spelling and help relevancy.
> 
> Or you may want to be doing some accent folding for sorting or
> faceting (which uses indexed tokens).
> 
> Regards,
>   Alex.
> 
> On Thu, 10 Sep 2020 at 11:19, Steven White  wrote:
>> 
>> Hi everyone,
>> 
>> In Solr's schema, I have come across field types that use a different logic
>> for "index" than for "query".  To be clear, I"m talking about this block:
>> 
>>> positionIncrementGap="100">
>>  
>>   
>>  
>>  
>>   
>>  
>>
>> 
>> Why would one want to not use the same logic for both and simply use:
>> 
>>> positionIncrementGap="100">
>>  
>>   
>>  
>>
>> 
>> What are real word use cases to use a different analyzer for index and
>> query?
>> 
>> Thanks,
>> 
>> Steve



Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Stavros Macrakis
I gave an example of why you might want to analyze the corpus differently
from the query just yesterday -- see
https://lucene.472066.n3.nabble.com/Lowercase-ing-everything-but-acronyms-td4462899.html

  -s

On Thu, Sep 10, 2020 at 11:19 AM Steven White  wrote:

> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different logic
> for "index" than for "query".  To be clear, I"m talking about this block:
>
>  positionIncrementGap="100">
>   
>
>   
>   
>
>   
> 
>
> Why would one want to not use the same logic for both and simply use:
>
>  positionIncrementGap="100">
>   
>
>   
> 
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve
>


Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Alexandre Rafalovitch
There are a lot of different use cases and the separate analyzers for
indexing and query is part of the Solr power. For example, you could
apply ngram during indexing time to generate multiple substrings. But
you don't want to do that during the query, because otherwise you are
matching on 'shared prefix' instead of on what user entered. Thinking
phone number directory where people may enter any suffix and you want
to match it.
See for example
https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
, starting slide 16 onwards.

Or, for non-production but fun use case:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
 (search phonetically mapped Thai text in English).

Similarly, you may want to apply synonyms at query time only if you
want to avoid diluting some relevancy. Or at index type to normalize
spelling and help relevancy.

Or you may want to be doing some accent folding for sorting or
faceting (which uses indexed tokens).

Regards,
   Alex.

On Thu, 10 Sep 2020 at 11:19, Steven White  wrote:
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different logic
> for "index" than for "query".  To be clear, I"m talking about this block:
>
>  positionIncrementGap="100">
>   
>
>   
>   
>
>   
> 
>
> Why would one want to not use the same logic for both and simply use:
>
>  positionIncrementGap="100">
>   
>
>   
> 
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve


Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Thomas Corthals
Hi Steve

I have a real-world use case. We don't apply a synonym filter at index
time, but we do apply a managed synonym filter at query time. This allows
content managers to add new synonyms (or remove existing ones) "on the fly"
without having to reindex any documents.

Thomas

Op do 10 sep. 2020 om 17:29 schreef Dunham-Wilkie, Mike CITZ:EX <
mike.dunham-wil...@gov.bc.ca>:

> Hi Steven,
>
> I can think of one case.  If we have an index of database table or column
> names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the
> name at the underscores when indexing (as well as keep the original), since
> the individual parts might be significant and meaningful.  When querying,
> though, if the searcher types in THIS_IS_A_TABLE_NAME then they are likely
> looking for the whole string, so we wouldn't want to split it apart.
>
> There also seems to be a debate on whether the SYNONYM filter should be
> included on indexing, on querying, or on both.  Google "solr synonyms index
> vs query"
>
> Mike
>
> -Original Message-
> From: Steven White 
> Sent: September 10, 2020 8:19 AM
> To: solr-user@lucene.apache.org
> Subject: Why use a different analyzer for "index" and "query"?
>
> [EXTERNAL] This email came from an external source. Only open attachments
> or links that you are expecting from a known sender.
>
>
> Hi everyone,
>
> In Solr's schema, I have come across field types that use a different
> logic for "index" than for "query".  To be clear, I"m talking about this
> block:
>
>  positionIncrementGap="100">
>   
>
>   
>   
>
>   
> 
>
> Why would one want to not use the same logic for both and simply use:
>
>  positionIncrementGap="100">
>   
>
>   
> 
>
> What are real word use cases to use a different analyzer for index and
> query?
>
> Thanks,
>
> Steve
>


RE: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Dunham-Wilkie, Mike CITZ:EX
Hi Steven, 

I can think of one case.  If we have an index of database table or column 
names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the name 
at the underscores when indexing (as well as keep the original), since the 
individual parts might be significant and meaningful.  When querying, though, 
if the searcher types in THIS_IS_A_TABLE_NAME then they are likely looking for 
the whole string, so we wouldn't want to split it apart.

There also seems to be a debate on whether the SYNONYM filter should be 
included on indexing, on querying, or on both.  Google "solr synonyms index vs 
query"

Mike

-Original Message-
From: Steven White  
Sent: September 10, 2020 8:19 AM
To: solr-user@lucene.apache.org
Subject: Why use a different analyzer for "index" and "query"?

[EXTERNAL] This email came from an external source. Only open attachments or 
links that you are expecting from a known sender.


Hi everyone,

In Solr's schema, I have come across field types that use a different logic for 
"index" than for "query".  To be clear, I"m talking about this block:


  
   
  
  
   
  


Why would one want to not use the same logic for both and simply use:


  
   
  


What are real word use cases to use a different analyzer for index and query?

Thanks,

Steve


Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Steven White
Hi everyone,

In Solr's schema, I have come across field types that use a different logic
for "index" than for "query".  To be clear, I"m talking about this block:


  
   
  
  
   
  


Why would one want to not use the same logic for both and simply use:


  
   
  


What are real word use cases to use a different analyzer for index and
query?

Thanks,

Steve


Re: Real time index data

2020-08-26 Thread Jörn Franke
Maybe to add to this . Additionally try to batch the requests from the queue - 
don’t do it one by one , but take n items at the same time.
Look on the Solr side also on the configuration of soft commits vs hard commits 
. Soft commits are relevant for definition how real time this is and can be.

> Am 26.08.2020 um 11:36 schrieb Jörn Franke :
> 
> You do not provide many details, but a queuing mechanism seems to be 
> appropriate for this use case.
> 
>> Am 26.08.2020 um 11:30 schrieb Tushar Arora :
>> 
>> Hi,
>> 
>> One of our use cases requires real time indexing of data in solr from DB.
>> Approximately, 30 rows are updated in a second in DB. And I also want these
>> to be updated in the index simultaneously.
>> Is the Queuing mechanism like Rabbitmq helpful in my case?
>> Please suggest the ways to achieve it.
>> 
>> Regards,
>> Tushar Arora


Re: Real time index data

2020-08-26 Thread Jörn Franke
You do not provide many details, but a queuing mechanism seems to be 
appropriate for this use case.

> Am 26.08.2020 um 11:30 schrieb Tushar Arora :
> 
> Hi,
> 
> One of our use cases requires real time indexing of data in solr from DB.
> Approximately, 30 rows are updated in a second in DB. And I also want these
> to be updated in the index simultaneously.
> Is the Queuing mechanism like Rabbitmq helpful in my case?
> Please suggest the ways to achieve it.
> 
> Regards,
> Tushar Arora


Real time index data

2020-08-26 Thread Tushar Arora
Hi,

One of our use cases requires real time indexing of data in solr from DB.
Approximately, 30 rows are updated in a second in DB. And I also want these
to be updated in the index simultaneously.
Is the Queuing mechanism like Rabbitmq helpful in my case?
Please suggest the ways to achieve it.

Regards,
Tushar Arora


Re: How to forcefully open new searcher, in case when there is no change in Solr index

2020-08-10 Thread Erick Erickson
Are you also posting the same question as :Akshay Murarka ? 
Please do not do this if so, use one e-mail address.


would in-place updates serve your use-case better? See: 
https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html

> On Aug 10, 2020, at 8:17 AM, raj.yadav  wrote:
> 
> I have a use case where none of the document in my solr index is changing but
> I still want to open a new searcher through the curl api. 
> 
> On executing the below curl command 
> curl
> "XXX.XX.XX.XXX:9744/solr/mycollection/update?openSearcher=true=true"
> it doesn't open a new searcher. 
> 
> Below is what I get in logs
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6824) [c:mycollection
> s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
> o.a.s.u.DirectUpdateHandler2 start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6819) [c:mycollection
> s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
> o.a.s.u.DirectUpdateHandler2 start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6829) [c:mycollection
> s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
> o.a.s.u.DirectUpdateHandler2 start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6824) [c:mycollection
> s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
> o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6819) [c:mycollection
> s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
> o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6766) [c:mycollection
> s:shard1_1_1 r:core_node7 x:mycollection_shard1_1_1_replica1]
> o.a.s.u.DirectUpdateHandler2 start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6829) [c:mycollection
> s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
> o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
> 2020-08-10 09:32:22.696 INFO  (qtp297786644-6766) [c:mycollection
> s:shard1_1_1 r:core_node7 x:mycollection_shard1_1_1_replica1]
> o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6824) [c:mycollection
> s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
> o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
> org.apache.solr.search.SolrIndexSearcher
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6819) [c:mycollection
> s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
> o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
> org.apache.solr.search.SolrIndexSearcher
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6829) [c:mycollection
> s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
> o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
> org.apache.solr.search.SolrIndexSearcher
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6824) [c:mycollection
> s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
> o.a.s.u.DirectUpdateHandler2 end_commit_flush
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6819) [c:mycollection
> s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
> o.a.s.u.DirectUpdateHandler2 end_commit_flush
> 2020-08-10 09:32:22.697 INFO  (qtp297786644-6829) [c:mycollection
> s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
> o.a.s.u.DirectUpdateHandler2 end_commit_flush
> 
> I don't want to do a complete reload of my collection.
> Is there any parameter that can be used to forcefully open a new searcher
> every time I do a commit with openSearcher=true. 
> 
> In our collection there are few ExternalFileField type and changes in the
> external file is not getting reflected on issuing commits (using the curl
> command mentioned above).
> 
> Thanks in advance for the help
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



How to forcefully open new searcher, in case when there is no change in Solr index

2020-08-10 Thread raj.yadav
I have a use case where none of the document in my solr index is changing but
I still want to open a new searcher through the curl api. 

On executing the below curl command 
curl
"XXX.XX.XX.XXX:9744/solr/mycollection/update?openSearcher=true=true"
it doesn't open a new searcher. 

Below is what I get in logs
2020-08-10 09:32:22.696 INFO  (qtp297786644-6824) [c:mycollection
s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2020-08-10 09:32:22.696 INFO  (qtp297786644-6819) [c:mycollection
s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2020-08-10 09:32:22.696 INFO  (qtp297786644-6829) [c:mycollection
s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2020-08-10 09:32:22.696 INFO  (qtp297786644-6824) [c:mycollection
s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
2020-08-10 09:32:22.696 INFO  (qtp297786644-6819) [c:mycollection
s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
2020-08-10 09:32:22.696 INFO  (qtp297786644-6766) [c:mycollection
s:shard1_1_1 r:core_node7 x:mycollection_shard1_1_1_replica1]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2020-08-10 09:32:22.696 INFO  (qtp297786644-6829) [c:mycollection
s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
2020-08-10 09:32:22.696 INFO  (qtp297786644-6766) [c:mycollection
s:shard1_1_1 r:core_node7 x:mycollection_shard1_1_1_replica1]
o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping IW.commit.
2020-08-10 09:32:22.697 INFO  (qtp297786644-6824) [c:mycollection
s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
2020-08-10 09:32:22.697 INFO  (qtp297786644-6819) [c:mycollection
s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
2020-08-10 09:32:22.697 INFO  (qtp297786644-6829) [c:mycollection
s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
o.a.s.c.SolrCore SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
2020-08-10 09:32:22.697 INFO  (qtp297786644-6824) [c:mycollection
s:shard1_1_0 r:core_node6 x:mycollection_shard1_1_0_replica1]
o.a.s.u.DirectUpdateHandler2 end_commit_flush
2020-08-10 09:32:22.697 INFO  (qtp297786644-6819) [c:mycollection
s:shard1_0_1 r:core_node5 x:mycollection_shard1_0_1_replica1]
o.a.s.u.DirectUpdateHandler2 end_commit_flush
2020-08-10 09:32:22.697 INFO  (qtp297786644-6829) [c:mycollection
s:shard1_0_0 r:core_node4 x:mycollection_shard1_0_0_replica1]
o.a.s.u.DirectUpdateHandler2 end_commit_flush

I don't want to do a complete reload of my collection.
Is there any parameter that can be used to forcefully open a new searcher
every time I do a commit with openSearcher=true. 

In our collection there are few ExternalFileField type and changes in the
external file is not getting reflected on issuing commits (using the curl
command mentioned above).

Thanks in advance for the help



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index files on Windows fileshare

2020-06-25 Thread Fiz N
Thanks Jason. Appreciate your response.

Thanks
Fiz N.

On Thu, Jun 25, 2020 at 5:42 AM Jason Gerlowski 
wrote:

> Hi Fiz,
>
> Since you're just looking for a POC solution, I think Solr's
> "bin/post" tool would probably help you achieve your first
> requirement.
>
> But I don't think "bin/post" gives you much control over the fields
> that get indexed - if you need the file path to be stored, you might
> be better off writing a small crawler in Java and using SolrJ to do
> the indexing.
>
> Good luck!
>
> Jason
>
> On Fri, Jun 19, 2020 at 9:34 AM Fiz N  wrote:
> >
> > Hello Solr experts,
> >
> > I am using standalone version of SOLR 8.5 on Windows machine.
> >
> > 1)  I want to index all types of files under different directory in the
> > file share.
> >
> > 2) I need to index  absolute path of the files and store it solr field. I
> > need that info so that end user can click and open the file(Pop-up)
> >
> > Could you please tell me how to go about this?
> > This is for POC purpose once we finalize the solution we would be further
> > going ahead with stable approach.
> >
> > Thanks
> > Fiz Nadian.
>


Re: Index files on Windows fileshare

2020-06-25 Thread Jason Gerlowski
Hi Fiz,

Since you're just looking for a POC solution, I think Solr's
"bin/post" tool would probably help you achieve your first
requirement.

But I don't think "bin/post" gives you much control over the fields
that get indexed - if you need the file path to be stored, you might
be better off writing a small crawler in Java and using SolrJ to do
the indexing.

Good luck!

Jason

On Fri, Jun 19, 2020 at 9:34 AM Fiz N  wrote:
>
> Hello Solr experts,
>
> I am using standalone version of SOLR 8.5 on Windows machine.
>
> 1)  I want to index all types of files under different directory in the
> file share.
>
> 2) I need to index  absolute path of the files and store it solr field. I
> need that info so that end user can click and open the file(Pop-up)
>
> Could you please tell me how to go about this?
> This is for POC purpose once we finalize the solution we would be further
> going ahead with stable approach.
>
> Thanks
> Fiz Nadian.


Re: Index file on Windows fileshare..

2020-06-23 Thread Erick Erickson
The program I pointed you to should take about an hour to make work.

But otherwise, you can try the post tool:
https://lucene.apache.org/solr/guide/7_2/post-tool.html

Best,
Erick

> On Jun 23, 2020, at 8:45 AM, Fiz N  wrote:
> 
> Thanks Erick. Is there easy way of doing this? Index files from windows
> share folder to SOLR.
> This is for POC only.
> 
> Thanks
> Nadian.
> 
> On Mon, Jun 22, 2020 at 3:54 PM Erick Erickson 
> wrote:
> 
>> Consider running Tika in a client and indexing the docs to Solr.
>> At that point, you have total control over what’s indexed.
>> 
>> Here’s a skeletal program to get you started:
>> https://lucidworks.com/post/indexing-with-solrj/
>> 
>> Best,
>> Erick
>> 
>>> On Jun 22, 2020, at 1:21 PM, Fiz N  wrote:
>>> 
>>> Hello Solr experts,
>>> 
>>> I am using standalone version of SOLR 8.5 on Windows machine.
>>> 
>>> 1)  I want to index all types of files under different directory in the
>>> file share.
>>> 
>>> 2) I need to index  absolute path of the files and store it solr field. I
>>> need that info so that end user can click and open the file(Pop-up)
>>> 
>>> Could you please tell me how to go about this?
>>> This is for POC purpose once we finalize the solution we would be further
>>> going ahead with stable approach.
>>> 
>>> Thanks
>>> Fiz Nadian.
>> 
>> 



Re: Index file on Windows fileshare..

2020-06-23 Thread Fiz N
Thanks Erick. Is there easy way of doing this? Index files from windows
share folder to SOLR.
This is for POC only.

Thanks
Nadian.

On Mon, Jun 22, 2020 at 3:54 PM Erick Erickson 
wrote:

> Consider running Tika in a client and indexing the docs to Solr.
> At that point, you have total control over what’s indexed.
>
> Here’s a skeletal program to get you started:
> https://lucidworks.com/post/indexing-with-solrj/
>
> Best,
> Erick
>
> > On Jun 22, 2020, at 1:21 PM, Fiz N  wrote:
> >
> > Hello Solr experts,
> >
> > I am using standalone version of SOLR 8.5 on Windows machine.
> >
> > 1)  I want to index all types of files under different directory in the
> > file share.
> >
> > 2) I need to index  absolute path of the files and store it solr field. I
> > need that info so that end user can click and open the file(Pop-up)
> >
> > Could you please tell me how to go about this?
> > This is for POC purpose once we finalize the solution we would be further
> > going ahead with stable approach.
> >
> > Thanks
> > Fiz Nadian.
>
>


Re: Index file on Windows fileshare..

2020-06-22 Thread Erick Erickson
Consider running Tika in a client and indexing the docs to Solr. 
At that point, you have total control over what’s indexed.

Here’s a skeletal program to get you started:
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Jun 22, 2020, at 1:21 PM, Fiz N  wrote:
> 
> Hello Solr experts,
> 
> I am using standalone version of SOLR 8.5 on Windows machine.
> 
> 1)  I want to index all types of files under different directory in the
> file share.
> 
> 2) I need to index  absolute path of the files and store it solr field. I
> need that info so that end user can click and open the file(Pop-up)
> 
> Could you please tell me how to go about this?
> This is for POC purpose once we finalize the solution we would be further
> going ahead with stable approach.
> 
> Thanks
> Fiz Nadian.



Index file on Windows fileshare..

2020-06-22 Thread Fiz N
Hello Solr experts,

I am using standalone version of SOLR 8.5 on Windows machine.

1)  I want to index all types of files under different directory in the
file share.

2) I need to index  absolute path of the files and store it solr field. I
need that info so that end user can click and open the file(Pop-up)

Could you please tell me how to go about this?
This is for POC purpose once we finalize the solution we would be further
going ahead with stable approach.

Thanks
Fiz Nadian.


Index files on Windows fileshare

2020-06-19 Thread Fiz N
Hello Solr experts,

I am using standalone version of SOLR 8.5 on Windows machine.

1)  I want to index all types of files under different directory in the
file share.

2) I need to index  absolute path of the files and store it solr field. I
need that info so that end user can click and open the file(Pop-up)

Could you please tell me how to go about this?
This is for POC purpose once we finalize the solution we would be further
going ahead with stable approach.

Thanks
Fiz Nadian.


Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Erick Erickson
What Walter said. Although with Solr 7.6, unless you specify maxSegments 
explicitly,
you won’t create segments over the default 5G maximum.

And if you have in the past specified maxSegments so you have segments over 5G, 
optimize (again without specifying maxSegments) will do a “singleton merge” on 
them,
i.e. it’ll rewrite each large segment into a single new segment with all the 
deleted data
removed thus gradually shrinking it. This happens automatically if you delete
documents (update is a delete + add so counts), but you may have a significant
percentage of deleted docs in your index..

Best,
Erick

> On Jun 17, 2020, at 12:39 PM, Walter Underwood  wrote:
> 
> From that short description, you should not be running optimize at all.
> 
> Just stop doing it. It doesn’t make that big a difference.
> 
> It may take your indexes a few weeks to get back to a normal state after the 
> forced merges.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 17, 2020, at 4:12 AM, Raveendra Yerraguntla 
>>  wrote:
>> 
>> Thank you David, Walt , Eric.
>> 1. First time bloated index generated , there is no disk space issue. one 
>> copy of index is 1/6 of disk capacity. we ran into disk capacity after more 
>> than 2  copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more 
>> than 5 segments is causing performance issue. Performance in 7.* is not 
>> measured for increasing segments. I will plan a PT to get optimum number. 
>> Application has incremental indexing multiple times in a work week.
>> I will keep you updated on the resolution.
>> Thanks again 
>>   On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
>>  wrote:  
>> 
>> It Depends (tm).
>> 
>> As of Solr 7.5, optimize is different. See: 
>> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> 
>> So, assuming you have _not_ specified maxSegments=1, any very large
>> segment (near 5G) that has _zero_ deleted documents won’t be merged.
>> 
>> So there are two scenarios:
>> 
>> 1> What Walter mentioned. The optimize process runs out of disk space
>>and leaves lots of crud around
>> 
>> 2> your “older segments” are just max-sized segments with zero deletions.
>> 
>> 
>> All that said… do you have demonstrable performance improvements after
>> optimizing? The entire name “optimize” is misleading, of course who
>> wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
>> it made quite a difference. In more recent Solr releases, it’s not as clear
>> cut. So before worrying about making optimize work, I’d recommend that
>> you do some performance tests on optimized and un-optimized indexes. 
>> If there are significant improvements, that’s one thing. Otherwise, it’s
>> a waste.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
>>> 
>>> For a full forced merge (mistakenly named “optimize”), the worst case disk 
>>> space
>>> is 3X the size of the index. It is common to need 2X the size of the index.
>>> 
>>> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
>>> behavior.
>>> I implemented a disk space check that would refuse to merge if there wasn’t 
>>> enough
>>> free space. It would log an error and send an email to the admin.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>>>> wrote:
>>>> 
>>>> I cant give you a 100% true answer but ive experienced this, and what
>>>> "seemed" to happen to me was that the optimize would start, and that will
>>>> drive the size up by 3 fold, and if you out of disk space in the process
>>>> the optimize will quit since, it cant optimize, and leave the live index
>>>> pieces in tact, so now you have the "current" index as well as the
>>>> "optimized" fragments
>>>> 
>>>> i cant say for certain thats what you ran into, but we found that if you
>>>> get an expanding disk it will keep growing and prevent this from happening,
>>>> then the index will contract and the disk will shrink back to only what it
>>>> needs.  saved me a lot of headaches not needing to ever worry about disk
>>>> space
>>>> 
>>>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
&g

Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Walter Underwood
From that short description, you should not be running optimize at all.

Just stop doing it. It doesn’t make that big a difference.

It may take your indexes a few weeks to get back to a normal state after the 
forced merges.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 17, 2020, at 4:12 AM, Raveendra Yerraguntla 
>  wrote:
> 
> Thank you David, Walt , Eric.
> 1. First time bloated index generated , there is no disk space issue. one 
> copy of index is 1/6 of disk capacity. we ran into disk capacity after more 
> than 2  copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more 
> than 5 segments is causing performance issue. Performance in 7.* is not 
> measured for increasing segments. I will plan a PT to get optimum number. 
> Application has incremental indexing multiple times in a work week.
> I will keep you updated on the resolution.
> Thanks again 
>On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
>  wrote:  
> 
> It Depends (tm).
> 
> As of Solr 7.5, optimize is different. See: 
> https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> 
> So, assuming you have _not_ specified maxSegments=1, any very large
> segment (near 5G) that has _zero_ deleted documents won’t be merged.
> 
> So there are two scenarios:
> 
> 1> What Walter mentioned. The optimize process runs out of disk space
> and leaves lots of crud around
> 
> 2> your “older segments” are just max-sized segments with zero deletions.
> 
> 
> All that said… do you have demonstrable performance improvements after
> optimizing? The entire name “optimize” is misleading, of course who
> wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
> it made quite a difference. In more recent Solr releases, it’s not as clear
> cut. So before worrying about making optimize work, I’d recommend that
> you do some performance tests on optimized and un-optimized indexes. 
> If there are significant improvements, that’s one thing. Otherwise, it’s
> a waste.
> 
> Best,
> Erick
> 
>> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
>> 
>> For a full forced merge (mistakenly named “optimize”), the worst case disk 
>> space
>> is 3X the size of the index. It is common to need 2X the size of the index.
>> 
>> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
>> behavior.
>> I implemented a disk space check that would refuse to merge if there wasn’t 
>> enough
>> free space. It would log an error and send an email to the admin.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>>> wrote:
>>> 
>>> I cant give you a 100% true answer but ive experienced this, and what
>>> "seemed" to happen to me was that the optimize would start, and that will
>>> drive the size up by 3 fold, and if you out of disk space in the process
>>> the optimize will quit since, it cant optimize, and leave the live index
>>> pieces in tact, so now you have the "current" index as well as the
>>> "optimized" fragments
>>> 
>>> i cant say for certain thats what you ran into, but we found that if you
>>> get an expanding disk it will keep growing and prevent this from happening,
>>> then the index will contract and the disk will shrink back to only what it
>>> needs.  saved me a lot of headaches not needing to ever worry about disk
>>> space
>>> 
>>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>>  wrote:
>>> 
>>>> 
>>>> when optimize command is issued, the expectation after the completion of
>>>> optimization process is that the index size either decreases or at most
>>>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>>>> is issued, some of the shard's transient or older segment files are not
>>>> deleted. This is happening randomly across all shards. When unnoticed these
>>>> transient files makes disk full. Currently it is handled through monitors,
>>>> but question is what is causing the transient/older files remains there.
>>>> Are there any specific race conditions which laves the older files not
>>>> being deleted?
>>>> Any pointers around this will be helpful.
>>>> TIA
>> 



Re: Solr 7.6 optimize index size increase

2020-06-17 Thread Raveendra Yerraguntla
Thank you David, Walt , Eric.
1. First time bloated index generated , there is no disk space issue. one copy 
of index is 1/6 of disk capacity. we ran into disk capacity after more than 2  
copies of bloated copies.2. Solr is upgraded from 5.*. in 5.* more than 5 
segments is causing performance issue. Performance in 7.* is not measured for 
increasing segments. I will plan a PT to get optimum number. Application has 
incremental indexing multiple times in a work week.
I will keep you updated on the resolution.
Thanks again 
On Tuesday, June 16, 2020, 07:34:26 PM EDT, Erick Erickson 
 wrote:  
 
 It Depends (tm).

As of Solr 7.5, optimize is different. See: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

So, assuming you have _not_ specified maxSegments=1, any very large
segment (near 5G) that has _zero_ deleted documents won’t be merged.

So there are two scenarios:

1> What Walter mentioned. The optimize process runs out of disk space
    and leaves lots of crud around

2> your “older segments” are just max-sized segments with zero deletions.


All that said… do you have demonstrable performance improvements after
optimizing? The entire name “optimize” is misleading, of course who
wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
it made quite a difference. In more recent Solr releases, it’s not as clear
cut. So before worrying about making optimize work, I’d recommend that
you do some performance tests on optimized and un-optimized indexes. 
If there are significant improvements, that’s one thing. Otherwise, it’s
a waste.

Best,
Erick

> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
> 
> For a full forced merge (mistakenly named “optimize”), the worst case disk 
> space
> is 3X the size of the index. It is common to need 2X the size of the index.
> 
> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
> behavior.
> I implemented a disk space check that would refuse to merge if there wasn’t 
> enough
> free space. It would log an error and send an email to the admin.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>> wrote:
>> 
>> I cant give you a 100% true answer but ive experienced this, and what
>> "seemed" to happen to me was that the optimize would start, and that will
>> drive the size up by 3 fold, and if you out of disk space in the process
>> the optimize will quit since, it cant optimize, and leave the live index
>> pieces in tact, so now you have the "current" index as well as the
>> "optimized" fragments
>> 
>> i cant say for certain thats what you ran into, but we found that if you
>> get an expanding disk it will keep growing and prevent this from happening,
>> then the index will contract and the disk will shrink back to only what it
>> needs.  saved me a lot of headaches not needing to ever worry about disk
>> space
>> 
>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>  wrote:
>> 
>>> 
>>> when optimize command is issued, the expectation after the completion of
>>> optimization process is that the index size either decreases or at most
>>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>>> is issued, some of the shard's transient or older segment files are not
>>> deleted. This is happening randomly across all shards. When unnoticed these
>>> transient files makes disk full. Currently it is handled through monitors,
>>> but question is what is causing the transient/older files remains there.
>>> Are there any specific race conditions which laves the older files not
>>> being deleted?
>>> Any pointers around this will be helpful.
>>> TIA
> 
  

Re: Solr 7.6 optimize index size increase

2020-06-16 Thread Erick Erickson
It Depends (tm).

As of Solr 7.5, optimize is different. See: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

So, assuming you have _not_ specified maxSegments=1, any very large
segment (near 5G) that has _zero_ deleted documents won’t be merged.

So there are two scenarios:

1> What Walter mentioned. The optimize process runs out of disk space
 and leaves lots of crud around

2> your “older segments” are just max-sized segments with zero deletions.


All that said… do you have demonstrable performance improvements after
optimizing? The entire name “optimize” is misleading, of course who
wouldn’t want an optimized index? In earlier versions of Solr (i.e. 4x),
it made quite a difference. In more recent Solr releases, it’s not as clear
cut. So before worrying about making optimize work, I’d recommend that
you do some performance tests on optimized and un-optimized indexes. 
If there are significant improvements, that’s one thing. Otherwise, it’s
a waste.

Best,
Erick

> On Jun 16, 2020, at 5:36 PM, Walter Underwood  wrote:
> 
> For a full forced merge (mistakenly named “optimize”), the worst case disk 
> space
> is 3X the size of the index. It is common to need 2X the size of the index.
> 
> When I worked on Ultraseek Server 20+ years ago, it had the same merge 
> behavior.
> I implemented a disk space check that would refuse to merge if there wasn’t 
> enough
> free space. It would log an error and send an email to the admin.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 16, 2020, at 1:58 PM, David Hastings  
>> wrote:
>> 
>> I cant give you a 100% true answer but ive experienced this, and what
>> "seemed" to happen to me was that the optimize would start, and that will
>> drive the size up by 3 fold, and if you out of disk space in the process
>> the optimize will quit since, it cant optimize, and leave the live index
>> pieces in tact, so now you have the "current" index as well as the
>> "optimized" fragments
>> 
>> i cant say for certain thats what you ran into, but we found that if you
>> get an expanding disk it will keep growing and prevent this from happening,
>> then the index will contract and the disk will shrink back to only what it
>> needs.  saved me a lot of headaches not needing to ever worry about disk
>> space
>> 
>> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>>  wrote:
>> 
>>> 
>>> when optimize command is issued, the expectation after the completion of
>>> optimization process is that the index size either decreases or at most
>>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>>> is issued, some of the shard's transient or older segment files are not
>>> deleted. This is happening randomly across all shards. When unnoticed these
>>> transient files makes disk full. Currently it is handled through monitors,
>>> but question is what is causing the transient/older files remains there.
>>> Are there any specific race conditions which laves the older files not
>>> being deleted?
>>> Any pointers around this will be helpful.
>>> TIA
> 



Re: Solr 7.6 optimize index size increase

2020-06-16 Thread Walter Underwood
For a full forced merge (mistakenly named “optimize”), the worst case disk space
is 3X the size of the index. It is common to need 2X the size of the index.

When I worked on Ultraseek Server 20+ years ago, it had the same merge behavior.
I implemented a disk space check that would refuse to merge if there wasn’t 
enough
free space. It would log an error and send an email to the admin.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 16, 2020, at 1:58 PM, David Hastings  
> wrote:
> 
> I cant give you a 100% true answer but ive experienced this, and what
> "seemed" to happen to me was that the optimize would start, and that will
> drive the size up by 3 fold, and if you out of disk space in the process
> the optimize will quit since, it cant optimize, and leave the live index
> pieces in tact, so now you have the "current" index as well as the
> "optimized" fragments
> 
> i cant say for certain thats what you ran into, but we found that if you
> get an expanding disk it will keep growing and prevent this from happening,
> then the index will contract and the disk will shrink back to only what it
> needs.  saved me a lot of headaches not needing to ever worry about disk
> space
> 
> On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
>  wrote:
> 
>> 
>> when optimize command is issued, the expectation after the completion of
>> optimization process is that the index size either decreases or at most
>> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
>> is issued, some of the shard's transient or older segment files are not
>> deleted. This is happening randomly across all shards. When unnoticed these
>> transient files makes disk full. Currently it is handled through monitors,
>> but question is what is causing the transient/older files remains there.
>> Are there any specific race conditions which laves the older files not
>> being deleted?
>> Any pointers around this will be helpful.
>> TIA



Re: Solr 7.6 optimize index size increase

2020-06-16 Thread David Hastings
I cant give you a 100% true answer but ive experienced this, and what
"seemed" to happen to me was that the optimize would start, and that will
drive the size up by 3 fold, and if you out of disk space in the process
the optimize will quit since, it cant optimize, and leave the live index
pieces in tact, so now you have the "current" index as well as the
"optimized" fragments

i cant say for certain thats what you ran into, but we found that if you
get an expanding disk it will keep growing and prevent this from happening,
then the index will contract and the disk will shrink back to only what it
needs.  saved me a lot of headaches not needing to ever worry about disk
space

On Tue, Jun 16, 2020 at 4:43 PM Raveendra Yerraguntla
 wrote:

>
> when optimize command is issued, the expectation after the completion of
> optimization process is that the index size either decreases or at most
> remain same. In solr 7.6 cluster with 50 plus shards, when optimize command
> is issued, some of the shard's transient or older segment files are not
> deleted. This is happening randomly across all shards. When unnoticed these
> transient files makes disk full. Currently it is handled through monitors,
> but question is what is causing the transient/older files remains there.
> Are there any specific race conditions which laves the older files not
> being deleted?
> Any pointers around this will be helpful.
>  TIA


Solr 7.6 optimize index size increase

2020-06-16 Thread Raveendra Yerraguntla

when optimize command is issued, the expectation after the completion of 
optimization process is that the index size either decreases or at most remain 
same. In solr 7.6 cluster with 50 plus shards, when optimize command is issued, 
some of the shard's transient or older segment files are not deleted. This is 
happening randomly across all shards. When unnoticed these transient files 
makes disk full. Currently it is handled through monitors, but question is what 
is causing the transient/older files remains there. Are there any specific race 
conditions which laves the older files not being deleted?
Any pointers around this will be helpful.
 TIA

Re: Index download speed while replicating is fixed at 5.1 in replication.html

2020-06-16 Thread Florin Babes
Hello,
The patch is to fix the display. It doesn't configure or limit the speed :)


În mar., 16 iun. 2020 la 14:26, Shawn Heisey  a scris:

> On 6/14/2020 12:06 AM, Florin Babes wrote:
> > While checking ways to optimize the speed of replication I've noticed
> that
> > the index download speed is fixed at 5.1 in replication.html. There is a
> > reason for that? If not, I would like to submit a patch with the fix.
> > We are using solr 8.3.1.
>
> Looking at the replication.html file, the part that says "5.1 MB/s"
> appears to be purely display.  As far as I can tell, it's not
> configuring anything, and it's not gathering information from anywhere.
>
> So unless your solrconfig.xml is configuring a speed limit in the
> replication handler, I don't think there is one.
>
> I'm curious about exactly what you have in mind for a patch.
>
> Thanks,
> Shawn
>


Re: Index download speed while replicating is fixed at 5.1 in replication.html

2020-06-16 Thread Shawn Heisey

On 6/14/2020 12:06 AM, Florin Babes wrote:

While checking ways to optimize the speed of replication I've noticed that
the index download speed is fixed at 5.1 in replication.html. There is a
reason for that? If not, I would like to submit a patch with the fix.
We are using solr 8.3.1.


Looking at the replication.html file, the part that says "5.1 MB/s" 
appears to be purely display.  As far as I can tell, it's not 
configuring anything, and it's not gathering information from anywhere.


So unless your solrconfig.xml is configuring a speed limit in the 
replication handler, I don't think there is one.


I'm curious about exactly what you have in mind for a patch.

Thanks,
Shawn


Index download speed while replicating is fixed at 5.1 in replication.html

2020-06-14 Thread Florin Babes
Hello,
While checking ways to optimize the speed of replication I've noticed that
the index download speed is fixed at 5.1 in replication.html. There is a
reason for that? If not, I would like to submit a patch with the fix.
We are using solr 8.3.1.
Thanks,
Florin Babes


Re: index join without query criteria

2020-06-08 Thread Mikhail Khludnev
or probably -director_id:[* TO *]

On Mon, Jun 8, 2020 at 10:56 PM Hari Iyer  wrote:

> Hi,
>
> It appears that a query criteria is mandatory for a join. Taking this
> example from the documentation: fq={!join from=id fromIndex=movie_directors
> to=director_id}has_oscar:true. What if I want to find all movies that have
> a director (regardless of whether they have won an Oscar or not)? This
> query: fq={!join from=id fromIndex=movie_directors to=director_id} fails.
> Do I just have to make up a dummy criteria like fq={!join from=id
> fromIndex=movie_directors to=director_id}id:[* TO *]?
>
> Thanks,
> Hari.
>
>

-- 
Sincerely yours
Mikhail Khludnev


index join without query criteria

2020-06-08 Thread Hari Iyer
Hi,

It appears that a query criteria is mandatory for a join. Taking this example 
from the documentation: fq={!join from=id fromIndex=movie_directors 
to=director_id}has_oscar:true. What if I want to find all movies that have a 
director (regardless of whether they have won an Oscar or not)? This query: 
fq={!join from=id fromIndex=movie_directors to=director_id} fails. Do I just 
have to make up a dummy criteria like fq={!join from=id 
fromIndex=movie_directors to=director_id}id:[* TO *]?

Thanks,
Hari.



Re: Need help on handling large size of index.

2020-05-22 Thread Phill Campbell
Maybe your problems are in AWS land.


> On May 22, 2020, at 3:45 AM, Modassar Ather  wrote:
> 
> Thanks Erick and Phill.
> 
> We index data weekly once and that is why we do the optimisation and it has
> helped in faster query result. I will experiment with a fewer segments with
> the current hardware.
> The thing I am not  clear about is although there is no constant high usage
> of extra IOPs other than a couple of spike during optimisation why there is
> so much difference in optimisation time when there is extra IOPs vs no
> Extra IOPs.
> The optimisation on different datacenter machine which was of same
> configuration with SSD used to take 4-5 hours to optimise. This time to
> optimise is comparable to r5a.16xlarge with extra 3 IOPs time.
> 
> Best,
> Modassar
> 
> On Fri, May 22, 2020 at 12:56 AM Phill Campbell
>  wrote:
> 
>> The optimal size for a shard of the index is be definition what works best
>> on the hardware with the JVM heap that is in use.
>> More shards mean smaller sizes of the index for the shard as you already
>> know.
>> 
>> I spent months changing the sharing, the JVM heap, the GC values before
>> taking the system live.
>> RAM is important, and I run with enough to allow Solr to load the entire
>> index into RAM. From my understanding Solr uses the system to memory map
>> the index files. I might be wrong.
>> I experimented with less RAM and SSD drives and found that was another way
>> to get the performance I needed. Since RAM is cheaper, I choose that
>> approach.
>> 
>> Again we never optimize. When we have to recover we rebuild the index by
>> spinning up new machines and use a massive EMR (Map reduce job) to force
>> the data into the system. Takes about 3 hours. Solr can ingest data at an
>> amazing rate. Then we do a blue/green switch over.
>> 
>> Query time, from my experience with my environment, is improved with more
>> sharding and additional hardware. Not just more sharding on the same
>> hardware.
>> 
>> My fields are not stored either, except ID. There are some fields that are
>> indexed and have DocValues and those are used for sorting and facets. My
>> queries can have any number of wildcards as well, but my field’s data
>> lengths are maybe a maximum of 100 characters so proximity searching is not
>> too bad. I tokenize and index everything. I do not expand terms at query
>> time to get broader results, I index the alternatives and let the indexer
>> do what it does best.
>> 
>> If you are running in SolrCloud mode and you are using the embedded
>> zookeeper I would change that. Solr and ZK are very chatty with each other,
>> run ZK on machines in proximity to Solr.
>> 
>> Regards
>> 
>>> On May 21, 2020, at 2:46 AM, Modassar Ather 
>> wrote:
>>> 
>>> Thanks Phill for your response.
>>> 
>>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>>> Hardware utilization?
>>> We are optimising it for query speed. What I understand even if we set
>> the
>>> merge policy to any number the amount of hard disk will still be required
>>> for the bigger segment merges. Please correct me if I am wrong.
>>> 
>>> Optimizing the index is something I never do. We live with about 28%
>>> deletes. You should check your configuration for your merge policy.
>>> There is a delete of about 10-20% in our updates. We have no merge policy
>>> set in configuration as we do a full optimisation after the indexing.
>>> 
>>> Increased sharding has helped reduce query response time, but surely
>> there
>>> is a point where the colation of results starts to be the bottleneck.
>>> The query response time is my concern. I understand the aggregation of
>>> results may increase the search response time.
>>> 
>>> *What does your schema look like? I index around 120 fields per
>> document.*
>>> The schema has a combination of text and string fields. None of the field
>>> except Id field is stored. We also have around 120 fields. A few of them
>>> have docValues enabled.
>>> 
>>> *What does your queries look like? Mine are so varied that caching never
>>> helps, the same query rarely comes through.*
>>> Our search queries are combination of proximity, nested proximity and
>>> wildcards most of the time. The query can be very complex with 100s of
>>> wildcard and proximity terms in it. Different grouping option are also
>>> enabled on search result. And the search queries vary a lot.
>>&

Re: Need help on handling large size of index.

2020-05-22 Thread Modassar Ather
Thanks Erick and Phill.

We index data weekly once and that is why we do the optimisation and it has
helped in faster query result. I will experiment with a fewer segments with
the current hardware.
The thing I am not  clear about is although there is no constant high usage
of extra IOPs other than a couple of spike during optimisation why there is
so much difference in optimisation time when there is extra IOPs vs no
Extra IOPs.
The optimisation on different datacenter machine which was of same
configuration with SSD used to take 4-5 hours to optimise. This time to
optimise is comparable to r5a.16xlarge with extra 3 IOPs time.

Best,
Modassar

On Fri, May 22, 2020 at 12:56 AM Phill Campbell
 wrote:

> The optimal size for a shard of the index is be definition what works best
> on the hardware with the JVM heap that is in use.
> More shards mean smaller sizes of the index for the shard as you already
> know.
>
> I spent months changing the sharing, the JVM heap, the GC values before
> taking the system live.
> RAM is important, and I run with enough to allow Solr to load the entire
> index into RAM. From my understanding Solr uses the system to memory map
> the index files. I might be wrong.
> I experimented with less RAM and SSD drives and found that was another way
> to get the performance I needed. Since RAM is cheaper, I choose that
> approach.
>
> Again we never optimize. When we have to recover we rebuild the index by
> spinning up new machines and use a massive EMR (Map reduce job) to force
> the data into the system. Takes about 3 hours. Solr can ingest data at an
> amazing rate. Then we do a blue/green switch over.
>
> Query time, from my experience with my environment, is improved with more
> sharding and additional hardware. Not just more sharding on the same
> hardware.
>
> My fields are not stored either, except ID. There are some fields that are
> indexed and have DocValues and those are used for sorting and facets. My
> queries can have any number of wildcards as well, but my field’s data
> lengths are maybe a maximum of 100 characters so proximity searching is not
> too bad. I tokenize and index everything. I do not expand terms at query
> time to get broader results, I index the alternatives and let the indexer
> do what it does best.
>
> If you are running in SolrCloud mode and you are using the embedded
> zookeeper I would change that. Solr and ZK are very chatty with each other,
> run ZK on machines in proximity to Solr.
>
> Regards
>
> > On May 21, 2020, at 2:46 AM, Modassar Ather 
> wrote:
> >
> > Thanks Phill for your response.
> >
> > Optimal Index size: Depends on what you are optimizing for. Query Speed?
> > Hardware utilization?
> > We are optimising it for query speed. What I understand even if we set
> the
> > merge policy to any number the amount of hard disk will still be required
> > for the bigger segment merges. Please correct me if I am wrong.
> >
> > Optimizing the index is something I never do. We live with about 28%
> > deletes. You should check your configuration for your merge policy.
> > There is a delete of about 10-20% in our updates. We have no merge policy
> > set in configuration as we do a full optimisation after the indexing.
> >
> > Increased sharding has helped reduce query response time, but surely
> there
> > is a point where the colation of results starts to be the bottleneck.
> > The query response time is my concern. I understand the aggregation of
> > results may increase the search response time.
> >
> > *What does your schema look like? I index around 120 fields per
> document.*
> > The schema has a combination of text and string fields. None of the field
> > except Id field is stored. We also have around 120 fields. A few of them
> > have docValues enabled.
> >
> > *What does your queries look like? Mine are so varied that caching never
> > helps, the same query rarely comes through.*
> > Our search queries are combination of proximity, nested proximity and
> > wildcards most of the time. The query can be very complex with 100s of
> > wildcard and proximity terms in it. Different grouping option are also
> > enabled on search result. And the search queries vary a lot.
> >
> > Oh, another thing, are you concerned about  availability? Do you have a
> > replication factor > 1? Do you run those replicas in a different region
> for
> > safety?
> > How many zookeepers are you running and where are they?
> > As of now we do not have any replication factor. We are not using
> zookeeper
> > ensemble but would like to move to it sooner.
> >
> > Best,
> > Modassar
> >

Re: Need help on handling large size of index.

2020-05-21 Thread Phill Campbell
The optimal size for a shard of the index is be definition what works best on 
the hardware with the JVM heap that is in use.
More shards mean smaller sizes of the index for the shard as you already know. 

I spent months changing the sharing, the JVM heap, the GC values before taking 
the system live.
RAM is important, and I run with enough to allow Solr to load the entire index 
into RAM. From my understanding Solr uses the system to memory map the index 
files. I might be wrong.
I experimented with less RAM and SSD drives and found that was another way to 
get the performance I needed. Since RAM is cheaper, I choose that approach.

Again we never optimize. When we have to recover we rebuild the index by 
spinning up new machines and use a massive EMR (Map reduce job) to force the 
data into the system. Takes about 3 hours. Solr can ingest data at an amazing 
rate. Then we do a blue/green switch over.

Query time, from my experience with my environment, is improved with more 
sharding and additional hardware. Not just more sharding on the same hardware.

My fields are not stored either, except ID. There are some fields that are 
indexed and have DocValues and those are used for sorting and facets. My 
queries can have any number of wildcards as well, but my field’s data lengths 
are maybe a maximum of 100 characters so proximity searching is not too bad. I 
tokenize and index everything. I do not expand terms at query time to get 
broader results, I index the alternatives and let the indexer do what it does 
best.

If you are running in SolrCloud mode and you are using the embedded zookeeper I 
would change that. Solr and ZK are very chatty with each other, run ZK on 
machines in proximity to Solr.

Regards

> On May 21, 2020, at 2:46 AM, Modassar Ather  wrote:
> 
> Thanks Phill for your response.
> 
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
> 
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
> 
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
> 
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
> 
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
> 
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using zookeeper
> ensemble but would like to move to it sooner.
> 
> Best,
> Modassar
> 
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
> 
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>> Can you please help me with following few questions?
>>> 
>>>- What is the ideal index size per shard?
>> 
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>> 
>> 
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>> 
>>>- The optimisation takes lot of time and IOPs to complete. Will
>>>increasing the number of shards help in reducing the optimisation
>> time and
>>>IOPs?
>> 
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more th

Re: Need help on handling large size of index.

2020-05-21 Thread Erick Erickson
Please consider _not_ optimizing. It’s kind of a misleading name anyway, and the
version of solr you’re using may have unintended consequences, see:

https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

There are situations where optimizing makes sense, but far too often people 
think
it’s A Good Thing (based almost entirely on the name, who _wouldn’t_ want an
optimized index?) without measuring, leading to tons of work to no real benefit.

Best,
Erick

> On May 21, 2020, at 4:58 AM, Modassar Ather  wrote:
> 
> Thanks Shawn for your response.
> 
> We have seen a performance increase in optimisation with a bigger number of
> IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
> whereas the same index took 5-6 hours to optimise with higher IOPs.
> Yes the entire extra IOPs were never used to full other than a couple of
> spike in its usage. So not able to understand how the increased IOPs makes
> so much of difference.
> Can you please help me understand what it involves to optimise? Is it the
> more RAM/IOPs?
> 
> Search response time is very important. Please advise if we increase the
> shard with extra servers how much effect it may have on search response
> time.
> 
> Best,
> Modassar
> 
> On Thu, May 21, 2020 at 2:16 PM Modassar Ather 
> wrote:
> 
>> Thanks Phill for your response.
>> 
>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>> Hardware utilization?
>> We are optimising it for query speed. What I understand even if we set the
>> merge policy to any number the amount of hard disk will still be required
>> for the bigger segment merges. Please correct me if I am wrong.
>> 
>> Optimizing the index is something I never do. We live with about 28%
>> deletes. You should check your configuration for your merge policy.
>> There is a delete of about 10-20% in our updates. We have no merge policy
>> set in configuration as we do a full optimisation after the indexing.
>> 
>> Increased sharding has helped reduce query response time, but surely there
>> is a point where the colation of results starts to be the bottleneck.
>> The query response time is my concern. I understand the aggregation of
>> results may increase the search response time.
>> 
>> *What does your schema look like? I index around 120 fields per document.*
>> The schema has a combination of text and string fields. None of the field
>> except Id field is stored. We also have around 120 fields. A few of them
>> have docValues enabled.
>> 
>> *What does your queries look like? Mine are so varied that caching never
>> helps, the same query rarely comes through.*
>> Our search queries are combination of proximity, nested proximity and
>> wildcards most of the time. The query can be very complex with 100s of
>> wildcard and proximity terms in it. Different grouping option are also
>> enabled on search result. And the search queries vary a lot.
>> 
>> Oh, another thing, are you concerned about  availability? Do you have a
>> replication factor > 1? Do you run those replicas in a different region for
>> safety?
>> How many zookeepers are you running and where are they?
>> As of now we do not have any replication factor. We are not using
>> zookeeper ensemble but would like to move to it sooner.
>> 
>> Best,
>> Modassar
>> 
>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
>> 
>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>>> Can you please help me with following few questions?
>>>> 
>>>>- What is the ideal index size per shard?
>>> 
>>> We have no way of knowing that.  A size that works well for one index
>>> use case may not work well for another, even if the index size in both
>>> cases is identical.  Determining the ideal shard size requires
>>> experimentation.
>>> 
>>> 
>>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>> 
>>>>- The optimisation takes lot of time and IOPs to complete. Will
>>>>increasing the number of shards help in reducing the optimisation
>>> time and
>>>>    IOPs?
>>> 
>>> No, changing the number of shards will not help with the time required
>>> to optimize, and might make it slower.  Increasing the speed of the
>>> disks won't help either.  Optimizing involves a lot more than just
>>> copying data -- it will never use all the available disk bandwidth of
>>> m

Re: Need help on handling large size of index.

2020-05-21 Thread Modassar Ather
Thanks Shawn for your response.

We have seen a performance increase in optimisation with a bigger number of
IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
whereas the same index took 5-6 hours to optimise with higher IOPs.
Yes the entire extra IOPs were never used to full other than a couple of
spike in its usage. So not able to understand how the increased IOPs makes
so much of difference.
Can you please help me understand what it involves to optimise? Is it the
more RAM/IOPs?

Search response time is very important. Please advise if we increase the
shard with extra servers how much effect it may have on search response
time.

Best,
Modassar

On Thu, May 21, 2020 at 2:16 PM Modassar Ather 
wrote:

> Thanks Phill for your response.
>
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
>
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
>
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
>
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
>
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
>
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using
> zookeeper ensemble but would like to move to it sooner.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:
>
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>> > Can you please help me with following few questions?
>> >
>> > - What is the ideal index size per shard?
>>
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> > - The optimisation takes lot of time and IOPs to complete. Will
>> > increasing the number of shards help in reducing the optimisation
>> time and
>> > IOPs?
>>
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>>
>> > - We are planning to reduce each shard index size to 30GB and the
>> entire
>> > 3.5 TB index will be distributed across more shards. In this case
>> to almost
>> > 70+ shards. Will this help?
>>
>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>> of shards without adding additional servers, I would expect things to
>> get worse, not better.
>>
>> > Kindly share your thoughts on how best we can use Solr with such a large
>> > index size.
>>
>> Something to keep in mind -- memory is the resource that makes the most
>> difference in performance.  Buying enough memory to get decent
>> performance out of an index that big would probably be very expensive.
>> You should probably explore ways to make your index smaller.  Another
>> idea is to split things up so the most frequently accessed search data
>> is in a relatively small index and lives on beefy servers, and data used
>> for less frequent or data-mining queries (where performance doesn't
>> matter as much) can live on less expensive servers.
>>
>> Thanks,
>> Shawn
>>
>


Re: Need help on handling large size of index.

2020-05-21 Thread Modassar Ather
Thanks Phill for your response.

Optimal Index size: Depends on what you are optimizing for. Query Speed?
Hardware utilization?
We are optimising it for query speed. What I understand even if we set the
merge policy to any number the amount of hard disk will still be required
for the bigger segment merges. Please correct me if I am wrong.

Optimizing the index is something I never do. We live with about 28%
deletes. You should check your configuration for your merge policy.
There is a delete of about 10-20% in our updates. We have no merge policy
set in configuration as we do a full optimisation after the indexing.

Increased sharding has helped reduce query response time, but surely there
is a point where the colation of results starts to be the bottleneck.
The query response time is my concern. I understand the aggregation of
results may increase the search response time.

*What does your schema look like? I index around 120 fields per document.*
The schema has a combination of text and string fields. None of the field
except Id field is stored. We also have around 120 fields. A few of them
have docValues enabled.

*What does your queries look like? Mine are so varied that caching never
helps, the same query rarely comes through.*
Our search queries are combination of proximity, nested proximity and
wildcards most of the time. The query can be very complex with 100s of
wildcard and proximity terms in it. Different grouping option are also
enabled on search result. And the search queries vary a lot.

Oh, another thing, are you concerned about  availability? Do you have a
replication factor > 1? Do you run those replicas in a different region for
safety?
How many zookeepers are you running and where are they?
As of now we do not have any replication factor. We are not using zookeeper
ensemble but would like to move to it sooner.

Best,
Modassar

On Thu, May 21, 2020 at 9:19 AM Shawn Heisey  wrote:

> On 5/20/2020 11:43 AM, Modassar Ather wrote:
> > Can you please help me with following few questions?
> >
> > - What is the ideal index size per shard?
>
> We have no way of knowing that.  A size that works well for one index
> use case may not work well for another, even if the index size in both
> cases is identical.  Determining the ideal shard size requires
> experimentation.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> > - The optimisation takes lot of time and IOPs to complete. Will
> > increasing the number of shards help in reducing the optimisation
> time and
> > IOPs?
>
> No, changing the number of shards will not help with the time required
> to optimize, and might make it slower.  Increasing the speed of the
> disks won't help either.  Optimizing involves a lot more than just
> copying data -- it will never use all the available disk bandwidth of
> modern disks.  SolrCloud does optimizes of the shard replicas making up
> a full collection sequentially, not simultaneously.
>
> > - We are planning to reduce each shard index size to 30GB and the
> entire
> > 3.5 TB index will be distributed across more shards. In this case to
> almost
> > 70+ shards. Will this help?
>
> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
> of shards without adding additional servers, I would expect things to
> get worse, not better.
>
> > Kindly share your thoughts on how best we can use Solr with such a large
> > index size.
>
> Something to keep in mind -- memory is the resource that makes the most
> difference in performance.  Buying enough memory to get decent
> performance out of an index that big would probably be very expensive.
> You should probably explore ways to make your index smaller.  Another
> idea is to split things up so the most frequently accessed search data
> is in a relatively small index and lives on beefy servers, and data used
> for less frequent or data-mining queries (where performance doesn't
> matter as much) can live on less expensive servers.
>
> Thanks,
> Shawn
>


Re: Need help on handling large size of index.

2020-05-20 Thread Shawn Heisey

On 5/20/2020 11:43 AM, Modassar Ather wrote:

Can you please help me with following few questions?

- What is the ideal index size per shard?


We have no way of knowing that.  A size that works well for one index 
use case may not work well for another, even if the index size in both 
cases is identical.  Determining the ideal shard size requires 
experimentation.


https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/


- The optimisation takes lot of time and IOPs to complete. Will
increasing the number of shards help in reducing the optimisation time and
IOPs?


No, changing the number of shards will not help with the time required 
to optimize, and might make it slower.  Increasing the speed of the 
disks won't help either.  Optimizing involves a lot more than just 
copying data -- it will never use all the available disk bandwidth of 
modern disks.  SolrCloud does optimizes of the shard replicas making up 
a full collection sequentially, not simultaneously.



- We are planning to reduce each shard index size to 30GB and the entire
3.5 TB index will be distributed across more shards. In this case to almost
70+ shards. Will this help?


Maybe.  Maybe not.  You'll have to try it.  If you increase the number 
of shards without adding additional servers, I would expect things to 
get worse, not better.



Kindly share your thoughts on how best we can use Solr with such a large
index size.


Something to keep in mind -- memory is the resource that makes the most 
difference in performance.  Buying enough memory to get decent 
performance out of an index that big would probably be very expensive. 
You should probably explore ways to make your index smaller.  Another 
idea is to split things up so the most frequently accessed search data 
is in a relatively small index and lives on beefy servers, and data used 
for less frequent or data-mining queries (where performance doesn't 
matter as much) can live on less expensive servers.


Thanks,
Shawn


Re: Need help on handling large size of index.

2020-05-20 Thread Phill Campbell
In my world your index size is common.

Optimal Index size: Depends on what you are optimizing for. Query Speed? 
Hardware utilization? 
Optimizing the index is something I never do. We live with about 28% deletes. 
You should check your configuration for your merge policy.
I run 120 shards, and I am currently redesigning for 256 shards.
Increased sharding has helped reduce query response time, but surely there is a 
point where the colation of results starts to be the bottleneck.
I run the 120 shards on 90 r4.4xlarge instances with a replication factor of 3.

The things missing are:
What does your schema look like? I index around 120 fields per document.
What does your queries look like? Mine are so varied that caching never helps, 
the same query rarely comes through.
My system takes continuous updates, yours does not.

It is really up to you to experiment.

If you follow the development pattern of Design By Use (DBU) the first thing 
you do for solr and even for SQL is to come up with your queries first. Then 
design the schema. Then figure out how to distribute it for performance.

Oh, another thing, are you concerned about  availability? Do you have a 
replication factor > 1? Do you run those replicas in a different region for 
safety?
How many zookeepers are you running and where are they?

Lots of questions.

Regards

> On May 20, 2020, at 11:43 AM, Modassar Ather  wrote:
> 
> Hi,
> 
> Currently we have index of size 3.5 TB. These index are distributed across
> 12 shards under two cores. The size of index on each shards are almost
> equal.
> We do a delta indexing every week and optimise the index.
> 
> The server configuration is as follows.
> 
>   - Solr Version  : 6.5.1
>   - AWS instance type : r5a.16xlarge
>   - CPU(s)  : 64
>   - RAM  : 512GB
>   - EBS size  : 7 TB (For indexing as well as index optimisation.)
>   - IOPs  : 3 (For faster index optimisation)
> 
> 
> Can you please help me with following few questions?
> 
>   - What is the ideal index size per shard?
>   - The optimisation takes lot of time and IOPs to complete. Will
>   increasing the number of shards help in reducing the optimisation time and
>   IOPs?
>   - We are planning to reduce each shard index size to 30GB and the entire
>   3.5 TB index will be distributed across more shards. In this case to almost
>   70+ shards. Will this help?
>   - Will adding so many new shards increase the search response time and
>   possibly how much?
>   - If we have to increase the shards should we do it on a single larger
>   server or should do it on multiple small servers?
> 
> 
> Kindly share your thoughts on how best we can use Solr with such a large
> index size.
> 
> Best,
> Modassar



Re: Need help on handling large size of index.

2020-05-20 Thread Phill Campbell
In my world your index size is common.

Optimal Index size: Depends on what you are optimizing for. Query Speed? 
Hardware utilization? 
Optimizing the index is something I never do. We live with about 28% deletes. 
You should check your configuration for your merge policy.
I run 120 shards, and I am currently redesigning for 256 shards.
Increased sharding has helped reduce query response time, but surely there is a 
point where the colation of results starts to be the bottleneck.
I run the 120 shards on 90 r4.4xlarge instances with a replication factor of 3.

The things missing are:
What does your schema look like? I index around 120 fields per document.
What does your queries look like? Mine are so varied that caching never helps, 
the same query rarely comes through.
My system takes continuous updates, yours does not.

It is really up to you to experiment.

If you follow the development pattern of Design By Use (DBU) the first thing 
you do for solr and even for SQL is to come up with your queries first. Then 
design the schema. Then figure out how to distribute it for performance.

Oh, another thing, are you concerned about  availability? Do you have a 
replication factor > 1? Do you run those replicas in a different region for 
safety?
How many zookeepers are you running and where are they?

Lots of questions.

Regards

> On May 20, 2020, at 11:43 AM, Modassar Ather  wrote:
> 
> Hi,
> 
> Currently we have index of size 3.5 TB. These index are distributed across
> 12 shards under two cores. The size of index on each shards are almost
> equal.
> We do a delta indexing every week and optimise the index.
> 
> The server configuration is as follows.
> 
>  - Solr Version  : 6.5.1
>  - AWS instance type : r5a.16xlarge
>  - CPU(s)  : 64
>  - RAM  : 512GB
>  - EBS size  : 7 TB (For indexing as well as index optimisation.)
>  - IOPs  : 3 (For faster index optimisation)
> 
> 
> Can you please help me with following few questions?
> 
>  - What is the ideal index size per shard?
>  - The optimisation takes lot of time and IOPs to complete. Will
>  increasing the number of shards help in reducing the optimisation time and
>  IOPs?
>  - We are planning to reduce each shard index size to 30GB and the entire
>  3.5 TB index will be distributed across more shards. In this case to almost
>  70+ shards. Will this help?
>  - Will adding so many new shards increase the search response time and
>  possibly how much?
>  - If we have to increase the shards should we do it on a single larger
>  server or should do it on multiple small servers?
> 
> 
> Kindly share your thoughts on how best we can use Solr with such a large
> index size.
> 
> Best,
> Modassar



Need help on handling large size of index.

2020-05-20 Thread Modassar Ather
Hi,

Currently we have index of size 3.5 TB. These index are distributed across
12 shards under two cores. The size of index on each shards are almost
equal.
We do a delta indexing every week and optimise the index.

The server configuration is as follows.

   - Solr Version  : 6.5.1
   - AWS instance type : r5a.16xlarge
   - CPU(s)  : 64
   - RAM  : 512GB
   - EBS size  : 7 TB (For indexing as well as index optimisation.)
   - IOPs  : 3 (For faster index optimisation)


Can you please help me with following few questions?

   - What is the ideal index size per shard?
   - The optimisation takes lot of time and IOPs to complete. Will
   increasing the number of shards help in reducing the optimisation time and
   IOPs?
   - We are planning to reduce each shard index size to 30GB and the entire
   3.5 TB index will be distributed across more shards. In this case to almost
   70+ shards. Will this help?
   - Will adding so many new shards increase the search response time and
   possibly how much?
   - If we have to increase the shards should we do it on a single larger
   server or should do it on multiple small servers?


Kindly share your thoughts on how best we can use Solr with such a large
index size.

Best,
Modassar


Re: Index using CSV file

2020-04-18 Thread Jörn Franke
Please also do not forget that you should create a schema in the Solr 
collection so that the data is correctly indexed so that you get fast and 
correct query result. 
I usually recommend to read one of the many Solr books out there to get 
started. This will save you a lot of time. 

> Am 18.04.2020 um 17:43 schrieb Jörn Franke :
> 
> 
> This you don’t do via the Solr UI. You have many choices amongst others 
> 1) write a client yourself that parses the csv and post it to the standard 
> Update handler 
> https://lucene.apache.org/solr/guide/8_4/uploading-data-with-index-handlers.html
> 2) use the Solr post tool 
> https://lucene.apache.org/solr/guide/8_4/post-tool.html
> 3) use a http client command line tool (eg curl) and post the data to the CSV 
> update handler: 
> https://lucene.apache.org/solr/guide/8_4/uploading-data-with-index-handlers.html
> 
> However, it would be useful to know what you exactly trying to achieve and 
> give more background on the project, what programming languages and 
> frameworks you (plan to) use etc to give you a more guided answer 
> 
>>> Am 18.04.2020 um 17:13 schrieb Shravan Kumar Bolla 
>>> :
>>> 
>> Hi,
>> 
>> I'm trying to import data from CSV file from Solr UI and I am completely new 
>> to Solr. Please provide the necessary configurations to achieve this.
>> 
>> 


Re: Index using CSV file

2020-04-18 Thread Jörn Franke
This you don’t do via the Solr UI. You have many choices amongst others 
1) write a client yourself that parses the csv and post it to the standard 
Update handler 
https://lucene.apache.org/solr/guide/8_4/uploading-data-with-index-handlers.html
2) use the Solr post tool 
https://lucene.apache.org/solr/guide/8_4/post-tool.html
3) use a http client command line tool (eg curl) and post the data to the CSV 
update handler: 
https://lucene.apache.org/solr/guide/8_4/uploading-data-with-index-handlers.html

However, it would be useful to know what you exactly trying to achieve and give 
more background on the project, what programming languages and frameworks you 
(plan to) use etc to give you a more guided answer 

> Am 18.04.2020 um 17:13 schrieb Shravan Kumar Bolla 
> :
> 
> Hi,
> 
> I'm trying to import data from CSV file from Solr UI and I am completely new 
> to Solr. Please provide the necessary configurations to achieve this.
> 
> 


Index using CSV file

2020-04-18 Thread Shravan Kumar Bolla
Hi,

I'm trying to import data from CSV file from Solr UI and I am completely new to 
Solr. Please provide the necessary configurations to achieve this.




Re: ReversedWildcardFilter - should it be applied only at the index time?

2020-04-15 Thread TK Solr

It doesn't tell much:

"debug":{ "rawquerystring":"email:*@aol.com", "querystring":"email:*@aol.com", 
"parsedquery":"(email:*@aol.com)", "parsedquery_toString":"email:*@aol.com", 
"explain":{ "11d6e092-58b5-4c1b-83bc-f3b37e0797fd":{ "match":true, "value":1.0, 
"description":"email:*@aol.com"},


The email field uses ReversedWildcardFilter for both indexing and query.

On 4/15/20 12:04 PM, Erick Erickson wrote:

What do you see if you add =query? That should tell you….

Best,
Erick


On Apr 15, 2020, at 2:40 PM, TK Solr  wrote:

Thank you.

Is there any harm if I use it on the query side too? In my case it seems working OK (even 
with withOriginal="false"), and even faster.
I see the query parser code is taking a look at index analyzer and applying 
ReversedWildcardFilter at query time. But I didn't
quite understand what happens if the query analyzer also uses 
ReversedWildcardFilter.

On 4/15/20 1:51 AM, Colvin Cowie wrote:

You only need apply it in the index analyzer:
https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
If it appears in the index analyzer, the query part of it is automatically
applied at query time.

The ReversedWildcardFilter indexes *every* token in reverse, with a special
character at the start ('\u0001' I believe) to avoid false positive matches
when the query term isn't reversed (e.g. if the term being indexed is mar,
then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
accidentally match that). If *withOriginal* is set to true then it will
reverse the normal token as well as the reversed token.


On Thu, 9 Apr 2020 at 02:27, TK Solr  wrote:


I experimented with the index-time only use of ReversedWildcardFilter and
the
both time use.

My result shows using ReverseWildcardFilter both times runs twice as fast
but my
dataset is not very large (in the order of 10k docs), so I'm not sure if I
can
make a conclusion.

On 4/8/20 2:49 PM, TK Solr wrote:

In the usage example shown in ReversedWildcardFilter
<

https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>


in Solr Ref Guide,
and only usage find in managed-schema to define text_general_rev, the

filter

is used only for indexing.







maxPosQuestion="2"

maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>










Is it incorrect to use the same analyzer for query like?








maxPosQuestion="0"

maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>



In the description of filter, I see "Tokens without wildcards are not

reversed."

But the wildcard appears only in the query string. How can
ReversedWildcardFilter know if the wildcard is being used
if the filter is used only at the indexing time?

TK






Re: ReversedWildcardFilter - should it be applied only at the index time?

2020-04-15 Thread Erick Erickson
What do you see if you add =query? That should tell you….

Best,
Erick

> On Apr 15, 2020, at 2:40 PM, TK Solr  wrote:
> 
> Thank you.
> 
> Is there any harm if I use it on the query side too? In my case it seems 
> working OK (even with withOriginal="false"), and even faster.
> I see the query parser code is taking a look at index analyzer and applying 
> ReversedWildcardFilter at query time. But I didn't
> quite understand what happens if the query analyzer also uses 
> ReversedWildcardFilter.
> 
> On 4/15/20 1:51 AM, Colvin Cowie wrote:
>> You only need apply it in the index analyzer:
>> https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
>> If it appears in the index analyzer, the query part of it is automatically
>> applied at query time.
>> 
>> The ReversedWildcardFilter indexes *every* token in reverse, with a special
>> character at the start ('\u0001' I believe) to avoid false positive matches
>> when the query term isn't reversed (e.g. if the term being indexed is mar,
>> then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
>> accidentally match that). If *withOriginal* is set to true then it will
>> reverse the normal token as well as the reversed token.
>> 
>> 
>> On Thu, 9 Apr 2020 at 02:27, TK Solr  wrote:
>> 
>>> I experimented with the index-time only use of ReversedWildcardFilter and
>>> the
>>> both time use.
>>> 
>>> My result shows using ReverseWildcardFilter both times runs twice as fast
>>> but my
>>> dataset is not very large (in the order of 10k docs), so I'm not sure if I
>>> can
>>> make a conclusion.
>>> 
>>> On 4/8/20 2:49 PM, TK Solr wrote:
>>>> In the usage example shown in ReversedWildcardFilter
>>>> <
>>> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>>> 
>>>> in Solr Ref Guide,
>>>> and only usage find in managed-schema to define text_general_rev, the
>>> filter
>>>> is used only for indexing.
>>>> 
>>>> >>> positionIncrementGap="100">
>>>> 
>>>> 
>>>> >>> ignoreCase="true"/>
>>>> 
>>>> >> maxPosQuestion="2"
>>>> maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
>>>> 
>>>> 
>>>> 
>>>> >>> ignoreCase="true" synonyms="synonyms.txt"/>
>>>> >>> ignoreCase="true"/>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Is it incorrect to use the same analyzer for query like?
>>>> 
>>>> >>> positionIncrementGap="100">
>>>> 
>>>> 
>>>> 
>>>> 
>>>> >> maxPosQuestion="0"
>>>> maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
>>>> 
>>>> 
>>>> 
>>>> In the description of filter, I see "Tokens without wildcards are not
>>> reversed."
>>>> But the wildcard appears only in the query string. How can
>>>> ReversedWildcardFilter know if the wildcard is being used
>>>> if the filter is used only at the indexing time?
>>>> 
>>>> TK
>>>> 
>>>> 



Re: ReversedWildcardFilter - should it be applied only at the index time?

2020-04-15 Thread TK Solr

Thank you.

Is there any harm if I use it on the query side too? In my case it seems working 
OK (even with withOriginal="false"), and even faster.
I see the query parser code is taking a look at index analyzer and applying 
ReversedWildcardFilter at query time. But I didn't
quite understand what happens if the query analyzer also uses 
ReversedWildcardFilter.


On 4/15/20 1:51 AM, Colvin Cowie wrote:

You only need apply it in the index analyzer:
https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
If it appears in the index analyzer, the query part of it is automatically
applied at query time.

The ReversedWildcardFilter indexes *every* token in reverse, with a special
character at the start ('\u0001' I believe) to avoid false positive matches
when the query term isn't reversed (e.g. if the term being indexed is mar,
then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
accidentally match that). If *withOriginal* is set to true then it will
reverse the normal token as well as the reversed token.


On Thu, 9 Apr 2020 at 02:27, TK Solr  wrote:


I experimented with the index-time only use of ReversedWildcardFilter and
the
both time use.

My result shows using ReverseWildcardFilter both times runs twice as fast
but my
dataset is not very large (in the order of 10k docs), so I'm not sure if I
can
make a conclusion.

On 4/8/20 2:49 PM, TK Solr wrote:

In the usage example shown in ReversedWildcardFilter
<

https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>


in Solr Ref Guide,
and only usage find in managed-schema to define text_general_rev, the

filter

is used only for indexing.







maxPosQuestion="2"

maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>










Is it incorrect to use the same analyzer for query like?








maxPosQuestion="0"

maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>



In the description of filter, I see "Tokens without wildcards are not

reversed."

But the wildcard appears only in the query string. How can
ReversedWildcardFilter know if the wildcard is being used
if the filter is used only at the indexing time?

TK




Re: Solr index size has increased in solr 7.7.2

2020-04-15 Thread David Hastings
i wouldnt worry about the index size until you get above a half terabyte or
so.  adding doc values and other features means you sacrifice things that
dont matter, like size.  memory and ssd's are cheap.

On Wed, Apr 15, 2020 at 1:21 PM Rajdeep Sahoo 
wrote:

> Hi all
> We are migrating from solr 4.6 to solr 7.7.2.
> In solr 4.6 the size was 2.5 gb but here in solr 7.7.2 the solr index size
> is showing 6.8 gb with the same no of documents. Is it expected behavior or
> any suggestions how to optimize the size.
>


Solr index size has increased in solr 7.7.2

2020-04-15 Thread Rajdeep Sahoo
Hi all
We are migrating from solr 4.6 to solr 7.7.2.
In solr 4.6 the size was 2.5 gb but here in solr 7.7.2 the solr index size
is showing 6.8 gb with the same no of documents. Is it expected behavior or
any suggestions how to optimize the size.


Re: ReversedWildcardFilter - should it be applied only at the index time?

2020-04-15 Thread Colvin Cowie
You only need apply it in the index analyzer:
https://lucene.apache.org/solr/8_4_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
If it appears in the index analyzer, the query part of it is automatically
applied at query time.

The ReversedWildcardFilter indexes *every* token in reverse, with a special
character at the start ('\u0001' I believe) to avoid false positive matches
when the query term isn't reversed (e.g. if the term being indexed is mar,
then the reversed token would be \u0001ram, so a search for 'ram' wouldn't
accidentally match that). If *withOriginal* is set to true then it will
reverse the normal token as well as the reversed token.


On Thu, 9 Apr 2020 at 02:27, TK Solr  wrote:

> I experimented with the index-time only use of ReversedWildcardFilter and
> the
> both time use.
>
> My result shows using ReverseWildcardFilter both times runs twice as fast
> but my
> dataset is not very large (in the order of 10k docs), so I'm not sure if I
> can
> make a conclusion.
>
> On 4/8/20 2:49 PM, TK Solr wrote:
> > In the usage example shown in ReversedWildcardFilter
> > <
> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter>
>
> > in Solr Ref Guide,
> > and only usage find in managed-schema to define text_general_rev, the
> filter
> > is used only for indexing.
> >
> >> positionIncrementGap="100">
> > 
> >   
> >> ignoreCase="true"/>
> >   
> >maxPosQuestion="2"
> > maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
> > 
> > 
> >   
> >> ignoreCase="true" synonyms="synonyms.txt"/>
> >> ignoreCase="true"/>
> >   
> > 
> >   
> >
> >
> > Is it incorrect to use the same analyzer for query like?
> >
> >> positionIncrementGap="100">
> > 
> > 
> >   
> >   
> >maxPosQuestion="0"
> > maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>
> > 
> >   
> >
> > In the description of filter, I see "Tokens without wildcards are not
> reversed."
> > But the wildcard appears only in the query string. How can
> > ReversedWildcardFilter know if the wildcard is being used
> > if the filter is used only at the indexing time?
> >
> > TK
> >
> >
>


Re: ReversedWildcardFilter - should it be applied only at the index time?

2020-04-08 Thread TK Solr
I experimented with the index-time only use of ReversedWildcardFilter and the 
both time use.


My result shows using ReverseWildcardFilter both times runs twice as fast but my 
dataset is not very large (in the order of 10k docs), so I'm not sure if I can 
make a conclusion.


On 4/8/20 2:49 PM, TK Solr wrote:
In the usage example shown in ReversedWildcardFilter 
<https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#reversed-wildcard-filter> 
in Solr Ref Guide,
and only usage find in managed-schema to define text_general_rev, the filter 
is used only for indexing.


  positionIncrementGap="100">

    
  
  ignoreCase="true"/>

  
  maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>

    
    
  
  ignoreCase="true" synonyms="synonyms.txt"/>
  ignoreCase="true"/>

  
    
  


Is it incorrect to use the same analyzer for query like?

  positionIncrementGap="100">

    
    
  
  
  maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>

    
  

In the description of filter, I see "Tokens without wildcards are not reversed."
But the wildcard appears only in the query string. How can 
ReversedWildcardFilter know if the wildcard is being used

if the filter is used only at the indexing time?

TK




ReversedWildcardFilter - should it be applied only at the index time?

2020-04-08 Thread TK Solr
In the usage example shown in ReversedWildcardFilter 
 
in Solr Ref Guide,
and only usage find in managed-schema to define text_general_rev, the filter is 
used only for indexing.


  positionIncrementGap="100">

    
  
  ignoreCase="true"/>

  
  maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>

    
    
  
  ignoreCase="true" synonyms="synonyms.txt"/>
  ignoreCase="true"/>

  
    
  


Is it incorrect to use the same analyzer for query like?

  positionIncrementGap="100">

    
    
  
  
  maxFractionAsterisk="0" maxPosAsterisk="100" withOriginal="false"/>

    
  

In the description of filter, I see "Tokens without wildcards are not reversed."
But the wildcard appears only in the query string. How can 
ReversedWildcardFilter know if the wildcard is being used

if the filter is used only at the indexing time?

TK




RE: No files to download for index generation

2020-03-30 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
I wanted to ask *yet again* whether anyone could please clarify what this error 
means?

The wording could be interpreted as a benign "I found that there was nothing 
which needed to be done after all"; but were that to be the meaning of this 
error, why would it be flagged as an ERROR rather than as INFO or WARN ?

Please advise


-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C] 
Sent: Wednesday, March 11, 2020 5:18 PM
To: solr-user@lucene.apache.org
Subject: RE: No files to download for index generation

I wanted to ask *again* whether anyone has any insight regarding this message

There seem to have been several people asking the question on this forum 
(Markus Jelsma on 8/23/19, Akreeti Agarwal on 12/27/19 and Vadim Ivanov on 
12/29/19)

The only response I have seen was five words from Erick Erickson on 12/27/19: 
"Not sure about that one"

Could someone please clarify what this error means?

The wording could be interpreted as a benign "I found that there was nothing 
which needed to be done after all"; but were that to be the meaning of this 
error, why would it be flagged as an ERROR rather than as INFO or WARN ?


-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C] 
Sent: Monday, June 10, 2019 9:57 AM
To: solr-user@lucene.apache.org
Subject: RE: No files to download for index generation

Does anyone yet have any insight on interpreting the severity of this message?

-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C] 
Sent: Tuesday, June 04, 2019 4:07 PM
To: solr-user@lucene.apache.org
Subject: No files to download for index generation

We have occasionally been seeing an error such as the following:
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Master's generation: 1424625
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Master's version: 1559619115480
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Slave's generation: 1424624
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Slave's version: 1559619050130
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2019-06-03 23:32:45.587 ERROR (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher No files to download for index generation: 1424625

Is that last line actually an error as in "there SHOULD be files to download, 
but there are none"?

Or is it simply informative as in "there are no files to download, so we are 
all done here"?


  1   2   3   4   5   6   7   8   9   10   >