Re: Order of applying tokens/filter

2020-10-05 Thread Jayadevan Maymala
>
> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
> WhitespaceTokenizerFactory
> SynonymGraphFilterFactory
> FlattenGraphFilterFactory
> KStemFilterFactory
> RemoveDuplicatesFilterFactory
>
> One doubt related to this. Ideally, the same sequence should be followed
for indexing and querying, right?
Regards,
Jayadevan


Re: Document centric external version conflict not returning 409

2020-10-05 Thread Deepu
Dear Team,
Any suggestions on below observation?

Thanks & Regards,
Deepu

On Sun, Oct 4, 2020 at 7:57 PM Deepu  wrote:

> Hi Team,
>
> I am using Solr document centric external version configuration to control
> concurrent updates.
> Followed sample configuration given in below github path.
>
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/solrconfig-externalversionconstraint.xml
>
> Document not getting updated if the new document's external version is
> less than the existing one, but it is not giving response status code as
> 409, still getting status=0 in update response, need to configure anything
> else to get status=409 in response.
>
> 
>   
> live_b
> true
>   
>
>   
> _external_version_
> _external_version_
>   
>
>   
> live_b
> false
>   
>
>   
> update_timestamp_tdt
>   
>
>   
>   
> 
>
>
> I am using Solr version : 8.6.1 and Solrj version : 8.4.0
>
> Thanks,
> Deepu
>
>
>


RE: Slow Solr 8 response for long query

2020-10-05 Thread Permakoff, Vadim
Hi Erick,
Thank you for looking into my question.

Below is timing for Solr 6 and Solr 8. I see that the search time depends on 
grouping, without grouping it is very fast and approx. the same for both solr 6 
& 8, but with grouping the solr 8 is much slower. The difference grows with 
number of returned results (groups). For 30 results the difference is not that 
big, but for 300 results Solr 6 speed is almost the same, but Solr 8 is about 
10 times slower. The data is the same, the indexing done from scratch.
The documents are nested, we are searching children and grouping on a field, 
which may group children from different parents, but in this particular case 
groups are only from one parent.
This is the query example:
qt=/select=json=true=0=30=_text_sp_=VERY_LONG_BOOLEAN_QUERY_USING_SEVERAL_INDEXED_STRING_FIELDS_FROM_CHILDREN=OR=q=true=_nested_id:child=true=true=uniqueId=true=id,score=timing

Solr-8:
  "debug":{
"timing":{
  "time":22258.0,
  "prepare":{
"time":20.0,
"query":{
  "time":20.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":0.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}},
  "process":{
"time":22210.0,
"query":{
  "time":22210.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":0.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}

Solr-6:
  "debug":{
"timing":{
  "time":16157.0,
  "prepare":{
"time":14.0,
"query":{
  "time":14.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":0.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}},
  "process":{
"time":16133.0,
"query":{
  "time":16133.0},
"facet":{
  "time":0.0},
"facet_module":{
  "time":0.0},
"mlt":{
  "time":0.0},
"highlight":{
  "time":0.0},
"stats":{
  "time":0.0},
"expand":{
  "time":0.0},
"terms":{
  "time":0.0},
"debug":{
  "time":0.0}

Best Regards,
Vadim Permakoff


-Original Message-
From: Erick Erickson  
Sent: Wednesday, September 30, 2020 8:04 AM
To: solr-user@lucene.apache.org
Subject: Re: Slow Solr 8 response for long query

Caution: This email originated outside of the organization

Increasing the number of rows should not have this kind of impact in either 
version of Solr, so I think there’s something fundamentally strange in your 
setup.

Whether returning 10 or 300 documents, every document has to be scored. There 
are two differences between 10 and 300 rows:

1> when returning 10 rows, Solr keeps a sorted list of 10 doc, just IDs and 
score (assuming you’re sorting by relevance), when returning 300 the list is 
300 long. I find it hard to believe that keeping a list 300 items long is 
making that much of a difference.

2> Solr needs to fetch/decompress/assemble 300 documents .vs. 10 documents for 
the response. Regardless of the fields returned, the entire document will be 
decompresses if you return any fields that are not docValues=true. So it’s 
possible that what you’re seeing is related.

Try adding, as Alexandre suggests,  to the query. Pay particular 
attention to the “timings” section too, that’ll show you the time each 
component took _exclusive_ of step <2> above and should give a clue.


All that said, fq clauses don’t score, so scoring is certainly involved in why 
the query takes so long to return even 10 rows but gets faster when you move 
the clause to a filter query, but my intuition is that there’s something else 
going on as well to account for the difference when you return 300 rows.

Best,
Erick

> On Sep 29, 2020, at 8:52 PM, Alexandre Rafalovitch  wrote:
>
> What do the debug versions of the query show between two versions?
>
> One thing that changed is sow (split on whitespace) parameter among 
> many. It is unlikely to be the cause, but I am mentioning just in 
> case.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org
> _solr_guide_8-5F6_the-2Dstandard-2Dquery-2Dparser.html-23standard-2Dqu
> ery-2Dparser-2Dparameters=DwIFaQ=birp9sjcGzT9DCP3EIAtLA=T7Y0P9fY
> -fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g=RUATSH_cpLfFDdUDmbHILMZFCZb7-4Ld
> nFI45UJRwrk=tkGnQKurRTwtyBUB8v3-C8khRra5oR7My0EaXsA7_LI=
>
> Regards,
>   

Reindexing major upgrades

2020-10-05 Thread Rafael Sousa
Hi all,

  I have an solr 6.5 indexes that I should migrate to a 8.6.2 version.
Knowing that the migration of more than one version shift is now blocked in
the 8.6 version, what is the recommended way of making an old 6.5 index to
be ported to a 8.6.2 version ? Having things reindexed from scratch is not
an option, so, is there a way of creating a 8.6.2 index from a pre-existing
6.5 index or something like that?

Thank you so much for your help.

Rafael


Re: MappingCharFilterFactory weird behaviour

2020-10-05 Thread Alexandre Rafalovitch
How do you know it does not apply?

My Doh moment is often forgetting that stored version of the field is not
affected by analyzers. One has to look in schema Admin UI to check indexed
values.

Regards,
   Alex

On Mon., Oct. 5, 2020, 6:01 a.m. Lukas Brune, 
wrote:

> Hello!
>
> I'm having some troubles with using MappingCharFilterFactory in my schema.
> We're using it to replace some escaped html entities
> so HTMLStripCharFilterFactory can take care of those.
>
> When testing this out in Analysis it works perfectly, however, when adding
> elements to Solr, the mapping doesn't seem to apply.
>
> We're currently copying some other fields into the field with the replaces,
> so it's a MultiValued field. (Don't know if that makes a difference)
>
>  positionIncrementGap="100">
> 
>mapping="mapping.txt"/>
>   
>// other stuff
> 
>   
>
>
>
>
> multiValued="true" required="false" termVectors="true" termPositions="true"
> termOffsets="true"/>
>
>
> Best Regards,
> *Lukas Brune* | Machine Learning Engineer & Web Developer | Comintelli AB
> lukas.br...@comintelli.com| Mobile:+46(0)706229823 |
> www.intelligence2day.com
>
> 
>


MappingCharFilterFactory weird behaviour

2020-10-05 Thread Lukas Brune
Hello!

I'm having some troubles with using MappingCharFilterFactory in my schema.
We're using it to replace some escaped html entities
so HTMLStripCharFilterFactory can take care of those.

When testing this out in Analysis it works perfectly, however, when adding
elements to Solr, the mapping doesn't seem to apply.

We're currently copying some other fields into the field with the replaces,
so it's a MultiValued field. (Don't know if that makes a difference)



  
  
   // other stuff

  




   


Best Regards,
*Lukas Brune* | Machine Learning Engineer & Web Developer | Comintelli AB
lukas.br...@comintelli.com| Mobile:+46(0)706229823 |
www.intelligence2day.com




Re: Solr 7.7 - Few Questions

2020-10-05 Thread Charlie Hull
Nested docs would be one approach, result grouping might be another. 
Regarding JOINs, the only way you're going to know is by some 
representative testing.


Charlie

On 05/10/2020 05:49, Rahul Goswami wrote:

Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:


Hi Rahul,



In addition to the wise advice below: remember in Solr, a 'document' is

just the name for the thing that would appear as one of the results when

you search (analagous to a database record). It's not the same

conceptually as a 'Word document' or a 'PDF document'. If your source

documents are so big, consider how they might be broken into parts, or

whether you really need to index all of them for retrieval purposes, or

what parts of them need to be extracted as text. Thus, the Solr

documents don't necessarily need to be as large as your source documents.



Consider an email size 20kb with ten PDF attachments, each 20MB. You

probably shouldn't push all this data into a single Solr document, but

you *could* index them as 11 separate Solr documents, but with metadata

to indicate that one is an email and ten are PDFs, and a shared ID of

some kind to indicate they're related. Then at query time there are

various ways for you to group these together, so for example if the

query hit one of the PDFs you could show the user the original email,

plus the 9 other attachments, using the shared ID as a key.



HTH,



Charlie



On 02/10/2020 01:53, Rahul Goswami wrote:


Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than

any


attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter


You'll need to configure it in the schema for the "index" analyzer for

the


data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).
- Rahul
On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to

Solr using Solr.Net commit. The data is being synced to SOLR in batches.
The document size is very huge (~0.5GB average) and solr indexing is

taking


long time. Total document size is ~200GB. As the solr commit is done as

a


part of API, the API calls are failing as document indexing is not
completed.
A single document is five hundred megabytes?  What kind of documents do
you have?  You can't even index something that big without tweaking
configuration parameters that most people don't even know about.
Assuming you can even get it working, there's no way that indexing a
document like that is going to be fast.

 1.  What is your advise on syncing such a large volume of data to

Solr KB.
What is "KB"?  I have never heard of this in relation to Solr.

 2.  Because of the search requirements, almost 8 fields are defined

as Text fields.
I can't figure out what you are trying to say with this statement.

 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such

a


large volume of data?
If just one of the documents you're sending to Solr really is five
hundred megabytes, then 2 gigabytes would probably be just barely enough
to index one document into an empty index ... and it would probably be
doing garbage collection so frequently that it would make things REALLY
slow.  I have no way to predict how much heap you will need.  That will
require experimentation.  I can tell you that 2GB is definitely not

enough.


 4.  How to set up Solr in production on Windows? Currently it's set

up as a standalone engine and client is requested to take the backup of

the


drive. Is there any other better way to do? How to set up for the

disaster


recovery?
I would suggest NOT doing it on Windows.  My reasons for that come down
to costs -- a Windows Server license isn't cheap.
That said, there's nothing wrong with running on Windows, but you're on
your own as far as running it as a service.  We only have a service
installer for UNIX-type systems.  Most of the testing for that is done
on Linux.

 5.  How to benchmark the system requirements for such a huge data

I do not know what all your needs are, so I have no way to answer this.
You're 

Re: Order of applying tokens/filter

2020-10-05 Thread Jayadevan Maymala
> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
> WhitespaceTokenizerFactory
> SynonymGraphFilterFactory
> FlattenGraphFilterFactory
> KStemFilterFactory
> RemoveDuplicatesFilterFactory
>
> Thanks a lot. Very useful insights.

Regards,
Jayadevan