Re: Solr 8.0.0 - CPU usage 100% when indexed documents

2019-04-09 Thread vishal patel
Hi,

After your suggestion i changed code
String SOLR_URL="http://localhost:7991/solr/actionscomments;;
SolrClient solrClient = new HttpSolrClient.Builder(SOLR_URL).build();
SolrInputDocument document = new SolrInputDocument();
document.addField("id","ACTC6401895");
solrClient.add(document);
solrClient.commit();

Still my CPU usage went high and my CPU has 4 core and no other application 
running in my machine.

After the lots of try, I found out the below issue.
Before solrconfig.xml (6.1.0)

   60
   2
   false
 

After the below change in solrconfig.xml
 (8.0.0)
   ${solr.autoCommit.maxTime:15000}
   2
   false
 

Actually I am upgrading solr 6.1.0 to 8.0.0. In 6.1.0 it is working fine with 
autocommit maxtime 60.
But in 8.0.0, CPU usage goes high.[commitScheduler thread running long time]

Please give me more details why is it happening in solr 8.0.0.
Is any my mistake? In previous mail, I attached solrconfig.xml so please verify 
it.


Sent from Outlook

From: Shawn Heisey 
Sent: Tuesday, April 9, 2019 1:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 8.0.0 - CPU usage 100% when indexed documents

On 4/8/2019 11:00 PM, vishal patel wrote:
> Sorry my mistake there is no class of that.
>
> I have add the data using below code.
> CloudSolrServer cloudServer = new CloudSolrServer(zkHost);
> cloudServer.setDefaultCollection("actionscomments");
> cloudServer.setParallelUpdates(true);
> List docs = new ArrayList<>();
> SolrInputDocument solrDocument = new SolrInputDocument();
> solrDocument.addField("id", "123");
> docs.add(solrDocument);
> cloudServer.add(docs, 1000);

Side note:  This code is not using SolrJ 8.0.0.  CloudSolrServer was
deprecated in version 5.0.0 and completely removed in version 6.0.0.
I'm surprised this code even works at all with Solr 8.0.0 -- you need to
upgrade to SolrJ 8 and use CloudSolrClient.

How long does the system remain at 100 percent CPU when you index that
single document that only has one field?  If it's longer than a very
small fraction of a second, then my guess is that it's cache warming
queries using the CPU, not the indexing itself.

How many CPU cores are at 100 percent?  Is it just one, or multiple?  It
would be odd for it to be multiple, unless there is other activity going
on at the same time.

Thanks,
Shawn


Re: Which fieldType to use for JSON Array in Solr 6.5.0?

2019-04-09 Thread Abhijit Pawar
Can parent-child relationship be used in this scenario?
Anyone?

I see it needs an update handler:
https://lucene.apache.org/solr/guide/6_6/transforming-and-indexing-custom-json.html#transforming-and-indexing-custom-json


curl 'http://localhost:8983/solr/my_collection/update/json/docs?split=/|/orgs'\
-H 'Content-type:application/json' -d '{  "name": "Joe Smith",
"phone": 876876687,  "orgs": [{  "name": "Microsoft",
"city": "Seattle",  "zip": 98052},{  "name": "Apple",
"city": "Cupertino",  "zip": 95014}  ]}'


On Tue, Apr 9, 2019 at 3:28 PM Shawn Heisey  wrote:

> On 4/9/2019 2:04 PM, Abhijit Pawar wrote:
> > Hello Guys,
> >
> > I am trying to index a JSON array in one of my collections in mongoDB in
> > Solr 6.5.0 however it is not getting indexed.
> >
> > I am using a DataImportHandler for this.
> >
> > *Here's how the data looks in mongoDB:*
> > {
> >   "idStr" : "5ca38e407b154dac08913a96",
> >  "sampleAttr" : "sampleAttrVal",
> > *"additionalInfo" : [ *
> > *{*
> > *"name" : "Manufacturer",*
> > *"value" : "Videocon"*
> > *}*
> > *]*
> > }
>
> That is not a structure that Solr knows how to handle.  Essentially what
> you have there is one document nested inside another.  Each of Solr's
> documents has a completely flat structure -- there is no possibility of
> a hierarchy within a single document.
>
> Solr does have support for parent/child documents, but it wouldn't be
> indexed like that.  I know almost nothing about how the parent/child
> document support works.  You would have to get help from someone else or
> consult the documentation.
>
> Thanks,
> Shawn
>


Re: Which fieldType to use for JSON Array in Solr 6.5.0?

2019-04-09 Thread David Hastings
Exactly, Solr is a search index, not a data store.  you need to flatten
your relationships.  Right tool for the job etc.

On Tue, Apr 9, 2019 at 4:28 PM Shawn Heisey  wrote:

> On 4/9/2019 2:04 PM, Abhijit Pawar wrote:
> > Hello Guys,
> >
> > I am trying to index a JSON array in one of my collections in mongoDB in
> > Solr 6.5.0 however it is not getting indexed.
> >
> > I am using a DataImportHandler for this.
> >
> > *Here's how the data looks in mongoDB:*
> > {
> >   "idStr" : "5ca38e407b154dac08913a96",
> >  "sampleAttr" : "sampleAttrVal",
> > *"additionalInfo" : [ *
> > *{*
> > *"name" : "Manufacturer",*
> > *"value" : "Videocon"*
> > *}*
> > *]*
> > }
>
> That is not a structure that Solr knows how to handle.  Essentially what
> you have there is one document nested inside another.  Each of Solr's
> documents has a completely flat structure -- there is no possibility of
> a hierarchy within a single document.
>
> Solr does have support for parent/child documents, but it wouldn't be
> indexed like that.  I know almost nothing about how the parent/child
> document support works.  You would have to get help from someone else or
> consult the documentation.
>
> Thanks,
> Shawn
>


Re: Which fieldType to use for JSON Array in Solr 6.5.0?

2019-04-09 Thread Shawn Heisey

On 4/9/2019 2:04 PM, Abhijit Pawar wrote:

Hello Guys,

I am trying to index a JSON array in one of my collections in mongoDB in
Solr 6.5.0 however it is not getting indexed.

I am using a DataImportHandler for this.

*Here's how the data looks in mongoDB:*
{
  "idStr" : "5ca38e407b154dac08913a96",
 "sampleAttr" : "sampleAttrVal",
*"additionalInfo" : [ *
*{*
*"name" : "Manufacturer",*
*"value" : "Videocon"*
*}*
*]*
}


That is not a structure that Solr knows how to handle.  Essentially what 
you have there is one document nested inside another.  Each of Solr's 
documents has a completely flat structure -- there is no possibility of 
a hierarchy within a single document.


Solr does have support for parent/child documents, but it wouldn't be 
indexed like that.  I know almost nothing about how the parent/child 
document support works.  You would have to get help from someone else or 
consult the documentation.


Thanks,
Shawn


Which fieldType to use for JSON Array in Solr 6.5.0?

2019-04-09 Thread Abhijit Pawar
Hello Guys,

I am trying to index a JSON array in one of my collections in mongoDB in
Solr 6.5.0 however it is not getting indexed.

I am using a DataImportHandler for this.

*Here's how the data looks in mongoDB:*
{
 "idStr" : "5ca38e407b154dac08913a96",
"sampleAttr" : "sampleAttrVal",
*"additionalInfo" : [ *
*{*
*"name" : "Manufacturer",*
*"value" : "Videocon"*
*}*
*]*
}

*data-source-config.xml:*






**
* *




*managed-schema.xml:*

 
 
 

What fieldType should I use for JSON Array.?I tried above - "strings"
however it doesn't seem to work.
Can someone help me on this?
Appreciate your response.Thanks.


RE: Solr Cache clear

2019-04-09 Thread Lewin Joy (TMNA)
Hmm. I am doing the same thing. But, somehow in my browser, after I select the 
core, it does not stay selected to view the stats/cache.
Attaching the gif for when I try it.

Anyway, that is a different issue from my side. Thanks for your input.

-Lewin

-Original Message-
From: Shawn Heisey  
Sent: Tuesday, April 9, 2019 1:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cache clear

On 4/9/2019 12:38 PM, Lewin Joy (TMNA) wrote:
> I just tried to go to the location you have specified. I could not see a 
> "CACHE" . I can see the "Statistics" section.
> I am using Solr 7.2 on solrcloud mode.

If you are trying to select a *collection* from a dropdown, you will not see 
this.  It will only show up when you select a *core* from the other dropdown.

In SolrCloud, collections are made up of one or more shards.  Shards are made 
up of one or more replicas.  Every shard replica is a core.

Here are some partial screenshots showing what I clicked on to get to the cache 
stats:

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_ked4xe4w45e9qr2_cache-2D1.png-3Fdl-3D0=DwICaQ=DDPRwrN9uYSNUDpKqPeD1g=WMeiuwk_Qf7aOundlWmtZMlairjO8ZQxQpAndx7JD6A=9o8kD9YOypvijGkYVvikg-74p2wDw2kblxyrVUijnaI=Zr1cZF633BlxWx0semmI-DCywSyOteEEM9_eABk-OrM=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_7v630mr2ii0rtee_cache-2D2.png-3Fdl-3D0=DwICaQ=DDPRwrN9uYSNUDpKqPeD1g=WMeiuwk_Qf7aOundlWmtZMlairjO8ZQxQpAndx7JD6A=9o8kD9YOypvijGkYVvikg-74p2wDw2kblxyrVUijnaI=rCCFdR6yk2ElhXMyxX0LJJoPkymG8Y3rrxCQz3gK8fI=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dropbox.com_s_6czqtxq5qf8vzwz_cache-2D3.png-3Fdl-3D0=DwICaQ=DDPRwrN9uYSNUDpKqPeD1g=WMeiuwk_Qf7aOundlWmtZMlairjO8ZQxQpAndx7JD6A=9o8kD9YOypvijGkYVvikg-74p2wDw2kblxyrVUijnaI=sA5FxVsaP1M8bGkxUnGdVoW7nLjF7rFthVCG6sKY_po=

The system where I captured these screenshots was not running in SolrCloud 
mode, so it did not have the "collections" dropdown that your admin UI will 
have.

When you reload a collection, Solr uses the information in ZooKeeper to locate 
all of the shard replicas that make up the collection, and reloads all those 
cores.  So a collection reload is basically equivalent to multiple core reloads.

Thanks,
Shawn


Re: Solr Cache clear

2019-04-09 Thread Shawn Heisey

On 4/9/2019 12:38 PM, Lewin Joy (TMNA) wrote:

I just tried to go to the location you have specified. I could not see a "CACHE" . I can 
see the "Statistics" section.
I am using Solr 7.2 on solrcloud mode.


If you are trying to select a *collection* from a dropdown, you will not 
see this.  It will only show up when you select a *core* from the other 
dropdown.


In SolrCloud, collections are made up of one or more shards.  Shards are 
made up of one or more replicas.  Every shard replica is a core.


Here are some partial screenshots showing what I clicked on to get to 
the cache stats:


https://www.dropbox.com/s/ked4xe4w45e9qr2/cache-1.png?dl=0
https://www.dropbox.com/s/7v630mr2ii0rtee/cache-2.png?dl=0
https://www.dropbox.com/s/6czqtxq5qf8vzwz/cache-3.png?dl=0

The system where I captured these screenshots was not running in 
SolrCloud mode, so it did not have the "collections" dropdown that your 
admin UI will have.


When you reload a collection, Solr uses the information in ZooKeeper to 
locate all of the shard replicas that make up the collection, and 
reloads all those cores.  So a collection reload is basically equivalent 
to multiple core reloads.


Thanks,
Shawn


RE: Solr Cache clear

2019-04-09 Thread Lewin Joy (TMNA)
Hi Shawn,

We are facing an issue where the caches got corrupted.
We are doing a json.facet and pivoting through 3 levels. We are taking 
allBuckets from the different levels. 

In json.facet query, while doing the inner facets, we are keeping a limit. We 
notice that as we change the limit, we are getting a different value for 
allBuckets.
And this got corrected after I explicitly applied one facet value as a filter 
one time. I am assuming, it cleared the cache for that filter.

Now, I have few other facet values having the similar issue. Assuming, that 
this issue would get resolved if I clear the cache, I am checking these values 
once I reload the collection. 

Anyway, If I am able to look at the cache sizes after reload, this gives me 
more information.

I just tried to go to the location you have specified. I could not see a 
"CACHE" . I can see the "Statistics" section.
I am using Solr 7.2 on solrcloud mode. 

thanks
-Lewin

-Original Message-
From: Shawn Heisey  
Sent: Tuesday, April 9, 2019 1:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cache clear

On 4/9/2019 11:51 AM, Lewin Joy (TMNA) wrote:
> Hmm. I only tried reloading the collection as a whole. Not the core reload.
> Where do I see the cache sizes after reload?

If you do not know how to see the cache sizes, then what information are you 
looking at which has led you to the conclusion that the caches have not been 
cleared?

To get to cache stats:  In the admin UI, choose a core from the dropdown.  Then 
click on Plugins/Stats, then CACHE, and choose the cache you want to look at.

Thanks,
Shawn


Re: Solr Cache clear

2019-04-09 Thread Shawn Heisey

On 4/9/2019 11:51 AM, Lewin Joy (TMNA) wrote:

Hmm. I only tried reloading the collection as a whole. Not the core reload.
Where do I see the cache sizes after reload?


If you do not know how to see the cache sizes, then what information are 
you looking at which has led you to the conclusion that the caches have 
not been cleared?


To get to cache stats:  In the admin UI, choose a core from the 
dropdown.  Then click on Plugins/Stats, then CACHE, and choose the cache 
you want to look at.


Thanks,
Shawn


Re: Solr Cache clear

2019-04-09 Thread Walter Underwood
I’d like to know this, too. We run benchmarks with log replay, starting with 
warming queries, then a measurement run. It is a pain to to a rolling restart 
of the whole cluster before each benchmark run.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 9, 2019, at 10:55 AM, Lewin Joy (TMNA)  wrote:
> 
> Thank you for email, Alex.
> 
> I have the autowarmCount set as 0. 
> So, this shouldn't prepopulate with old cache data.
> 
> -Lewin
> 
> -Original Message-
> From: Alexandre Rafalovitch  
> Sent: Monday, April 8, 2019 6:45 PM
> To: solr-user 
> Subject: Re: Solr Cache clear
> 
> You may have warming queries to prepopulate your cache. Check your 
> solrconfig.xml.
> 
> Regards,
>Alex
> 
> On Mon, Apr 8, 2019, 4:16 PM Lewin Joy (TMNA),  wrote:
> 
>> ** PROTECTED 関係者外秘
>> How do I clear the solr caches without restarting Solr cluster?
>> Is there a way?
>> I tried reloading the collection. But, it did not help.
>> 
>> Thanks,
>> Lewin
>> 
>> 



RE: Solr Cache clear

2019-04-09 Thread Lewin Joy (TMNA)
Thank you for email, Alex.

I have the autowarmCount set as 0. 
So, this shouldn't prepopulate with old cache data.

-Lewin

-Original Message-
From: Alexandre Rafalovitch  
Sent: Monday, April 8, 2019 6:45 PM
To: solr-user 
Subject: Re: Solr Cache clear

You may have warming queries to prepopulate your cache. Check your 
solrconfig.xml.

Regards,
Alex

On Mon, Apr 8, 2019, 4:16 PM Lewin Joy (TMNA),  wrote:

> ** PROTECTED 関係者外秘
> How do I clear the solr caches without restarting Solr cluster?
> Is there a way?
> I tried reloading the collection. But, it did not help.
>
> Thanks,
> Lewin
>
>


RE: Solr Cache clear

2019-04-09 Thread Lewin Joy (TMNA)
Hmm. I only tried reloading the collection as a whole. Not the core reload.
Where do I see the cache sizes after reload?

-Lewin

-Original Message-
From: Shawn Heisey  
Sent: Monday, April 8, 2019 5:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cache clear

On 4/8/2019 2:14 PM, Lewin Joy (TMNA) wrote:
> How do I clear the solr caches without restarting Solr cluster?
> Is there a way?
> I tried reloading the collection. But, it did not help.

When I reload a core on a test setup (solr 7.4.0), I see cache sizes reset.

What evidence are you seeing that reloading doesn't work?

Thanks,
Shawn


How to prevent solr from deleting cores when getting an empty config from zookeeper

2019-04-09 Thread Koen De Groote
Hello,

I recently ran in to the following scenario:

Solr, version 7.5, in a docker container, running as cloud, with an
external zookeeper ensemble of 3 zookeepers. Instructions were followed to
make a root first, this was set correctly, as could be seen by the solr
logs outputting the connect info.

root command is: "bin/solr zk mkroot /solr -z "

For a yet undetermined reason, the zookeeper ensemble had some kind of
split-brain occur. At a later point, Solr was restarted and then suddenly
all its directories were gone.

By which I mean: the directories containing the configuration and the data.
The stopwords, the schema, the solr config, the "shard1_replica_n2"
directories, those directories.

Those were gone without a trace.

As far as I can tell, solr started, asked zookeeper for its config,
zookeeper returned an empty config and consequently "made it so".

I am by no means very knowledgeable about solr internals. Can anyone chime
in as to what happened here and how to prevent it? Is more info needed?

Ideally, if something like this were to happen, I'd like for either solr to
not delete folders or if that's not possible, add some kind of pre-startup
check that stops solr from going any further if things go wrong.

Regards,
Koen


Re: Interesting Grouping/Facet issue

2019-04-09 Thread Shawn Heisey

On 4/9/2019 7:03 AM, Erie Data Systems wrote:

Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag=on=0=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]=100=1=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.


That is exactly what faceting is designed to do.  It is behaving exactly 
as designed.



Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.


If you use pivot facets, first on the field you want to classify on, 
then on HASHTAG, that MIGHT get you what you want.


You could also try running many different facet queries, each one with a 
specific query and/or filter that achieves the results you want.


FYI:  Including "hashtag:*" in your query makes it a wildcard query. 
This is most likely VERY slow.  If you are trying to match all possible 
values in the hashtag field, then take it out, it's unnecessary.  If you 
are trying to match only documents where hashtag contains a value, then 
replace it with this for a performance improvement:


hashtag:[* TO *]

Range queries are almost always faster than wildcards.

Thanks,
Shawn


Re: Moving index from stand-alone Solr 6.6.0 to 3 node Solr Cloud 6.6.0 with Zookeeper

2019-04-09 Thread Erick Erickson
Glad to hear it. Now, if you want to be really bold (and I haven’t verified it, 
but it _should_ work).

Rather than copy the index, try this:

1> spin up a one-replica empty collection
2> use the REPLICATION API to copy the index from the re-indexed source.
3> ADDREPLICAs as before.

<2> looks something like: 
http://_slave_host:port_/solr/_core_name_/replication?command=fetchindex=http://solr_with_new_index:port/solr/core_name_/replication.

_core_name_ in this case is something like collection1_shard1_replica1, i.e. 
what shows up in the “cores” dropdown.

The replication API is still used by SolrCloud for “full sync” and has been 
around forever, so it’s well-tested. Again, though, I don’t use this regularly 
so no guarantees…..

See: https://lucene.apache.org/solr/guide/7_5/index-replication.html

Best,
Erick

> On Apr 9, 2019, at 12:38 AM, kevinc  wrote:
> 
> Thanks so much - your approaches worked a treat!
> 
> Best,
> Kevin.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Performance problems with extremely common terms in collection (Solr 7.4)

2019-04-09 Thread Diego Ceccarelli
Another way to make queries faster is, if you can, identify a subset of
documents that are in general relevant for the users (most recent ones,
most browsed etc etc), index those documents into a separate collection and
then query the small collection and back out to the full one if the small
one didn't have enough documents (caveat: the small collection could affect
the ranking because all terms stats will be different..)

Cheers,
Diego

On Mon, Apr 8, 2019, 15:59 Michael Gibney  wrote:

> In addition to Toke's suggestions (and those in the linked article), some
> more ideas:
> If single-term, bare queries are slow, it might be productive to check
> config/performance of your queryResultCache (I realize this doesn't
> directly address the concern of slow queries, but might nonetheless be
> helpful in practice).
> If multi-term queries that include these terms are slow, maybe check your
> mm config to make sure it's not more inclusive than necessary for your use
> case (scoring over union of docSets/clauses). If multi-term queries get
> faster by disabling pf, you could try disabling main-query pf, and invoke
> implicit phrase search (pseudo-pf) using ReRankQParser?
> If you're able to share your configs (built queries, indexing/fieldType
> config (positions, payloads?), etc.), that might enable more specific
> advice.
> I'm assuming the query-times posted are for queries that isolate the
> performance of main query only (i.e., no other components, like facets,
> etc.)?
> Michael
>
> On Mon, Apr 8, 2019 at 3:28 AM Ash Ramesh  wrote:
>
> > Hi Toke,
> >
> > Thanks for the prompt reply. I'm glad to hear that this is a common
> > problem. In regards to stop words, I've been thinking about trying that
> > out. In our business case, most of these terms are keywords related to
> > stock photography, therefore it's natural for 'photography' or
> 'background'
> > to appear commonly in a document's keyword list. it seems unlikely we can
> > use the common grams solution with our business case.
> >
> > Regards,
> >
> > Ash
> >
> > On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen  wrote:
> >
> > > On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > > > We have a corpus of 50+ million documents in our collection. I've
> > > > noticed that some queries with specific keywords tend to be extremely
> > > > slow.
> > > > E.g. the q=`photography' or q='background'. After digging into the
> > > > raw documents, I could see that these two terms appear in greater
> > > > than 90% of all documents, which means solr has to score each of
> > > > those documents.
> > >
> > > That is known behaviour, which can be remedied somewhat. Stop words is
> > > a common approach, but your samples does not seem to fit well with
> > > that. Instead you can look at Common Grams, where your high-frequency
> > > words gets concatenated with surrounding words. This only works with
> > > phrases though. There's a nice article at
> > >
> > >
> > >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> > >
> > > - Toke Eskildsen, Royal Danish Library
> > >
> > >
> > >
> >
> > --
> > *P.S. We've launched a new blog to share the latest ideas and case
> studies
> > from our team. Check it out here: product.canva.com
> > . ***
> > ** Empowering the
> > world to design
> > Also, we're hiring. Apply here!
> > 
> >  
> >  
> >   
> >   
> >
> >
> >
> >
> >
> >
> >
>


Re: Understanding Performance of Function Query

2019-04-09 Thread Erik Hatcher
maybe something like q=

({!edismax  v=$q1} OR {!edismax  v=$q2} OR {!edismax ... v=$q3})

 and setting q1, q2, q3 as needed (or all to the same maybe with different qf’s 
and such)

  Erik

> On Apr 9, 2019, at 09:12, sidharth228  wrote:
> 
> I did infact use "bf" parameter for individual edismax queries. 
> 
> However, the reason I can't condense these edismax queries into a single
> edismax query is because each of them uses different fields in "qf". 
> 
> Basically what I'm trying to do is this: each of these edismax queries (q1,
> q2, q3) has a logic, and scores docs using it. I am then trying to combine
> the scores (to get an overall score) from these scores later by summing
> them.
> 
> What options do I have of implementing this?
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Understanding Performance of Function Query

2019-04-09 Thread sidharth228
I did infact use "bf" parameter for individual edismax queries. 

However, the reason I can't condense these edismax queries into a single
edismax query is because each of them uses different fields in "qf". 

Basically what I'm trying to do is this: each of these edismax queries (q1,
q2, q3) has a logic, and scores docs using it. I am then trying to combine
the scores (to get an overall score) from these scores later by summing
them.

What options do I have of implementing this?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Interesting Grouping/Facet issue

2019-04-09 Thread Erie Data Systems
Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag=on=0=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]=100=1=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.

Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.

Hope this makes sense... any recommendations welcomed.

Thank you in advance,
-Craig


Re: Understanding Performance of Function Query

2019-04-09 Thread Erik Hatcher
Function queries in ‘q’ score EVERY DOCUMENT.   Use ‘bf’ or ‘boost’ for the 
function part, so its only computed on main query matching docs.  

Erik

> On Apr 9, 2019, at 03:29, Sidharth Negi  wrote:
> 
> Hi,
> 
> I'm working with "edismax" and "function-query" parsers in Solr and have
> difficulty in understanding whether the query time taken by
> "function-query" makes sense. The query I'm trying to optimize looks as
> follows:
> 
> q={!func sum($q1,$q2,$q3)} where q1,q2,q3 are edismax queries.
> 
> The QTime returned by edismax queries takes well under 50ms but it seems
> that function-query is the rate determining step since combined query above
> takes around 200-300ms. I also analyzed the performance of function query
> using only constants.
> 
> The QTime results for different q are as follows:
> 
>   -
> 
>   097ms for q={!func} sum(10,20)
>   -
> 
>   109ms for q={!func} sum(10,20,30)
>   -
> 
>   127ms for q={!func} sum(10,20,30,40)
>   -
> 
>   145ms for q={!func} sum(10,20,30,40,50)
> 
> Does this trend make sense? Are function-queries expected to be this slow?
> 
> What makes edismax queries so much faster?
> 
> What can I do to optimize my original query (which has edismax subqueries
> q1,q2,q3) to work under 100ms?
> 
> I originally posted this question
> 
> on
> StackOverflow with no success, so any help here would be appreciated.


edismax pt 2

2019-04-09 Thread Dwane Hall
Hi guys,



I’m just following up from an earlier question I raised on the forum regarding 
inconsistencies in edismax query behaviour and I think I may have discovered 
the cause of the problem.  From testing I've noticed that edismax query 
behaviour seems to change depending on the field types specified in the qf 
parameter.



Here’s an example first using only solr.Text fields.



(all fields are “text_general” – standard tokenizer, lower case filter only)

NAME solr.TextField

ADDRESS solr.TextField

EMAIL solr.TextField

PHONE_NUM solr.TextField



“qf":"NAME ADDRESS EMAIL PHONE_NUM”



"querystring":"peter john spain",

"parsedquery":"+(+DisjunctionMaxQuery((PHONE_NUM:peter | ADDRESS:peter | 
EMAIL:peter | NAME:peter)) +DisjunctionMaxQuery((PHONE_NUM:john | ADDRESS:john 
| EMAIL:john | NAME:john)) +DisjunctionMaxQuery((PHONE_NUM:spain | 
ADDRESS:spain | EMAIL:spain | NAME:spain)))",





Now with no other configuration changes when I introduce a range date field 
(solr.DateRangeField) called “DOB” into the qf parameter the behaviour of the 
edismax parser changes dramatically.



DOB solr.DateRangeField



“qf":"NAME ADDRESS EMAIL PHONE_NUM DOB”



"querystring":"peter john spain",

"parsedquery":"+(+DisjunctionMaxQuery(((+PHONE_NUM:peter +PHONE_NUM:john 
+PHONE_NUM:spain) | (+EMAIL:peter +EMAIL:john +EMAIL:spain) | () | (+NAME:peter 
+NAME:john +NAME:spain) | (+ADDRESS:peter +ADDRESS:john +ADDRESS:spain",




Notice the difference of the “|OR” and “+AND” between terms and also every term 
is now mandatory to exist in every field.  Is this the expected behaviour for 
the edismax query parser or am I overlooking something that may be causing this 
behaviour inconsistency?



As always any comments or feedback is greatly appreciated,


Thanks



Dwane





Re: Solr 8.0.0 - CPU usage 100% when indexed documents

2019-04-09 Thread Shawn Heisey

On 4/8/2019 11:00 PM, vishal patel wrote:

Sorry my mistake there is no class of that.

I have add the data using below code.
CloudSolrServer cloudServer = new CloudSolrServer(zkHost);
cloudServer.setDefaultCollection("actionscomments");
cloudServer.setParallelUpdates(true);
List docs = new ArrayList<>();
SolrInputDocument solrDocument = new SolrInputDocument();
solrDocument.addField("id", "123");
docs.add(solrDocument);
cloudServer.add(docs, 1000);


Side note:  This code is not using SolrJ 8.0.0.  CloudSolrServer was 
deprecated in version 5.0.0 and completely removed in version 6.0.0. 
I'm surprised this code even works at all with Solr 8.0.0 -- you need to 
upgrade to SolrJ 8 and use CloudSolrClient.


How long does the system remain at 100 percent CPU when you index that 
single document that only has one field?  If it's longer than a very 
small fraction of a second, then my guess is that it's cache warming 
queries using the CPU, not the indexing itself.


How many CPU cores are at 100 percent?  Is it just one, or multiple?  It 
would be odd for it to be multiple, unless there is other activity going 
on at the same time.


Thanks,
Shawn


Re: Sql entity processor sortedmapbackedcache out of memory issue

2019-04-09 Thread Shawn Heisey

On 4/8/2019 11:47 PM, Srinivas Kashyap wrote:

I'm using DIH to index the data and the structure of the DIH is like below for 
solr core:


16 child entities


During indexing, since the number of requests being made to database was 
high(to process one document 17 queries) and was utilizing most of connections 
of database thereby blocking our web application.


If you have 17 entities, then one document will indeed take 17 queries. 
That's the nature of multiple DIH entities.



To tackle it, we implemented SORTEDMAPBACKEDCACHE with cacheImpl parameter to 
reduce the number of requests to database.


When you use SortedMapBackedCache on an entity, you are asking Solr to 
store the results of the entire query in memory, even if you don't need 
all of the results.  If the database has a lot of rows, that's going to 
take a lot of memory.


In your excerpt from the config, your inner entity doesn't have a WHERE 
clause.  Which means that it's going to retrieve all of the rows of the 
ABC table for *EVERY* single entry in the DEF table.  That's going to be 
exceptionally slow.  Normally the SQL query on inner entities will have 
some kind of WHERE clause that limits the results to rows that match the 
entry from the outer entity.


You may need to write a custom indexing program that runs separately 
from Solr, possibly on an entirely different server.  That might be a 
lot more efficient than DIH.


Thanks,
Shawn


Understanding Performance of Function Query

2019-04-09 Thread Sidharth Negi
Hi,

I'm working with "edismax" and "function-query" parsers in Solr and have
difficulty in understanding whether the query time taken by
"function-query" makes sense. The query I'm trying to optimize looks as
follows:

q={!func sum($q1,$q2,$q3)} where q1,q2,q3 are edismax queries.

The QTime returned by edismax queries takes well under 50ms but it seems
that function-query is the rate determining step since combined query above
takes around 200-300ms. I also analyzed the performance of function query
using only constants.

The QTime results for different q are as follows:

   -

   097ms for q={!func} sum(10,20)
   -

   109ms for q={!func} sum(10,20,30)
   -

   127ms for q={!func} sum(10,20,30,40)
   -

   145ms for q={!func} sum(10,20,30,40,50)

Does this trend make sense? Are function-queries expected to be this slow?

What makes edismax queries so much faster?

What can I do to optimize my original query (which has edismax subqueries
q1,q2,q3) to work under 100ms?

I originally posted this question

on
StackOverflow with no success, so any help here would be appreciated.


Re: Moving index from stand-alone Solr 6.6.0 to 3 node Solr Cloud 6.6.0 with Zookeeper

2019-04-09 Thread kevinc
Thanks so much - your approaches worked a treat!

Best,
Kevin.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Understanding Performance of Function Query

2019-04-09 Thread Sidharth Negi
Hi,

I'm working with "edismax" and "function-query" parsers in Solr and have
difficulty in understanding whether the query time taken by
"function-query" makes sense. The query I'm trying to optimize looks as
follows:

q={!func sum($q1,$q2,$q3)} where q1,q2,q3 are edismax queries.

The QTime returned by edismax queries takes well under 50ms but it seems
that function-query is the rate determining step since combined query above
takes around 200-300ms. I also analyzed the performance of function query
using only constants.

The QTime results for different q are as follows:

   -

   097ms for q={!func} sum(10,20)
   -

   109ms for q={!func} sum(10,20,30)
   -

   127ms for q={!func} sum(10,20,30,40)
   -

   145ms for q={!func} sum(10,20,30,40,50)

Does this trend make sense? Are function-queries expected to be this slow?

What makes edismax queries so much faster?

What can I do to optimize my original query (which has edismax subqueries
q1,q2,q3) to work under 100ms?

I originally posted this question

on
StackOverflow with no success, so any help here would be appreciated.


Duplicated tokens in search string

2019-04-09 Thread rodio
Hi all,

We are trying to emulate in Solr 8.0 the behaviour of Solr 3.6 and we are
facing a problem that we cannot solve

When we have duplicated tokens:

- Solr 8.0 scores only once the token but it applies a huge boost
- Solr 3.6 scores individually each token and the final score is lower

We are using ClassicSimilarity algorythm but we cannot prevent that boosting

Example: table 60 cm 50 cm

Solr 8.0

/11.096966 = sum of:
  4.3195267 = sum of:
4.3195267 = weight(name:table in 138556) [ClassicSimilarity], result of:
  4.3195267 = score(freq=1.0), product of:
8.639053 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
  62381 = docFreq, number of documents containing term
  129615816 = docCount, total number of documents with field
1.0 = tf(freq=1.0), with freq of:
  1.0 = freq, occurrences of term within document
0.5 = fieldNorm
  2.7624812 = weight(name:60 in 138556) [ClassicSimilarity], result of:
2.7624812 = score(freq=1.0), product of:
  5.5249624 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1404402 = docFreq, number of documents containing term
129615816 = docCount, total number of documents with field
  1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
  0.5 = fieldNorm
  4.0149584 = weight(name:cm in 138556) [ClassicSimilarity], result of:
4.0149584 = score(freq=1.0), product of:
*  2.0 = boost*
  4.0149584 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
6357381 = docFreq, number of documents containing term
129615816 = docCount, total number of documents with field
  1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
  0.5 = fieldNorm
/

Solr 3.6

/3.098446 = (MATCH) product of:
  3.8730574 = (MATCH) sum of:
2.120801 = (MATCH) sum of:
  2.120801 = (MATCH) weight(name:table in 101441), product of:
0.4913325 = queryWeight(name:table), product of:
  8.632854 = idf(docFreq=135231, maxDocs=279245306)
  0.05691426 = queryNorm
4.316427 = (MATCH) fieldWeight(name:table in 101441), product of:
  1.0 = tf(termFreq(name:table)=1)
  8.632854 = idf(docFreq=135231, maxDocs=279245306)
  0.5 = fieldNorm(field=name, doc=101441)
0.8427305 = (MATCH) weight(name:60 in 101441), product of:
  0.30972046 = queryWeight(name:60), product of:
5.4418783 = idf(docFreq=3287778, maxDocs=279245306)
0.05691426 = queryNorm
  2.7209392 = (MATCH) fieldWeight(name:60 in 101441), product of:
1.0 = tf(termFreq(name:60)=1)
5.4418783 = idf(docFreq=3287778, maxDocs=279245306)
0.5 = fieldNorm(field=name, doc=101441)
0.45476305 = (MATCH) weight(name:cm in 101441), product of:
  0.22751924 = queryWeight(name:cm), product of:
3.9975789 = idf(docFreq=13936507, maxDocs=279245306)
0.05691426 = queryNorm
  1.9987894 = (MATCH) fieldWeight(name:cm in 101441), product of:
1.0 = tf(termFreq(name:cm)=1)
3.9975789 = idf(docFreq=13936507, maxDocs=279245306)
0.5 = fieldNorm(field=name, doc=101441)
0.45476305 = (MATCH) weight(name:cm in 101441), product of:
  0.22751924 = queryWeight(name:cm), product of:
3.9975789 = idf(docFreq=13936507, maxDocs=279245306)
0.05691426 = queryNorm
  1.9987894 = (MATCH) fieldWeight(name:cm in 101441), product of:
1.0 = tf(termFreq(name:cm)=1)
3.9975789 = idf(docFreq=13936507, maxDocs=279245306)
0.5 = fieldNorm(field=name, doc=101441)
  0.8 = coord(4/5)
/

Is it possible to configure this?

Thanks in advance!




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Sql entity processor sortedmapbackedcache out of memory issue

2019-04-09 Thread Srinivas Kashyap
Hello,

I'm using DIH to index the data and the structure of the DIH is like below for 
solr core:


16 child entities


During indexing, since the number of requests being made to database was 
high(to process one document 17 queries) and was utilizing most of connections 
of database thereby blocking our web application.

To tackle it, we implemented SORTEDMAPBACKEDCACHE with cacheImpl parameter to 
reduce the number of requests to database.












.
.
.
.
.
.
.


We have 8GB Physical memory system(RAM) with 5GB of it allocated to JVM and 
when we do full-import, only 17 requests are made to database. However, it is 
shooting up memory consumption and is making the JVM out of memory. Out of 
memory is happening depending on the number of records each entity is bringing 
in to the memory. For Dev and QA environments, the above memory config is 
sufficient. When we move to production, we have to increase the memory to 
around 16GB of RAM and 12 GB of JVM.

Is there any logic/configurations to limit the memory usage?

Thanks and Regards,
Srinivas Kashyap


DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.