Re: Performance Issue since Solr 7.7 with wt=javabin

2019-10-12 Thread Noble Paul
How are you consuming the output? Are you using solrj?

On Tue, Jun 18, 2019, 1:27 AM Andy Reek  wrote:

> Hi Solr team,
>
>
> we are using Solr in version 7.1 as search engine in our online shop (SAP
> Hybris). And as a task I needed to migrate to the most recent Solr in
> version 7 (7.7). Doing this I faced extreme performance issues. After
> debugging and testing different setups I found out, that they were caused
> by the parameter wt=javabin. These issues begin to raise since version 7.7,
> in 7.6 it is still working as fast as in 7.1.
>
>
> Just an example: Doing a simple query for *.* and wt=javabin in 7.6: 0.2
> seconds and in 7.7: 34 seconds!
>
>
> The configuration of the schema.xml and solrconfig.xml are equal in both
> versions. Version 8.1 has the same effect as 7.7. Using something other
> than wt=javabin (e.g. wt=xml) will work fast in every version - which is
> our current workaround.
>
>
>
> To reproduce this issue I have attached my used configsets folder plus
> some test data. This all can be tested with docker and wget:
>
>
> Solr 7.6:
>
> docker run -d --name solr7.6 -p 8983:8983 --rm -v
> $PWD/configsets/default:/opt/solr/server/solr/configsets/myconfig:ro
> solr:7.6-slim solr-create -c mycore -d
> /opt/solr/server/solr/configsets/myconfig
> docker cp $PWD/data.json solr7.6:/opt/solr/data.json
> docker exec -it --user solr solr7.6 bin/post -c mycore data.json
> wget "http://localhost:8983/solr/mycore/select?q=*:*=javabin=;
> (0.2s)
>
> Solr 7.7:
> docker run -d --name solr7.7 -p 18983:8983 --rm -v
> $PWD/configsets/default:/opt/solr/server/solr/configsets/myconfig:ro
> solr:7.7-slim solr-create -c mycore -d
> /opt/solr/server/solr/configsets/myconfig
> docker cp $PWD/data.json solr7.7:/opt/solr/data.json
> docker exec -it --user solr solr7.7 bin/post -c mycore data.json
> (34s)
>
> For me it seems like a bug. But if not, then please let me know what I did
> wrong ;-)
>
>
>
> Best Regards,
>
>
>
> *Andy Reek*
>
> Principal Software Developer
>
>
>
> *diva-e* Jena
>
> Mälzerstraße 3, 07745 Jena, Deutschland
>
> T:   +49 (3641) 3678 (223)
>
> F:   +49 (3641) 3678 101
>
> *andy.r...@diva-e.com *
>
>
>
> *www.diva-e.com * follow us: facebook
> , twitter
> , LinkedIn
> *,*
>  *Xing *
>
> 
>
>
>
> *diva-e* AGETO GmbH
>
> Handelsregister: HRB 210399 Amtsgericht Jena
>
> Geschäftsführung: Sascha Sauer, Sirko Schneppe, Axel Jahn
>
>


Re: Solr-8.2.0 Cannot create collection on CentOS 7.7

2019-10-12 Thread Peter Davie

Hi Shawn,

Thanks for your response.

I have downloaded JDK 11 from jdk.java.net and found that Solr is now 
able to create a collection successfully.  The 
java-11-openjdk-11.0.4.11-1.el7_7.x86_64 package on CentOS 7.7 should 
not be used with Solr.  It does not appear that there is any later RPMs 
for Java 11 (or 12/13) on CentOS.


Peter

On 12/10/2019 5:30 am, Shawn Heisey wrote:

On 10/10/2019 11:01 PM, Peter Davie wrote:
I have just installed Solr 8.2.0 on CentOS 7.7.1908.   Java version 
is as follows:


openjdk version "11.0.4" 2019-07-16 LTS




Caused by: java.time.format.DateTimeParseException: Text 
'2019-10-11T04:46:03.971Z' could not be parsed: null




Note that I have tested this and it is working on Windows 10 with 
Solr 8.2.0 using the following Java version:


openjdk version "11.0.2" 2019-01-15


This is looking like either a bug in Java or a change in the Java API. 
The code being called here by the core initialization routines is all 
in Java -- Solr is validating that the Java date formatter it's trying 
to use can parse ISO date strings, and on the version of Java where 
this occurs, that validation is failing.


I haven't been keeping an eye on our Jenkins tests, so I do not know 
if we have automated testing with OpenJDK 11.0.4 ... but it seems like 
if we do, that a large number of tests would be failing because of 
this ... unless maybe your install of OpenJDK has something wrong with 
it.


I fired up the cloud example on a fresh solr-8.2.0 download on Ubuntu 
18 with OpenJDK 11.0.4.  It created its default "gettingstarted" 
collection with no problems, and then I also created a new collection 
with the bin/solr command with no problems.  I do not have CentOS in 
my little lab.


Thanks,
Shawn

--
Peter Davie
(+61) (0)417 265 175
peter.da...@convergentsolutions.com.au 



Re: Solr 7.7 restore issue

2019-10-12 Thread Koen De Groote
I also ran into this while researching cluster policies. Solr 7.6

Except same situation: introduce a rule to control placement of
collections. Backup. Delete. Restore. Solr complains it can't do it.

I don't need them just yet, so I stopped there, but reading this is quite
disturbing.

Does deleting the rule, restore and then immediately re-instating the rule
work?



On Wed, Oct 9, 2019 at 6:33 AM Natarajan, Rajeswari <
rajeswari.natara...@sap.com> wrote:

> I am also facing the same issue. With Solr 7.6 restore fails with below
> rule. Would like to place one replica per node by below rule
>
>  with the rule to place one replica per node
> "set-cluster-policy": [{
> "replica": "<2",
> "shard": "#EACH",
> "node": "#ANY"
> }]
>
> Without the rule the restore works. But we need this rule. Any suggestions
> to overcome this issue.
>
> Thanks,
> Rajeswari
>
> On 7/12/19, 11:00 AM, "Mark Thill"  wrote:
>
> I have a 4 node cluster.  My goal is to have 2 shards with two replicas
> each and only allowing 1 core on each node.  I have a cluster policy
> set to:
>
> [{"replica":"2", "shard": "#EACH", "collection":"test",
> "port":"8983"},{"cores":"1", "node":"#ANY"}]
>
> I then manually create a collection with:
>
> name: test
> config set: test
> numShards: 2
> replicationFact: 2
>
> This works and I get a collection that looks like what I expect.  I
> then
> backup this collection.  But when I try to restore the collection it
> fails
> and says
>
> "Error getting replica locations : No node can satisfy the rules"
> [{"replica":"2", "shard": "#EACH", "collection":"test",
> "port":"8983"},{"cores":"1", "node":"#ANY"}]
>
> If I set my cluster-policy rules back to [] and try to restore it then
> successfully restores my collection exactly how I expect it to be.  It
> appears that having any cluster-policy rules in place is affecting my
> restore, but the "error getting replica locations" is strange.
>
> Any suggestions?
>
> mark 
>
>
>


Re: igain query parser generating invalid output

2019-10-12 Thread Peter Davie

Hi,

I have created the bug report in Jira and attached the patch to it.

Kind Regards,
Peter

On 12/10/2019 2:34 am, Joel Bernstein wrote:

This sounds like a great patch. I can help with the review and commit after
the jira is created.

Thanks!

Joel


On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:


Hi,

I apologise in advance for the length of this email, but I want to share
my discovery steps to make sure that I haven't missed anything during my
investigation...

I am working on a classification project and will be using the
classify(model()) stream function to classify documents.  I have noticed
that models generated include many noise terms from the (lexically)
early part of the term list.  To test, I have used the /BBC articles
fulltext and category //dataset from Kaggle/
(https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
indexed the data into a Solr collection (news_categories) and am
performing the following operation to generate a model for documents
categorised as "BUSINESS" (only keeping the 100th iteration):

having(
  train(
  news_categories,
  features(
  news_categories,
  zkHost="localhost:9983",
  q="*:*",
  fq="role:train",
  fq="category:BUSINESS",
  featureSet="business",
  field="body",
  outcome="positive",
  numTerms=500
  ),
  fq="role:train",
  fq="category:BUSINESS",
  zkHost="localhost:9983",
  name="business_model",
  field="body",
  outcome="positive",
  maxIterations=100
  ),
  eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following
"1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
"09", and these terms all have the same value for idfs_ds ("-Infinity").

Investigating the "features()" output, it seems that the issue is that
the noise terms are being returned with NaN for the score_f field:

  "docs": [
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "1,011.15",
  "idf_d": "-Infinity",
  "index_i": 1,
  "id": "business_1"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "10.3m",
  "idf_d": "-Infinity",
  "index_i": 2,
  "id": "business_2"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "01",
  "idf_d": "-Infinity",
  "index_i": 3,
  "id": "business_3"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "02",
  "idf_d": "-Infinity",
  "index_i": 4,
  "id": "business_4"
},...

I have examined the code within
org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
see that the scores being returned by {!igain} include NaN values, as
follows:

{
"responseHeader":{
  "zkConnected":true,
  "status":0,
  "QTime":20,
  "params":{
"q":"*:*",
"distrib":"false",
"positiveLabel":"1",
"field":"body",
"numTerms":"300",
"fq":["category:BUSINESS",
  "role:train",
  "{!igain}"],
"version":"2",
"wt":"json",
"outcome":"positive",
"_":"1569982496170"}},
"featuredTerms":[
  "0","NaN",
  "0.0051","NaN",
  "0.01","NaN",
  "0.02","NaN",
  "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
seems that when a term is not included in the positive or negative
documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
that subsequent calculations result in NaN (division by 0) which
generates these meaningless values for the computed score.

I have patched a local version of Solr to skip terms for which docFreq
is 0 in the finish() method of IGainTermsQParserPlugin and this is now
the result:

{
"responseHeader":{
  "zkConnected":true,
  "status":0,
  "QTime":260,
  "params":{
"q":"*:*",
"distrib":"false",
"positiveLabel":"1",
"field":"body",
"numTerms":"300",
"fq":["category:BUSINESS",
  "role:train",
  "{!igain}"],
"version":"2",
"wt":"json",
"outcome":"positive",
"_":"1569983546342"}},
"featuredTerms":[
  "3",-0.0173133558644304,
  "authority",-0.0173133558644304,
  "brand",-0.0173133558644304,
  "commission",-0.0173133558644304,
  "compared",-0.0173133558644304,
  "condition",-0.0173133558644304,
  "continuing",-0.0173133558644304,
  "deficit",-0.0173133558644304,
  "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing more
reasonable results.

With this change in