Re: No Live SolrServer available to handle this request

2017-12-05 Thread Selvam Raman
When i look at the solr logs i find the below exception

Caused by: java.io.IOException: Invalid JSON type java.lang.String,
expected Map
at
org.apache.solr.schema.JsonPreAnalyzedParser.parse(JsonPreAnalyzedParser.java:86)
at
org.apache.solr.schema.PreAnalyzedField$PreAnalyzedTokenizer.decodeInput(PreAnalyzedField.java:345)
at
org.apache.solr.schema.PreAnalyzedField$PreAnalyzedTokenizer.access$000(PreAnalyzedField.java:280)
at
org.apache.solr.schema.PreAnalyzedField$PreAnalyzedAnalyzer$1.setReader(PreAnalyzedField.java:375)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:202)
at
org.apache.lucene.search.uhighlight.AnalysisOffsetStrategy.tokenStream(AnalysisOffsetStrategy.java:58)
at
org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.getOffsetsEnums(MemoryIndexOffsetStrategy.java:106)
... 37 more



 I am setting up lot of fields (fq, score, highlight,etc) then put it into
solrquery.

On Wed, Dec 6, 2017 at 11:22 AM, Selvam Raman  wrote:

> When i am firing query it returns the doc as expected. (Example:
> q=synthesis)
>
> I am facing the problem when i include wildcard character in the query.
> (Example: q=synthesi*)
>
>
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://localhost:8983/solr/Metadata2:
> org.apache.solr.client.solrj.SolrServerException:
>
> No live SolrServers available to handle this request:[/solr/Metadata2_
> shard1_replica1,
>   solr/Metadata2_shard2_replica2,
>   solr/Metadata2_shard1_replica2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


No Live SolrServer available to handle this request

2017-12-05 Thread Selvam Raman
When i am firing query it returns the doc as expected. (Example:
q=synthesis)

I am facing the problem when i include wildcard character in the query.
(Example: q=synthesi*)


org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://localhost:8983/solr/Metadata2:
org.apache.solr.client.solrj.SolrServerException:

No live SolrServers available to handle this
request:[/solr/Metadata2_shard1_replica1,
  solr/Metadata2_shard2_replica2,
  solr/Metadata2_shard1_replica2]

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Dataimport handler showing idle status with multiple shards

2017-12-05 Thread Sarah Weissman


From: Shawn Heisey 
Reply-To: "solr-user@lucene.apache.org" 
Date: Tuesday, December 5, 2017 at 1:31 PM
To: "solr-user@lucene.apache.org" 
Subject: Re: Dataimport handler showing idle status with multiple shards

On 12/5/2017 10:47 AM, Sarah Weissman wrote:
I’ve recently been using the dataimport handler to import records from a 
database into a Solr cloud collection with multiple shards. I have 6 dataimport 
handlers configured on 6 different paths all running simultaneously against the 
same DB. I’ve noticed that when I do this I often get “idle” status from the 
DIH even when the import is still running. The percentage of the time I get an 
“idle” response seems proportional to the number of shards. I.e., with 1 shard 
it always shows me non-idle status, with 2 shards I see idle about half the 
time I check the status, with 96 shards it seems to be showing idle almost all 
the time. I can see the size of each shard increasing, so I’m sure the import 
is still going.

I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. 
Does anyone know why the DIH would report idle when it’s running?

e.g.:
curl http://myserver:8983/solr/collection/dataimport6



To use DIH with SolrCloud, you should be sending your request directly
to a shard replica core, not the collection, so that you can be
absolutely certain that the import command and the status command are
going to the same place.  You MIGHT need to also have a distrib=false
parameter on the request, but I do not know whether that is required to
prevent the load balancing on the dataimport handler.



Thanks for the information, Shawn. I am relatively new to Solr cloud and I am 
used to running the dataimport from the admin dashboard, where it happens at 
the collection level, so I find it surprising that the right way to do this is 
at the core level. So, if I want to be able to check the status of my data 
import for N cores I would need to create N different data import configs that 
manually partition the collection and start each different config on a 
different core? That seems like it could get confusing. And then if I wanted to 
grow or shrink my shards I’d have to rejigger my data import configs every 
time. I kind of expect a distributed index to hide these details from me.

I only have one node at the moment, and I don’t understand how Solr cloud works 
internally well enough to understand what it means for the data import to be 
running on a shard vs. a node. It would be nice if doing a status query would 
at least tell you something, like the number of documents last indexed on that 
core, even if nothing is currently running. That way at least I could 
extrapolate how much longer the operation will take.



RE: Multiple cores versus a "source" field.

2017-12-05 Thread Phil Scadden
Thanks Walter. Your case does apply as both data stores do indeed cover the 
same kind of material, with many important terms in common. "source" + fq: 
coming up.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Tuesday, 5 December 2017 5:51 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Multiple cores versus a "source" field.

One more opinion on source field vs separate collections for multiple corpora.

Index statistics don’t really settle down until at least 100k documents. Below 
that, idf is pretty noisy. With Ultraseek, we used pre-calculated frequency 
data for collections under 10k docs.

If your corpora have similar word statistics, you might get more predictable 
relevance with a single collection. For example, if you have data sheets and 
press releases, but they are both about test instruments, then you might get 
some advantage from having more data points about the “text” and “title” fields.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2017, at 7:17 PM, Phil Scadden  wrote:
>
> Thanks Eric. I have already followed the solrj indexing very closely - I have 
> to do a lot of manipulation at indexing time. The other blog article is very 
> interesting as I do indeed use "year" (year of publication) and it is very 
> frequently used to filter queries. I will have a play with that now.
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, 5 December 2017 4:11 p.m.
> To: solr-user 
> Subject: Re: Multiple cores versus a "source" field.
>
> That's the unpleasant part of semi-structued documents (PDF, Word, whatever). 
> You never know the relationship between raw size and indexable text.
>
> Basically anything that you don't care to contribute to _scoring_ is often 
> better in an fq clause. You can also use {!cache=false} to bypass actually 
> using the cache if you know it's unlikely to be reused.
>
> Two other points:
>
> 1> you can offload the parsing to clients rather than Solr and gain
> more control over the process (assuming you haven't already). Here's a blog:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> 2> One reason to not go to fq clauses (except if you use
> {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, 
> one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the 
> subject:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
>
>
> Best,
> Erick
>
>
> On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden  wrote:
>>> You'll have a few economies of scale I think with a single core, but 
>>> frankly I don't know if they'd be enough to measure. You say the docs are 
>>> "quite large" though, >are you talking books? Magazine articles? is 20K 
>>> large or are the 20M?
>>
>> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot 
>> of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>>
>> Thanks for tip on fq: I will put that into code now as I have other fields 
>> used is similar fashion.
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Dataimport handler showing idle status with multiple shards

2017-12-05 Thread Shawn Heisey

On 12/5/2017 10:47 AM, Sarah Weissman wrote:

I’ve recently been using the dataimport handler to import records from a 
database into a Solr cloud collection with multiple shards. I have 6 dataimport 
handlers configured on 6 different paths all running simultaneously against the 
same DB. I’ve noticed that when I do this I often get “idle” status from the 
DIH even when the import is still running. The percentage of the time I get an 
“idle” response seems proportional to the number of shards. I.e., with 1 shard 
it always shows me non-idle status, with 2 shards I see idle about half the 
time I check the status, with 96 shards it seems to be showing idle almost all 
the time. I can see the size of each shard increasing, so I’m sure the import 
is still going.

I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. 
Does anyone know why the DIH would report idle when it’s running?

e.g.:
curl http://myserver:8983/solr/collection/dataimport6


When you send a DIH request to the collection name, SolrCloud is going 
to load balance that request across the cloud, just like it would with 
any other request.  Solr will look at the list of all responding nodes 
that host part of the collection and send multiple such requests to 
different cores (shards/replicas) across the cloud.  If there are four 
cores in the collection and the nodes hosting them are all working, then 
each of those cores would only see requests to /dataimport about one 
fourth of the time.


DIH imports happen at the core level, NOT the collection level, so when 
you start an import on a collection with four cores in the cloud, only 
one of those four cores is actually going to be doing the import, the 
rest of them are idle.


This behavior should happen with any version, so I would expect it in 
6.1 as well as 7.1.


To use DIH with SolrCloud, you should be sending your request directly 
to a shard replica core, not the collection, so that you can be 
absolutely certain that the import command and the status command are 
going to the same place.  You MIGHT need to also have a distrib=false 
parameter on the request, but I do not know whether that is required to 
prevent the load balancing on the dataimport handler.


A similar question came to this list two days ago, and I replied to that 
one yesterday.


http://lucene.472066.n3.nabble.com/Dataimporter-status-tp4365602p4365879.html

Somebody did open an issue a LONG time ago about this problem:

https://issues.apache.org/jira/browse/SOLR-3666

I just commented on the issue.

Thanks,
Shawn



Dataimport handler showing idle status with multiple shards

2017-12-05 Thread Sarah Weissman
Hi,

I’ve recently been using the dataimport handler to import records from a 
database into a Solr cloud collection with multiple shards. I have 6 dataimport 
handlers configured on 6 different paths all running simultaneously against the 
same DB. I’ve noticed that when I do this I often get “idle” status from the 
DIH even when the import is still running. The percentage of the time I get an 
“idle” response seems proportional to the number of shards. I.e., with 1 shard 
it always shows me non-idle status, with 2 shards I see idle about half the 
time I check the status, with 96 shards it seems to be showing idle almost all 
the time. I can see the size of each shard increasing, so I’m sure the import 
is still going.

I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. 
Does anyone know why the DIH would report idle when it’s running?

e.g.:
curl http://myserver:8983/solr/collection/dataimport6
{
  "responseHeader":{
"status":0,
"QTime":0},
  "initArgs":[
"defaults",[
  "config","data-config6.xml"]],
  "status":"idle",
  "importResponse":"",
  "statusMessages":{}}

Thanks,
Sarah


Re: SolrIndexSearcher count

2017-12-05 Thread Rick Dig
No custom code at all.

On Dec 5, 2017 10:31 PM, "Erick Erickson"  wrote:

> Do you have any custom code in the mix anywhere?
>
> On Tue, Dec 5, 2017 at 5:02 AM, Rick Dig  wrote:
> > Hello all,
> > is it normal to have many instances (100+) of SolrIndexSearchers to be
> open
> > at the same time? Our Heap Analysis shows this to be the case.
> >
> > We have autoCommit for every 5 minutes, with openSearcher=true, would
> this
> > close the old searcher and create a new one or just create a new one with
> > the old one still not getting dereferenced? if so, when do the older
> > searchers get cleaned up ?
> >
> > thanks for your help
> > -rakshit
>


Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.

My philosophy about Lucene-based search is that it's not a solution, but
rather a framework that should have sane defaults but large amounts of
configurability.

For example,I'm not sure there's a globally "right" answer maxDoc vs
docCount

Problems with docCount come into play when a corpus usually has an empty
field, but it's occasionally filled out. This creates a strong bias against
matches in that usually empty field, when previously a match in that field
was weighted very highly

For example, if a product catalog has a user-editable tag field that is
rarely used, and a product description, such as

Product Name: Nice Pants!
Product Description: Come wear these pants!
Tags: [blue] [acid-wash]

Product Name: Acid Wash Pants
Product Description: Come wear these pants!
Tags: (empty)

In this case, the IDF for the acid wash match in tags is very low using
docCount whereas with maxDocs it was very high. Not sure what the right
answer is, but there is often a desire to want more complete docs to be
boosted much higher, which the "maxDocs" method does.

Another case where docCount can be a problem is copy fields: With copy
fields, you care that the original field had terms, even if for some reason
they were removed in the analysis chain. This can happen with some methods
we use for simple entity extraction.

Further the definitions of BM25, etc rely on corpus level document
frequency for a term and don't have a concept of fields. BM25F can mostly
be implemented with BlendedTermQuery which blends doc frequencies across
fields
http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/


On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedetti 
wrote:

> Thanks Yonik and thanks Doug.
>
> I agree with Doug in adding few generics test corpora Jenkins automatically
> runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
> golden truth too much.
> This of course can be very complex, but I think it is a direction the
> Apache
> Lucene/Solr community should work on.
>
> Given that, I do believe that in this case, moving from maxDocs(field
> independent) to docCount(field dependent) was a good move ( and this
> specific multi language use case is an example).
>
> Actually I also believe that theoretically docCount(field dependent) is
> still better than maxDocs(field dependent).
> This is because docCount(field dependent) represents a state in time
> associated to the current index while maxDocs represents an historical
> consideration.
> A corpus of documents can change in time, and how much a term is rare can
> drastically change ( let's pick an highly dynamic domain such news).
>
> Doug, were you able to generalise and abstract any consideration from what
> happened to your customers and why they got regressions moving from maxDocs
> to docCount(field dependent) ?
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


Re: SolrIndexSearcher count

2017-12-05 Thread Erick Erickson
Do you have any custom code in the mix anywhere?

On Tue, Dec 5, 2017 at 5:02 AM, Rick Dig  wrote:
> Hello all,
> is it normal to have many instances (100+) of SolrIndexSearchers to be open
> at the same time? Our Heap Analysis shows this to be the case.
>
> We have autoCommit for every 5 minutes, with openSearcher=true, would this
> close the old searcher and create a new one or just create a new one with
> the old one still not getting dereferenced? if so, when do the older
> searchers get cleaned up ?
>
> thanks for your help
> -rakshit


Re: Logging in Solrcloud

2017-12-05 Thread Walter Underwood
HTTP request log, not solr.log.

This is intra-cluster:

10.98.15.241 - - [29/Oct/2017:23:59:57 +] "POST 
//sc16.prod2.cloud.cheggnet.com:8983/solr/questions_shard4_replica8/auto 
HTTP/1.1" 200 194

This is from outside (yes, we have long queries):

10.98.15.110 - - [29/Oct/2017:23:59:58 +] "GET 
//solr-cloud.prod2.cheggnet.com:8983/solr/questions/srp?qt=%2Fsrp=jack+and+jill+are+maneuvering+a+2800+kg+boat+near+a+dock.+initially+the+boat%27s+position+is+m+and+its+speed+is+1.9+m%2Fs.+as+the+boat+moves+to+position+m%2C+jack+exerts+a+force+n+and+jill+exerts=source%3Atbs=0=2=true=jack+and+jill+are+maneuvering+a+2800+kg+boat+near+a+dock.+initially+the+boat%27s+position+is+m+and+its+speed+is+1.9+m%2Fs.+as+the+boa

In your case, “gettingstarted_shard1_replica_n2” should mean that is an 
intra-cluster request. Also, “distrib=false” means it is for a single core.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 5, 2017, at 7:38 AM, Matzdorf, Stefan, Springer SBM DE 
>  wrote:
> 
> first of all, i'm using solr 7.1.0 ...
> 
> i took a look into the logfile of solr and see the follwowing 2 log 
> statements for query "test":
> 
> 4350609 INFO  (qtp1918627686-691) [c:gettingstarted s:shard1 r:core_node5 
> x:gettingstarted_shard1_replica_n2] o.a.s.c.S.Request 
> [gettingstarted_shard1_replica_n2]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/|http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/=10=2=test=1512474643732=true=javabin}
>  hits=0 status=0 QTime=0
> 
> 4350615 INFO  (qtp1918627686-20) [c:gettingstarted s:shard2 r:core_node8 
> x:gettingstarted_shard2_replica_n6] o.a.s.c.S.Request 
> [gettingstarted_shard2_replica_n6]  webapp=/solr path=/select 
> params={q=test=on=json} hits=0 status=0 QTime=7
> 
> 
> Both were logged by the org.apache.solr.core.Request - logger (i configured 
> that to log on info level), but there is no information about what kind of 
> request (GET/POST etc) comes in. it just logs what you could see above. do 
> you use a different logger for that? (and with logger in that case i mean the 
> ones you could configre und the logger/level menu in the solr ui, where to 
> choose what you want to log).
> 
> Regards
> Matze
> 
> 
> --
> Stefan Matzdorf
> Software Engineer
> B2X Platform Development
> 
> Springer Nature
> Heidelberger Platz 3, 14197 Berlin, Germany
> T  +4903827975072
> stefan.matzd...@springer.com
> www.springernature.com
> ---
> Springer Nature is one of the world’s leading global research, educational 
> and professional publishers, created in May 2015 through the combination of 
> Nature Publishing Group,
> Palgrave Macmillan, Macmillan Education and Springer Science+Business Media.
> ---
> Springer Science+Business Media Deutschland GmbH
> Registered Office: Berlin / Amtsgericht Berlin-Charlottenburg, HRB 152987 B
> Directors: Derk Haank, Martin Mos, Dr. Ulrich Vest
> 
> 
> Von: Walter Underwood 
> Gesendet: Dienstag, 5. Dezember 2017 16:20
> An: solr-user@lucene.apache.org
> Betreff: Re: Logging in Solrcloud
> 
> In 6.5.1, the intra-cluster requests are POST, which makes them easy to 
> distinguish in the request logs. Also, the intra-cluster requests go to a 
> specific core instead of to the collection. So we use the request logs and 
> grep out the GET lines.
> 
> We are considering fronting every Solr process with a local nginx server. 
> That will allow us to limit concurrent connections. It will also give us a 
> log of just the client requests.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Dec 5, 2017, at 4:25 AM, Matzdorf, Stefan, Springer SBM DE 
>>  wrote:
>> 
>> To be more precisely and provide some more details, i tried to simplify the 
>> problem by using the Solr-examples that were delivered with the solr
>> So i started bin/solr -e cloud, using 2 nodes, 2 shards and replication of 2.
>> 
>> To understand the following, it might be important to know, which ports are 
>> used:
>> node 1: 8983 (leader for shard1 and shard2)
>> node 2: 7574 (no leader at all)
>> 
>> 
>> In this example i searched for 3 terms in the following order: first on node 
>> 1 (8983 - leader) and then on node 2 (7574).
>> 
>> Sample1 (q=test):
>>   http://localhost:8983/solr/gettingstarted/select?indent=on=test=json
>> 
>>   produced logs:
>> 1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
>> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=test=1512474523045=true=javabin}
>>  hits=0 status=0 QTime=1
>> 2)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
>> 

AW: Logging in Solrcloud

2017-12-05 Thread Matzdorf, Stefan, Springer SBM DE
first of all, i'm using solr 7.1.0 ...

i took a look into the logfile of solr and see the follwowing 2 log statements 
for query "test":

4350609 INFO  (qtp1918627686-691) [c:gettingstarted s:shard1 r:core_node5 
x:gettingstarted_shard1_replica_n2] o.a.s.c.S.Request 
[gettingstarted_shard1_replica_n2]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/|http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/=10=2=test=1512474643732=true=javabin}
 hits=0 status=0 QTime=0

4350615 INFO  (qtp1918627686-20) [c:gettingstarted s:shard2 r:core_node8 
x:gettingstarted_shard2_replica_n6] o.a.s.c.S.Request 
[gettingstarted_shard2_replica_n6]  webapp=/solr path=/select 
params={q=test=on=json} hits=0 status=0 QTime=7


Both were logged by the org.apache.solr.core.Request - logger (i configured 
that to log on info level), but there is no information about what kind of 
request (GET/POST etc) comes in. it just logs what you could see above. do you 
use a different logger for that? (and with logger in that case i mean the ones 
you could configre und the logger/level menu in the solr ui, where to choose 
what you want to log).

Regards
Matze


--
Stefan Matzdorf
Software Engineer
B2X Platform Development

Springer Nature
Heidelberger Platz 3, 14197 Berlin, Germany
T  +4903827975072
stefan.matzd...@springer.com
www.springernature.com
---
Springer Nature is one of the world’s leading global research, educational and 
professional publishers, created in May 2015 through the combination of Nature 
Publishing Group,
Palgrave Macmillan, Macmillan Education and Springer Science+Business Media.
---
Springer Science+Business Media Deutschland GmbH
Registered Office: Berlin / Amtsgericht Berlin-Charlottenburg, HRB 152987 B
Directors: Derk Haank, Martin Mos, Dr. Ulrich Vest


Von: Walter Underwood 
Gesendet: Dienstag, 5. Dezember 2017 16:20
An: solr-user@lucene.apache.org
Betreff: Re: Logging in Solrcloud

In 6.5.1, the intra-cluster requests are POST, which makes them easy to 
distinguish in the request logs. Also, the intra-cluster requests go to a 
specific core instead of to the collection. So we use the request logs and grep 
out the GET lines.

We are considering fronting every Solr process with a local nginx server. That 
will allow us to limit concurrent connections. It will also give us a log of 
just the client requests.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 5, 2017, at 4:25 AM, Matzdorf, Stefan, Springer SBM DE 
>  wrote:
>
> To be more precisely and provide some more details, i tried to simplify the 
> problem by using the Solr-examples that were delivered with the solr
> So i started bin/solr -e cloud, using 2 nodes, 2 shards and replication of 2.
>
> To understand the following, it might be important to know, which ports are 
> used:
> node 1: 8983 (leader for shard1 and shard2)
> node 2: 7574 (no leader at all)
>
>
> In this example i searched for 3 terms in the following order: first on node 
> 1 (8983 - leader) and then on node 2 (7574).
>
> Sample1 (q=test):
>http://localhost:8983/solr/gettingstarted/select?indent=on=test=json
>
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=test=1512474523045=true=javabin}
>  hits=0 status=0 QTime=1
>  2)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474523045=true=javabin}
>  hits=0 status=0 QTime=1
>
>
>
>http://localhost:7574/solr/gettingstarted/select?indent=on=test=json
>
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={q=test=on=json} hits=0 status=0 QTime=17
>
> ##
> ##
>
> Sample2 (q=foo):
>http://localhost:8983/solr/gettingstarted/select?indent=on=foo=json
>
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=foo=1512474569299=true=javabin}
>  hits=0 status=0 QTime=0
>
>
>
>http://localhost:7574/solr/gettingstarted/select?indent=on=foo=json
>
>produced logs:
>  1) 

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
Thanks Yonik and thanks Doug.

I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr community should work on.

Given that, I do believe that in this case, moving from maxDocs(field
independent) to docCount(field dependent) was a good move ( and this
specific multi language use case is an example).

Actually I also believe that theoretically docCount(field dependent) is
still better than maxDocs(field dependent).
This is because docCount(field dependent) represents a state in time
associated to the current index while maxDocs represents an historical
consideration.
A corpus of documents can change in time, and how much a term is rare can
drastically change ( let's pick an highly dynamic domain such news).

Doug, were you able to generalise and abstract any consideration from what
happened to your customers and why they got regressions moving from maxDocs
to docCount(field dependent) ?




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Java profiler?

2017-12-05 Thread Walter Underwood
Anybody have a favorite profiler to use with Solr? I’ve been asked to look at 
why out queries are slow on a detail level.

Personally, I think they are slow because they are so long, up to 40 terms.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: Logging in Solrcloud

2017-12-05 Thread Walter Underwood
In 6.5.1, the intra-cluster requests are POST, which makes them easy to 
distinguish in the request logs. Also, the intra-cluster requests go to a 
specific core instead of to the collection. So we use the request logs and grep 
out the GET lines.

We are considering fronting every Solr process with a local nginx server. That 
will allow us to limit concurrent connections. It will also give us a log of 
just the client requests.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 5, 2017, at 4:25 AM, Matzdorf, Stefan, Springer SBM DE 
>  wrote:
> 
> To be more precisely and provide some more details, i tried to simplify the 
> problem by using the Solr-examples that were delivered with the solr
> So i started bin/solr -e cloud, using 2 nodes, 2 shards and replication of 2. 
> 
> To understand the following, it might be important to know, which ports are 
> used:
> node 1: 8983 (leader for shard1 and shard2)
> node 2: 7574 (no leader at all)
> 
> 
> In this example i searched for 3 terms in the following order: first on node 
> 1 (8983 - leader) and then on node 2 (7574).
> 
> Sample1 (q=test):
>http://localhost:8983/solr/gettingstarted/select?indent=on=test=json
> 
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=test=1512474523045=true=javabin}
>  hits=0 status=0 QTime=1
>  2)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474523045=true=javabin}
>  hits=0 status=0 QTime=1
> 
> 
> 
>http://localhost:7574/solr/gettingstarted/select?indent=on=test=json
> 
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={q=test=on=json} hits=0 status=0 QTime=17
> 
> ##
> ##
> 
> Sample2 (q=foo):
>http://localhost:8983/solr/gettingstarted/select?indent=on=foo=json
> 
>produced logs:
>  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=foo=1512474569299=true=javabin}
>  hits=0 status=0 QTime=0
> 
> 
> 
>http://localhost:7574/solr/gettingstarted/select?indent=on=foo=json
> 
>produced logs:
>  1) [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={q=foo=on=json} hits=0 status=0 QTime=13
> 
> ##
> ##
> 
> Sample3 (q=test) NOTE- its the same query as in sample1: 
>http://localhost:8983/solr/gettingstarted/select?indent=on=test=json
> 
>produced logs:
>  1) [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474643732=true=javabin}
>  hits=0 status=0 QTime=0
> 
> 
>http://localhost:7574/solr/gettingstarted/select?indent=on=test=json
> 
>produced logs:
>  1)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474627254=true=javabin}
>  hits=0 status=0 QTime=0
>  2)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
> params={q=test=on=json} hits=0 status=0 QTime=13
> 
> ##
> ##
> 
> Sample4 (q=baa):
>http://localhost:8983/solr/gettingstarted/select?indent=on=baa=json
> 
>produced logs:
>  1)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
> params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=baa=1512474709460=true=javabin}
>  hits=0 status=0 QTime=0
> 
> 
>

Re: Metadata passed with CURL (via literal) is not recognized by SOLR ...?

2017-12-05 Thread Jan . Christopher . Schluchtmann-EXT
Ok, I found the solution myself.

Reason for this behaviour was the "lowernames = true"-configuration of the 
Tika-requestHandler, that transformed the "module-id" to "module_id". 
I added a fitting copyField to my schema and it seems to work now.


Maybe, this information is useful for someone ... of course, it is 
mentioned the manual, but finding it is the problem, if you don't know, 
what you are looking for. ;)


Regards
Jan



Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346



Von:jan.christopher.schluchtmann-...@continental-corporation.com
An: solr-user@lucene.apache.org, 
Datum:  05.12.2017 11:02
Betreff:Metadata passed with CURL (via literal) is not recognized 
by SOLR ...?



Hi!
I am trying to index RTF-files by uploading them to the Solr-Server with 
CURL.
I am trying to pass the required metadata by the 
"literal.="-statement.


The "id" and the "module-id" are mandatory in my schema.
The "id" is recognized correctly, as one can see in the Solr-response 
"doc=48a0xxx" ... but the "module-id" seems to be neglected.

Why is that?


Thanks in advance!!!



Here is the CURL-command I pass via Windows 10 Powershell:

SOLR-REQUEST:

curl.exe "
http://localhost:8983/solr/ContiReqManCore/update/extract/?commit=true=48a04d8e5da651c5-000ba8a6-1=000d8181=FPK_Medium_19S1=%2FFPK_Medium_19S1=000ba8a6=PVVTS_Functional_FPK_Medium_19S1=%2FFPK_Medium_19S1%2F02_Quality%2F10_Verification-Validation%2FPVVTS_Functional_FPK_Medium_19S1=PVVTS_Funct_=1

" -F "object-ole=@D:\(...)\PVVTS_Funct_263.rtf"


SOLR-RESPONSE:

{
  "responseHeader":{
"status":400,
"QTime":7},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"[doc=48a04d8e5da651c5-000ba8a6-1] missing required field: 
module-id",
"code":400}
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346


Implicit routing changes to Composite while re-deploy configuration changes

2017-12-05 Thread Ketan Thanki
Hi,

I have implemented implicit routing with below configuration.
Created one default collection manually 'AMS_Config' which contains 
configurations files  schema,solrconfig etc.

Using 'AMS_Config' I have created 2 collections model,workset respectively with 
below command which created 2 shard for each collection containing 2 node for 
each shard where nodes are each solr instance where collection created.

Command: 
/admin/collections?action=CREATE=model=2=implicit=dr=shard1,shard2=2=AMS_Config

Collection Detail:
Model = Shard1,Shard2
Shard1 = node1,node2[leader]
Shard2 =node1[leader],node2

Configuration in admin UI on solr for model collection:
Shard count:2
configName:AMS_Config
replicationFactor:2
maxShardsPerNode:2
router:implicit
autoAddReplicas:false

after this I have index document to particular shard using set value 
(router.field) dr = shard1

Issue:  After indexing the document made changes in schema files and redeploy 
using below command  for set latest configs
zkcli.bat -cmd upconfig -confdir ../../solr/AMS_Config/conf -confname 
AMS_Config -z 

It will changes the router value implicit to compositeId and now my document 
are index across all shard so why this should happens.How to avoid this.

Please do needful.

Regards,
Ketan.

[CC Award Winners!]



Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
Just a piece of feedback from clients on the original docCount change.

I have seen several cases with clients where the switch to docCount
surprised and harmed  relevance.

More broadly, I’m concerned when we make these changes there’s not a
testing process against test corpuses with judgments and relevance metrics
to understand their impact. I see it mentioned in a JIRA from time to time
that someone saw an improvement on a private collection in NDCG. And we
have to take their word for it.

Public testing of relevance against every build using stock settings could
be extremely valuable and would more easily justify these changes.
Something similar to the performance tests that are made.

Sadly I can only complain now :) I wish I had time to work on something
like this.

Doug

On Tue, Dec 5, 2017 at 7:38 AM Yonik Seeley  wrote:

> On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
>  wrote:
> > "Lucene/Solr doesn't actually delete documents when you delete them, it
> > just marks them as deleted.  I'm pretty sure that the difference between
> > docCount and maxDoc is deleted documents.  Maybe I don't understand what
> > I'm talking about, but that is the best I can come up with. "
> >
> > Thanks Shawn, yes, that is correct and I was aware of it.
> > I was curious of another difference :
> > I think we confirmed that docCount is local to the field ( thanks Yonik
> for
> > that) so :
> >
> > docCount(index,field1)= # of documents in the index that currently have
> > value(s) for field1
> >
> > My question is :
> >
> > maxDocs(index,field1)= max # of documents in the index that had value(s)
> for
> > field1
> >
> > OR
> >
> > maxDocs(index)= max # of documents that appeared in the index ( field
> > independent)
>
> The latter.
> I imagine that's why docCount was introduced (to avoid changing the
> meaning of an existing term).
> FWIW, the scoring change was made in
> https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0
>
> -Yonik
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


SolrIndexSearcher count

2017-12-05 Thread Rick Dig
Hello all,
is it normal to have many instances (100+) of SolrIndexSearchers to be open
at the same time? Our Heap Analysis shows this to be the case.

We have autoCommit for every 5 minutes, with openSearcher=true, would this
close the old searcher and create a new one or just create a new one with
the old one still not getting dereferenced? if so, when do the older
searchers get cleaned up ?

thanks for your help
-rakshit


Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Yonik Seeley
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
 wrote:
> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted.  I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents.  Maybe I don't understand what
> I'm talking about, but that is the best I can come up with. "
>
> Thanks Shawn, yes, that is correct and I was aware of it.
> I was curious of another difference :
> I think we confirmed that docCount is local to the field ( thanks Yonik for
> that) so :
>
> docCount(index,field1)= # of documents in the index that currently have
> value(s) for field1
>
> My question is :
>
> maxDocs(index,field1)= max # of documents in the index that had value(s) for
> field1
>
> OR
>
> maxDocs(index)= max # of documents that appeared in the index ( field
> independent)

The latter.
I imagine that's why docCount was introduced (to avoid changing the
meaning of an existing term).
FWIW, the scoring change was made in
https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0

-Yonik


Re: Logging in Solrcloud

2017-12-05 Thread Matzdorf, Stefan, Springer SBM DE
To be more precisely and provide some more details, i tried to simplify the 
problem by using the Solr-examples that were delivered with the solr
So i started bin/solr -e cloud, using 2 nodes, 2 shards and replication of 2. 

To understand the following, it might be important to know, which ports are 
used:
 node 1: 8983 (leader for shard1 and shard2)
 node 2: 7574 (no leader at all)


In this example i searched for 3 terms in the following order: first on node 1 
(8983 - leader) and then on node 2 (7574).

Sample1 (q=test):
http://localhost:8983/solr/gettingstarted/select?indent=on=test=json

produced logs:
  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=test=1512474523045=true=javabin}
 hits=0 status=0 QTime=1
  2)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474523045=true=javabin}
 hits=0 status=0 QTime=1



http://localhost:7574/solr/gettingstarted/select?indent=on=test=json

produced logs:
  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
params={q=test=on=json} hits=0 status=0 QTime=17

##
##

Sample2 (q=foo):
http://localhost:8983/solr/gettingstarted/select?indent=on=foo=json

produced logs:
  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard1_replica_n1/|http://127.0.1.1:8983/solr/gettingstarted_shard1_replica_n2/=10=2=foo=1512474569299=true=javabin}
 hits=0 status=0 QTime=0



http://localhost:7574/solr/gettingstarted/select?indent=on=foo=json

produced logs:
  1) [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
params={q=foo=on=json} hits=0 status=0 QTime=13

##
##

Sample3 (q=test) NOTE- its the same query as in sample1: 
http://localhost:8983/solr/gettingstarted/select?indent=on=test=json

produced logs:
  1) [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474643732=true=javabin}
 hits=0 status=0 QTime=0


http://localhost:7574/solr/gettingstarted/select?indent=on=test=json

produced logs:
  1)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=test=1512474627254=true=javabin}
 hits=0 status=0 QTime=0
  2)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
params={q=test=on=json} hits=0 status=0 QTime=13

##
##

Sample4 (q=baa):
http://localhost:8983/solr/gettingstarted/select?indent=on=baa=json

produced logs:
  1)  [gettingstarted_shard2_replica_n4]  webapp=/solr path=/select 
params={df=_text_=false=id=score=4=0=true=http://127.0.1.1:7574/solr/gettingstarted_shard2_replica_n4/|http://127.0.1.1:8983/solr/gettingstarted_shard2_replica_n6/=10=2=baa=1512474709460=true=javabin}
 hits=0 status=0 QTime=0


http://localhost:7574/solr/gettingstarted/select?indent=on=baa=json

produced logs:
  1)  [gettingstarted_shard1_replica_n1]  webapp=/solr path=/select 
params={q=baa=on=json} hits=0 status=0 QTime=12




Sorry for this messy logs. 
I'll try to sumarize

For queries against the node 1, the leading node, i never got those "short 
logs". just containing what i was querying. Instead i recieve logs containing 
all these sharding information. Sometimes 2 equivalent ones (see sample 1) and 
sometimes just one log (sample 2-4). Mentioned that i got different logs for 
the same query/request (sample1 vs sample3).

For queries against the node 2, not leading anything, i got those "short logs" 
everytime. In addition to that, i also recievie sometimes 

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it 
just marks them as deleted.  I'm pretty sure that the difference between 
docCount and maxDoc is deleted documents.  Maybe I don't understand what 
I'm talking about, but that is the best I can come up with. "

Thanks Shawn, yes, that is correct and I was aware of it.
I was curious of another difference :
I think we confirmed that docCount is local to the field ( thanks Yonik for
that) so :

docCount(index,field1)= # of documents in the index that currently have
value(s) for field1

My question is :

maxDocs(index,field1)= max # of documents in the index that had value(s) for
field1

OR

maxDocs(index)= max # of documents that appeared in the index ( field
independent)

Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-12-05 Thread Amrit Sarkar
Tom,

Thank you for trying out bunch of things with CDCR setup. I am successfully
able to replicate the exact issue on my setup, this is a problem.

I have opened a JIRA for the same:
https://issues.apache.org/jira/browse/SOLR-11724. Feel free to add any
relevant details as you like.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Tue, Dec 5, 2017 at 2:23 AM, Tom Peters  wrote:

> Not sure how it's possible. But I also tried using the _default config and
> just adding in the source and target configuration to make sure I didn't
> have something wonky in my custom solrconfig that was causing this issue. I
> can confirm that until I restart the follower nodes, they will not receive
> the initial index.
>
> > On Dec 1, 2017, at 12:52 AM, Amrit Sarkar 
> wrote:
> >
> > Tom,
> >
> > (and take care not to restart the leader node otherwise it will replicate
> >> from one of the replicas which is missing the index).
> >
> > How is this possible? Ok I will look more into it. Appreciate if someone
> > else also chimes in if they have similar issue.
> >
> > Amrit Sarkar
> > Search Engineer
> > Lucidworks, Inc.
> > 415-589-9269
> > www.lucidworks.com
> > Twitter http://twitter.com/lucidworks
> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> > Medium: https://medium.com/@sarkaramrit2
> >
> > On Fri, Dec 1, 2017 at 4:49 AM, Tom Peters  wrote:
> >
> >> Hi Amrit, I tried issuing hard commits to the various nodes in the
> target
> >> cluster and it does not appear to cause the follower replicas to receive
> >> the initial index. The only way I can get the replicas to see the
> original
> >> index is by restarting those nodes (and take care not to restart the
> leader
> >> node otherwise it will replicate from one of the replicas which is
> missing
> >> the index).
> >>
> >>
> >>> On Nov 30, 2017, at 12:16 PM, Amrit Sarkar 
> >> wrote:
> >>>
> >>> Tom,
> >>>
> >>> This is very useful:
> >>>
>  I found a way to get the follower replicas to receive the documents
> from
>  the leader in the target data center, I have to restart the solr
> >> instance
>  running on that server. Not sure if this information helps at all.
> >>>
> >>>
> >>> You have to issue hardcommit on target after the bootstrapping is done.
> >>> Reloading makes the core opening a new searcher. While explicit commit
> is
> >>> issued at target leader after the BS is done, follower are left
> >> unattended
> >>> though the docs are copied over.
> >>>
> >>> Amrit Sarkar
> >>> Search Engineer
> >>> Lucidworks, Inc.
> >>> 415-589-9269
> >>> www.lucidworks.com
> >>> Twitter http://twitter.com/lucidworks
> >>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>> Medium: https://medium.com/@sarkaramrit2
> >>>
> >>> On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters 
> >> wrote:
> >>>
>  Hi Amrit,
> 
>  Starting with more documents doesn't appear to have made a difference.
>  This time I tried with >1000 docs. Here are the steps I took:
> 
>  1. Deleted the collection on both the source and target DCs.
> 
>  2. Recreated the collections.
> 
>  3. Indexed >1000 documents on source data center, hard commmit
> 
>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>  $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> >> done
>  solr01-a: 1368
>  solr01-b: 1368
>  solr01-c: 1368
>  solr02-a: 0
>  solr02-b: 0
>  solr02-c: 0
> 
>  4. Enabled CDCR and checked docs
> 
>  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
> 
>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>  $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
> >> done
>  solr01-a: 1368
>  solr01-b: 1368
>  solr01-c: 1368
>  solr02-a: 0
>  solr02-b: 0
>  solr02-c: 1368
> 
>  Some additional notes:
> 
>  * I do not have numRecordsToKeep defined in my solrconfig.xml, so I
> >> assume
>  it will use the default of 100
> 
>  * I found a way to get the follower replicas to receive the documents
> >> from
>  the leader in the target data center, I have to restart the solr
> >> instance
>  running on that server. Not sure if this information helps at all.
> 
> > On Nov 30, 2017, at 11:22 AM, Amrit Sarkar 
>  wrote:
> >
> > Hi Tom,
> >
> > I see what you are saying and I too think this is a bug, but I will
>  confirm
> > once on the code. Bootstrapping should happen on all the nodes of the
> > target.
> >
> > Meanwhile can you index more than 100 documents in the source and do
> >> the
> > exact same experiment again. Followers will not 

Metadata passed with CURL (via literal) is not recognized by SOLR ...?

2017-12-05 Thread Jan . Christopher . Schluchtmann-EXT
Hi!
I am trying to index RTF-files by uploading them to the Solr-Server with 
CURL.
I am trying to pass the required metadata by the 
"literal.="-statement.


The "id" and the "module-id" are mandatory in my schema.
The "id" is recognized correctly, as one can see in the Solr-response 
"doc=48a0xxx" ... but the "module-id" seems to be neglected.

Why is that?


Thanks in advance!!!



Here is the CURL-command I pass via Windows 10 Powershell:

SOLR-REQUEST:

curl.exe "
http://localhost:8983/solr/ContiReqManCore/update/extract/?commit=true=48a04d8e5da651c5-000ba8a6-1=000d8181=FPK_Medium_19S1=%2FFPK_Medium_19S1=000ba8a6=PVVTS_Functional_FPK_Medium_19S1=%2FFPK_Medium_19S1%2F02_Quality%2F10_Verification-Validation%2FPVVTS_Functional_FPK_Medium_19S1=PVVTS_Funct_=1
" -F "object-ole=@D:\(...)\PVVTS_Funct_263.rtf"


SOLR-RESPONSE:

{
  "responseHeader":{
"status":400,
"QTime":7},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"[doc=48a04d8e5da651c5-000ba8a6-1] missing required field: 
module-id",
"code":400}
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346

Re: Logging in Solrcloud

2017-12-05 Thread Emir Arnautović
Hi Stefan,
I am not aware of option to log only client side queries, but I think that you 
can find workaround with what you currently have. If you take a look at log 
lines for query that comes from the client and one that is result of querying 
shards, you will see differences - the most simple one, if you are not using 
solrj for querying, would be wt parameter: e.g. client request might have 
wt=json while shard requests would have wt=javabin.
There are also parameters that are added by Solr for internal calls so just 
compare log lines you will find some discriminator in your version of Solr.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Dec 2017, at 07:58, Matzdorf, Stefan, Springer SBM DE 
>  wrote:
> 
> Hey everybody,
> 
> i have a question regarding query-request logging in solr-cloud. I've set the 
> the "org.apache.solr.core.SolrCore.Request"-logger to INFO-level and its 
> logging all those query-requests. So far so good. BUT, as I'm running Solr in 
> cloud mode with 3 nodes and 3 shards per collections (with a replica of 3; 
> distributed about all 3 nodes), i get a logging statement from each node as 
> well as from each shard. That i get it from each node seems quite obvious to 
> me. Different server, different solr-instances...ok. But how could i avoid 
> getting also the logs from the shards itself?
> 
> My main problem is, that i would like to measure, classify etc my queries. 
> But for example if i would like to count the number of queries it goes a bit 
> weird. So from one request sent to the cloud i got 5-7 logging statements. (i 
> guess it depends on the results of found documents within a shard?!).
> 
> 
> If i could get just one log-statement per node per request (in my case 3) 
> would be good. But even then, i have to do some math to get the exact values. 
> At the first look it seems quite easy, dividing by 3, but thats sadly not the 
> case. So what happens if one node goes down? Then i would just get 2 
> log-statements. Thats also the reason why i can't set the log-level to INFO 
> just on one node.
> 
> 
> 
> Long story short, is there a better way to log queries, then setting 
> "org.apache.solr.core.SolrCore.Request" to INFO???
> 
> 
> Thanks in advance?