date:20160804

Re: Solr Cloud with 5 servers cluster failed due to Leader out of memory

2016-08-04 Thread Shawn Heisey

On 8/4/2016 8:14 PM, Tim Chen wrote:
> Couple of thoughts: 1, If Leader goes down, it should just go down,
> like dead down, so other servers can do the election and choose the
> new leader. This at least avoids bringing down the whole cluster. Am I
> right? 

Supplementing what Erick told you:

When a typical Java program throws OutOfMemoryError, program behavior is
completely unpredictable.  There are programming techniques that can be
used so that behavior IS predictable, but writing that code can be
challenging.

Solr 5.x and 6.x, when they are started on a UNIX/Linux system, use a
Java option to execute a script when OutOfMemoryError happens.  This
script kills Solr completely.  We are working on adding this capability
when running on Windows.

> 2, Apparently we should not pushing too many documents to Solr, how do
> you guys handle this? Set a limit somewhere? 

There are exactly two ways to deal with OOME problems: Increase the heap
or reduce Solr's memory requirements.  The number of documents you push
to Solr is unlikely to have a large effect on the amount of memory that
Solr requires.  Here's some information on this topic:

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn

Re: Solr Cloud with 5 servers cluster failed due to Leader out of memory

2016-08-04 Thread Erick Erickson

The fact that all the shards have the same leader is somewhat of a red
herring. Until you get hundreds of shards (perhaps across a _lot_ of
collections), the additional load on the leaders is hard to measure.
If you really see this as a problem, consider the BALANCESHARDUNIQUE
and REBALANCELEADERS Collection API commands, see:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-BalanceSliceUnique

That said, your OOM errors indicate you simply have too many Solr
collections doing too many things with too little memory.

bq: All other non-leader server are relying on Leader to finish the
new document index.

This is not the case. The indexing is done independently on all
replicas. What's _probably_ happening here is that the leaders are
spinning off threads to pass the data on to the replicas and you're
running so close to the heap limit that spinning up those threads is
pushing you to OOM errors.

And, if my hypothesis is true you'll soon run into problems on the
non-leaders as you index more and more documents to your collections.
Consider some serious effort in terms of determining your hardware/JVM
needs, see: 
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread Alexandre Rafalovitch

Just as a note, TYPO3 uses a lot of include files though I do not remember
which specific mechanism they rely on.

Regards,
Alex

On 5 Aug 2016 10:51 AM, "John Bickerstaff"  wrote:

> Many thanks for your time!  Yes, it does make sense.
>
> I'll give your recommendation a shot tomorrow and update the thread.
>
> On Aug 4, 2016 6:22 PM, "Chris Hostetter" 
> wrote:
>
>
> TL;DR: use entity includes *WITH OUT TOP LEVEL WRAPPER ELEMENTS* like in
> this example...
>
> https://github.com/apache/lucene-solr/blob/master/solr/
> core/src/test-files/solr/collection1/conf/schema-snippet-types.incl
> https://github.com/apache/lucene-solr/blob/master/solr/
> core/src/test-files/solr/collection1/conf/schema-xinclude.xml
>
>
> : The file I pasted last time is the file I was trying to include into the
> : main schema.xml.  It was when that file was getting processed that I got
> : the error  ['content' is not a glob and doesn't match any explicit field
> or
> : dynamicField. ]
>
> Ok -- so just to be crystal clear, you have two files, that look roughly
> like this...
>
> --- BEGIN schema.xml ---
> 
> 
>   
>   http://www.w3.org/
> 2001/XInclude"/>
> 
> --- END schema.xml ---
>
> -- BEGIN statdx_custom_schema.xml ---
> 
> 
>   
> 
> --- END statdx_custom_schema.xml ---
>
> ...am I correct?
>
>
> I'm going to skip a lot of the nitty gritty and just summarize by saying
> that ultimately there are 2 problems here that combine to lead to the
> error you are getting:
>
> 1) what you are trying to do as far as the xinclude is not really what
> xinclude is designed for and doesn't work the way you (or any other sane
> person) would think it does.
>
> 2) for historical reasons, Solr is being sloppy in what 
> entries it recognizes.  If anything the "bug" is that Solr is
> willing to try to load any parts of your include file at all -- it it were
> behaving consistently it should be ignoring all of it.
>
>
> Ok ... that seems terse, i'll clarify with a little of the nitty gritty...
>
>
> The root of the issue is really something you alluded to earlier that
> dind't make sense to me at the time because I didn't realize you were
> showing us the *includED* file when you said it...
>
> >>> I assumed (perhaps wrongly) that I could duplicate the 
> >>>   arrangement from the schema.xml file.
>
> ...that assumption is the crux of the problem, because when the XML parser
> evaluates your xinclude, what it produces is functionally equivilent to if
> you had a schema.xml file that looked like this
>
> --- BEGIN EFFECTIVE schema.xml ---
> 
> 
>   
>   
> 
>   
> 
> --- END EFFECTIVE schema.xml ---
>
> ...that extra  element nested inside of the original 
> element is what's confusing the hell out of solr.  The  and
>  parsing is fairly strict, and only expects to find them as top
> level elements (or, for historical purposes, as children of  and
>  -- note the plurals) while the  parsing is sloppy and
> finds the one that gives you an error.
>
> (Even if the  and  parsing was equally sloppy, only the
> outermost  tag would be recognized, so your default field props
> would be based on the version="1.5" declaration, not the version="1.6"
> declaration of the included file they'd be in ... which would be confusing
> as hell, so it's a good thing Solr isn't sloppy about that parsing too)
>
>
> In contrast to xincludes, XML Entity includes are (almost as a side effect
> of the triviality of their design) vastly supperiour 90% of the time, and
> capable of doing what you want.  The key diff being that Entity includes
> do not require that the file being included is valid XML -- it can be an
> arbitrary snippet of xml content (w/o a top level element) that will be
> inlined verbatim.  so you can/should do soemthing like this...
>
> --- BEGIN schema.xml ---
> 
>  
> ]>
> 
>   
>   &statdx_custom_include;
> 
> --- END schema.xml ---
>
> -- BEGIN statdx_custom_schema.incl ---
> 
> --- END statdx_custom_schema.incl ---
>
>
> ...make sense?
>
>
> -Hoss
> http://www.lucidworks.com/
>

Solr Cloud with 5 servers cluster failed due to Leader out of memory

2016-08-04 Thread Tim Chen

Hi Guys,

Me again. :)

We have 5 Solr servers:
01 -04 running Solr version 4.10 and ZooKeeper service
05 running ZooKeeper only.

JVM Max Memory set to 10G.

We have around 20 collections, and for each collection, there are 4 shards, for 
each shard, there are 4 replica sitting across on 4 Solr servers.

Unfortunately most of time, all the Shards have the same Leader (eg, Solr 
server 01).

Now, If we are adding a lot of documents to Solr, and eventually Solr 01 (All 
Shard's Leader) throws Out of memory in Tomcat log, and service goes down (but 
8983 port is still responding to telnet).
At this moment, I went to see logs on Solr02, Solr03, Solr04, and there are a 
lot of "Connection time out", in another 2 minutes, all these three Solr 
servers' service goes down too!

My feeling is that, when there are a lot of documents pushing in, Leader will 
be busy with indexing, and also requesting other (non-leader) servers to do the 
index as well. All other non-leader server are relying on Leader to finish the 
new document index. At a certain point, that Solr01 (Leader) server has no more 
memory, it gives up, but other (non-leader) servers are still waiting for 
Leader to respond. The whole Solr Cloud cluster breaks from here  No more 
requests being served.

Couple of thoughts:
1, If Leader goes down, it should just go down, like dead down, so other 
servers can do the election and choose the new leader. This at least avoids 
bringing down the whole cluster. Am I right?
2, Apparently we should not pushing too many documents to Solr, how do you guys 
handle this? Set a limit somewhere?

Thanks,
Tim




[Premiere League Starts Saturday 13 August 9.30pm on 
SBS]

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

Many thanks for your time!  Yes, it does make sense.

I'll give your recommendation a shot tomorrow and update the thread.

On Aug 4, 2016 6:22 PM, "Chris Hostetter"  wrote:

TL;DR: use entity includes *WITH OUT TOP LEVEL WRAPPER ELEMENTS* like in
this example...

https://github.com/apache/lucene-solr/blob/master/solr/
core/src/test-files/solr/collection1/conf/schema-snippet-types.incl
https://github.com/apache/lucene-solr/blob/master/solr/
core/src/test-files/solr/collection1/conf/schema-xinclude.xml

: The file I pasted last time is the file I was trying to include into the
: main schema.xml.  It was when that file was getting processed that I got
: the error  ['content' is not a glob and doesn't match any explicit field
or
: dynamicField. ]

Ok -- so just to be crystal clear, you have two files, that look roughly
like this...

--- BEGIN schema.xml ---

  http://www.w3.org/
2001/XInclude"/>

--- END schema.xml ---

-- BEGIN statdx_custom_schema.xml ---

--- END statdx_custom_schema.xml ---

...am I correct?

I'm going to skip a lot of the nitty gritty and just summarize by saying
that ultimately there are 2 problems here that combine to lead to the
error you are getting:

1) what you are trying to do as far as the xinclude is not really what
xinclude is designed for and doesn't work the way you (or any other sane
person) would think it does.

2) for historical reasons, Solr is being sloppy in what 
entries it recognizes.  If anything the "bug" is that Solr is
willing to try to load any parts of your include file at all -- it it were
behaving consistently it should be ignoring all of it.

Ok ... that seems terse, i'll clarify with a little of the nitty gritty...

The root of the issue is really something you alluded to earlier that
dind't make sense to me at the time because I didn't realize you were
showing us the *includED* file when you said it...

>>> I assumed (perhaps wrongly) that I could duplicate the 
>>>   arrangement from the schema.xml file.

...that assumption is the crux of the problem, because when the XML parser
evaluates your xinclude, what it produces is functionally equivilent to if
you had a schema.xml file that looked like this

--- BEGIN EFFECTIVE schema.xml ---

--- END EFFECTIVE schema.xml ---

...that extra  element nested inside of the original 
element is what's confusing the hell out of solr.  The  and
 parsing is fairly strict, and only expects to find them as top
level elements (or, for historical purposes, as children of  and
 -- note the plurals) while the  parsing is sloppy and
finds the one that gives you an error.

(Even if the  and  parsing was equally sloppy, only the
outermost  tag would be recognized, so your default field props
would be based on the version="1.5" declaration, not the version="1.6"
declaration of the included file they'd be in ... which would be confusing
as hell, so it's a good thing Solr isn't sloppy about that parsing too)

In contrast to xincludes, XML Entity includes are (almost as a side effect
of the triviality of their design) vastly supperiour 90% of the time, and
capable of doing what you want.  The key diff being that Entity includes
do not require that the file being included is valid XML -- it can be an
arbitrary snippet of xml content (w/o a top level element) that will be
inlined verbatim.  so you can/should do soemthing like this...

--- BEGIN schema.xml ---

]>

  &statdx_custom_include;

--- END schema.xml ---

-- BEGIN statdx_custom_schema.incl ---

--- END statdx_custom_schema.incl ---

...make sense?

-Hoss
http://www.lucidworks.com/

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread Chris Hostetter


TL;DR: use entity includes *WITH OUT TOP LEVEL WRAPPER ELEMENTS* like in 
this example...

https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/schema-snippet-types.incl
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/schema-xinclude.xml


: The file I pasted last time is the file I was trying to include into the
: main schema.xml.  It was when that file was getting processed that I got
: the error  ['content' is not a glob and doesn't match any explicit field or
: dynamicField. ]

Ok -- so just to be crystal clear, you have two files, that look roughly 
like this...

--- BEGIN schema.xml ---


  
  http://www.w3.org/2001/XInclude"/>

--- END schema.xml ---

-- BEGIN statdx_custom_schema.xml ---


  

--- END statdx_custom_schema.xml ---

...am I correct?


I'm going to skip a lot of the nitty gritty and just summarize by saying 
that ultimately there are 2 problems here that combine to lead to the 
error you are getting:

1) what you are trying to do as far as the xinclude is not really what 
xinclude is designed for and doesn't work the way you (or any other sane 
person) would think it does.

2) for historical reasons, Solr is being sloppy in what  
entries it recognizes.  If anything the "bug" is that Solr is 
willing to try to load any parts of your include file at all -- it it were 
behaving consistently it should be ignoring all of it.


Ok ... that seems terse, i'll clarify with a little of the nitty gritty...


The root of the issue is really something you alluded to earlier that 
dind't make sense to me at the time because I didn't realize you were 
showing us the *includED* file when you said it...

>>> I assumed (perhaps wrongly) that I could duplicate the 
>>>   arrangement from the schema.xml file.

...that assumption is the crux of the problem, because when the XML parser 
evaluates your xinclude, what it produces is functionally equivilent to if 
you had a schema.xml file that looked like this

--- BEGIN EFFECTIVE schema.xml ---


  
  

  

--- END EFFECTIVE schema.xml ---

...that extra  element nested inside of the original  
element is what's confusing the hell out of solr.  The  and 
 parsing is fairly strict, and only expects to find them as top 
level elements (or, for historical purposes, as children of  and 
 -- note the plurals) while the  parsing is sloppy and 
finds the one that gives you an error.

(Even if the  and  parsing was equally sloppy, only the 
outermost  tag would be recognized, so your default field props 
would be based on the version="1.5" declaration, not the version="1.6" 
declaration of the included file they'd be in ... which would be confusing 
as hell, so it's a good thing Solr isn't sloppy about that parsing too)


In contrast to xincludes, XML Entity includes are (almost as a side effect 
of the triviality of their design) vastly supperiour 90% of the time, and 
capable of doing what you want.  The key diff being that Entity includes 
do not require that the file being included is valid XML -- it can be an 
arbitrary snippet of xml content (w/o a top level element) that will be 
inlined verbatim.  so you can/should do soemthing like this...

--- BEGIN schema.xml ---


]>

  
  &statdx_custom_include;

--- END schema.xml ---

-- BEGIN statdx_custom_schema.incl ---

--- END statdx_custom_schema.incl ---


...make sense?


-Hoss
http://www.lucidworks.com/

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

Thanks and sorry for the misunderstanding.

The file I pasted last time is the file I was trying to include into the
main schema.xml.  It was when that file was getting processed that I got
the error  ['content' is not a glob and doesn't match any explicit field or
dynamicField. ]

I should note that I tried this with and without the  tags and got
the same error both times.

I used this line inside the main schema.xml to include the one I pasted in
the last email.

http://www.w3.org/2001/XInclude"/>

What follows is my main schema.xml which was  an exact copy of the
schema.xml that ships with Solr in the techproducts sample.

At the very bottom of the text / file, you'll see the line I used to
include the "statdx_custom_schema.xml" file

==










  


   

   

   
   

   
   

   
   

   
   
   
   
   
   

   
   
   
   

   

   
   
   
   
   
   
   
   
   
   
   
   
   


   
   

   
   

   
   

   


   

   

   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   
   

   
   
   

   
   
   
   
   
   

   

   
   

   

   
   




 
 id

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread Chris Hostetter


: The schema is a copy of the techproducts sample.
: 
: Entire include here - and I take your point about the possibility of
: malformation - thanks.
: 
: I assumed (perhaps wrongly) that I could duplicate the 
:   arrangement from the schema.xml file.

I really can't make heads or tails of what you're saying here -- what i 
asked you to provide was the full details on your schema.xml file and the 
file you are xincluding into it -- what you provided looks like a normal 
schema.xml, w/o any xinclude tags.  I also don't see any mention of any 
other files you are attempting to include in your schema.xml so i can see 
what it's structure looks like.

For comparison, here is an example of a schema.xml file that uses an 
xinclude elements...
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/schema-xinclude.xml

Here is a specific example of one xinclude element from that file...


here is the file that is included by that xinclude element...
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/schema-snippet-type.xml


...if you can provide the corrisponding specifics for your schema -- 
showing the xinclude elements, and the files refrenced by them -- we can 
try to help make sense of why it's not working for you.



: 
: I'm unfamiliar with xml entity includes, but I'll go look them up...
: 
: 
: 
: 
:
:
:
:
:
:
:
:
:
:
:
: 
:
:
: 
:
:
: 
:
: *//HERE IS WHERE "CONTENT" IS DEFINED*
: 
: 
:
: 
:
:
:
:
:
:  /*/THROWING ERROR ABOUT
: "CONTENT" NOT EXISTING HERE*
:
:
:
: 
:
:
: 
:   
: 
: 
: 
: 
: 
: 
:   
: 
: 
: 
: 
:   
:   
: 
: 
: 
: 
: 
:   
: 
: 
: 
: 
: 
: 
: On Thu, Aug 4, 2016 at 3:55 PM, Chris Hostetter 
: wrote:
: 
: >
: > you mentioned that the problem only happens when you use xinclude, but you
: > havne't shown us hte details of your xinclude -- what exactly does your
: > schema.xml look like (with the xinclude call) and what exactly does the
: > file being included look like (entire contents)
: >
: > (I suspect the problem you are seeing is realted to the way xinclude
: > doens't really support "snippets" of malformed xml, and instead requires
: > some root tag -- i can't imagine what root tag you are using in the
: > included file that would play nicely with mixing/matching field
: > declarations. ... using xml entity includes may be a simpler/safer option)
: >
: >
: >
: > : Date: Thu, 4 Aug 2016 15:47:00 -0600
: > : From: John Bickerstaff 
: > : Reply-To: solr-user@lucene.apache.org
: > : To: solr-user@lucene.apache.org
: > : Subject: Re: Problems using fieldType text_general in copyField
: > :
: > : I would call this a bug...
: > :
: > : I'm going out on a limb and say that if you define a field in the
: > included
: > : XML file, you will get this error.
: > :
: > : As long as the field is defined first in schema.xml, you can "copyFIeld"
: > it
: > : or whatever in the include file, but apparently fields MUST be created in
: > : the schema.xml file.
: > :
: > : That makes use of the include for custom things somewhat moot - at least
: > in
: > : my situation.
: > :
: > : I'd love to be wrong by the way, but that's what my tests suggest right
: > : now...
: > :
: > : On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff <
: > j...@johnbickerstaff.com>
: > : wrote:
: > :
: > : > Summary:
: > : >
: > : > Using xinclude to include an xml file into schema.xml
: > : >
: > : > The following line
: > : >
: > : > 
: > : >
: > : > generates an error:  about a field being "not a glob and not matching
: > an
: > : > explicit field" even though I declare the field in the line just above.
: > : >
: > : > This seems to happen only for for fieldType text_general?
: > : >
: > : > 
: > : >
: > : > Explanation:
: > : >
: > : > I need a little help - keep getting an error when trying to use the
: > : > ability to include an additional XML file.  I may be overlooking
: > something,
: > : > but if so, I need help to see it.
: > : >
: > : > I have the following two lines which throw zero errors when part of
: > : > schema.xml:
: > : >
: > : >  : > multiValued="true"/>
: > : >  
: > : >
: > : > However, when I put this into an include file and use xinclude, then I
: > get
: > : > this error when starting Solr.
: > : >
: > : >
: > : >
: > : >- *statdx_shard1_replica3:* org.apache.solr.common.
: > : >SolrException:org.apache.solr.common.SolrException: Could not load
: > : >conf for core statdx_shard1_replica3: Can't load schema schema.xml:
: > : >copyField source :'content' is not a glob and doesn't match any
: > explicit
: > : >field or dynamicField.
: > : >
: > : >
: > : > Given that I am defining the field in the line right above the
: > copyField
: > : > statement, I

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

I get the same error with the Entity Includes - with or without the
 tag...

I'm probably just going to make a section in schema.xml rather than worry
about this.

Includes are "nice to have" but not critical.

On Thu, Aug 4, 2016 at 4:25 PM, John Bickerstaff 
wrote:

> Found the Entity Includes - thanks.
>
> On Thu, Aug 4, 2016 at 4:22 PM, John Bickerstaff  > wrote:
>
>> Thanks!
>>
>> The schema is a copy of the techproducts sample.
>>
>> Entire include here - and I take your point about the possibility of
>> malformation - thanks.
>>
>> I assumed (perhaps wrongly) that I could duplicate the 
>>   arrangement from the schema.xml file.
>>
>> I'm unfamiliar with xml entity includes, but I'll go look them up...
>>
>> 
>> 
>>
>>
>>
>>> indexed="true" stored="false"/>
>>> stored="true" multiValued="false"/>
>>> stored="true" multiValued="false"/>
>>> multiValued="false"/>
>>> multiValued="false"/>
>>
>>> stored="true" multiValued="false"/>
>>
>>> multiValued="false"/>
>>
>>> stored="true" multiValued="false"/>
>>
>>
>>
>>>  stored="true"/>
>>
>>
>>> multiValued="true"/> *//HERE IS WHERE "CONTENT" IS DEFINED*
>>
>> 
>>> stored="true" multiValued="true"/>
>>
>>
>>
>>
>>
>>
>>  /*/THROWING ERROR ABOUT
>> "CONTENT" NOT EXISTING HERE*
>>
>>
>>
>>
>>
>>
>>
>>   
>>
>> 
>>
>>
>> 
>> > positionIncrementGap="100">
>>   
>> 
>> > words="stopwords.txt" />
>> 
>> 
>>   
>>   
>> 
>> > words="stopwords.txt" />
>> 
>> > synonyms="contentType_synonyms.txt" ignoreCase="true" expand="true"/>
>> 
>>   
>> 
>>
>> 
>>
>>
>>
>> On Thu, Aug 4, 2016 at 3:55 PM, Chris Hostetter > > wrote:
>>
>>>
>>> you mentioned that the problem only happens when you use xinclude, but
>>> you
>>> havne't shown us hte details of your xinclude -- what exactly does your
>>> schema.xml look like (with the xinclude call) and what exactly does the
>>> file being included look like (entire contents)
>>>
>>> (I suspect the problem you are seeing is realted to the way xinclude
>>> doens't really support "snippets" of malformed xml, and instead requires
>>> some root tag -- i can't imagine what root tag you are using in the
>>> included file that would play nicely with mixing/matching field
>>> declarations. ... using xml entity includes may be a simpler/safer
>>> option)
>>>
>>>
>>>
>>> : Date: Thu, 4 Aug 2016 15:47:00 -0600
>>> : From: John Bickerstaff 
>>> : Reply-To: solr-user@lucene.apache.org
>>> : To: solr-user@lucene.apache.org
>>> : Subject: Re: Problems using fieldType text_general in copyField
>>> :
>>> : I would call this a bug...
>>> :
>>> : I'm going out on a limb and say that if you define a field in the
>>> included
>>> : XML file, you will get this error.
>>> :
>>> : As long as the field is defined first in schema.xml, you can
>>> "copyFIeld" it
>>> : or whatever in the include file, but apparently fields MUST be created
>>> in
>>> : the schema.xml file.
>>> :
>>> : That makes use of the include for custom things somewhat moot - at
>>> least in
>>> : my situation.
>>> :
>>> : I'd love to be wrong by the way, but that's what my tests suggest right
>>> : now...
>>> :
>>> : On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff <
>>> j...@johnbickerstaff.com>
>>> : wrote:
>>> :
>>> : > Summary:
>>> : >
>>> : > Using xinclude to include an xml file into schema.xml
>>> : >
>>> : > The following line
>>> : >
>>> : > 
>>> : >
>>> : > generates an error:  about a field being "not a glob and not
>>> matching an
>>> : > explicit field" even though I declare the field in the line just
>>> above.
>>> : >
>>> : > This seems to happen only for for fieldType text_general?
>>> : >
>>> : > 
>>> : >
>>> : > Explanation:
>>> : >
>>> : > I need a little help - keep getting an error when trying to use the
>>> : > ability to include an additional XML file.  I may be overlooking
>>> something,
>>> : > but if so, I need help to see it.
>>> : >
>>> : > I have the following two lines which throw zero errors when part of
>>> : > schema.xml:
>>> : >
>>> : > >> stored="true"
>>> : > multiValued="true"/>
>>> : >  
>>> : >
>>> : > However, when I put this into an include file and use xinclude, then
>>> I get
>>> : > this error when starting Solr.
>>> : >
>>> : >
>>> : >
>>> : >- *statdx_shard1_replica3:* org.apache.solr.common.
>>> : >SolrException:org.apache.solr.common.SolrException: Could not
>>> load
>>> : >conf for core statdx_shard1_replica3: Can't load schema
>>> schema.xml:
>>> : >copyField source :'content' is not a glob and doesn't match any
>>> explicit
>>> : >field or dynamicField.
>>> : >
>>> : >
>>> : > Given that I am defining the field in the line right above the
>>> copyField
>>> : > statement, I'm confused about why this works fine in schema.xml but
>>> NOT in
>>> : > an included file.
>>> : >
>>>

Re: Out of sync deletions causing differing IDF

2016-08-04 Thread Upayavira

Thx for these both, we'll give them both a try, see what difference they
make.

Upayavira

On Thu, 4 Aug 2016, at 12:27 PM, Erick Erickson wrote:
> Upayavira:
> 
> bq: I would have expected that, because the data is being indexed
> concurrently across replicas, that the pattern of delete/merge would be
> similar across replicas.
> 
> Except for the pesky timing issue. The timers start for autocommit when a
> request is received. So the time the autocommit timer expires won't be
> the same wall-clock time on all servers and thus may not have the same
> docs
> in the same segments. It would be _really nice_ if they did, because then
> we wouldn't have to fall back to full replication so often for recovery.
> 
> I think there's a JIRA out there for trying to coordinate all the commits
> across
> replicas in a shard, but I can't find it on a quick look.
> 
> Would distributed IDF help here?
> https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
> really old, it's in 5.0+)
> 
> Best,
> Erick
> 
> On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
>  wrote:
> > Hello - your similarity should rely on numDoc instead, it solves the 
> > problem. I believe it is already fixed in trunk, but i am not sure.
> > Markus
> >
> > -Original message-
> >> From:Upayavira 
> >> Sent: Thursday 4th August 2016 13:59
> >> To: solr-user@lucene.apache.org
> >> Subject: Out of sync deletions causing differing IDF
> >>
> >> We have a system that has a reasonable number of changes going on on a
> >> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
> >> Cloud, the data is split into 10 shards and those shards are replicated.
> >>
> >> What we are finding is that the number of deletions is causing differing
> >> maxDocs across the different replicas, and that is causing significantly
> >> different IDF values between replicas of the same shard, giving
> >> different scores and thus different orders depending upon which replica
> >> we hit.
> >>
> >> I would have expected that, because the data is being indexed
> >> concurrently across replicas, that the pattern of delete/merge would be
> >> similar across replicas, but that doesn't seem to be the case in
> >> practice.
> >>
> >> We could, of course, optimise the index to merge down to a single
> >> segment. This would clear all deletes out, but would leave us in a worse
> >> place for the future, as now most of our deletes would be concentrated
> >> into a single large segment.
> >>
> >> Has anyone seen this sort of thing before, and does anyone have
> >> suggested strategies as to how to encourage IDF values into a similar
> >> range across replicas?
> >>
> >> Upayavira
> >>

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

Found the Entity Includes - thanks.

On Thu, Aug 4, 2016 at 4:22 PM, John Bickerstaff 
wrote:

> Thanks!
>
> The schema is a copy of the techproducts sample.
>
> Entire include here - and I take your point about the possibility of
> malformation - thanks.
>
> I assumed (perhaps wrongly) that I could duplicate the 
>   arrangement from the schema.xml file.
>
> I'm unfamiliar with xml entity includes, but I'll go look them up...
>
> 
> 
>
>
>
> indexed="true" stored="false"/>
> stored="true" multiValued="false"/>
> stored="true" multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
>
> stored="true" multiValued="false"/>
>
> multiValued="false"/>
>
> stored="true" multiValued="false"/>
>
>
>
>  stored="true"/>
>
>
> multiValued="true"/> *//HERE IS WHERE "CONTENT" IS DEFINED*
>
> 
> stored="true" multiValued="true"/>
>
>
>
>
>
>
>  /*/THROWING ERROR ABOUT
> "CONTENT" NOT EXISTING HERE*
>
>
>
>
>
>
>
>   
>
> 
>
>
> 
>  positionIncrementGap="100">
>   
> 
>  words="stopwords.txt" />
> 
> 
>   
>   
> 
>  words="stopwords.txt" />
> 
>  synonyms="contentType_synonyms.txt"
> ignoreCase="true" expand="true"/>
> 
>   
> 
>
> 
>
>
>
> On Thu, Aug 4, 2016 at 3:55 PM, Chris Hostetter 
> wrote:
>
>>
>> you mentioned that the problem only happens when you use xinclude, but you
>> havne't shown us hte details of your xinclude -- what exactly does your
>> schema.xml look like (with the xinclude call) and what exactly does the
>> file being included look like (entire contents)
>>
>> (I suspect the problem you are seeing is realted to the way xinclude
>> doens't really support "snippets" of malformed xml, and instead requires
>> some root tag -- i can't imagine what root tag you are using in the
>> included file that would play nicely with mixing/matching field
>> declarations. ... using xml entity includes may be a simpler/safer option)
>>
>>
>>
>> : Date: Thu, 4 Aug 2016 15:47:00 -0600
>> : From: John Bickerstaff 
>> : Reply-To: solr-user@lucene.apache.org
>> : To: solr-user@lucene.apache.org
>> : Subject: Re: Problems using fieldType text_general in copyField
>> :
>> : I would call this a bug...
>> :
>> : I'm going out on a limb and say that if you define a field in the
>> included
>> : XML file, you will get this error.
>> :
>> : As long as the field is defined first in schema.xml, you can
>> "copyFIeld" it
>> : or whatever in the include file, but apparently fields MUST be created
>> in
>> : the schema.xml file.
>> :
>> : That makes use of the include for custom things somewhat moot - at
>> least in
>> : my situation.
>> :
>> : I'd love to be wrong by the way, but that's what my tests suggest right
>> : now...
>> :
>> : On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff <
>> j...@johnbickerstaff.com>
>> : wrote:
>> :
>> : > Summary:
>> : >
>> : > Using xinclude to include an xml file into schema.xml
>> : >
>> : > The following line
>> : >
>> : > 
>> : >
>> : > generates an error:  about a field being "not a glob and not matching
>> an
>> : > explicit field" even though I declare the field in the line just
>> above.
>> : >
>> : > This seems to happen only for for fieldType text_general?
>> : >
>> : > 
>> : >
>> : > Explanation:
>> : >
>> : > I need a little help - keep getting an error when trying to use the
>> : > ability to include an additional XML file.  I may be overlooking
>> something,
>> : > but if so, I need help to see it.
>> : >
>> : > I have the following two lines which throw zero errors when part of
>> : > schema.xml:
>> : >
>> : > > stored="true"
>> : > multiValued="true"/>
>> : >  
>> : >
>> : > However, when I put this into an include file and use xinclude, then
>> I get
>> : > this error when starting Solr.
>> : >
>> : >
>> : >
>> : >- *statdx_shard1_replica3:* org.apache.solr.common.
>> : >SolrException:org.apache.solr.common.SolrException: Could not load
>> : >conf for core statdx_shard1_replica3: Can't load schema schema.xml:
>> : >copyField source :'content' is not a glob and doesn't match any
>> explicit
>> : >field or dynamicField.
>> : >
>> : >
>> : > Given that I am defining the field in the line right above the
>> copyField
>> : > statement, I'm confused about why this works fine in schema.xml but
>> NOT in
>> : > an included file.
>> : >
>> : > I experimented and found that any field of type "text_general" will
>> throw
>> : > this same error if it is part of the included xml file.  Other
>> fieldTypes
>> : > that I tried (string, int, double) did not have this issue.
>> : >
>> : > I'm using Solr 5.4, although I'm pulling custom config into an
>> included
>> : > file for purposes of moving to 6.1
>> : >
>> : > I have the following list of copyField commands in the included xml
>> file,
>> : > and get no errors on an

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

Thanks!

The schema is a copy of the techproducts sample.

Entire include here - and I take your point about the possibility of
malformation - thanks.

I assumed (perhaps wrongly) that I could duplicate the 
  arrangement from the schema.xml file.

I'm unfamiliar with xml entity includes, but I'll go look them up...




   
   
   
   
   
   
   
   
   
   
   

   
   

   
   

   
*//HERE IS WHERE "CONTENT" IS DEFINED*


   

   
   
   
   
   
 /*/THROWING ERROR ABOUT
"CONTENT" NOT EXISTING HERE*
   
   
   

   
   

  






  




  
  





  






On Thu, Aug 4, 2016 at 3:55 PM, Chris Hostetter 
wrote:

>
> you mentioned that the problem only happens when you use xinclude, but you
> havne't shown us hte details of your xinclude -- what exactly does your
> schema.xml look like (with the xinclude call) and what exactly does the
> file being included look like (entire contents)
>
> (I suspect the problem you are seeing is realted to the way xinclude
> doens't really support "snippets" of malformed xml, and instead requires
> some root tag -- i can't imagine what root tag you are using in the
> included file that would play nicely with mixing/matching field
> declarations. ... using xml entity includes may be a simpler/safer option)
>
>
>
> : Date: Thu, 4 Aug 2016 15:47:00 -0600
> : From: John Bickerstaff 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Re: Problems using fieldType text_general in copyField
> :
> : I would call this a bug...
> :
> : I'm going out on a limb and say that if you define a field in the
> included
> : XML file, you will get this error.
> :
> : As long as the field is defined first in schema.xml, you can "copyFIeld"
> it
> : or whatever in the include file, but apparently fields MUST be created in
> : the schema.xml file.
> :
> : That makes use of the include for custom things somewhat moot - at least
> in
> : my situation.
> :
> : I'd love to be wrong by the way, but that's what my tests suggest right
> : now...
> :
> : On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff <
> j...@johnbickerstaff.com>
> : wrote:
> :
> : > Summary:
> : >
> : > Using xinclude to include an xml file into schema.xml
> : >
> : > The following line
> : >
> : > 
> : >
> : > generates an error:  about a field being "not a glob and not matching
> an
> : > explicit field" even though I declare the field in the line just above.
> : >
> : > This seems to happen only for for fieldType text_general?
> : >
> : > 
> : >
> : > Explanation:
> : >
> : > I need a little help - keep getting an error when trying to use the
> : > ability to include an additional XML file.  I may be overlooking
> something,
> : > but if so, I need help to see it.
> : >
> : > I have the following two lines which throw zero errors when part of
> : > schema.xml:
> : >
> : >  : > multiValued="true"/>
> : >  
> : >
> : > However, when I put this into an include file and use xinclude, then I
> get
> : > this error when starting Solr.
> : >
> : >
> : >
> : >- *statdx_shard1_replica3:* org.apache.solr.common.
> : >SolrException:org.apache.solr.common.SolrException: Could not load
> : >conf for core statdx_shard1_replica3: Can't load schema schema.xml:
> : >copyField source :'content' is not a glob and doesn't match any
> explicit
> : >field or dynamicField.
> : >
> : >
> : > Given that I am defining the field in the line right above the
> copyField
> : > statement, I'm confused about why this works fine in schema.xml but
> NOT in
> : > an included file.
> : >
> : > I experimented and found that any field of type "text_general" will
> throw
> : > this same error if it is part of the included xml file.  Other
> fieldTypes
> : > that I tried (string, int, double) did not have this issue.
> : >
> : > I'm using Solr 5.4, although I'm pulling custom config into an included
> : > file for purposes of moving to 6.1
> : >
> : > I have the following list of copyField commands in the included xml
> file,
> : > and get no errors on any but the "content" one.  It just so happens
> that
> : > "content" is the only field of type "text_general" in there.
> : >
> : >
> : > Any hints greatly appreciated.
> : >
> : >   
> : >
> : >
> : >
> : >
> : >
> : >
> : >
> : >
> : >
> :
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Can a MergeStrategy filter returned docs?

2016-08-04 Thread Joel Bernstein

Collapse will have dups unless you use the _route_ parameter to co-locate
documents with the same group, onto the same shard.

In you're scenario, co-locating docs sounds like it won't work because you
may have different grouping criteria.

The doc counts would be inflated unless you sent all the documents from the
shards to be merged and then de-duped them, which is how streaming
operates. But streaming has the capability to do these types of operations
in parallel and the merge strategy does not.






Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Aug 4, 2016 at 6:04 PM, tedsolr  wrote:

> Perhaps my assumptions about merge are wrong. When I run a search with the
> collapsing filter (q=*:*&fq={!collapse field=VENDOR_NAME}...) I get "dupes"
> if the same VENDOR_NAME is on shard1 and shard2. Here's the response:
>
> "response": {
> "numFound": 24158,
> "start": 0,
> "docs": [
>   {
> "VENDOR_NAME": "01DB  METRAVIB SAS",
> "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
>   },
>   {
> "VENDOR_NAME": "01DB  METRAVIB SAS",
> "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_1_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_1_replica2/";
>   },
>   {
> "VENDOR_NAME": "1 BIG SELF STORE LTD",
> "[shard]":
> "http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|
> http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
>   }
> ]
>   }
>
> You can see the same vendor is returned from shard1_1 and shard1_0. So I'm
> expecting the same results from my plugin (once I get it to work). I
> thought
> the merge strategy could be used to filter out the "duplicate" vendor. So
> would that require rebuilding the document list and then replacing the solr
> response like shardResponse.setSolrResponse()?
>
> And if that is the correct approach, I could return many more results than
> the user expected. If I'm thinking correctly, then worse case is no "dupes"
> between the shards and the returned result count is rows X shards. To make
> sure the correct results are returned based on the sort I'll also have to
> resort the merged results. So for a search like q=*:*&fl=vendor&sort=vendor
> asc...
>
> results example:
> shard 1 docs: { A, B, D }
> shard 2 docs: { B, C, D }
>
> So walking through the solr responses for each shard I end up with a return
> set of { A, B, C, D }
>
>
> Joel Bernstein wrote
> > Can you describe more about what you're trying to do in the merge? Why
> > does
> > it seem it's too late to drop documents in the merge?
> >
> > If you can provide a very simple example with some sample records and a
> > sample output, that would be helpful.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Aug 4, 2016 at 4:25 PM, tedsolr <
>
> > tsmith@
>
> > > wrote:
> >
> >> I've been struggling just to get my search plugin working for sharded
> >> collections, but I haven't ascertained if my end goal is even
> achievable.
> >> I
> >> have a plugin that groups documents that are considered duplicates
> (based
> >> on
> >> multiple fields - like the CollapsingQParserPlugin). When responses come
> >> back from different shards another culling will be necessary to remove
> >> dupes
> >> between the shards. In the merge() method it seems it will be too late
> to
> >> simply "drop" documents. Is this something that the client will just
> have
> >> to
> >> deal with? Maybe in the process() method of a search component? I was
> >> expecting to be able to preserve the requested return count, but that
> >> seems
> >> really unlikely now.
> >>
> >> Thanks for any suggestions,
> >> Ted v5.2.1
> >>
> >>
> >>
> >> --
> >> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
> >> MergeStrategy-filter-returned-docs-tp4290446.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
> MergeStrategy-filter-returned-docs-tp4290446p4290458.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Can a MergeStrategy filter returned docs?

2016-08-04 Thread tedsolr

Perhaps my assumptions about merge are wrong. When I run a search with the
collapsing filter (q=*:*&fq={!collapse field=VENDOR_NAME}...) I get "dupes"
if the same VENDOR_NAME is on shard1 and shard2. Here's the response:

"response": {
"numFound": 24158,
"start": 0,
"docs": [
  {
"VENDOR_NAME": "01DB  METRAVIB SAS",
"[shard]":
"http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
  },
  {
"VENDOR_NAME": "01DB  METRAVIB SAS",
"[shard]":
"http://localhost:8983/solr/ShardTest1_shard1_1_replica1/|http://localhost:8984/solr/ShardTest1_shard1_1_replica2/";
  },
  {
"VENDOR_NAME": "1 BIG SELF STORE LTD",
"[shard]":
"http://localhost:8983/solr/ShardTest1_shard1_0_replica1/|http://localhost:8984/solr/ShardTest1_shard1_0_replica2/";
  }
]
  }

You can see the same vendor is returned from shard1_1 and shard1_0. So I'm
expecting the same results from my plugin (once I get it to work). I thought
the merge strategy could be used to filter out the "duplicate" vendor. So
would that require rebuilding the document list and then replacing the solr
response like shardResponse.setSolrResponse()?

And if that is the correct approach, I could return many more results than
the user expected. If I'm thinking correctly, then worse case is no "dupes"
between the shards and the returned result count is rows X shards. To make
sure the correct results are returned based on the sort I'll also have to
resort the merged results. So for a search like q=*:*&fl=vendor&sort=vendor
asc...

results example:
shard 1 docs: { A, B, D }
shard 2 docs: { B, C, D }

So walking through the solr responses for each shard I end up with a return
set of { A, B, C, D }


Joel Bernstein wrote
> Can you describe more about what you're trying to do in the merge? Why
> does
> it seem it's too late to drop documents in the merge?
> 
> If you can provide a very simple example with some sample records and a
> sample output, that would be helpful.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Thu, Aug 4, 2016 at 4:25 PM, tedsolr <

> tsmith@

> > wrote:
> 
>> I've been struggling just to get my search plugin working for sharded
>> collections, but I haven't ascertained if my end goal is even achievable.
>> I
>> have a plugin that groups documents that are considered duplicates (based
>> on
>> multiple fields - like the CollapsingQParserPlugin). When responses come
>> back from different shards another culling will be necessary to remove
>> dupes
>> between the shards. In the merge() method it seems it will be too late to
>> simply "drop" documents. Is this something that the client will just have
>> to
>> deal with? Maybe in the process() method of a search component? I was
>> expecting to be able to preserve the requested return count, but that
>> seems
>> really unlikely now.
>>
>> Thanks for any suggestions,
>> Ted v5.2.1
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
>> MergeStrategy-filter-returned-docs-tp4290446.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-a-MergeStrategy-filter-returned-docs-tp4290446p4290458.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread Chris Hostetter


you mentioned that the problem only happens when you use xinclude, but you 
havne't shown us hte details of your xinclude -- what exactly does your 
schema.xml look like (with the xinclude call) and what exactly does the 
file being included look like (entire contents)

(I suspect the problem you are seeing is realted to the way xinclude 
doens't really support "snippets" of malformed xml, and instead requires 
some root tag -- i can't imagine what root tag you are using in the 
included file that would play nicely with mixing/matching field 
declarations. ... using xml entity includes may be a simpler/safer option)



: Date: Thu, 4 Aug 2016 15:47:00 -0600
: From: John Bickerstaff 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: Problems using fieldType text_general in copyField
: 
: I would call this a bug...
: 
: I'm going out on a limb and say that if you define a field in the included
: XML file, you will get this error.
: 
: As long as the field is defined first in schema.xml, you can "copyFIeld" it
: or whatever in the include file, but apparently fields MUST be created in
: the schema.xml file.
: 
: That makes use of the include for custom things somewhat moot - at least in
: my situation.
: 
: I'd love to be wrong by the way, but that's what my tests suggest right
: now...
: 
: On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff 
: wrote:
: 
: > Summary:
: >
: > Using xinclude to include an xml file into schema.xml
: >
: > The following line
: >
: > 
: >
: > generates an error:  about a field being "not a glob and not matching an
: > explicit field" even though I declare the field in the line just above.
: >
: > This seems to happen only for for fieldType text_general?
: >
: > 
: >
: > Explanation:
: >
: > I need a little help - keep getting an error when trying to use the
: > ability to include an additional XML file.  I may be overlooking something,
: > but if so, I need help to see it.
: >
: > I have the following two lines which throw zero errors when part of
: > schema.xml:
: >
: >  multiValued="true"/>
: >  
: >
: > However, when I put this into an include file and use xinclude, then I get
: > this error when starting Solr.
: >
: >
: >
: >- *statdx_shard1_replica3:* org.apache.solr.common.
: >SolrException:org.apache.solr.common.SolrException: Could not load
: >conf for core statdx_shard1_replica3: Can't load schema schema.xml:
: >copyField source :'content' is not a glob and doesn't match any explicit
: >field or dynamicField.
: >
: >
: > Given that I am defining the field in the line right above the copyField
: > statement, I'm confused about why this works fine in schema.xml but NOT in
: > an included file.
: >
: > I experimented and found that any field of type "text_general" will throw
: > this same error if it is part of the included xml file.  Other fieldTypes
: > that I tried (string, int, double) did not have this issue.
: >
: > I'm using Solr 5.4, although I'm pulling custom config into an included
: > file for purposes of moving to 6.1
: >
: > I have the following list of copyField commands in the included xml file,
: > and get no errors on any but the "content" one.  It just so happens that
: > "content" is the only field of type "text_general" in there.
: >
: >
: > Any hints greatly appreciated.
: >
: >   
: >
: >
: >
: >
: >
: >
: >
: >
: >
: 

-Hoss
http://www.lucidworks.com/

Re: Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

I would call this a bug...

I'm going out on a limb and say that if you define a field in the included
XML file, you will get this error.

As long as the field is defined first in schema.xml, you can "copyFIeld" it
or whatever in the include file, but apparently fields MUST be created in
the schema.xml file.

That makes use of the include for custom things somewhat moot - at least in
my situation.

I'd love to be wrong by the way, but that's what my tests suggest right
now...

On Thu, Aug 4, 2016 at 1:37 PM, John Bickerstaff 
wrote:

> Summary:
>
> Using xinclude to include an xml file into schema.xml
>
> The following line
>
> 
>
> generates an error:  about a field being "not a glob and not matching an
> explicit field" even though I declare the field in the line just above.
>
> This seems to happen only for for fieldType text_general?
>
> 
>
> Explanation:
>
> I need a little help - keep getting an error when trying to use the
> ability to include an additional XML file.  I may be overlooking something,
> but if so, I need help to see it.
>
> I have the following two lines which throw zero errors when part of
> schema.xml:
>
>  multiValued="true"/>
>  
>
> However, when I put this into an include file and use xinclude, then I get
> this error when starting Solr.
>
>
>
>- *statdx_shard1_replica3:* org.apache.solr.common.
>SolrException:org.apache.solr.common.SolrException: Could not load
>conf for core statdx_shard1_replica3: Can't load schema schema.xml:
>copyField source :'content' is not a glob and doesn't match any explicit
>field or dynamicField.
>
>
> Given that I am defining the field in the line right above the copyField
> statement, I'm confused about why this works fine in schema.xml but NOT in
> an included file.
>
> I experimented and found that any field of type "text_general" will throw
> this same error if it is part of the included xml file.  Other fieldTypes
> that I tried (string, int, double) did not have this issue.
>
> I'm using Solr 5.4, although I'm pulling custom config into an included
> file for purposes of moving to 6.1
>
> I have the following list of copyField commands in the included xml file,
> and get no errors on any but the "content" one.  It just so happens that
> "content" is the only field of type "text_general" in there.
>
>
> Any hints greatly appreciated.
>
>   
>
>
>
>
>
>
>
>
>

Re: Can a MergeStrategy filter returned docs?

2016-08-04 Thread Joel Bernstein

Can you describe more about what you're trying to do in the merge? Why does
it seem it's too late to drop documents in the merge?

If you can provide a very simple example with some sample records and a
sample output, that would be helpful.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Aug 4, 2016 at 4:25 PM, tedsolr  wrote:

> I've been struggling just to get my search plugin working for sharded
> collections, but I haven't ascertained if my end goal is even achievable. I
> have a plugin that groups documents that are considered duplicates (based
> on
> multiple fields - like the CollapsingQParserPlugin). When responses come
> back from different shards another culling will be necessary to remove
> dupes
> between the shards. In the merge() method it seems it will be too late to
> simply "drop" documents. Is this something that the client will just have
> to
> deal with? Maybe in the process() method of a search component? I was
> expecting to be able to preserve the requested return count, but that seems
> really unlikely now.
>
> Thanks for any suggestions,
> Ted v5.2.1
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-a-
> MergeStrategy-filter-returned-docs-tp4290446.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Can a MergeStrategy filter returned docs?

2016-08-04 Thread tedsolr

I've been struggling just to get my search plugin working for sharded
collections, but I haven't ascertained if my end goal is even achievable. I
have a plugin that groups documents that are considered duplicates (based on
multiple fields - like the CollapsingQParserPlugin). When responses come
back from different shards another culling will be necessary to remove dupes
between the shards. In the merge() method it seems it will be too late to
simply "drop" documents. Is this something that the client will just have to
deal with? Maybe in the process() method of a search component? I was
expecting to be able to preserve the requested return count, but that seems
really unlikely now.

Thanks for any suggestions,
Ted v5.2.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-a-MergeStrategy-filter-returned-docs-tp4290446.html
Sent from the Solr - User mailing list archive at Nabble.com.

Problems using fieldType text_general in copyField

2016-08-04 Thread John Bickerstaff

Summary:

Using xinclude to include an xml file into schema.xml

The following line



generates an error:  about a field being "not a glob and not matching an
explicit field" even though I declare the field in the line just above.

This seems to happen only for for fieldType text_general?



Explanation:

I need a little help - keep getting an error when trying to use the ability
to include an additional XML file.  I may be overlooking something, but if
so, I need help to see it.

I have the following two lines which throw zero errors when part of
schema.xml:


 

However, when I put this into an include file and use xinclude, then I get
this error when starting Solr.



   - *statdx_shard1_replica3:*
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
   Could not load conf for core statdx_shard1_replica3: Can't load schema
   schema.xml: copyField source :'content' is not a glob and doesn't match any
   explicit field or dynamicField.


Given that I am defining the field in the line right above the copyField
statement, I'm confused about why this works fine in schema.xml but NOT in
an included file.

I experimented and found that any field of type "text_general" will throw
this same error if it is part of the included xml file.  Other fieldTypes
that I tried (string, int, double) did not have this issue.

I'm using Solr 5.4, although I'm pulling custom config into an included
file for purposes of moving to 6.1

I have the following list of copyField commands in the included xml file,
and get no errors on any but the "content" one.  It just so happens that
"content" is the only field of type "text_general" in there.


Any hints greatly appreciated.

Re: Difference in boolean query parsing. Solr-5.4.0 VS Solr.6.1.0

2016-08-04 Thread Steve Rowe

It’s fairly likely these differences are as a result of SOLR-2649[1] (released 
with 5.5) and SOLR-8812[2] (released with 6.1).

If you haven’t seen it, I recommend you read Hoss'ss blog “Why Not AND, OR, And 
NOT?” .

If you can, add parentheses to explicitly specify precedence.

[1] https://issues.apache.org/jira/browse/SOLR-2649
[2] https://issues.apache.org/jira/browse/SOLR-8812

--
Steve
www.lucidworks.com

> On Aug 4, 2016, at 2:23 AM, Modassar Ather  wrote:
> 
> Hi,
> 
> During migration from Solr-5.4.1 to Solr-6.1.0 I saw a difference in the
> behavior of few of my boolean queries.
> As per my current understanding the default operator comes in when there is
> no operator present in between two terms.
> Also both the ANDed terms are marked mandatory if not, any of them is
> introduced as NOT. Same is the case with OR.
> Please correct me if my understanding is wrong.
> 
> The below queries are parsed differently and causes a lot of difference in
> search result.
> The default operator used is AND and no mm is set.
> 
> 
> *Query  : *fl:(network hardware AND device OR system)
> *Solr.6.1.0 :* "+(+fl:network +fl:hardware fl:device fl:system)"
> *Solr-5.4.0 : *"+(fl:network +fl:hardware +fl:device fl:system)"
> 
> *Query  : *fl:(network OR hardware device system)
> *Solr.6.1.0 : *"+(fl:network fl:hardware +fl:device +fl:system)"
> *Solr-5.4.0 : *"+(fl:network fl:hardware fl:device fl:system)"
> 
> *Query  : *fl:(network OR hardware AND device OR system)
> *Solr.6.1.0 : *"+(fl:network +fl:hardware fl:device fl:system)"
> *Solr-5.4.0 : *"+(fl:network +fl:hardware +fl:device fl:system)"
> 
> *Query  : *fl:(network AND hardware AND device OR system)"
> *Solr.6.1.0 : *"+(+fl:network +fl:hardware fl:device fl:system)"
> *Solr-5.4.0 : *"+(+fl:network +fl:hardware +fl:device fl:system)"
> 
> Please help me understand the difference in parsing and its effect on
> search.
> 
> Thanks,
> Modassar

unique( )- How to override default of 100

2016-08-04 Thread Lewin Joy (TMS)

** PROTECTED 関係者外秘
Hi,

I was looking at Solr’s countdistinct feature with unique and hll functions.
I am interested in getting accurate cardinality in cloud setup.

As per the link, unique() function provides exact counts if the number of 
values per node does not exceed 100 by default.
How do I override this default to a much higher value?
Is it possible?
Refer: http://yonik.com/solr-count-distinct/


Thanks,
Lewin

Re: Out of sync deletions causing differing IDF

2016-08-04 Thread Erick Erickson

Upayavira:

bq: I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas.

Except for the pesky timing issue. The timers start for autocommit when a
request is received. So the time the autocommit timer expires won't be
the same wall-clock time on all servers and thus may not have the same docs
in the same segments. It would be _really nice_ if they did, because then
we wouldn't have to fall back to full replication so often for recovery.

I think there's a JIRA out there for trying to coordinate all the commits across
replicas in a shard, but I can't find it on a quick look.

Would distributed IDF help here?
https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
really old, it's in 5.0+)

Best,
Erick

On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
 wrote:
> Hello - your similarity should rely on numDoc instead, it solves the problem. 
> I believe it is already fixed in trunk, but i am not sure.
> Markus
>
> -Original message-
>> From:Upayavira 
>> Sent: Thursday 4th August 2016 13:59
>> To: solr-user@lucene.apache.org
>> Subject: Out of sync deletions causing differing IDF
>>
>> We have a system that has a reasonable number of changes going on on a
>> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
>> Cloud, the data is split into 10 shards and those shards are replicated.
>>
>> What we are finding is that the number of deletions is causing differing
>> maxDocs across the different replicas, and that is causing significantly
>> different IDF values between replicas of the same shard, giving
>> different scores and thus different orders depending upon which replica
>> we hit.
>>
>> I would have expected that, because the data is being indexed
>> concurrently across replicas, that the pattern of delete/merge would be
>> similar across replicas, but that doesn't seem to be the case in
>> practice.
>>
>> We could, of course, optimise the index to merge down to a single
>> segment. This would clear all deletes out, but would leave us in a worse
>> place for the future, as now most of our deletes would be concentrated
>> into a single large segment.
>>
>> Has anyone seen this sort of thing before, and does anyone have
>> suggested strategies as to how to encourage IDF values into a similar
>> range across replicas?
>>
>> Upayavira
>>

Re: Replication with managed resources?

2016-08-04 Thread rosbaldeston

Raised as https://issues.apache.org/jira/browse/SOLR-9382



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: QParsePlugin not working on sharded collection

2016-08-04 Thread tedsolr

So my implementation with a DocTransformer is causing an exception (with a
sharded collection):

ERROR - 2016-08-04 09:41:44.247; [ShardTest1 shard1_0 core_node3
ShardTest1_shard1_0_replica1] org.apache.solr.common.SolrException;
null:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at
http://localhost:8983/solr/ShardTest1_shard1_0_replica1: parsing error
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:538)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:235)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:227)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:218)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:183)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: parsing error
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:52)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:536)
... 12 more
Caused by: java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:208)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
at
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:508)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:202)
at
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:390)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:237)
at
org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:135)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:204)
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:126)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:50)
... 13 more

Here are the changes to TedQuery (I reduced the amount of data being
returned and map the docId to the document - like the [docid] transformer,
and put the map in the request context in the finish() method)

public void collect(int doc) throws IOException {
count++;
if (doc % 1 == 0) {
mydata.put(Integer.valueOf(doc + 
super.docBase), String.valueOf(doc +
super.docBase));
super.collect(doc);
}
}

public void finish() throws IOException {
...
rb.req.getContext().put("mystats", mydata);
...
}

Here's the transformer:

public class TedTransform extends TransformerFactory {
@Override
public DocTransformer create(String arg0, SolrParams arg1, 
SolrQueryRequest
arg2) {
return new TedTransformer(arg0, arg2);
}

private class TedTransformer extends TransformerWithContext {
private final String f;
private HashMap data;

public TedTransformer(String f, SolrQueryRequest r) {
this.f = f;
}

@Override
public String getName() {
return null;
}

@Override
public void transform(SolrDocument arg0, int arg1) throws 
IOException {
if (context.req != null) {
if (data == null) {
data = (HashMap)
context.req.getContext().get("mystats");
}
arg0.setField(f, data.get(Integer.valueOf(arg1)));
}
}
}
}

And I added the transformer to the solrconfig.xml:


   
   
   
{!TedFilter myvar=hello}
[TedT]

   

Why does this barf on multi-sharded collections?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249p4290390.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why Nested document 'child' entity query (iterative count)repeatedly executing?

2016-08-04 Thread Mikhail Khludnev

It seems like debug reporting issue. It deserves a minor Jira.

On Thu, Jul 14, 2016 at 2:38 PM, Rajendran, Prabaharan 
wrote:

> Hi All,
>
> I am trying to index nested document. While start full data-import (from
> UI) following options selected Verbose, Commit, Debug & Debug-Mode.
>
> Raw Debug-Response shows that "child" entity query execute repeatedly.
> Kindly help me to understand this, how can I resolve this, am I missed
> anything?
>
> Sample data-config.xml
> 
>  transformer="RegexTransformer" rootEntity="true"
> query="SELECT * FROM COMPANY">
>  name="c_id"/>
>  name="c_department"/>
>
>  child="true"
> query="SELECT * FROM
> EMPLOYEE
> WHERE com_employee_id =
> '${company.ID}'">
>  column="NAME" name="e_name"/>
>  column="EXPERIENCE" name="e_experience"/>
> 
> 
> 
>
> 
> Raw Debug-Response:
>
> "entity:company",
> [
> "document#1",
> [
> "query",
> "SELECT * FROM COMPANY",
>
>"",
> "transformer:RegexTransformer",
> [
>   "entity:employee",
> [
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'",
> ]
> ]
> ],
> "document#2",
> [
> "transformer:RegexTransformer",
> [
>   "entity:employee",
>   [
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'",
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'",
> ]
>   ]
>  ]
> ]
>
> "document#3"
> "entity:employee",
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'",
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'",
> "query",
> "SELECT * FROM EMPLOYEE WHERE com_employee_id = '1924'"
>
>
> Thanks,
> Prabaharan
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Sort Facet Values by "Interestingness"?

2016-08-04 Thread Joel Bernstein

Ok let's explore how to use scoreNodes() with the facet() expression.

scoreNodes is a graph expression, so it expects certain fields to be on
each Tuple. The fields are:

1) node: The node field is the node id gathered by the gatherNodes()
function.
2) collection: This is the collection that the node is belongs to
3) field: The field the node was gathered from.

The facet function does not include these fields automatically so we'll
need to adjust the tuples returned by the facet function using the select
function.

The pseudo code is:

scoreNodes(select(facet(...)))

In order to add detail to this let's take a simple case:

facet(collection1,
  q="*:*",
  buckets="author",
  bucketSorts="count(*) desc",
  bucketSizeLimit=100,
  count(*))

The tuples for this would look like this:

author : joel
count(*) : 5

author : jim
count(*) : 4


So three things need to be done to these tuples to work with scoreNodes:

1) the author field needs to be renamed "node", so it looks like the node
id of a gatherNodes function.
2) The "collection"  needs to be added.
3) The "field" needs to be added

So we can wrap a scoreNodes and a select function around the facet function
like this:

scoreNodes(
select(facet(collection1,
 q="*:*",
 buckets="author",
 bucketSorts="count(*) desc",
 bucketSizeLimit=100,
 count(*)),
   author as node,
   count(*),
   replace(collection,null,withValue=collection1),
   replace(field, null, withValue=author)))











Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 11:53 AM, Joel Bernstein  wrote:

> You first gather the candidates and then call the TermsComponent with a
> callback. The scoreNodes expression does this and it's very fast because
> Streaming expressions are run from a Solr node in the same cluster.
>
> The TermsComponent will return the global docFreq for the terms and global
> numDocs for the collection, so you'll be able to compute idf for each term.
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Aug 3, 2016 at 11:22 AM, Ben Heuwing 
> wrote:
>
>> Hi Joel,
>>
>> thank you, this sounds great!
>>
>> As to your first proposal: I am a bit out of my depth here, as I have not
>> worked with streaming expressions so far. But I will try out your example
>> using the facet() expression on a simple use case as soon as you publish it.
>>
>> Using the TermsComponent directly, would that imply that I have to
>> retrieve all possible candidates and then sent them back as a  terms.list
>> to get their df? However, I assume that this would still be faster than
>> having 2 repeated facet-calls. Or did you suggest to use the component in a
>> customized RequestHandler?
>>
>> Regards,
>>
>> Ben
>>
>>
>> Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
>>
>>> Also the TermsComponent now can export the docFreq for a list of terms
>>> and
>>> the numDocs for the index. This can be used as a general purpose
>>> mechanism
>>> for scoring facets with a callback.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-9243
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein
>>> wrote:
>>>
>>> What you're describing is implemented with Graph aggregations in this
 ticket using tf-idf. Other scoring methods can be implemented as well.

 https://issues.apache.org/jira/browse/SOLR-9193

 I'll update this thread with a description of how this can be used with
 the facet() streaming expression as well as with graph queries later
 today.



 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Wed, Aug 3, 2016 at 8:18 AM,  wrote:

 Dear everybody,
>
> as the JSON-API now makes configuration of facets and sub-facets
> easier,
> there appears to be a lot of potential to enable instant calculation of
> facet-recommendations for a query, that is, to sort facets by their
> relative importance/interestingess/signficance for a current query
> relative
> to the complete collection or relative to a result set defined by a
> different query.
>
> An example would be to show the most typical terms which are used in
> descriptions of horror-movies, in contrast to the most popular ones for
> this query, as these may include terms that occur as often in other
> genres.
>
> This feature has been discussed earlier in the context of solr:
> *
> http://stackoverflow.duapp.com/questions/26399264/how-
> can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
> *
> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-
> concept-td504070.html
>
> In elasticsearch, the specific feature that I am looking for is called
> Significant Terms Aggregation:
> https://www.elast

Re: QParsePlugin not working on sharded collection

2016-08-04 Thread tedsolr

Thanks Erick, you answered my question by pointing out the aggregator. I
didn't realize a merge strategy was _required_ to return stats info when
there are multiple shards. I'm having trouble with my actual plugin so I've
scaled back to the simplest possible example. I'm adding to it little by
little to see what the last straw is.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249p4290365.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Out of sync deletions causing differing IDF

2016-08-04 Thread Markus Jelsma

Hello - your similarity should rely on numDoc instead, it solves the problem. I 
believe it is already fixed in trunk, but i am not sure.
Markus
 
-Original message-
> From:Upayavira 
> Sent: Thursday 4th August 2016 13:59
> To: solr-user@lucene.apache.org
> Subject: Out of sync deletions causing differing IDF
> 
> We have a system that has a reasonable number of changes going on on a
> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
> Cloud, the data is split into 10 shards and those shards are replicated.
> 
> What we are finding is that the number of deletions is causing differing
> maxDocs across the different replicas, and that is causing significantly
> different IDF values between replicas of the same shard, giving
> different scores and thus different orders depending upon which replica
> we hit.
> 
> I would have expected that, because the data is being indexed
> concurrently across replicas, that the pattern of delete/merge would be
> similar across replicas, but that doesn't seem to be the case in
> practice.
> 
> We could, of course, optimise the index to merge down to a single
> segment. This would clear all deletes out, but would leave us in a worse
> place for the future, as now most of our deletes would be concentrated
> into a single large segment.
> 
> Has anyone seen this sort of thing before, and does anyone have
> suggested strategies as to how to encourage IDF values into a similar
> range across replicas?
> 
> Upayavira
>

Out of sync deletions causing differing IDF

2016-08-04 Thread Upayavira

We have a system that has a reasonable number of changes going on on a
daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
Cloud, the data is split into 10 shards and those shards are replicated.

What we are finding is that the number of deletions is causing differing
maxDocs across the different replicas, and that is causing significantly
different IDF values between replicas of the same shard, giving
different scores and thus different orders depending upon which replica
we hit.

I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas, but that doesn't seem to be the case in
practice.

We could, of course, optimise the index to merge down to a single
segment. This would clear all deletes out, but would leave us in a worse
place for the future, as now most of our deletes would be concentrated
into a single large segment.

Has anyone seen this sort of thing before, and does anyone have
suggested strategies as to how to encourage IDF values into a similar
range across replicas?

Upayavira

ComplexPhraseQuery and range query

2016-08-04 Thread JM Rouand


Hi,

We are running solr 6.1.0
We are using ComplexPhraseQuery As default parser.

It seems that range query are not working with this parser.

Here is the query:
nbaff:({0 TO * })

Here is the field definition in the schema:
positionIncrementGap="0" precisionStep="0"/>



Running query in debug mode gives this:

 "debug":{
"rawquerystring":"nbaff:({0 TO * })",
"querystring":"nbaff:({0 TO * })",
"parsedquery":"nbaff:{EXCEPTION(val=0) TO *}",
"parsedquery_toString":"nbaff:{0 TO *}",

Note that the query nbaff:0 is working properly.

Is there something special to do ?

Regards
JMR

Re: problems with bulk indexing with concurrent DIH

2016-08-04 Thread Bernd Fehling

After updating to version 5.5.3 it looks good now.
I think LUCENE-6161 has fixed my problem.
Nevertheless, after updating my development system and recompyling
my plugins I will have a look at DIH about the "update" and also
your advise about the uniqueKey.

Best regards
Bernd

Am 02.08.2016 um 16:16 schrieb Mikhail Khludnev:
> These deletes seem really puzzling to me. Can you experiment with
> commenting uniqeKey in schema.xml. My expectation that deletes should go
> away after that.
> 
> On Tue, Aug 2, 2016 at 4:50 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> Hi Mikhail,
>>
>> there are no deletes at all from my point of view.
>> All records have unique id's.
>> No sharding at all, it is a single index and it is certified
>> that all DIH's get different data to load and no record is
>> sent twice to any DIH participating at concurrent loading.
>>
>> Only assumption so far, DIH is sending the records as "update"
>> (and not pure "add") to the indexer which will generate delete
>> files during merge. If the number of segments is high it will
>> take quite long to merge and check all records of all segments.
>>
>> I'm currently setting up SOLR 5.5.3 but that takes a while.
>> I also located an "overwrite" parameter somewhere in DIH which
>> will force an "add" and not an "update" to the index, but
>> couldn't figure out how to set the parameter with command.
>>
>> Bernd
>>
>>
>> Am 02.08.2016 um 15:15 schrieb Mikhail Khludnev:
>>> Bernd,
>>> But why do you have so many deletes? Is it expected?
>>> When you run DIHs concurrently, do you shard intput data by uniqueKey?
>>>
>>> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de> wrote:
>>>
 If there is a problem in single index then it might also be in
>> CloudSolr.
 As far as I could figure out from INFOSTREAM, documents are added to
 segments
 and terms are "collected". Duplicate term are "deleted" (or whatever).
 These deletes (or whatever) are not concurrent.
 I have a lines like:
 BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
 infos=...
 BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
>> took
 180028 msec
 ...
 BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
 infos=...
 BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>> took
 3411845 msec

 3411545 msec are about 56 minutes where the system is doing what???
 At least not indexing because only one JAVA process and no I/O at all!

 How can SolrJ help me now with this problem?

 Best
 Bernd


 Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> Well, at least it'll be easier to debug in my experience. Simple
>> example.
> At some point you'll call CloudSolrClient.add(doc list). Comment just
 that
> out and you'll be able to isolate whether the issue is querying the be
>> or
> sending to Solr.
>
> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> routing...
>
> Best
> Erick
>
> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
>> bernd.fehl...@uni-bielefeld.de
>
> wrote:
>
>> So writing some SolrJ doing the same job as the DIH script
>> and using that concurrent will solve my problem?
>> I'm not using Tika.
>>
>> I don't think that DIH is my problem, even if it is not the best
 solution
>> right now.
>> Nevertheless, you are right SolrJ has higher performance, but what
>> if I have the same problems with SolrJ like with DIH?
>>
>> If it runs with DIH it should run with SolrJ with additional
>> performance
>> boost.
>>
>> Bernd
>>
>>
>> On 27.07.2016 at 16:03, Erick Erickson:
>>> I'd actually recommend you move to a SolrJ solution
>>> or similar. Currently, you're putting a load on the Solr
>>> servers (especially if you're also using Tika) in addition
>>> to all indexing etc.
>>>
>>> Here's a sample:
>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>>
>>> Dodging the question I know, but DIH sometimes isn't
>>> the best solution.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
>>>  wrote:
 After enhancing the server with SSDs I'm trying to speed up
>> indexing.

 The server has 16 CPUs and more than 100G RAM.
 JAVA (1.8.0_92) has 24G.
 SOLR is 4.10.4.
 Plain XML data to load is 218G with about 96M records.
 This will result in a single index of 299G.

 I tried with 4, 8, 12 and 16 concurrent DIHs.
 16 and 12 was to much because for 16 CPUs and my test continued
>> with 8
>> concurrent DIHs.
 Then i was trying different  and 
>> settings
>> but now I'm stuck.
 I can't figure out what is the best setting for bul

Re: Solr Cloud with 5 servers cluster failed due to Leader out of memory

Re: Solr Cloud with 5 servers cluster failed due to Leader out of memory

Re: Problems using fieldType text_general in copyField

Solr Cloud with 5 servers cluster failed due to Leader out of memory

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Out of sync deletions causing differing IDF

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Can a MergeStrategy filter returned docs?

Re: Can a MergeStrategy filter returned docs?

Re: Problems using fieldType text_general in copyField

Re: Problems using fieldType text_general in copyField

Re: Can a MergeStrategy filter returned docs?

Can a MergeStrategy filter returned docs?

Problems using fieldType text_general in copyField

Re: Difference in boolean query parsing. Solr-5.4.0 VS Solr.6.1.0

unique( )- How to override default of 100

Re: Out of sync deletions causing differing IDF

Re: Replication with managed resources?

Re: QParsePlugin not working on sharded collection

Re: Why Nested document 'child' entity query (iterative count)repeatedly executing?

Re: Sort Facet Values by "Interestingness"?

Re: QParsePlugin not working on sharded collection

RE: Out of sync deletions causing differing IDF

Out of sync deletions causing differing IDF

ComplexPhraseQuery and range query

Re: problems with bulk indexing with concurrent DIH

31 matches

Site Navigation

Mail list logo

Footer information