Re: Cascading failures with replicas

2017-03-18 Thread Walter Underwood
6.3.0. No idea how it is happening, but I got two replicas on the same host 
after one host went down.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2017, at 8:35 PM, Erick Erickson  wrote:
> 
> Hmmm, I'm totally mystified about how Solr is "creating a new replica
> when one host is down". Are you saying this is happening
> automagically? You're right the autoAddReplica bit is HDFS so having
> replicas just show up is completely completely weird. In days past,
> when a replica was discovered on disk when Solr started up, it would
> reconstruct itself _even if the collection had been deleted_. They
> would reappear in clusterstate.json, _not_ the individual state.json
> files under collections in ZK. Is this at all possible?
> 
> What version of Solr are you using anyway? The reconstruction I
> mentioned above is 4x IIRC.
> 
> Best,
> Erick
> 
> On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood  
> wrote:
>> Thanks. This is a very CPU-heavy workload, with ngram fields and very long 
>> queries. 16.7 million docs.
>> 
>> The whole cascading failure thing in search engines is hard. The first time 
>> I hit this was at Infoseek, over twenty years ago.
>> 
>>> On Mar 18, 2017, at 12:46 PM, Erick Erickson  
>>> wrote:
>>> 
>>> bug# 2, Solr shouldn't be adding replicas by itself unless you
>>> specified autoAddReplicas=true when you created the collection. It
>>> default to "false". So I'm not sure what's going on here.
>> 
>>"autoAddReplicas":"false",
>> 
>> in both collections. I thought that only worked with HDFS anyway.
>> 
>>> bug #3. The internal load balancers are round-robin, so this is
>>> expected. Not optimal I'll grant but expected.
>> 
>> Right. Still a bug. Should be round-robin on instances, not cores.
>> 
>>> bug #4. What shard placement rules are you using? There are a series
>>> of rules for replica placement and one of the criteria (IIRC) is
>>> exactly to try to distribute replicas to different hosts. Although
>>> there was some glitchiness whether two JVMs on the same _host_ were
>>> considered "the same host" or not.
>> 
>> Separate Amazon EC2 instances, one JVM per instance, no rules, other than 
>> the default.
>> 
>>"maxShardsPerNode":"1",
>> 
>>> bug #1 has been more or less of a pain for quite a while, work is ongoing 
>>> there.
>> 
>> Glad to share our logs.
>> 
>> wunder
>> 
>>> FWIW,
>>> Erick
>>> 
>>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood  
>>> wrote:
 I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I 
 shut down Solr on one host because it got into some kind of bad, 
 can’t-recover state where it was causing timeouts across the whole cluster 
 (bug #1).
 
 I ran a load benchmark near the capacity of the cluster. This had run fine 
 in test, this was the prod cluster.
 
 Solr Cloud added a replica to replace the down node. The node with two 
 cores got double the traffic and started slowly flapping in and out of 
 service. The 95th percentile response spiked from 3 seconds to 100 
 seconds. At some point, another replica was made, with two replicas from 
 the same shard on the same instance. Naturally, that was overloaded, and I 
 killed the benchmark out of charity.
 
 Bug #2 is creating a new replica when one host is down. This should be an 
 option and default to “false”, because it causes the cascade.
 
 Bug #3 is sending equal traffic to each core without considering the host. 
 Each host should get equal traffic, not each core.
 
 Bug #4 is putting two replicas from the same shard on one instance. That 
 is just asking for trouble.
 
 When it works, this cluster is awesome.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 
>> 



Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread Erick Erickson
OK, you're defining a . It has one or two sections,
 blah blah blah


For the time being, these should pretty much be very, very similar if
not identical.

If you only have
 in the fieldType, then the same analysis chain is used both
for indexing and querying.

The admin UI analysis page can either show you specific fields or
there's a "types" (or "fieldTypes, I forget which) later on in the
drop-down. So to see the analysis results, all you have to do is
define a fieldType and reload/restart.

Once you're satisfied with the fieldType, you assign it to a specific
field with the
 tag

Have fun!
Erick

On Sat, Mar 18, 2017 at 8:34 PM, donato  wrote:
> Thank you so much, Erick! I will try that!
>
> I do have one other question though... what sections do I do all of this in? 
> I see like four or five sections with different things in them. Do I use all 
> of those in each section or just in some? What is each section? What do they 
> do?
>
> Thanks again for your time. Truly. Thank you!
>
> 
> From: Erick Erickson [via Lucene] 
> Sent: Saturday, March 18, 2017 11:29:49 PM
> To: donato
> Subject: Re: How on EARTH do I remove 's in schema file?
>
> First, uncheck the "verbose" checkbox. The nitty-gritty information
> isn't relevant at this point.
>
> Second, hover over each of the light-gray like "MCF", "PRCF" and such.
> You'll see the element of the analysis chain that stands for, and the
> difference between the line before and this line is the effect of that
> element. For instance, on the query side you see that "patrick" is
> turned into "patrick", "patricks" and "patrick's" by "SF" which I'd
> guess is your SynonymFilter. But hovering over that will tell you
> exactly what element is producing those changes.
>
> Then it looks like you're using HTMLStripCharFilter, MappingCharFilter
> and PatternReplaceCharFilter (Factories all). Why do you think all
> those are necessary?
>
> So stop. Take a deep breath. My guess is that you've been trying a
> bunch of different approaches and the interactions of all the
> different parts are throwing you off. Start simple, with say
> StandardTokenizerFactory
> LowercaseFilterFactory
> EnglishPosessiveFilterFactory
> PorterStemFilterFactory
>
> Use the analysis page and work your way toward complexity. Concentrate
> on the indexing side first. Enter all three of your variants (jack
> jacks jack's) in the box and press the button. Do not pass go. Do not
> collection $200 until you see the effects of your changes on the
> analysis page.
>
> Your stated goal here is that all of your variants reduce to "jack" in
> the example above. Don't bother querying until you see that result in
> your index.
>
> Tip: It is a bit clumsy to have to restart Solr every time you make
> changes in your schema (although if you're running stand-alone you can
> reload the core). So I often define several different field types with
> different possibilities and compare them after a single reload.
>
> Best,
> Erick
>
> On Sat, Mar 18, 2017 at 8:12 PM, vishal jain <[hidden 
> email]> wrote:
>
>> Try "stemEnglishPossessive" to remove.
>>
>> On Sat, Mar 18, 2017 at 4:00 AM, donato <[hidden 
>> email]> wrote:
>>
>>> I have been racking my brain for days... I need to remove 's from say
>>> "patrick's" If I search for "patrick" or "patricks" I get the same number
>>> of
>>> results, however, if I search for "patrick's" it's a different number. I
>>> just want solr to ignore the 'sCan someone PLEASE help me It is driving
>>> me nutsHere is my schema file...
>>> Id  Name
>>>
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.
>>> nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325826.html
> To unsubscribe from How on EARTH do I remove 's in schema file?, click 
> here.
> NAML
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325827.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Cascading failures with replicas

2017-03-18 Thread Erick Erickson
Hmmm, I'm totally mystified about how Solr is "creating a new replica
when one host is down". Are you saying this is happening
automagically? You're right the autoAddReplica bit is HDFS so having
replicas just show up is completely completely weird. In days past,
when a replica was discovered on disk when Solr started up, it would
reconstruct itself _even if the collection had been deleted_. They
would reappear in clusterstate.json, _not_ the individual state.json
files under collections in ZK. Is this at all possible?

What version of Solr are you using anyway? The reconstruction I
mentioned above is 4x IIRC.

Best,
Erick

On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood  wrote:
> Thanks. This is a very CPU-heavy workload, with ngram fields and very long 
> queries. 16.7 million docs.
>
> The whole cascading failure thing in search engines is hard. The first time I 
> hit this was at Infoseek, over twenty years ago.
>
>> On Mar 18, 2017, at 12:46 PM, Erick Erickson  wrote:
>>
>> bug# 2, Solr shouldn't be adding replicas by itself unless you
>> specified autoAddReplicas=true when you created the collection. It
>> default to "false". So I'm not sure what's going on here.
>
> "autoAddReplicas":"false",
>
> in both collections. I thought that only worked with HDFS anyway.
>
>> bug #3. The internal load balancers are round-robin, so this is
>> expected. Not optimal I'll grant but expected.
>
> Right. Still a bug. Should be round-robin on instances, not cores.
>
>> bug #4. What shard placement rules are you using? There are a series
>> of rules for replica placement and one of the criteria (IIRC) is
>> exactly to try to distribute replicas to different hosts. Although
>> there was some glitchiness whether two JVMs on the same _host_ were
>> considered "the same host" or not.
>
> Separate Amazon EC2 instances, one JVM per instance, no rules, other than the 
> default.
>
> "maxShardsPerNode":"1",
>
>> bug #1 has been more or less of a pain for quite a while, work is ongoing 
>> there.
>
> Glad to share our logs.
>
> wunder
>
>> FWIW,
>> Erick
>>
>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood  
>> wrote:
>>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I 
>>> shut down Solr on one host because it got into some kind of bad, 
>>> can’t-recover state where it was causing timeouts across the whole cluster 
>>> (bug #1).
>>>
>>> I ran a load benchmark near the capacity of the cluster. This had run fine 
>>> in test, this was the prod cluster.
>>>
>>> Solr Cloud added a replica to replace the down node. The node with two 
>>> cores got double the traffic and started slowly flapping in and out of 
>>> service. The 95th percentile response spiked from 3 seconds to 100 seconds. 
>>> At some point, another replica was made, with two replicas from the same 
>>> shard on the same instance. Naturally, that was overloaded, and I killed 
>>> the benchmark out of charity.
>>>
>>> Bug #2 is creating a new replica when one host is down. This should be an 
>>> option and default to “false”, because it causes the cascade.
>>>
>>> Bug #3 is sending equal traffic to each core without considering the host. 
>>> Each host should get equal traffic, not each core.
>>>
>>> Bug #4 is putting two replicas from the same shard on one instance. That is 
>>> just asking for trouble.
>>>
>>> When it works, this cluster is awesome.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>


Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread donato
Thank you so much, Erick! I will try that!

I do have one other question though... what sections do I do all of this in? I 
see like four or five sections with different things in them. Do I use all of 
those in each section or just in some? What is each section? What do they do?

Thanks again for your time. Truly. Thank you!


From: Erick Erickson [via Lucene] 
Sent: Saturday, March 18, 2017 11:29:49 PM
To: donato
Subject: Re: How on EARTH do I remove 's in schema file?

First, uncheck the "verbose" checkbox. The nitty-gritty information
isn't relevant at this point.

Second, hover over each of the light-gray like "MCF", "PRCF" and such.
You'll see the element of the analysis chain that stands for, and the
difference between the line before and this line is the effect of that
element. For instance, on the query side you see that "patrick" is
turned into "patrick", "patricks" and "patrick's" by "SF" which I'd
guess is your SynonymFilter. But hovering over that will tell you
exactly what element is producing those changes.

Then it looks like you're using HTMLStripCharFilter, MappingCharFilter
and PatternReplaceCharFilter (Factories all). Why do you think all
those are necessary?

So stop. Take a deep breath. My guess is that you've been trying a
bunch of different approaches and the interactions of all the
different parts are throwing you off. Start simple, with say
StandardTokenizerFactory
LowercaseFilterFactory
EnglishPosessiveFilterFactory
PorterStemFilterFactory

Use the analysis page and work your way toward complexity. Concentrate
on the indexing side first. Enter all three of your variants (jack
jacks jack's) in the box and press the button. Do not pass go. Do not
collection $200 until you see the effects of your changes on the
analysis page.

Your stated goal here is that all of your variants reduce to "jack" in
the example above. Don't bother querying until you see that result in
your index.

Tip: It is a bit clumsy to have to restart Solr every time you make
changes in your schema (although if you're running stand-alone you can
reload the core). So I often define several different field types with
different possibilities and compare them after a single reload.

Best,
Erick

On Sat, Mar 18, 2017 at 8:12 PM, vishal jain <[hidden 
email]> wrote:

> Try "stemEnglishPossessive" to remove.
>
> On Sat, Mar 18, 2017 at 4:00 AM, donato <[hidden 
> email]> wrote:
>
>> I have been racking my brain for days... I need to remove 's from say
>> "patrick's" If I search for "patrick" or "patricks" I get the same number
>> of
>> results, however, if I search for "patrick's" it's a different number. I
>> just want solr to ignore the 'sCan someone PLEASE help me It is driving
>> me nutsHere is my schema file...
>> Id  Name
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325826.html
To unsubscribe from How on EARTH do I remove 's in schema file?, click 
here.
NAML




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325827.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread Erick Erickson
First, uncheck the "verbose" checkbox. The nitty-gritty information
isn't relevant at this point.

Second, hover over each of the light-gray like "MCF", "PRCF" and such.
You'll see the element of the analysis chain that stands for, and the
difference between the line before and this line is the effect of that
element. For instance, on the query side you see that "patrick" is
turned into "patrick", "patricks" and "patrick's" by "SF" which I'd
guess is your SynonymFilter. But hovering over that will tell you
exactly what element is producing those changes.

Then it looks like you're using HTMLStripCharFilter, MappingCharFilter
and PatternReplaceCharFilter (Factories all). Why do you think all
those are necessary?

So stop. Take a deep breath. My guess is that you've been trying a
bunch of different approaches and the interactions of all the
different parts are throwing you off. Start simple, with say
StandardTokenizerFactory
LowercaseFilterFactory
EnglishPosessiveFilterFactory
PorterStemFilterFactory

Use the analysis page and work your way toward complexity. Concentrate
on the indexing side first. Enter all three of your variants (jack
jacks jack's) in the box and press the button. Do not pass go. Do not
collection $200 until you see the effects of your changes on the
analysis page.

Your stated goal here is that all of your variants reduce to "jack" in
the example above. Don't bother querying until you see that result in
your index.

Tip: It is a bit clumsy to have to restart Solr every time you make
changes in your schema (although if you're running stand-alone you can
reload the core). So I often define several different field types with
different possibilities and compare them after a single reload.

Best,
Erick

On Sat, Mar 18, 2017 at 8:12 PM, vishal jain  wrote:
> Try "stemEnglishPossessive" to remove.
>
> On Sat, Mar 18, 2017 at 4:00 AM, donato  wrote:
>
>> I have been racking my brain for days... I need to remove 's from say
>> "patrick's" If I search for "patrick" or "patricks" I get the same number
>> of
>> results, however, if I search for "patrick's" it's a different number. I
>> just want solr to ignore the 'sCan someone PLEASE help me It is driving
>> me nutsHere is my schema file...
>> Id  Name
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread vishal jain
Try "stemEnglishPossessive" to remove.

On Sat, Mar 18, 2017 at 4:00 AM, donato  wrote:

> I have been racking my brain for days... I need to remove 's from say
> "patrick's" If I search for "patrick" or "patricks" I get the same number
> of
> results, however, if I search for "patrick's" it's a different number. I
> just want solr to ignore the 'sCan someone PLEASE help me It is driving
> me nutsHere is my schema file...
> Id  Name
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: OCR not working occasionally

2017-03-18 Thread Zheng Lin Edwin Yeo
Hi Rick,

Thanks for your reply.
I saw this error message for the file which has a failure.
Am I able to index such files together with the other files which store
text as an image together in the same indexing threads?


2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
2017-03-19 01:02:26.610 ERROR
(updateExecutor-2-thread-4-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.SolrCmdDistributor
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.99.1:8984/solr/collection1_shard1_replica1:
Expected mime type application/octet-stream but got text/html. 


Error 404 


HTTP ERROR: 404
Problem accessing /solr/collection1_shard1_replica1/update. Reason:
Not Found




at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:578)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:430)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:293)
at
org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:282)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
end_commit_flush
2017-03-19 01:02:26.658 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
Searcher@77e108d5[collection1_shard1_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
2017-03-19 01:02:26.658 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.QuerySenderListener QuerySenderListener done.
2017-03-19 01:02:26.659 INFO
 (searcherExecutor-16-thread-1-processing-n:192.168.99.1:8983_solr
x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node1)
[c:collection1 s:shard1 r:core_node1 x:collection1_shard1_replica2]
o.a.s.c.SolrCore [collection1_shard1_replica2] Registered new searcher
Searcher@77e108d5[collection1_shard1_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
2017-03-19 01:02:26.659 INFO  (qtp1543727556-19) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.p.LogUpdateProcessorFactory [collection1_shard1_replica2]
 webapp=/solr path=/update
params={update.distrib=FROMLEADER=files-update-processor=true=true=true=false=
http://192.168.99.1:8983/solr/collection1_shard1_replica2/_end_point=true=javabin=2=false}{commit=}
0 49
2017-03-19 01:02:26.662 WARN  (qtp1543727556-139) [c:collection1 s:shard1
r:core_node1 x:collection1_shard1_replica2]
o.a.s.u.p.DistributedUpdateProcessor Error sending update to
http://192.168.99.1:8984/solr
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.99.1:8984/solr/collection1_shard1_replica1:
Expected mime type application/octet-stream but got text/html. 


Error 404 


HTTP ERROR: 404
Problem accessing /solr/collection1_shard1_replica1/update. Reason:
Not Found




at

Re: Cascading failures with replicas

2017-03-18 Thread Walter Underwood
Thanks. This is a very CPU-heavy workload, with ngram fields and very long 
queries. 16.7 million docs.

The whole cascading failure thing in search engines is hard. The first time I 
hit this was at Infoseek, over twenty years ago.

> On Mar 18, 2017, at 12:46 PM, Erick Erickson  wrote:
> 
> bug# 2, Solr shouldn't be adding replicas by itself unless you
> specified autoAddReplicas=true when you created the collection. It
> default to "false". So I'm not sure what's going on here.

"autoAddReplicas":"false",

in both collections. I thought that only worked with HDFS anyway.

> bug #3. The internal load balancers are round-robin, so this is
> expected. Not optimal I'll grant but expected.

Right. Still a bug. Should be round-robin on instances, not cores.

> bug #4. What shard placement rules are you using? There are a series
> of rules for replica placement and one of the criteria (IIRC) is
> exactly to try to distribute replicas to different hosts. Although
> there was some glitchiness whether two JVMs on the same _host_ were
> considered "the same host" or not.

Separate Amazon EC2 instances, one JVM per instance, no rules, other than the 
default.

"maxShardsPerNode":"1",

> bug #1 has been more or less of a pain for quite a while, work is ongoing 
> there.

Glad to share our logs.

wunder

> FWIW,
> Erick
> 
> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood  
> wrote:
>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I 
>> shut down Solr on one host because it got into some kind of bad, 
>> can’t-recover state where it was causing timeouts across the whole cluster 
>> (bug #1).
>> 
>> I ran a load benchmark near the capacity of the cluster. This had run fine 
>> in test, this was the prod cluster.
>> 
>> Solr Cloud added a replica to replace the down node. The node with two cores 
>> got double the traffic and started slowly flapping in and out of service. 
>> The 95th percentile response spiked from 3 seconds to 100 seconds. At some 
>> point, another replica was made, with two replicas from the same shard on 
>> the same instance. Naturally, that was overloaded, and I killed the 
>> benchmark out of charity.
>> 
>> Bug #2 is creating a new replica when one host is down. This should be an 
>> option and default to “false”, because it causes the cascade.
>> 
>> Bug #3 is sending equal traffic to each core without considering the host. 
>> Each host should get equal traffic, not each core.
>> 
>> Bug #4 is putting two replicas from the same shard on one instance. That is 
>> just asking for trouble.
>> 
>> When it works, this cluster is awesome.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 



Re: OCR not working occasionally

2017-03-18 Thread Rick Leir
Hi Edwin
The pdf file format can store text as an image, and then you need OCR to get 
the text. However, text is more commonly not stored as an image in the pdf, and 
then you should not use OCR to get the text.

Do you get an error message when you have a failure?
Cheers -- Rick

On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo  
wrote:
>Hi,
>
>I'm facing the issue of that the Tesseract OCR is not able to extract
>the
>words in a PDF file in an attachment in EMLfile and index it into Solr
>occasionally? However, most of the time it can be extracted.
>
>What could be the reason that causes the file in the email attachment
>to be
>failed to extracted using OCR?
>
>I'm using Solr 6.4.2.
>
>Regards,
>Edwin

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread donato
Erick,

Here is the analysis:  https://www.screencast.com/t/DKKklTXk
  

Do you need everything on that page? I'm not sure what I am looking for
here...

Also, this is my current schema.xml file * DOWNLOAD HERE
  *. Not sure if I have something in the wrong
place/order?

Thanks again! I really need this done by Monday... 

Cheers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325809.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: stemEnglishPossessive and contractions

2017-03-18 Thread donato
Hi Herman,

I just noticed your post on possessives and I am having the same problem.
With Sr. Patrick's Day coming up, people are searching our site for
"patrick" and patrick's" yet they are yielding different results. If we
search for "patrick" and patricks" they yield the same results. I want all
three to yield the same results. 

Here is my schema file  CLICK HERE   . Am I
missing something? Do I have the order wrong? Are they in the wrong place?

Thank you in advance. I am not too familiar with this stuff as of yet...

Cheers.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/stemEnglishPossessive-and-contractions-tp3434657p4325808.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How on EARTH do I remove 's in schema file?

2017-03-18 Thread Erick Erickson
bq: I'm not too familiar with this technology yet. I tried adding that
=query at the end of my URL, but nothing happened.

You need to look at the raw response. There should be a section at the
end of the response where debug information is appended.

Please just paste the relevant bits of your xml file inline for the
field you're considering. The admin UI>>select core>>analysis page is
_really_ your friend here.

Best,
Erick

On Fri, Mar 17, 2017 at 5:29 PM, donato  wrote:
> Thanks for the response, Erik!
>
> Can you download my schema file here?  CLICK HERE 
> .
>
> I'm not too familiar with this technology yet. I tried adding that
> =query at the end of my URL, but nothing happened.
>
> Thanks again for the repsonse! All along, I just wanted queries for cat,
> cats, kitten and kitties to return the same number of results - and it does
> - partially because of the synonyms.txt file.
>
> But this apostrophe thing is killing me!
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325718.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Cascading failures with replicas

2017-03-18 Thread Erick Erickson
bug# 2, Solr shouldn't be adding replicas by itself unless you
specified autoAddReplicas=true when you created the collection. It
default to "false". So I'm not sure what's going on here.

bug #3. The internal load balancers are round-robin, so this is
expected. Not optimal I'll grant but expected.

bug #4. What shard placement rules are you using? There are a series
of rules for replica placement and one of the criteria (IIRC) is
exactly to try to distribute replicas to different hosts. Although
there was some glitchiness whether two JVMs on the same _host_ were
considered "the same host" or not.

bug #1 has been more or less of a pain for quite a while, work is ongoing there.

FWIW,
Erick

On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood  wrote:
> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I 
> shut down Solr on one host because it got into some kind of bad, 
> can’t-recover state where it was causing timeouts across the whole cluster 
> (bug #1).
>
> I ran a load benchmark near the capacity of the cluster. This had run fine in 
> test, this was the prod cluster.
>
> Solr Cloud added a replica to replace the down node. The node with two cores 
> got double the traffic and started slowly flapping in and out of service. The 
> 95th percentile response spiked from 3 seconds to 100 seconds. At some point, 
> another replica was made, with two replicas from the same shard on the same 
> instance. Naturally, that was overloaded, and I killed the benchmark out of 
> charity.
>
> Bug #2 is creating a new replica when one host is down. This should be an 
> option and default to “false”, because it causes the cascade.
>
> Bug #3 is sending equal traffic to each core without considering the host. 
> Each host should get equal traffic, not each core.
>
> Bug #4 is putting two replicas from the same shard on one instance. That is 
> just asking for trouble.
>
> When it works, this cluster is awesome.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

2017-03-18 Thread Erick Erickson
Hey Jay!

All I can say is "good luck with that". I do know Morphlines uses
EmbeddedSolrServer to do its work. So I don't really see a good way to
pluck just what you'd need for schemaless.

The MapReduceIndexerTool is carried right along with Solr though. IIRC
the Morphlines stuff is mostly the ETL process. Have you tried just
running an MRIT job with a current Solr? I have no idea whether it'd
work, but it seem like it "should"...

Erick

On Fri, Mar 17, 2017 at 5:51 PM, Jay Hill  wrote:
> I've got a very difficult project to tackle. I've been tasked with using
> schemaless mode to index json files that we receive. The structure of the
> json files will always be very different as we're receiving files from
> different customers totally unrelated to one another. We are attempting to
> build a "one size fits all" approach to receiving documents from a wide
> variety of sources and then index them into Solr.
>
> We're running in Solr 5.3. The schemaless approach works well enough -
> until it doesn't. It seems to fail on type guessing and also gets confused
> indexing to different shards. If it was reliable it would be the perfect
> solution for our task. But the larger the JSON file the more likely it is
> to fail. At a certain size it just doesn't work.
>
> I've been advised by some experts and committers that schemaless is a good
> tool for prototyping, but risky to run in production, but we thought we
> would try it by doing offline indexing using the Cloudera
> MapReduceIndexerTool to build offline indexes - but still using managed
> schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
> that pipes together a series of commands to transform data. For example a
> JSON or CSV file can be processed and loaded into a Solr index with a
> "readJSON" command piped to a "loadSolr" command, for a simple example.
>
> But the kite-sdk that manages the morphlines only seems to offer as they're
> latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
> 4.10.3)
>
> So I can't see any way to integrate schemaless (which has dependencies
> after 4.10.3) with the morphlines.
>
> But I thought I would ask here: Anybody had ANY experience with morphlines
> to index to Solr? Any info would help me make sense of this.
>
> Cheers to all!


OCR not working occasionally

2017-03-18 Thread Zheng Lin Edwin Yeo
Hi,

I'm facing the issue of that the Tesseract OCR is not able to extract the
words in a PDF file in an attachment in EMLfile and index it into Solr
occasionally? However, most of the time it can be extracted.

What could be the reason that causes the file in the email attachment to be
failed to extracted using OCR?

I'm using Solr 6.4.2.

Regards,
Edwin


Re: Group by range results

2017-03-18 Thread Zheng Lin Edwin Yeo
You can try using JSON Facet.
It has Range Facet, which you can use it to group in by the date range.

http://yonik.com/json-facet-api/#Range_Facet

Regards,
Edwin


On 16 March 2017 at 21:32, Mikhail Ibraheem 
wrote:

> Any help on this please?
>
>
>
> From: Mikhail Ibraheem
> Sent: 15 مارس, 2017 08:53 م
> To: solr-user@lucene.apache.org
> Subject: Group by range results
>
>
>
> Hi,
>
> Can we group by ranges? something like:
>
> facet=true
> stats=true
> stats.field={!tag=piv1 min=true max=true}price
> facet.range={!tag=r1}manufacturedate_dt
> facet.range.start=2006-01-01T00:00:00Z
> facet.range.end=NOW/YEAR
> facet.range.gap=+1YEAR
> facet.pivot={!stats=piv1}r1
>
>
>
> Where I want the max price and min price for each range
> of manufacturedate_dt.
>
> Please advise.
>
>
>
> Thanks
>


Re: fq performance

2017-03-18 Thread Damien Kamerman
You may want to consider a join, esp. if you're ever consider thousands of
groups. e.g.
fq={!join from=access_control_group
to=doc_group}access_control_user_id:USERID

On 18 March 2017 at 05:57, Yonik Seeley  wrote:

> On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey  wrote:
> > On 3/17/2017 8:11 AM, Yonik Seeley wrote:
> >> For Solr 6.4, we've managed to circumvent this for filter queries and
> >> other contexts where scoring isn't needed.
> >> http://yonik.com/solr-6-4/  "More efficient filter queries"
> >
> > Nice!
> >
> > If the filter looks like the following (because q.op=AND), does it still
> > use TermsQuery?
> >
> > fq=id:(id1 OR id2 OR id3 OR ... id2000)
>
> Yep, that works as well.  As does fq=id:id1 OR id:id2 OR id:id3 ...
> Was implemented here: https://issues.apache.org/jira/browse/SOLR-9786
>
> -Yonik
>