RE: Read or Capture Solr Logs

2015-03-24 Thread Markus Jelsma
Hello - process the logs means you have to build your own program that reads 
and processes the logs, and does what ever you need it to. In a custom 
SearchComponent you can implement e.g. process() [1] and read the query, and do 
something with it.

[1]: 
http://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/handler/component/SearchComponent.html#process%28org.apache.solr.handler.component.ResponseBuilder%29
 
-Original message-
 From:Nitin Solanki nitinml...@gmail.com
 Sent: Tuesday 24th March 2015 11:55
 To: solr-user@lucene.apache.org
 Subject: Re: Read or Capture Solr Logs
 
 Hi Markus,
   Can you please help me. How to do that?
 Using both Process the logs
 or make a simple SearchComponent implementation that reads
 SolrQueryRequest.
 
 On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  Hello, you can either process the logs, or make a simple SearchComponent
  implementation that reads SolrQueryRequest.
 
  Markus
 
 
 
  -Original message-
   From:Nitin Solanki nitinml...@gmail.com
   Sent: Tuesday 24th March 2015 11:38
   To: solr-user@lucene.apache.org
   Subject: Read or Capture Solr Logs
  
   Hello,
   I want to read or capture all the queries which are searched
  by
   users. Any help on this?
  
 
 


Re: TooManyBasicQueries?

2015-03-24 Thread Erik Hatcher
Somehow a surround query is being constructed along the way.  Search your logs 
for “surround” and see if someone is maybe sneaking a q={!surround}… in there.  
If you’re passing input directly through from your application to Solr’s q 
parameter without any sanitizing or filtering, it’s possible a surround query 
parser could be asked for.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote:
 
 Hi Erik -
 
 Sorry, I totally missed your reply.  To the best of my knowledge, we are
 not using any surround queries (have to admit I had never heard of them
 until now).  We use solr.SearchHandler for all of our queries.
 
 Does that answer the question?
 
 Cheers,
 Ian
 
 
 On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com
 wrote:
 
 It results from a surround query with too many terms.   Says the javadoc:
 
 * Exception thrown when {@link BasicQueryFactory} would exceed the limit
 * of query clauses.
 
 I’m curious, are you issuing a large {!surround} query or is it expanding
 to hit that limit?
 
 
 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com
 
 
 
 
 On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote:
 
 I sometimes see the following in my logs:
 
 ERROR org.apache.solr.core.SolrCore  –
 org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
 Exceeded
 maximum of 1000 basic queries.
 
 
 What does this mean?  Does this mean that we have issued a query with too
 many terms?  Or that the number of concurrent queries running on the
 server
 is too high?
 
 Also, is this a builtin limit or something set in a config file?
 
 Thanks!
 - Ian
 
 



Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Hi Erik -

Sorry, I totally missed your reply.  To the best of my knowledge, we are
not using any surround queries (have to admit I had never heard of them
until now).  We use solr.SearchHandler for all of our queries.

Does that answer the question?

Cheers,
Ian


On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 It results from a surround query with too many terms.   Says the javadoc:

 * Exception thrown when {@link BasicQueryFactory} would exceed the limit
 * of query clauses.

 I’m curious, are you issuing a large {!surround} query or is it expanding
 to hit that limit?


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com




  On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote:
 
  I sometimes see the following in my logs:
 
  ERROR org.apache.solr.core.SolrCore  –
  org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
 Exceeded
  maximum of 1000 basic queries.
 
 
  What does this mean?  Does this mean that we have issued a query with too
  many terms?  Or that the number of concurrent queries running on the
 server
  is too high?
 
  Also, is this a builtin limit or something set in a config file?
 
  Thanks!
  - Ian




rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is too many for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian


Issues to create new core

2015-03-24 Thread Alejandro Jesus Mariño Molerio
Dear Solr Community: 
I just began to work with Solr. I choose Solr 5.0, but when I try to create a 
new core with GUI, show the following error:  Error CREATEing SolrCore 
'datos': Unable to create core [datos] Caused by: Can't find resource 
'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My question 
is simple, How can fix this problem?. 

Thanks in advance for your consideration. 
Alejandro. 


Solr replicas going in recovering state during heavy indexing

2015-03-24 Thread Gopal Jee
Hi
We have a large solrcloud cluster. We have observed that during heavy
indexing, large number of replicas go to recovering or down state.
What could be the possible reason and/or fix for the issue.

Gopal


Re: maxReplicasPerNode

2015-03-24 Thread Shai Erera
Thanks Anshum,

About #3, i line with my answer to the previous question, Solr wouldn't
 auto-add a Replica to meet the replication factor when a node goes down.


Just to make sure the answer applies to both these cases:

   1. There are two replicas on node1 and node2. Solr won't add a replica
   to node1 when node2 goes down.
   2. The collection was created with rf=2, Solr creates replicas on node1
   and node2. If node2 goes down and a node3 comes up instead, will it be
   assigned a replica, or Solr does not do that also?

In short, is there any scenario where Solr would auto-add replicas (aside
from running on HDFS) to meet the 'rf' setting, or after the collection has
been created, ensuring RF is met is my responsibility?

Shai

On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net
wrote:

 Hi Shai,

 As of now, all replicas for a collections are created to meet the specified
 replication factor at the time of collection creation. There's no way to
 defer that until more nodes are up. Your best bet is to have the nodes
 already up before you CREATE the collection or create the collection with a
 lower replication factor and then use ADDREPLICA.

 About auto-addition of replicas, that's kind of supported when using shared
 file system (HDFS) to host the index. It's doesn't truly work as per your
 use-case i.e. it doesn't consider the intended replication factor but only
 brings up a Replica in case all replicas for a node are down, so that
 SolrCloud continues to be usable. It also doesn't auto-remove replica when
 the old node comes back up. You can read more about this in the
 Automatically Add Replicas in SolrCloud section here:
 https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

 About #3, i line with my answer to the previous question, Solr wouldn't
 auto-add a Replica to meet the replication factor when a node goes down.


 On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote:

  Hi
 
  I saw that we can define maxShardsPerNode when creating a collection,
 but I
  don't see that I can set something similar for replicas. My scenario is
 the
  following:
 
 - I setup one Solr node
 - Create collection with numShards=1 and replicationFactor=2
 - Hopefully, one replica is created on that node
 - When I bring up the second Solr node, the second replica will be
 created
 
  What I see is that both replicas are created on the first node, and when
 I
  bring up the second Solr node, none of the replicas are moved.
 
  I know that I can move one replica by calling ADDREPLICA on node2, then
  DELETEREPLICA on node1, but I was wondering if there's an automated way
 to
  do that.
 
  I've also considered creating the collection with replicationFactor=1 and
  when the second node comes up it will look for shards w/ one replica
 only,
  and assign themselves as the replica. But it means I have to own that
 piece
  of logic, where if Solr already does that, that's better.
 
  Also, from what I understand, if I create a collection w/ rf=2 and there
  are two nodes, then each node is assigned a replica. If one of the nodes
  comes down, and a 3rd node comes up, it will be assigned a replica -- is
  that correct?
 
  Another related question, if there are two replicas on node1 and node2,
 and
  node2 goes down -- will node1 be assigned the second replica as well?
 
  If this is explained somewhere, I'd appreciate if you can give me a
  pointer.
 
  Shai
 



 --
 Anshum Gupta



Read or Capture Solr Logs

2015-03-24 Thread Nitin Solanki
Hello,
I want to read or capture all the queries which are searched by
users. Any help on this?


How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera
Hi

Is there a recommended, preferably fast, way to check that a document is
indexed by all replicas? I currently do that by issuing a search request to
each replica, but was wondering if there's a faster way.

Even better, is there a way to verify all replicas of a shard are
up-to-date, e.g. by comparing their version or something? By up-to-date
I mean that they've all processed the same update requests that came
through.

If there's a replica lagging behind, I'd like to wait for it to catch up,
something like a checkpoint(), before I continue sending more updates.

Shai


Set search query logs into Solr

2015-03-24 Thread Nitin Solanki
Hello,
 I want to insert searched queries into solr log to track the
input of users. I googled too much but didn't find anything. Please help.
Your help will be appreciated...


Re: _text

2015-03-24 Thread Zheng Lin Edwin Yeo
Hi Philippe,

Are you using the default schemaFactory, in which your setting in
solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you
have used your own defined schema.xml, in which your setting in
solrconfig.xml should be schemaFactory class=ClassicIndexSchemaFactory/?


Regards,
Edwin


On 24 March 2015 at 17:40, phi...@free.fr wrote:


 Hello,

 my SOLR 5 Admin Panel displays the following error:

 23/03/2015 15:05:05 ERROR   SolrCore
 org.apache.solr.common.SolrException: undefined field: _text

 How should _text be defined in schema.xml?

 Many thanks.

 Philippe



Re: _text

2015-03-24 Thread phiroc
Hi Zheng,

I copied the SOLR 5 schema.xml file on Github (?), which contains the following 
line:

schemaFactory class=ClassicIndexSchemaFactory/





- Mail original -
De: Zheng Lin Edwin Yeo edwinye...@gmail.com
À: solr-user@lucene.apache.org
Envoyé: Mardi 24 Mars 2015 10:59:49
Objet: Re: _text

Hi Philippe,

Are you using the default schemaFactory, in which your setting in
solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you
have used your own defined schema.xml, in which your setting in
solrconfig.xml should be schemaFactory class=ClassicIndexSchemaFactory/?


Regards,
Edwin


On 24 March 2015 at 17:40, phi...@free.fr wrote:


 Hello,

 my SOLR 5 Admin Panel displays the following error:

 23/03/2015 15:05:05 ERROR   SolrCore
 org.apache.solr.common.SolrException: undefined field: _text

 How should _text be defined in schema.xml?

 Many thanks.

 Philippe



RE: Read or Capture Solr Logs

2015-03-24 Thread Markus Jelsma
Hello, you can either process the logs, or make a simple SearchComponent 
implementation that reads SolrQueryRequest.

Markus

 
 
-Original message-
 From:Nitin Solanki nitinml...@gmail.com
 Sent: Tuesday 24th March 2015 11:38
 To: solr-user@lucene.apache.org
 Subject: Read or Capture Solr Logs
 
 Hello,
 I want to read or capture all the queries which are searched by
 users. Any help on this?
 


Re: _text

2015-03-24 Thread Zheng Lin Edwin Yeo
Hi Philippe,

That means you're using the physical schema.xml. You can check the file in
your collection, under conf folder. For mine I don't have the _text field
in my schema.xml. If you don't require it in your setup, you can try
removing it and see if it's ok?

Else you can use the schema.xml or the entire conf folder from the
techproducts example located at
{SOLR_HOME}\server\solr\configsets\sample_techproducts_configs\CONF, which
comes together with the Solr 5.0 package.




On 24 March 2015 at 18:12, phi...@free.fr wrote:

 Hi Zheng,

 I copied the SOLR 5 schema.xml file on Github (?), which contains the
 following line:

 schemaFactory class=ClassicIndexSchemaFactory/





 - Mail original -
 De: Zheng Lin Edwin Yeo edwinye...@gmail.com
 À: solr-user@lucene.apache.org
 Envoyé: Mardi 24 Mars 2015 10:59:49
 Objet: Re: _text

 Hi Philippe,

 Are you using the default schemaFactory, in which your setting in
 solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you
 have used your own defined schema.xml, in which your setting in
 solrconfig.xml should be schemaFactory
 class=ClassicIndexSchemaFactory/?


 Regards,
 Edwin


 On 24 March 2015 at 17:40, phi...@free.fr wrote:

 
  Hello,
 
  my SOLR 5 Admin Panel displays the following error:
 
  23/03/2015 15:05:05 ERROR   SolrCore
  org.apache.solr.common.SolrException: undefined field: _text
 
  How should _text be defined in schema.xml?
 
  Many thanks.
 
  Philippe
 



Re: Read or Capture Solr Logs

2015-03-24 Thread Nitin Solanki
Hi Markus,
  Can you please help me. How to do that?
Using both Process the logs or make a simple SearchComponent
implementation that reads SolrQueryRequest

On Tue, Mar 24, 2015 at 4:25 PM, Nitin Solanki nitinml...@gmail.com wrote:

 Hi Markus,
   Can you please help me. How to do that?
 Using both Process the logs
 or make a simple SearchComponent implementation that reads
 SolrQueryRequest.

 On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 Hello, you can either process the logs, or make a simple SearchComponent
 implementation that reads SolrQueryRequest.

 Markus



 -Original message-
  From:Nitin Solanki nitinml...@gmail.com
  Sent: Tuesday 24th March 2015 11:38
  To: solr-user@lucene.apache.org
  Subject: Read or Capture Solr Logs
 
  Hello,
  I want to read or capture all the queries which are
 searched by
  users. Any help on this?
 





_text

2015-03-24 Thread phiroc

Hello,

my SOLR 5 Admin Panel displays the following error:

23/03/2015 15:05:05 ERROR   SolrCore
org.apache.solr.common.SolrException: undefined field: _text

How should _text be defined in schema.xml?

Many thanks.

Philippe


Re: Custom TokenFilter

2015-03-24 Thread Erick Erickson
bq: 13 moreCaused by: java.lang.ClassCastException: class
com.tamingtext.texttamer.solr.

This usually means you have jar files from different versions of Solr
in your classpath.

Best,
Erick

On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
 Hi there,
 I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
 setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
 have this error message : Caused by: org.apache.solr.common.SolrException: 
 Plugin init failure for [schema.xml] fieldType text: Plugin init failure 
 for [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
 .../conf/schema.xmlat 
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at 
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at 
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
  
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
  
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
  org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 
 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
 for [schema.xml] fieldType text: Plugin init failure for [schema.xml] 
 analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
  org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 
 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
 [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
  
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
  
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
  
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
  
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
  13 moreCaused by: java.lang.ClassCastException: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
 java.lang.Class.asSubclass(Class.java:3208)at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
  
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
  
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
  
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
  
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 Someone can help?
 Thanks.Regards.


Custom TokenFilter

2015-03-24 Thread Test Test
Hi there, 
I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
setting schema.xml and have adding path in solrconfig.xml, i start solr.I have 
this error message : Caused by: org.apache.solr.common.SolrException: Plugin 
init failure for [schema.xml] fieldType text: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
.../conf/schema.xmlat 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at 
org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] fieldType text: Plugin init failure for [schema.xml] 
analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
 13 moreCaused by: java.lang.ClassCastException: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
java.lang.Class.asSubclass(Class.java:3208)at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
Someone can help?
Thanks.Regards.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Test Test:

From Hossman's apache page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is hidden in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.

Also, please format your stack trace for readability. On a quick
glance, you probably
have mis-matched jars in your classpath.

On Tue, Mar 24, 2015 at 1:35 PM, Test Test andymish...@yahoo.fr wrote:
 Hi there,
 I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
 setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
 have this error message : Caused by: org.apache.solr.common.SolrException: 
 Plugin init failure for [schema.xml] fieldType text: Plugin init failure 
 for [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
 .../conf/schema.xml at 
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
 org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166) at 
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) 
 at 
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
  at 
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
  at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
 ... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init 
 failure for [schema.xml] fieldType text: Plugin init failure for 
 [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 
 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
 for [schema.xml] analyzer/tokenizer: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
  at 
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
  at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
  at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
  at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
  ... 13 moreCaused by: java.lang.ClassCastException: class 
 com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
 java.lang.Class.asSubclass(Class.java:3208) at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)
  at 
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
  at 
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
  at 
 org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
  at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 Someone can help?
 Thanks.Regards.


  Le Mardi 24 mars 2015 21h24, Jack Krupansky jack.krupan...@gmail.com a 
 écrit :


  I'm sure that I am quite unqualified to describe his hypothetical setup. I
 mean, he's the one using the term multi-tenancy, so it's for him to be
 clear.

 For me, it's a question of who has control over the config and schema and
 collection creation. Having more than one business entity controlling the
 configuration of a single (Solr) server is a recipe for disaster. Solr
 works well if there is an architect for the system. Ever hear the old
 saying Too many cooks spoil the stew?

 -- Jack Krupansky

 On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk
 wrote:

 Jack Krupansky [jack.krupan...@gmail.com] wrote:
  Don't confuse customers and tenants.

 Perhaps you could explain what you mean by multi-tenant in the context of
 Ian's setup? It is not clear to me what the distinction is in this case.

 - Toke Eskildsen






Re: maxReplicasPerNode

2015-03-24 Thread Anshum Gupta
Yes, it applies to both. Solr wouldn't auto-add replicas in either of those
cases (or any other case) to meet the rf specified at create time.

On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera ser...@gmail.com wrote:

 Thanks Anshum,

 About #3, i line with my answer to the previous question, Solr wouldn't
  auto-add a Replica to meet the replication factor when a node goes down.
 

 Just to make sure the answer applies to both these cases:

1. There are two replicas on node1 and node2. Solr won't add a replica
to node1 when node2 goes down.
2. The collection was created with rf=2, Solr creates replicas on node1
and node2. If node2 goes down and a node3 comes up instead, will it be
assigned a replica, or Solr does not do that also?

 In short, is there any scenario where Solr would auto-add replicas (aside
 from running on HDFS) to meet the 'rf' setting, or after the collection has
 been created, ensuring RF is met is my responsibility?

 Shai

 On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  Hi Shai,
 
  As of now, all replicas for a collections are created to meet the
 specified
  replication factor at the time of collection creation. There's no way to
  defer that until more nodes are up. Your best bet is to have the nodes
  already up before you CREATE the collection or create the collection
 with a
  lower replication factor and then use ADDREPLICA.
 
  About auto-addition of replicas, that's kind of supported when using
 shared
  file system (HDFS) to host the index. It's doesn't truly work as per your
  use-case i.e. it doesn't consider the intended replication factor but
 only
  brings up a Replica in case all replicas for a node are down, so that
  SolrCloud continues to be usable. It also doesn't auto-remove replica
 when
  the old node comes back up. You can read more about this in the
  Automatically Add Replicas in SolrCloud section here:
  https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
 
  About #3, i line with my answer to the previous question, Solr wouldn't
  auto-add a Replica to meet the replication factor when a node goes down.
 
 
  On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote:
 
   Hi
  
   I saw that we can define maxShardsPerNode when creating a collection,
  but I
   don't see that I can set something similar for replicas. My scenario is
  the
   following:
  
  - I setup one Solr node
  - Create collection with numShards=1 and replicationFactor=2
  - Hopefully, one replica is created on that node
  - When I bring up the second Solr node, the second replica will be
  created
  
   What I see is that both replicas are created on the first node, and
 when
  I
   bring up the second Solr node, none of the replicas are moved.
  
   I know that I can move one replica by calling ADDREPLICA on node2,
 then
   DELETEREPLICA on node1, but I was wondering if there's an automated way
  to
   do that.
  
   I've also considered creating the collection with replicationFactor=1
 and
   when the second node comes up it will look for shards w/ one replica
  only,
   and assign themselves as the replica. But it means I have to own that
  piece
   of logic, where if Solr already does that, that's better.
  
   Also, from what I understand, if I create a collection w/ rf=2 and
 there
   are two nodes, then each node is assigned a replica. If one of the
 nodes
   comes down, and a 3rd node comes up, it will be assigned a replica --
 is
   that correct?
  
   Another related question, if there are two replicas on node1 and node2,
  and
   node2 goes down -- will node1 be assigned the second replica as well?
  
   If this is explained somewhere, I'd appreciate if you can give me a
   pointer.
  
   Shai
  
 
 
 
  --
  Anshum Gupta
 




-- 
Anshum Gupta


Re: document contained more than 100000 characters

2015-03-24 Thread Shawn Heisey
On 3/23/2015 3:08 AM, Srinivas wrote:
 Present in my project we are using apache tika for reading metadata of the
 file,So whenever we handled large files(contained more than 10
 characters file) tika generating the error is file contained more than
 10 characters, So is it possible or not handling large files by using
 tika,Please let me know.

This sounds like a Tika problem.  This is a solr mailing list.  You may
find some Tika expertise here, but this is the incorrect place for a
question about Tika.

Solr does use the Tika parser, in the contrib module for the
ExtractingRequestHandler.  I have never heard of such a limitation in
the context of the ExtractingRequestHandler, and I've heard some people
complain about OutOfMemory exceptions when they index 4 gigabyte PDF
files with our extracting handler ... so I am guessing that you are
using Tika in your own software.  If that is correct, you'll need to ask
your question on a Tika mailing list.

Thanks,
Shawn



Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Erick Erickson
You can always issue a *:* query, but it'd have to be at least your
autoSoftCommit interval ago since the soft commit trigger will have
slightly different wall clock times.

But it shouldn't be necessary to wait I don't think. Since the
indexing request doesn't succeed until the docs have been written to
the tlogs, and since the tlogs will be replayed in the event of a
problem your data should be fine. Of course if you're indexing at a
very fast rate and your tlog is huge, it'll take a while

FWIW,
Erick

On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 Is there a recommended, preferably fast, way to check that a document is
 indexed by all replicas? I currently do that by issuing a search request to
 each replica, but was wondering if there's a faster way.

 Even better, is there a way to verify all replicas of a shard are
 up-to-date, e.g. by comparing their version or something? By up-to-date
 I mean that they've all processed the same update requests that came
 through.

 If there's a replica lagging behind, I'd like to wait for it to catch up,
 something like a checkpoint(), before I continue sending more updates.

 Shai


Setting up SOLR 5 from an RPM

2015-03-24 Thread Tom Evans
Hi all

We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
would prefer we installed SOLR from an RPM rather than extracting the
tarball where we need it. They are creating the RPM file themselves,
and it installs an init.d script and the equivalent of the tarball to
/opt/solr.

We're having problems running SOLR from the installed files, as SOLR
wants to (I think) extract the WAR file and create various temporary
files below /opt/solr/server.

We currently have this structure:

/data/solr - root directory of our solr instance
/data/solr/{logs,run} - log/run directories
/data/solr/cores - configuration for our cores and solr.in.sh
/opt/solr - the RPM installed solr 5

The user running solr can modify anything under /data/solr, but
nothing under /opt/solr.

Is this sort of configuration supported? Am I missing some variable in
our solr.in.sh that sets where temporary files can be extracted? We
currently set:

SOLR_PID_DIR=/data/solr/run
SOLR_HOME=/data/solr/cores
SOLR_LOGS_DIR=/data/solr/logs


Cheers

Tom


Re: How to remove an Alert

2015-03-24 Thread Shawn Heisey
On 3/23/2015 2:35 PM, jack.met...@hp.com wrote:
 I have a problem with [ ... briefly describe your problem here ... ]
 
   [ ... insert additional info here - keep it short and to the point ... ]
 
 Below are some SPM graphs showing the state of my system.
 Here's the 'Threads' graph:
   https://apps.sematext.com/spm-reports/s/aFUIR1fecb

You've used some kind of boilerplate help request, but forgot to edit it
for your specific problem.

Solr doesn't send alerts, so the subject of your message makes no sense
in a Solr context, and you haven't indicated how it connects with the
SPM graph you linked.

You'll need to ask an actual question and provide relevant details from
your system to support your question.

Thanks,
Shawn



Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Ah yes, right you are.  I had thought that `surround` required a different
endpoint, but I see now that someone is using a surround query.

Many thanks!

On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 Somehow a surround query is being constructed along the way.  Search your
 logs for “surround” and see if someone is maybe sneaking a q={!surround}…
 in there.  If you’re passing input directly through from your application
 to Solr’s q parameter without any sanitizing or filtering, it’s possible a
 surround query parser could be asked for.


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote:
 
  Hi Erik -
 
  Sorry, I totally missed your reply.  To the best of my knowledge, we are
  not using any surround queries (have to admit I had never heard of them
  until now).  We use solr.SearchHandler for all of our queries.
 
  Does that answer the question?
 
  Cheers,
  Ian
 
 
  On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com
  wrote:
 
  It results from a surround query with too many terms.   Says the
 javadoc:
 
  * Exception thrown when {@link BasicQueryFactory} would exceed the limit
  * of query clauses.
 
  I’m curious, are you issuing a large {!surround} query or is it
 expanding
  to hit that limit?
 
 
  —
  Erik Hatcher, Senior Solutions Architect
  http://www.lucidworks.com
 
 
 
 
  On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote:
 
  I sometimes see the following in my logs:
 
  ERROR org.apache.solr.core.SolrCore  –
  org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
  Exceeded
  maximum of 1000 basic queries.
 
 
  What does this mean?  Does this mean that we have issued a query with
 too
  many terms?  Or that the number of concurrent queries running on the
  server
  is too high?
 
  Also, is this a builtin limit or something set in a config file?
 
  Thanks!
  - Ian
 
 




Re: Solr replicas going in recovering state during heavy indexing

2015-03-24 Thread Erick Erickson
What do the Solr logs show happens on those servers when they go into
recovery? What have you tried to do to diagnose the problem? You might
review: http://wiki.apache.org/solr/UsingMailingLists

The first thing I'd check, though, is whether you're seeing large GC
pauses that exceed the Zookeeper timeout, thus ZK thinks the replica
is down and puts it into recovery. YOu can get this info by tracking
the GC cycles as here:
https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/, the
section getting a view into garbage collection

Best,
Erick

On Tue, Mar 24, 2015 at 5:57 AM, Gopal Jee gopal@myntra.com wrote:
 Hi
 We have a large solrcloud cluster. We have observed that during heavy
 indexing, large number of replicas go to recovering or down state.
 What could be the possible reason and/or fix for the issue.

 Gopal


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Well, there's a ticket out there for thousands of collections on a
single machine,
although this is wy out there. I often see 10-20 small cores on a
4-8 core machine
if they're reasonably small (a few million docs). I see a single
replica strain a 128G 16
core machine if it has 300M docs

Which is a way of saying ya gotta test with your data/query mix.

Wish there was a better answer.
Erick

On Tue, Mar 24, 2015 at 6:02 AM, Ian Rose ianr...@fullstory.com wrote:
 Hi all -

 I'm sure this topic has been covered before but I was unable to find any
 clear references online or in the mailing list.

 Are there any rules of thumb for how many cores (aka shards, since I am
 using SolrCloud) is too many for one machine?  I realize there is no one
 answer (depends on size of the machine, etc.) so I'm just looking for a
 rough idea.  Something like the following would be very useful:

 * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
 server without any problems.
 * I have never heard of anyone successfully running X cores/shards on a
 single machine, even if you throw a lot of hardware at it.

 Thanks!
 - Ian


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Shards per collection, or across all collections on the node?

It will all depend on:

1. Your ingestion/indexing rate. High, medium or low?
2. Your query access pattern. Note that a typical query fans out to all
shards, so having more shards than CPU cores means less parallelism.
3. How many collections you will have per node.

In short, it depends on what you want to achieve, not some limit of Solr
per se.

Why are you even sharding the node anyway? Why not just run with a single
shard per node, and do sharding by having separate nodes, to maximize
parallel processing and availability?

Also be careful to be clear about using the Solr term shard (a slice,
across all replica nodes) as distinct from the Elasticsearch term shard
(a single slice of an index for a single replica, analogous to a Solr
core.)


-- Jack Krupansky

On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote:

 Hi all -

 I'm sure this topic has been covered before but I was unable to find any
 clear references online or in the mailing list.

 Are there any rules of thumb for how many cores (aka shards, since I am
 using SolrCloud) is too many for one machine?  I realize there is no one
 answer (depends on size of the machine, etc.) so I'm just looking for a
 rough idea.  Something like the following would be very useful:

 * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
 server without any problems.
 * I have never heard of anyone successfully running X cores/shards on a
 single machine, even if you throw a lot of hardware at it.

 Thanks!
 - Ian



Re: maxReplicasPerNode

2015-03-24 Thread Shawn Heisey
On 3/24/2015 3:22 AM, Shai Erera wrote:
 If this is explained somewhere, I'd appreciate if you can give me a
 pointer.

I don't think it's explained anywhere, so that's a lack in the
documentation.

One problem with automatic replica addition in response to cluster
problems is that there is no mechanism (currently, at least) to indicate
that a node disappearance is intentional and temporary, and no way to
configure a minimum time interval before taking automatic action.  It
would be necessary to have these mechanisms before any kind of automatic
repair ability could be implemented.

Thanks,
Shawn



Re: Unable to setup solr cloud with multiple collections.

2015-03-24 Thread Erick Erickson
Why are you doing this in the first place? SolrCloud and master/slave
are fundamentally different. When running in SolrCloud mode, there is
no need whatsoever to configure replication as per the Wiki link
you've outlined above, that's for the older style master/slave setups.

Just change it back and watch the magic would be my advice.

So if you'd tell us why you thought this was necessary, perhaps we can
suggest alternatives because from a quick glance it looks unnecessary,
and in fact harmful.

Best,
Erick

On Mon, Mar 23, 2015 at 10:08 PM, sthita sthit...@gmail.com wrote:
 I have newly created a new collection and activated the replication for 4
 nodes(Including masters).
 After doing the config changes as suggested on
 http://wiki.apache.org/solr/SolrReplication
 http://wiki.apache.org/solr/SolrReplication
 The nodes of the newly created collections are down on solr cloud. We are
 not able to add or remove any document on newly created core i.e dict_cn in
 our case. All the configuration files  look ok on solr cloud

 http://lucene.472066.n3.nabble.com/file/n4194833/solr_issue.png

 This is my replication changes on solrconfig.xml

 requestHandler name=/replication class=solr.ReplicationHandler
 startup=lazy
 lst name=master str name=replicateAftercommit/str str
 name=replicateAfterstartup/str
 str name=confFilessolrconfig_cn.xml,schema_cn.xml/str /lst

 lst name=slave str name=masterUrlhttp://mail:8983/solr/dict_cn/str
 /lst

 Note: I am using solr 4.4.0, zookeeper-3.4.5

 Can anyone help me on this ?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issues to create new core

2015-03-24 Thread Erick Erickson
Tell us all the steps you went through to do this. Note that you
should _not_ be using the core admin in the admin UI if you're working
with SolrCloud.

For stand-alone Solr, the message above is probably caused by your not
having a conf directory set up already. The core admin UI expects that
you have a pre-existing directory with a conf directory that
contains solrconfig.xml, schema.xml, and all the rest of the
configuration files. You can specify this via some of the parameters
on the admin UI screen (see instanceDir and dataDir). Each core must
be in a separate directory or Bad Things Happen.

HTH,
Erick

On Tue, Mar 24, 2015 at 7:01 AM, Alejandro Jesus Mariño Molerio
ajmar...@estudiantes.uci.cu wrote:
 Dear Solr Community:
 I just began to work with Solr. I choose Solr 5.0, but when I try to create a 
 new core with GUI, show the following error:  Error CREATEing SolrCore 
 'datos': Unable to create core [datos] Caused by: Can't find resource 
 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My 
 question is simple, How can fix this problem?.

 Thanks in advance for your consideration.
 Alejandro.


Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Shawn Heisey
On 3/23/2015 10:48 AM, Shai Erera wrote:
 The 'name' param isn't set when I send the URL request (and it's also not
 specified in the reference guide), but only when I add the replica using
 SolrJ. I then tweaked my code to do the following:
 
   final CollectionAdminRequest.AddReplica addReplicaRequest = new
 CollectionAdminRequest.AddReplica() {
 @Override
 public SolrParams getParams() {
   final ModifiableSolrParams params = (ModifiableSolrParams)
 super.getParams();
   params.remove(CoreAdminParams.NAME);
   return params;
 }
   };
 
 And voila, the core is now also named mycollection_shard1_replica2, and I'm
 even able to add as many replicas as I want on this node (where before it
 failed since 'mycollection' already existed).
 
 The 'name' parameter is added by
 CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix
 it:
 
1. Remove it in AddReplica.getParams() -- replicas will always be
auto-named. It makes sense as users didn't have control over it before, and
maybe they shouldn't.
2. Add a setCoreName to AddReplica request -- this would be nice if
someone wanted to control the name of the added replica, but otherwise
should not be included in the request
 
 Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow
 naming replicas?

Doing both sounds like a good solution to me.  I'm trying to think of
some cautionary text for the javadoc on the new method, but I'm not
really sure what it should say.  Perhaps something like when this
method is not used, the new core will receive a name like
collection_shardN_replicaN, be aware that if you override it,
understanding the collection layout may be more difficult.

I'm hoping Mark and/or Yonik (or someone else, if they know) can comment
about why the AddReplica code had that behavior and whether this is a
good idea in the larger SolrCloud environment.

Thanks,
Shawn



Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera
Thanks Erick,

When a replica is down, no updates are sent to it. When it comes back up,
it discovers that it needs to catch-up with the leader. If there are many
events it falls back to index replication (slower). During this period of
time, is the replica considered ACTIVE or RECOVERING?

And, can I assume that at any given moment (aside from ZK connection
timeouts etc.) when I check the replicas' state, all the ones that report
ACTIVE are in sync with each other?

Shai

On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com
wrote:

 You can always issue a *:* query, but it'd have to be at least your
 autoSoftCommit interval ago since the soft commit trigger will have
 slightly different wall clock times.

 But it shouldn't be necessary to wait I don't think. Since the
 indexing request doesn't succeed until the docs have been written to
 the tlogs, and since the tlogs will be replayed in the event of a
 problem your data should be fine. Of course if you're indexing at a
 very fast rate and your tlog is huge, it'll take a while

 FWIW,
 Erick

 On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Is there a recommended, preferably fast, way to check that a document is
  indexed by all replicas? I currently do that by issuing a search request
 to
  each replica, but was wondering if there's a faster way.
 
  Even better, is there a way to verify all replicas of a shard are
  up-to-date, e.g. by comparing their version or something? By
 up-to-date
  I mean that they've all processed the same update requests that came
  through.
 
  If there's a replica lagging behind, I'd like to wait for it to catch up,
  something like a checkpoint(), before I continue sending more updates.
 
  Shai



Re: Unable to setup solr cloud with multiple collections.

2015-03-24 Thread sthita
Thanks Erick for your reply.
I am trying to create a new core i.e dict_cn , which is totally different in
terms of index data, configs etc from the existing core abc. 
The core is created successfully in my master (i.e mail) and i can do solr
query on this newly created core .
All the config files(Schema.xml and solrconfig.xml) are in mail server and
zookeper helps it for me to share all config files to other collections.
I did the similar setup to other collection , so that newly created core
should be available to all the collections, but it is still showing down.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
First off thanks everyone for the very useful replies thus far.

Shawn - thanks for the list of items to check.  #1 and #2 should be fine
for us and I'll check our ulimit for #3.

To add a bit of clarification, we are indeed using SolrCloud.  Our current
setup is to create a new collection for each customer.  For now we allow
SolrCloud to decide for itself where to locate the initial shard(s) but in
time we expect to refine this such that our system will automatically
choose the least loaded nodes according to some metric(s).

Having more than one business entity controlling the configuration of a
 single (Solr) server is a recipe for disaster. Solr works well if there is
 an architect for the system.


Jack, can you explain a bit what you mean here?  It looks like Toke caught
your meaning but I'm afraid it missed me.  What do you mean by business
entity?  Is your concern that with automatic creation of collections they
will be distributed willy-nilly across the cluster, leading to uneven load
across nodes?  If it is relevant, the schema and solrconfig are controlled
entirely by me and is the same for all collections.  Thus theoretically we
could actually just use one single collection for all of our customers
(adding a 'customer:whatever' type fq to all queries) but since we never
need to query across customers it seemed more performant (as well as safer
- less chance of accidentally leaking data across customers) to use
separate collections.

Better to give each tenant a separate Solr instance that you spin up and
 spin down based on demand.


Regarding this, if by tenant you mean customer, this is not viable for us
from a cost perspective.  As I mentioned initially, many of our customers
are very small so dedicating an entire machine to each of them would not be
economical (or efficient).  Or perhaps I am not understanding what your
definition of tenant is?

Cheers,
Ian



On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Jack Krupansky [jack.krupan...@gmail.com] wrote:
  I'm sure that I am quite unqualified to describe his hypothetical setup.
 I
  mean, he's the one using the term multi-tenancy, so it's for him to be
  clear.

 It was my understanding that Ian used them interchangeably, but of course
 Ian it the only one that knows.

  For me, it's a question of who has control over the config and schema and
  collection creation. Having more than one business entity controlling the
  configuration of a single (Solr) server is a recipe for disaster.

 Thank you. Now your post makes a lot more sense. I will not argue against
 that.

 - Toke Eskildsen



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Damien Kamerman
From my experience on a high-end sever (256GB memory, 40 core CPU) testing
collection numbers with one shard and two replicas, the maximum that would
work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
half of that), depending on your startup-time requirements. (Though I have
settled on 6,000 collection maximum with some patching. See SOLR-7191). You
could create multiple clouds after that, and choose the cloud least used to
create your collection.

Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
collection.

On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote:

 First off thanks everyone for the very useful replies thus far.

 Shawn - thanks for the list of items to check.  #1 and #2 should be fine
 for us and I'll check our ulimit for #3.

 To add a bit of clarification, we are indeed using SolrCloud.  Our current
 setup is to create a new collection for each customer.  For now we allow
 SolrCloud to decide for itself where to locate the initial shard(s) but in
 time we expect to refine this such that our system will automatically
 choose the least loaded nodes according to some metric(s).

 Having more than one business entity controlling the configuration of a
  single (Solr) server is a recipe for disaster. Solr works well if there
 is
  an architect for the system.


 Jack, can you explain a bit what you mean here?  It looks like Toke caught
 your meaning but I'm afraid it missed me.  What do you mean by business
 entity?  Is your concern that with automatic creation of collections they
 will be distributed willy-nilly across the cluster, leading to uneven load
 across nodes?  If it is relevant, the schema and solrconfig are controlled
 entirely by me and is the same for all collections.  Thus theoretically we
 could actually just use one single collection for all of our customers
 (adding a 'customer:whatever' type fq to all queries) but since we never
 need to query across customers it seemed more performant (as well as safer
 - less chance of accidentally leaking data across customers) to use
 separate collections.

 Better to give each tenant a separate Solr instance that you spin up and
  spin down based on demand.


 Regarding this, if by tenant you mean customer, this is not viable for us
 from a cost perspective.  As I mentioned initially, many of our customers
 are very small so dedicating an entire machine to each of them would not be
 economical (or efficient).  Or perhaps I am not understanding what your
 definition of tenant is?

 Cheers,
 Ian



 On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk
 wrote:

  Jack Krupansky [jack.krupan...@gmail.com] wrote:
   I'm sure that I am quite unqualified to describe his hypothetical
 setup.
  I
   mean, he's the one using the term multi-tenancy, so it's for him to be
   clear.
 
  It was my understanding that Ian used them interchangeably, but of course
  Ian it the only one that knows.
 
   For me, it's a question of who has control over the config and schema
 and
   collection creation. Having more than one business entity controlling
 the
   configuration of a single (Solr) server is a recipe for disaster.
 
  Thank you. Now your post makes a lot more sense. I will not argue against
  that.
 
  - Toke Eskildsen
 




-- 
Damien Kamerman


Re: Using G1 with Apache Solr

2015-03-24 Thread Shawn Heisey
On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
 I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:
 
 -verbose:gc 
 -XX:+PrintGCDateStamps 
 -XX:+PrintGCDetails 
 -XX:+PrintAdaptiveSizePolicy 
 -XX:+PrintReferenceGC 
 -Xmx3072m 
 -Xms3072m 
 -XX:+UseG1GC 
 -XX:+UseLargePages 
 -XX:+AggressiveOpts 
 -XX:+ParallelRefProcEnabled 
 -XX:G1HeapRegionSize=8m 
 -XX:InitiatingHeapOccupancyPercent=35 
 
 
 What I'm currently seeing is that many of the gc pauses are under an
 acceptable 0.25 seconds but seeing way too many full GCs with an average
 stop time of 3.2 seconds.
 
 You can find the gc logs
 here: https://www.dropbox.com/s/v04b336v2k5l05e/g1_gc_7u75.log.gz?dl=0
 
 I initially tested without specifying the HeapRegionSize but that
 resulted in the humongous message in the gc logs and a ton of full gc
 pauses.

This is similar to the settings I've been working on that I've
documented on my wiki page, with better results than you are seeing, and
a larger heap than you have configured:

https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector

You have one additional option that I don't --
InitiatingHeapOccupancyPercent.  I would suggest running without that
option to see how it affects your GC times.

I'm curious what OS you're running under, whether the OS and Java are
64-bit, and whether you have actually enabled huge pages in your
operating system.  If it's Linux and you have enabled huge pages, have
you turned off transparent huge pages as documented by Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

On my servers, I do *not* have huge pages configured in the operating
system, so the UseLargePages java option isn't doing anything.

One final thing ... Oracle developers have claimed that Java 8u40 has
some major improvements to the G1 collector, particularly for programs
that allocate very large objects.  Can you try 8u40?

Thanks,
Shawn



Re: Using G1 with Apache Solr

2015-03-24 Thread Shawn Heisey
On 3/24/2015 9:52 PM, Shawn Heisey wrote:
 On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
 I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:

I really got my wires crossed.  Kamran sent his message to the
hostpot-gc-use mailing list, not the solr-user list!

Thanks,
Shawn



Re: Solr 5.0 -- IllegalStateException: unexpected docvalues type NONE on result grouping

2015-03-24 Thread Shawn Heisey

On 3/12/2015 3:36 PM, Alexandre Rafalovitch wrote:

Manual optimize is no longer needed for modern Solr. It does great
optimization automatically. The only reason I recommended it here is
to make sure that all segments are brought up to the latest version
and the deleted documents are purged. That's something that also would
happen automatically eventually, but eventually was not an option
for you.

I am glad this helped. I am not 100% sure if you have to do it on each
shard in SolrCloud mode, but I suspect so.


In SolrCloud, whenever you send an optimize command to any shard replica 
in a collection, the entire collection will be optimized.  SolrCloud 
will do the optimization sequentially, not in parallel.  There is 
currently no way to optimize only one shard replica, and as far as I 
know, there is no way to ask for a parallel optimization.


Alexandre's comments about the necessity of optimization (whether it's 
SolrCloud or not) is spot on.  The only time that optimization should be 
done on a modern Solr index is when you have a lot of deleted documents 
and want to clean those up, either to reclaim disk space or remove them 
from the relevancy calculation.


Most people do see a performance boost on an optimized index compared to 
a non-optimized index, but with a modern Solr install, you might 
actually see better performance on a multi-segment index when the 
indexing rate is high, because Lucene is moving to a model where there 
are per-segment caches that are not invalidated at commit time, only at 
merge time.


Thanks,
Shawn



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Don't confuse customers and tenants.

-- Jack Krupansky

On Tue, Mar 24, 2015 at 2:24 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Sorry Jack. That doesn't scale when you have millions of customers. And
 these are good problems to have!

 On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky jack.krupan...@gmail.com
 
 wrote:

  Multi-tenancy is a bad idea for a single solr Cluster. Better to give
 each
  tenant a separate Solr instance that you spin up and spin down based on
  demand.
 
  Think about it: If there are a small number of tenants, just giving each
  their own machine will be cheaper than the effort spent managing a
  multi-tenant cluster, and if there are a large number of tenants of even
 a
  moderate number of large tenants, you can't expect them to all run
  reasonably on a relatively small cluster. Think about scalability.
 
 
  -- Jack Krupansky
 
  On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote:
 
   Let me give a bit of background.  Our Solr cluster is multi-tenant,
 where
   we use one collection for each of our customers.  In many cases, these
   customers are very tiny, so their collection consists of just a single
   shard on a single Solr node.  In fact, a non-trivial number of them are
   totally empty (e.g. trial customers that never did anything with their
   trial account).  However there are also some customers that are larger,
   requiring their collection to be sharded.  Our strategy is to try to
 keep
   the total documents in any one shard under 20 million (honestly not
 sure
   where my coworker got that number from - I am open to alternatives but
 I
   realize this is heavily app-specific).
  
   So my original question is not related to indexing or query traffic,
 but
   just the sheer number of cores.  For example, if I have 10 active cores
  on
   a machine and everything is working fine, should I expect that
 everything
   will still work fine if I add 10 nearly-idle cores to that machine?
 What
   about 100?  1000?  I figure the overhead of each core is probably
 fairly
   low but at some point starts to matter.
  
   Does that make sense?
   - Ian
  
  
   On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky 
  jack.krupan...@gmail.com
   
   wrote:
  
Shards per collection, or across all collections on the node?
   
It will all depend on:
   
1. Your ingestion/indexing rate. High, medium or low?
2. Your query access pattern. Note that a typical query fans out to
 all
shards, so having more shards than CPU cores means less parallelism.
3. How many collections you will have per node.
   
In short, it depends on what you want to achieve, not some limit of
  Solr
per se.
   
Why are you even sharding the node anyway? Why not just run with a
  single
shard per node, and do sharding by having separate nodes, to maximize
parallel processing and availability?
   
Also be careful to be clear about using the Solr term shard (a
 slice,
across all replica nodes) as distinct from the Elasticsearch term
  shard
(a single slice of an index for a single replica, analogous to a Solr
core.)
   
   
-- Jack Krupansky
   
On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com
  wrote:
   
 Hi all -

 I'm sure this topic has been covered before but I was unable to
 find
   any
 clear references online or in the mailing list.

 Are there any rules of thumb for how many cores (aka shards, since
 I
  am
 using SolrCloud) is too many for one machine?  I realize there is
  no
one
 answer (depends on size of the machine, etc.) so I'm just looking
  for a
 rough idea.  Something like the following would be very useful:

 * People commonly run up to X cores/shards on a mid-sized (4 or 8
  core)
 server without any problems.
 * I have never heard of anyone successfully running X cores/shards
  on a
 single machine, even if you throw a lot of hardware at it.

 Thanks!
 - Ian

   
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shalin Shekhar Mangar
Hi Shai,

To your original question on how to know if a document has been indexed at
all replicas -- You can add a min_rf=true parameter to your indexing
request and then Solr will add information to the response about how many
replicas gave an ack' to the leader. So if the returned number is equal to
the number of replicas, you can be sure that the doc has been indexed
everywhere.

More comments inline:

On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera ser...@gmail.com wrote:

 Thanks Erick,

 When a replica is down, no updates are sent to it. When it comes back up,
 it discovers that it needs to catch-up with the leader. If there are many
 events it falls back to index replication (slower). During this period of
 time, is the replica considered ACTIVE or RECOVERING?


It is marked as recovering.


 And, can I assume that at any given moment (aside from ZK connection
 timeouts etc.) when I check the replicas' state, all the ones that report
 ACTIVE are in sync with each other?


Yes, 'active' replicas should be in sync but autoCommits can cause
inconsistency between replicas as to what is visible to searchers (even if
all replicas have indexed the same data). Also, checking the state of the
replica is not enough, one should always check for the state=active and
live-ness of the replica i.e. the node is marked live under /live_nodes in
ZK.


 Shai

 On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  You can always issue a *:* query, but it'd have to be at least your
  autoSoftCommit interval ago since the soft commit trigger will have
  slightly different wall clock times.
 
  But it shouldn't be necessary to wait I don't think. Since the
  indexing request doesn't succeed until the docs have been written to
  the tlogs, and since the tlogs will be replayed in the event of a
  problem your data should be fine. Of course if you're indexing at a
  very fast rate and your tlog is huge, it'll take a while
 
  FWIW,
  Erick
 
  On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Is there a recommended, preferably fast, way to check that a document
 is
   indexed by all replicas? I currently do that by issuing a search
 request
  to
   each replica, but was wondering if there's a faster way.
  
   Even better, is there a way to verify all replicas of a shard are
   up-to-date, e.g. by comparing their version or something? By
  up-to-date
   I mean that they've all processed the same update requests that came
   through.
  
   If there's a replica lagging behind, I'd like to wait for it to catch
 up,
   something like a checkpoint(), before I continue sending more updates.
  
   Shai
 




-- 
Regards,
Shalin Shekhar Mangar.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Let me give a bit of background.  Our Solr cluster is multi-tenant, where
we use one collection for each of our customers.  In many cases, these
customers are very tiny, so their collection consists of just a single
shard on a single Solr node.  In fact, a non-trivial number of them are
totally empty (e.g. trial customers that never did anything with their
trial account).  However there are also some customers that are larger,
requiring their collection to be sharded.  Our strategy is to try to keep
the total documents in any one shard under 20 million (honestly not sure
where my coworker got that number from - I am open to alternatives but I
realize this is heavily app-specific).

So my original question is not related to indexing or query traffic, but
just the sheer number of cores.  For example, if I have 10 active cores on
a machine and everything is working fine, should I expect that everything
will still work fine if I add 10 nearly-idle cores to that machine?  What
about 100?  1000?  I figure the overhead of each core is probably fairly
low but at some point starts to matter.

Does that make sense?
- Ian


On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Shards per collection, or across all collections on the node?

 It will all depend on:

 1. Your ingestion/indexing rate. High, medium or low?
 2. Your query access pattern. Note that a typical query fans out to all
 shards, so having more shards than CPU cores means less parallelism.
 3. How many collections you will have per node.

 In short, it depends on what you want to achieve, not some limit of Solr
 per se.

 Why are you even sharding the node anyway? Why not just run with a single
 shard per node, and do sharding by having separate nodes, to maximize
 parallel processing and availability?

 Also be careful to be clear about using the Solr term shard (a slice,
 across all replica nodes) as distinct from the Elasticsearch term shard
 (a single slice of an index for a single replica, analogous to a Solr
 core.)


 -- Jack Krupansky

 On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote:

  Hi all -
 
  I'm sure this topic has been covered before but I was unable to find any
  clear references online or in the mailing list.
 
  Are there any rules of thumb for how many cores (aka shards, since I am
  using SolrCloud) is too many for one machine?  I realize there is no
 one
  answer (depends on size of the machine, etc.) so I'm just looking for a
  rough idea.  Something like the following would be very useful:
 
  * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
  server without any problems.
  * I have never heard of anyone successfully running X cores/shards on a
  single machine, even if you throw a lot of hardware at it.
 
  Thanks!
  - Ian
 



Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Shai Erera
I use vanilla 5.0. I intended to fix it myself, but if you want to go
ahead, I'd be happy to review the patch.

Shai

On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 It's certainly looks like a bug and the name shouldn't be added to the
 request automatically.
 Can you confirm what version of Solr are you using?

 If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
 both #1 and #2.

 On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote:

  Shawn, that was a great tip!
 
  When I tried the URL, the core was named as expected
  (mycollection_shard1_replica2). I then compared the URLs as reported in
 the
  logs, and I believe I found the bug:
 
  SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1
  *name=mycollection*action=ADDREPLICA*collection=mycollection*
  wt=javabinversion=2}
 
  The 'name' param isn't set when I send the URL request (and it's also not
  specified in the reference guide), but only when I add the replica using
  SolrJ. I then tweaked my code to do the following:
 
final CollectionAdminRequest.AddReplica addReplicaRequest = new
  CollectionAdminRequest.AddReplica() {
  @Override
  public SolrParams getParams() {
final ModifiableSolrParams params = (ModifiableSolrParams)
  super.getParams();
params.remove(CoreAdminParams.NAME);
return params;
  }
};
 
  And voila, the core is now also named mycollection_shard1_replica2, and
 I'm
  even able to add as many replicas as I want on this node (where before it
  failed since 'mycollection' already existed).
 
  The 'name' parameter is added by
  CollectionSpecificAdminRequest.getParams(). So how would you suggest to
 fix
  it:
 
 1. Remove it in AddReplica.getParams() -- replicas will always be
 auto-named. It makes sense as users didn't have control over it
 before,
  and
 maybe they shouldn't.
 2. Add a setCoreName to AddReplica request -- this would be nice if
 someone wanted to control the name of the added replica, but otherwise
 should not be included in the request
 
  Or maybe we fix the bug by doing #1 and consider #2 as a new feature
 allow
  naming replicas?
 
  Shai
 
 
  On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org
 wrote:
 
   On 3/23/2015 9:27 AM, Shai Erera wrote:
I have a Solr cluster started (all programmatically) with one Solr
  node,
one collection and one shard. I set the replicationFactor to 1. The
  name
   of
the result core was set to mycollection_shard1_replica1.
   
I then start a second Solr node and issue an ADDREPLICA command as
described in the reference guide, using following code:
   
  final CollectionAdminRequest.AddReplica addReplicaRequest = new
CollectionAdminRequest.AddReplica();
  addReplicaRequest.setCollectionName(mycollection);
  addReplicaRequest.setShardName(shard1);
  final CollectionAdminResponse response =
addReplicaRequest.process(solrClient);
   
The replica is added under a core named mycollection and not e.g.
mycollection_shard1_replica2.
  
   I'd call that a bug.
  
BTW, the example in the reference guide shows that issuing the
 request:
   
   
  
 
 http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr
   
Results in
   
response
  lst name=responseHeader
int name=status0/int
int name=QTime3764/int
  /lst
  lst name=success
lst
  lst name=responseHeader
int name=status0/int
int name=QTime3450/int
  /lst
*  str name=coretest2_shard2_replica4/str*
  
   Did you try out a URL like that to see whether it also results in the
   misnamed core, or if it behaves correctly as the reference guide
  indicates?
  
   If the URL behaves correctly, I'd be curious what Solr logs for the URL
   request and the SolrJ request.
  
   Thanks,
   Shawn
  
  
 



 --
 Anshum Gupta



Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Anshum Gupta
It's certainly looks like a bug and the name shouldn't be added to the
request automatically.
Can you confirm what version of Solr are you using?

If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
both #1 and #2.

On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote:

 Shawn, that was a great tip!

 When I tried the URL, the core was named as expected
 (mycollection_shard1_replica2). I then compared the URLs as reported in the
 logs, and I believe I found the bug:

 SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1
 *name=mycollection*action=ADDREPLICA*collection=mycollection*
 wt=javabinversion=2}

 The 'name' param isn't set when I send the URL request (and it's also not
 specified in the reference guide), but only when I add the replica using
 SolrJ. I then tweaked my code to do the following:

   final CollectionAdminRequest.AddReplica addReplicaRequest = new
 CollectionAdminRequest.AddReplica() {
 @Override
 public SolrParams getParams() {
   final ModifiableSolrParams params = (ModifiableSolrParams)
 super.getParams();
   params.remove(CoreAdminParams.NAME);
   return params;
 }
   };

 And voila, the core is now also named mycollection_shard1_replica2, and I'm
 even able to add as many replicas as I want on this node (where before it
 failed since 'mycollection' already existed).

 The 'name' parameter is added by
 CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix
 it:

1. Remove it in AddReplica.getParams() -- replicas will always be
auto-named. It makes sense as users didn't have control over it before,
 and
maybe they shouldn't.
2. Add a setCoreName to AddReplica request -- this would be nice if
someone wanted to control the name of the added replica, but otherwise
should not be included in the request

 Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow
 naming replicas?

 Shai


 On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 3/23/2015 9:27 AM, Shai Erera wrote:
   I have a Solr cluster started (all programmatically) with one Solr
 node,
   one collection and one shard. I set the replicationFactor to 1. The
 name
  of
   the result core was set to mycollection_shard1_replica1.
  
   I then start a second Solr node and issue an ADDREPLICA command as
   described in the reference guide, using following code:
  
 final CollectionAdminRequest.AddReplica addReplicaRequest = new
   CollectionAdminRequest.AddReplica();
 addReplicaRequest.setCollectionName(mycollection);
 addReplicaRequest.setShardName(shard1);
 final CollectionAdminResponse response =
   addReplicaRequest.process(solrClient);
  
   The replica is added under a core named mycollection and not e.g.
   mycollection_shard1_replica2.
 
  I'd call that a bug.
 
   BTW, the example in the reference guide shows that issuing the request:
  
  
 
 http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr
  
   Results in
  
   response
 lst name=responseHeader
   int name=status0/int
   int name=QTime3764/int
 /lst
 lst name=success
   lst
 lst name=responseHeader
   int name=status0/int
   int name=QTime3450/int
 /lst
   *  str name=coretest2_shard2_replica4/str*
 
  Did you try out a URL like that to see whether it also results in the
  misnamed core, or if it behaves correctly as the reference guide
 indicates?
 
  If the URL behaves correctly, I'd be curious what Solr logs for the URL
  request and the SolrJ request.
 
  Thanks,
  Shawn
 
 




-- 
Anshum Gupta


Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Anshum Gupta
Either of them works for me. If you want to get your hands dirty, please go
ahead.
I can review/provide feedback if you need anything there. I'll just create
a JIRA to begin with.

On Tue, Mar 24, 2015 at 9:15 AM, Shai Erera ser...@gmail.com wrote:

 I use vanilla 5.0. I intended to fix it myself, but if you want to go
 ahead, I'd be happy to review the patch.

 Shai

 On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  It's certainly looks like a bug and the name shouldn't be added to the
  request automatically.
  Can you confirm what version of Solr are you using?
 
  If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
  both #1 and #2.
 
  On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote:
 
   Shawn, that was a great tip!
  
   When I tried the URL, the core was named as expected
   (mycollection_shard1_replica2). I then compared the URLs as reported in
  the
   logs, and I believe I found the bug:
  
   SolrJ: [admin] webapp=null path=/admin/collections
 params={shard=shard1
   *name=mycollection*action=ADDREPLICA*collection=mycollection*
   wt=javabinversion=2}
  
   The 'name' param isn't set when I send the URL request (and it's also
 not
   specified in the reference guide), but only when I add the replica
 using
   SolrJ. I then tweaked my code to do the following:
  
 final CollectionAdminRequest.AddReplica addReplicaRequest = new
   CollectionAdminRequest.AddReplica() {
   @Override
   public SolrParams getParams() {
 final ModifiableSolrParams params = (ModifiableSolrParams)
   super.getParams();
 params.remove(CoreAdminParams.NAME);
 return params;
   }
 };
  
   And voila, the core is now also named mycollection_shard1_replica2, and
  I'm
   even able to add as many replicas as I want on this node (where before
 it
   failed since 'mycollection' already existed).
  
   The 'name' parameter is added by
   CollectionSpecificAdminRequest.getParams(). So how would you suggest to
  fix
   it:
  
  1. Remove it in AddReplica.getParams() -- replicas will always be
  auto-named. It makes sense as users didn't have control over it
  before,
   and
  maybe they shouldn't.
  2. Add a setCoreName to AddReplica request -- this would be nice if
  someone wanted to control the name of the added replica, but
 otherwise
  should not be included in the request
  
   Or maybe we fix the bug by doing #1 and consider #2 as a new feature
  allow
   naming replicas?
  
   Shai
  
  
   On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org
  wrote:
  
On 3/23/2015 9:27 AM, Shai Erera wrote:
 I have a Solr cluster started (all programmatically) with one Solr
   node,
 one collection and one shard. I set the replicationFactor to 1. The
   name
of
 the result core was set to mycollection_shard1_replica1.

 I then start a second Solr node and issue an ADDREPLICA command as
 described in the reference guide, using following code:

   final CollectionAdminRequest.AddReplica addReplicaRequest = new
 CollectionAdminRequest.AddReplica();
   addReplicaRequest.setCollectionName(mycollection);
   addReplicaRequest.setShardName(shard1);
   final CollectionAdminResponse response =
 addReplicaRequest.process(solrClient);

 The replica is added under a core named mycollection and not e.g.
 mycollection_shard1_replica2.
   
I'd call that a bug.
   
 BTW, the example in the reference guide shows that issuing the
  request:


   
  
 
 http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr

 Results in

 response
   lst name=responseHeader
 int name=status0/int
 int name=QTime3764/int
   /lst
   lst name=success
 lst
   lst name=responseHeader
 int name=status0/int
 int name=QTime3450/int
   /lst
 *  str name=coretest2_shard2_replica4/str*
   
Did you try out a URL like that to see whether it also results in the
misnamed core, or if it behaves correctly as the reference guide
   indicates?
   
If the URL behaves correctly, I'd be curious what Solr logs for the
 URL
request and the SolrJ request.
   
Thanks,
Shawn
   
   
  
 
 
 
  --
  Anshum Gupta
 




-- 
Anshum Gupta


Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera

 You can add a min_rf=true parameter to your indexing


Yeah I read about it, but it doesn't help me as in this case, I'm
implementing some monitoring component over a SolrCloud instance, so I have
no handle to the indexing client. I would like the monitor to check the
replicas and report something if all replicas are in sync, some are not in
sync, or e.g. replicas 2 and 3 are further ahead than replica1.

Also, checking the state of the
 replica is not enough, one should always check for the state=active and
 live-ness of the replica i.e. the node is marked live under /live_nodes in
 ZK.


Thanks, I've looked at code samples in tests and saw this is done, so I
copied the logic. E.g. an .isReplicaAlive(Replica replica) checks both the
replica's state, as well that the node it's one is in the cluster state's
live nodes.

Also, verifying replicas are in sync via searching is not the best solution
at all. Apart from not being that fast, it also doesn't factor in documents
that are in the tlog, or in IW's RAM buffer, or even that a document may
have been updated. So I will change my test to ensuring that all replicas
of a slice are in state active (and on a live node) and rely on that being
OK.

Shai

On Tue, Mar 24, 2015 at 6:39 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Hi Shai,

 To your original question on how to know if a document has been indexed at
 all replicas -- You can add a min_rf=true parameter to your indexing
 request and then Solr will add information to the response about how many
 replicas gave an ack' to the leader. So if the returned number is equal to
 the number of replicas, you can be sure that the doc has been indexed
 everywhere.

 More comments inline:

 On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera ser...@gmail.com wrote:

  Thanks Erick,
 
  When a replica is down, no updates are sent to it. When it comes back up,
  it discovers that it needs to catch-up with the leader. If there are many
  events it falls back to index replication (slower). During this period of
  time, is the replica considered ACTIVE or RECOVERING?
 
 
 It is marked as recovering.


  And, can I assume that at any given moment (aside from ZK connection
  timeouts etc.) when I check the replicas' state, all the ones that report
  ACTIVE are in sync with each other?
 
 
 Yes, 'active' replicas should be in sync but autoCommits can cause
 inconsistency between replicas as to what is visible to searchers (even if
 all replicas have indexed the same data). Also, checking the state of the
 replica is not enough, one should always check for the state=active and
 live-ness of the replica i.e. the node is marked live under /live_nodes in
 ZK.


  Shai
 
  On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
   You can always issue a *:* query, but it'd have to be at least your
   autoSoftCommit interval ago since the soft commit trigger will have
   slightly different wall clock times.
  
   But it shouldn't be necessary to wait I don't think. Since the
   indexing request doesn't succeed until the docs have been written to
   the tlogs, and since the tlogs will be replayed in the event of a
   problem your data should be fine. Of course if you're indexing at a
   very fast rate and your tlog is huge, it'll take a while
  
   FWIW,
   Erick
  
   On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote:
Hi
   
Is there a recommended, preferably fast, way to check that a document
  is
indexed by all replicas? I currently do that by issuing a search
  request
   to
each replica, but was wondering if there's a faster way.
   
Even better, is there a way to verify all replicas of a shard are
up-to-date, e.g. by comparing their version or something? By
   up-to-date
I mean that they've all processed the same update requests that came
through.
   
If there's a replica lagging behind, I'd like to wait for it to catch
  up,
something like a checkpoint(), before I continue sending more
 updates.
   
Shai
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: maxReplicasPerNode

2015-03-24 Thread Shai Erera
Thanks guys, this makes sense I guess, from Solr's side.

Perhaps we can have a new Collections API like REDIRECTREPLICA or
something, that will redirect a replica to the new node.
This API can simply do ADDREPLICA on the new node, and DELETEREPLICA of the
node that doesn't exist anymore.

I guess I need to implement that for my use case now (I know that if a node
came down, it won't ever come back up again - there will be a new node
replacing it), so I'll see how it plays out and if it works well, I'll open
a JIRA issue. In my case, when the new node comes up, it can check the
cluster's status, and if it detects an orphanage replica, it will add
itself as a new replica and delete the orphanage one.

Let me know if you see a problem with how I intend to address that.

Shai

On Tue, Mar 24, 2015 at 6:01 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 Yes, it applies to both. Solr wouldn't auto-add replicas in either of those
 cases (or any other case) to meet the rf specified at create time.

 On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera ser...@gmail.com wrote:

  Thanks Anshum,
 
  About #3, i line with my answer to the previous question, Solr wouldn't
   auto-add a Replica to meet the replication factor when a node goes
 down.
  
 
  Just to make sure the answer applies to both these cases:
 
 1. There are two replicas on node1 and node2. Solr won't add a replica
 to node1 when node2 goes down.
 2. The collection was created with rf=2, Solr creates replicas on
 node1
 and node2. If node2 goes down and a node3 comes up instead, will it be
 assigned a replica, or Solr does not do that also?
 
  In short, is there any scenario where Solr would auto-add replicas (aside
  from running on HDFS) to meet the 'rf' setting, or after the collection
 has
  been created, ensuring RF is met is my responsibility?
 
  Shai
 
  On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net
  wrote:
 
   Hi Shai,
  
   As of now, all replicas for a collections are created to meet the
  specified
   replication factor at the time of collection creation. There's no way
 to
   defer that until more nodes are up. Your best bet is to have the nodes
   already up before you CREATE the collection or create the collection
  with a
   lower replication factor and then use ADDREPLICA.
  
   About auto-addition of replicas, that's kind of supported when using
  shared
   file system (HDFS) to host the index. It's doesn't truly work as per
 your
   use-case i.e. it doesn't consider the intended replication factor but
  only
   brings up a Replica in case all replicas for a node are down, so that
   SolrCloud continues to be usable. It also doesn't auto-remove replica
  when
   the old node comes back up. You can read more about this in the
   Automatically Add Replicas in SolrCloud section here:
   https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
  
   About #3, i line with my answer to the previous question, Solr wouldn't
   auto-add a Replica to meet the replication factor when a node goes
 down.
  
  
   On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote:
  
Hi
   
I saw that we can define maxShardsPerNode when creating a collection,
   but I
don't see that I can set something similar for replicas. My scenario
 is
   the
following:
   
   - I setup one Solr node
   - Create collection with numShards=1 and replicationFactor=2
   - Hopefully, one replica is created on that node
   - When I bring up the second Solr node, the second replica will be
   created
   
What I see is that both replicas are created on the first node, and
  when
   I
bring up the second Solr node, none of the replicas are moved.
   
I know that I can move one replica by calling ADDREPLICA on node2,
  then
DELETEREPLICA on node1, but I was wondering if there's an automated
 way
   to
do that.
   
I've also considered creating the collection with replicationFactor=1
  and
when the second node comes up it will look for shards w/ one replica
   only,
and assign themselves as the replica. But it means I have to own that
   piece
of logic, where if Solr already does that, that's better.
   
Also, from what I understand, if I create a collection w/ rf=2 and
  there
are two nodes, then each node is assigned a replica. If one of the
  nodes
comes down, and a 3rd node comes up, it will be assigned a replica --
  is
that correct?
   
Another related question, if there are two replicas on node1 and
 node2,
   and
node2 goes down -- will node1 be assigned the second replica as well?
   
If this is explained somewhere, I'd appreciate if you can give me a
pointer.
   
Shai
   
  
  
  
   --
   Anshum Gupta
  
 



 --
 Anshum Gupta



One of three cores is missing userData and lastModified fields from /admin/cores

2015-03-24 Thread Aaron Daubman
Hey All,

On a Solr server running 4.10.2 with three cores, two return the expected
info from /solr/admin/cores?wt=json but the third is missing userData and
lastModified.

The first (artists) and third (tracks) cores from the linked screenshot are
the ones I care about. Unfortunately, the third (tracks) is the one missing
lastModified.

As far as I can see, that comes from:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568

I can't trace back to see what would possible cause getUserData() to return
an empty Object, but that appears to be what is happening?

For these severs, indexes that are pre-optimized are shipped over to the
server and the server is re-started... nothing is actually ever committed
on these live servers. This should behave exactly the same for artists and
tracks, even though tracks is the one always missing lastUpdated.

Here's the output in img format, I'll paste the full JSON[1] below:
http://monosnap.com/image/XMyAfk5z3AvHgY39m0qAKAGlc3RACI.png

I'd like to be able to provide access to clients to grab lastUpdated time
for both indices so that they can see how old/stale the data they are
getting results back from is...

...alternately, is there any other way to expose easily how old (last
modified time?) the index for a core is?

Thanks,
  Aaron

1: Full JSON
---snip---
{
  responseHeader: {
status: 0,
QTime: 10
  },
  defaultCoreName: collection1,
  initFailures: {
  },
  status: {
artists: {
  name: artists,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/artists/,
  dataDir: /opt/solr/search/solr/artists/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:23.667Z,
  uptime: 7335696,
  index: {
numDocs: 3360380,
maxDoc: 3360380,
deletedDocs: 0,
indexHeapUsageBytes: 63366952,
version: 421,
segmentCount: 1,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/artists/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/artists/index,
userData: {
  commitTimeMSec: 1427133705908
},
lastModified: 2015-03-23T18:01:45.908Z,
sizeInBytes: 25341305528,
size: 23.6 GB
  }
},
banana-int: {
  name: banana-int,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/banana-int/,
  dataDir: /opt/solr/search/solr/banana-int/data/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:22.895Z,
  uptime: 7336472,
  index: {
numDocs: 3,
maxDoc: 3,
deletedDocs: 0,
indexHeapUsageBytes: 17448,
version: 135,
segmentCount: 3,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/opt/solr/search/solr/banana-int/data/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/banana-int/data/index;
maxCacheMB=48.0 maxMergeSizeMB=4.0),
userData: {
  commitTimeMSec: 1412796723183
},
lastModified: 2014-10-08T19:32:03.183Z,
sizeInBytes: 16196,
size: 15.82 KB
  }
},
tracks: {
  name: tracks,
  isDefaultCore: false,
  instanceDir: /opt/solr/search/solr/tracks/,
  dataDir: /opt/solr/search/solr/tracks/,
  config: solrconfig.xml,
  schema: schema.xml,
  startTime: 2015-03-24T14:12:23.656Z,
  uptime: 7335713,
  index: {
numDocs: 53268126,
maxDoc: 53268126,
deletedDocs: 0,
indexHeapUsageBytes: 517650552,
version: 100,
segmentCount: 1,
current: true,
hasDeletions: false,
directory:
org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/tracks/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/tracks/index,
userData: {
},
sizeInBytes: 122892905007,
size: 114.45 GB
  }
}
  }
}
---snip---


Regarding detection of duplication

2015-03-24 Thread Iniyan
Hi,

My requirement is to detect duplication in title after removing punctuation
marks, stop words, accented characters.

I am trying to do exact match . After that I am thinking of applying
filters. 

I have tried solr. KeywordTokenizerFactory . It does exact matching. But
when I add 

filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true /

Stop filter is not working.

But If I apply solr.StandardTokenizerFactory , am not getting the exact
match.


Title:

What is a apple?
What is an apple?
What is the apple?

When I type What is a apple I need to get all the above.

Could you please let me know that Is there any tokenizer/filter matching my
requirement.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regarding-detection-of-duplication-tp4194975.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr and HDFS configuration

2015-03-24 Thread Michael Della Bitta
The ultimate answer is that you need to test your configuration with your
expected workflow.

However, the thing that mitigates the remote IO factor (hopefully) is that
the Solr HDFS stuff features a blockcache that should (when tuned
correctly) cache in RAM the blocks your Solr process needs the most.

Solr on HDFS currently doesn't have any sort of rack locality like there is
with say HBase colocated on the HDFS nodes. So you can expect that even
with Solr installed on the same nodes as your datanodes for HDFS, that
there will be remote IO.



Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Tue, Mar 24, 2015 at 2:47 PM, Joseph Obernberger j...@lovehorsepower.com
 wrote:

 Hi All - does it make sense to run a solr shard on a node within an Hadoop
 cluster that is not a data node?  In that case all the data that node
 processes would need to come over the network, but you get the benefit of
 more CPU for things like faceting.
 Thank you!

 -Joe



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shawn Heisey
On 3/24/2015 11:22 AM, Ian Rose wrote:
 Let me give a bit of background.  Our Solr cluster is multi-tenant, where
 we use one collection for each of our customers.  In many cases, these
 customers are very tiny, so their collection consists of just a single
 shard on a single Solr node.  In fact, a non-trivial number of them are
 totally empty (e.g. trial customers that never did anything with their
 trial account).  However there are also some customers that are larger,
 requiring their collection to be sharded.  Our strategy is to try to keep
 the total documents in any one shard under 20 million (honestly not sure
 where my coworker got that number from - I am open to alternatives but I
 realize this is heavily app-specific).

 So my original question is not related to indexing or query traffic, but
 just the sheer number of cores.  For example, if I have 10 active cores on
 a machine and everything is working fine, should I expect that everything
 will still work fine if I add 10 nearly-idle cores to that machine?  What
 about 100?  1000?  I figure the overhead of each core is probably fairly
 low but at some point starts to matter.

One resource that may be exhausted faster than any other when you have a
lot of cores on a solr instance (especially when they are not idle) is
Java heap memory, so you might need to increase the java heap.  Memory
in the server is one of the most important resources you have for Solr
performance, and here I am talking about memory that is *not* used in
the Java heap (or any other program) -- the OS must be able to
effectively cache your index data or Solr performance will be terrible.

You have said Solr cluster and collection ... so that makes me think
you're running SolrCloud.  In cloud mode, you can't really use the
LotsOfCores functionality, where you mark cores transient and tell Solr
how many cores you'd like to have resident at the same time.  If you are
NOT in cloud mode, then you can use this feature:

http://wiki.apache.org/solr/LotsOfCores

In general, there are three resources other than memory which might
become exhausted with a large number of cores:

One resource is the maximum open files limit in the operating system,
which typically defaults to 1024.  Each core will typically have several
dozen files in its index, so it's very easy to reach 1024 open files.

The second resource is the maximum allowed threads in your servlet
container config -- each core you add requires more threads.  The
default maxThreads value in most containers is 200.  The Jetty container
included in the Solr download is preconfigured with a maxThreads value
of 1, effectively removing the limit for most setups.

The third resource is related to the second -- some operating systems
implement threads as hidden processes, and many operating systems will
limit the number of processes that a user may start.  On Linux, this
limit is typically 1024, and may need to be increased.

I really need to add this kind of info to the wiki.

Thanks,
Shawn



Solr and HDFS configuration

2015-03-24 Thread Joseph Obernberger
Hi All - does it make sense to run a solr shard on a node within an 
Hadoop cluster that is not a data node?  In that case all the data that 
node processes would need to come over the network, but you get the 
benefit of more CPU for things like faceting.

Thank you!

-Joe


Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Martin Wunderlich
Hi Alex, 

Thanks again for the reply. See my response below inline. 

 Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch arafa...@gmail.com:
 
 I am not entirely sure your problem is at the XSL level yet?
 
 *) I see problems with quotes in two places (in datasource, and in
 outer entity). Did you paste definitions from MSWord by any chance?

The file was created in a text editor. I am not sure which quotes you are 
referring to. They look fine to me and the XML file valides alright. Could you 
perhaps be more specific?

 *) I see that you declare outer entity to be rootEntity=true, so you
 will not get anything from inner documents

That’s correct, I have set the value to „false now 

 *) I don't see any XPath definitions in the inner entity, so the
 processor does not know how to actually map to the fields (that's
 different for SQLEntityProcessor which auto-maps).

As far as I know, the explicit mappings are not required when the result of the 
transformation is in the Solr default import format. The documentation says: 
useSolrAddSchema

- Set this to true if the content is in the form of the standard Solr update 
XML schema.

(https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
 
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler)

But maybe my interpretation here is incorrect. I was assuming that setting this 
attribute to „true“ will allow the DIH to directly process the resulting XML 
file as if I was importing it with the command line Java tool. 

 
 I would step back from inner DIH entity and make sure your outer
 entity actually captures something. Maybe by enabling dynamicField *
 with stored=true. See what you get into the schema. Then, add XPath
 against original XML, just to make sure you capture _something_. Then,
 XSLT and XPath.

OK, I will try to debug the DIH like this. Thanks again. 

Cheers, 

Martin
 
 


 
 Regards,
   Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/
 
 
 On 22 March 2015 at 12:36, Martin Wunderlich martin...@gmx.net wrote:
 Hi Alex,
 
 Thanks a lot for the reply and apologies for being unclear. The 
 XPathEntityProcessor provides an option to specify an XSLT file that should 
 be applied to the XML input prior to the actual data import. I am including 
 my current configuration below, with the respective attribute highlighted.
 
 I have checked various forums and documentation bits, but the config XML 
 seems ok to me. And yet, nothing gets imported.
 
 Cheers,
 
 Martin
 
 
 dataConfig
dataSource encoding=UTF-8
type=„FileDataSource /
entity
name=pickupdir
processor=FileListEntityProcessor
rootEntity=true
fileName=.*xml
baseDir=„/abs/path/to/source/dir/for/import/
recursive=true
newerThan=${dataimporter.last_index_time}
dataSource=null
 
entity
name=xml
processor=XPathEntityProcessor
stream=false
useSolrAddSchema=true
url=${pickupdir.fileAbsolutePath}
xsl=/abs/path/to/xslt/file/in/myCore/conf/transform.xsl
/entity
/entity
/document
 /dataConfig
 
 
 
 
 Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch arafa...@gmail.com 
 mailto:arafa...@gmail.com:
 
 What do you mean using DIH with XSLT together? DIH uses a basic XPath
 parser, but not full XSLT.
 
 So, it's not very clear what the question actually means. How did you
 configure it all?
 
 Regards,
  Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/ http://www.solr-start.com/
 
 
 On 21 March 2015 at 14:14, Martin Wunderlich martin...@gmx.net wrote:
 Hi all,
 
 I am trying to create a data import handler (DIH) to import XML files. The 
 source XML should be transformed using XSLT into the standard Solr import 
 format. I have tested the XSLT and successfully imported data using the 
 Java-based simple import tool. However, when I try to import the same XML 
 files with the same XSLT pre-processing using a DIH configured in 
 solrconfig.xml, it doesn’t work. I can execute the DIH from the admin 
 interface, but no documents get imported. The logging console doesn’t give 
 any errors.
 
 Could someone who has managed to successfully set up a similar 
 configuration (XML import via DIH with XSL pre-processing), provide with 
 the basic configuration, so that I can check what might be wrong in mine?
 
 Thanks a lot.
 
 Cheers,
 
 Martin
 
 
 



Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Alexandre Rafalovitch
type=„FileDataSource /

I am getting both missing closing quote and the opening quote is a
funny one (aligns on the bottom). But your response email also does
that, so maybe you are using some smart editor. Try checking this
conversation in a web archive if you can't see the unusual quotes.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 March 2015 at 15:41, Martin Wunderlich martin...@gmx.net wrote:
 Hi Alex,

 Thanks again for the reply. See my response below inline.

 Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch arafa...@gmail.com:

 I am not entirely sure your problem is at the XSL level yet?

 *) I see problems with quotes in two places (in datasource, and in
 outer entity). Did you paste definitions from MSWord by any chance?

 The file was created in a text editor. I am not sure which quotes you are 
 referring to. They look fine to me and the XML file valides alright. Could 
 you perhaps be more specific?

 *) I see that you declare outer entity to be rootEntity=true, so you
 will not get anything from inner documents

 That’s correct, I have set the value to „false now

 *) I don't see any XPath definitions in the inner entity, so the
 processor does not know how to actually map to the fields (that's
 different for SQLEntityProcessor which auto-maps).

 As far as I know, the explicit mappings are not required when the result of 
 the transformation is in the Solr default import format. The documentation 
 says:
 useSolrAddSchema

 - Set this to true if the content is in the form of the standard Solr update 
 XML schema.

 (https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
  
 https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler)

 But maybe my interpretation here is incorrect. I was assuming that setting 
 this attribute to „true“ will allow the DIH to directly process the resulting 
 XML file as if I was importing it with the command line Java tool.


 I would step back from inner DIH entity and make sure your outer
 entity actually captures something. Maybe by enabling dynamicField *
 with stored=true. See what you get into the schema. Then, add XPath
 against original XML, just to make sure you capture _something_. Then,
 XSLT and XPath.

 OK, I will try to debug the DIH like this. Thanks again.

 Cheers,

 Martin





 Regards,
   Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 22 March 2015 at 12:36, Martin Wunderlich martin...@gmx.net wrote:
 Hi Alex,

 Thanks a lot for the reply and apologies for being unclear. The 
 XPathEntityProcessor provides an option to specify an XSLT file that should 
 be applied to the XML input prior to the actual data import. I am including 
 my current configuration below, with the respective attribute highlighted.

 I have checked various forums and documentation bits, but the config XML 
 seems ok to me. And yet, nothing gets imported.

 Cheers,

 Martin


 dataConfig
dataSource encoding=UTF-8
type=„FileDataSource /
entity
name=pickupdir
processor=FileListEntityProcessor
rootEntity=true
fileName=.*xml
baseDir=„/abs/path/to/source/dir/for/import/
recursive=true
newerThan=${dataimporter.last_index_time}
dataSource=null

entity
name=xml
processor=XPathEntityProcessor
stream=false
useSolrAddSchema=true
url=${pickupdir.fileAbsolutePath}
xsl=/abs/path/to/xslt/file/in/myCore/conf/transform.xsl
/entity
/entity
/document
 /dataConfig




 Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch arafa...@gmail.com 
 mailto:arafa...@gmail.com:

 What do you mean using DIH with XSLT together? DIH uses a basic XPath
 parser, but not full XSLT.

 So, it's not very clear what the question actually means. How did you
 configure it all?

 Regards,
  Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/ http://www.solr-start.com/


 On 21 March 2015 at 14:14, Martin Wunderlich martin...@gmx.net wrote:
 Hi all,

 I am trying to create a data import handler (DIH) to import XML files. 
 The source XML should be transformed using XSLT into the standard Solr 
 import format. I have tested the XSLT and successfully imported data 
 using the Java-based simple import tool. However, when I try to import 
 the same XML files with the same XSLT pre-processing using a DIH 
 configured in solrconfig.xml, it doesn’t work. I can execute the DIH from 
 the admin interface, but no documents get imported. The logging console 
 doesn’t give any errors.

 Could someone who has managed to successfully set up a similar 
 configuration 

RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
 Don't confuse customers and tenants.

Perhaps you could explain what you mean by multi-tenant in the context of Ian's 
setup? It is not clear to me what the distinction is in this case.

- Toke Eskildsen


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Test Test
Hi there, 
I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
setting schema.xml and have adding path in solrconfig.xml, i start solr.I have 
this error message : Caused by: org.apache.solr.common.SolrException: Plugin 
init failure for [schema.xml] fieldType text: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
.../conf/schema.xml at 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166) at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
 at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
 at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
for [schema.xml] fieldType text: Plugin init failure for [schema.xml] 
analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 12 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 ... 13 moreCaused by: java.lang.ClassCastException: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
java.lang.Class.asSubclass(Class.java:3208) at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474) 
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 
Someone can help?
Thanks.Regards.


 Le Mardi 24 mars 2015 21h24, Jack Krupansky jack.krupan...@gmail.com a 
écrit :
   

 I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying Too many cooks spoil the stew?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Jack Krupansky [jack.krupan...@gmail.com] wrote:
  Don't confuse customers and tenants.

 Perhaps you could explain what you mean by multi-tenant in the context of
 Ian's setup? It is not clear to me what the distinction is in this case.

 - Toke Eskildsen



  

RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
 I'm sure that I am quite unqualified to describe his hypothetical setup. I
 mean, he's the one using the term multi-tenancy, so it's for him to be
 clear.

It was my understanding that Ian used them interchangeably, but of course Ian 
it the only one that knows.

 For me, it's a question of who has control over the config and schema and
 collection creation. Having more than one business entity controlling the
 configuration of a single (Solr) server is a recipe for disaster.

Thank you. Now your post makes a lot more sense. I will not argue against that.

- Toke Eskildsen


Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Martin Wunderlich
Very interesting. Thanks, Shawn. Here is what the config file looks like in the 
Solr admin console: 

https://www.dropbox.com/s/qtfclbvs8oze7lp/Bildschirmfoto%202015-03-24%20um%2021.11.12.png?dl=0
 
https://www.dropbox.com/s/qtfclbvs8oze7lp/Bildschirmfoto%202015-03-24%20um%2021.11.12.png?dl=0

No problems with quotes here. It might have been Apple Mail that converted 
them. 

Cheers, 

Martin
 

 Am 24.03.2015 um 20:59 schrieb Shawn Heisey apa...@elyograg.org:
 
 On 3/24/2015 1:41 PM, Martin Wunderlich wrote:
 The file was created in a text editor. I am not sure which quotes you
 are referring to. They look fine to me and the XML file valides
 alright. Could you perhaps be more specific?
 
 This partial screenshot is your email to the list showing your
 dataconfig, as I see it in Thunderbird, with the unusual quote
 characters clearly indicated:
 
 https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0
 
 Thanks,
 Shawn
 



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying Too many cooks spoil the stew?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Jack Krupansky [jack.krupan...@gmail.com] wrote:
  Don't confuse customers and tenants.

 Perhaps you could explain what you mean by multi-tenant in the context of
 Ian's setup? It is not clear to me what the distinction is in this case.

 - Toke Eskildsen



Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Shawn Heisey
On 3/24/2015 1:41 PM, Martin Wunderlich wrote:
 The file was created in a text editor. I am not sure which quotes you
 are referring to. They look fine to me and the XML file valides
 alright. Could you perhaps be more specific?

This partial screenshot is your email to the list showing your
dataconfig, as I see it in Thunderbird, with the unusual quote
characters clearly indicated:

https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0

Thanks,
Shawn



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
tenant a separate Solr instance that you spin up and spin down based on
demand.

Think about it: If there are a small number of tenants, just giving each
their own machine will be cheaper than the effort spent managing a
multi-tenant cluster, and if there are a large number of tenants of even a
moderate number of large tenants, you can't expect them to all run
reasonably on a relatively small cluster. Think about scalability.


-- Jack Krupansky

On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote:

 Let me give a bit of background.  Our Solr cluster is multi-tenant, where
 we use one collection for each of our customers.  In many cases, these
 customers are very tiny, so their collection consists of just a single
 shard on a single Solr node.  In fact, a non-trivial number of them are
 totally empty (e.g. trial customers that never did anything with their
 trial account).  However there are also some customers that are larger,
 requiring their collection to be sharded.  Our strategy is to try to keep
 the total documents in any one shard under 20 million (honestly not sure
 where my coworker got that number from - I am open to alternatives but I
 realize this is heavily app-specific).

 So my original question is not related to indexing or query traffic, but
 just the sheer number of cores.  For example, if I have 10 active cores on
 a machine and everything is working fine, should I expect that everything
 will still work fine if I add 10 nearly-idle cores to that machine?  What
 about 100?  1000?  I figure the overhead of each core is probably fairly
 low but at some point starts to matter.

 Does that make sense?
 - Ian


 On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com
 
 wrote:

  Shards per collection, or across all collections on the node?
 
  It will all depend on:
 
  1. Your ingestion/indexing rate. High, medium or low?
  2. Your query access pattern. Note that a typical query fans out to all
  shards, so having more shards than CPU cores means less parallelism.
  3. How many collections you will have per node.
 
  In short, it depends on what you want to achieve, not some limit of Solr
  per se.
 
  Why are you even sharding the node anyway? Why not just run with a single
  shard per node, and do sharding by having separate nodes, to maximize
  parallel processing and availability?
 
  Also be careful to be clear about using the Solr term shard (a slice,
  across all replica nodes) as distinct from the Elasticsearch term shard
  (a single slice of an index for a single replica, analogous to a Solr
  core.)
 
 
  -- Jack Krupansky
 
  On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote:
 
   Hi all -
  
   I'm sure this topic has been covered before but I was unable to find
 any
   clear references online or in the mailing list.
  
   Are there any rules of thumb for how many cores (aka shards, since I am
   using SolrCloud) is too many for one machine?  I realize there is no
  one
   answer (depends on size of the machine, etc.) so I'm just looking for a
   rough idea.  Something like the following would be very useful:
  
   * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
   server without any problems.
   * I have never heard of anyone successfully running X cores/shards on a
   single machine, even if you throw a lot of hardware at it.
  
   Thanks!
   - Ian
  
 



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shalin Shekhar Mangar
Sorry Jack. That doesn't scale when you have millions of customers. And
these are good problems to have!

On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
 tenant a separate Solr instance that you spin up and spin down based on
 demand.

 Think about it: If there are a small number of tenants, just giving each
 their own machine will be cheaper than the effort spent managing a
 multi-tenant cluster, and if there are a large number of tenants of even a
 moderate number of large tenants, you can't expect them to all run
 reasonably on a relatively small cluster. Think about scalability.


 -- Jack Krupansky

 On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote:

  Let me give a bit of background.  Our Solr cluster is multi-tenant, where
  we use one collection for each of our customers.  In many cases, these
  customers are very tiny, so their collection consists of just a single
  shard on a single Solr node.  In fact, a non-trivial number of them are
  totally empty (e.g. trial customers that never did anything with their
  trial account).  However there are also some customers that are larger,
  requiring their collection to be sharded.  Our strategy is to try to keep
  the total documents in any one shard under 20 million (honestly not sure
  where my coworker got that number from - I am open to alternatives but I
  realize this is heavily app-specific).
 
  So my original question is not related to indexing or query traffic, but
  just the sheer number of cores.  For example, if I have 10 active cores
 on
  a machine and everything is working fine, should I expect that everything
  will still work fine if I add 10 nearly-idle cores to that machine?  What
  about 100?  1000?  I figure the overhead of each core is probably fairly
  low but at some point starts to matter.
 
  Does that make sense?
  - Ian
 
 
  On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky 
 jack.krupan...@gmail.com
  
  wrote:
 
   Shards per collection, or across all collections on the node?
  
   It will all depend on:
  
   1. Your ingestion/indexing rate. High, medium or low?
   2. Your query access pattern. Note that a typical query fans out to all
   shards, so having more shards than CPU cores means less parallelism.
   3. How many collections you will have per node.
  
   In short, it depends on what you want to achieve, not some limit of
 Solr
   per se.
  
   Why are you even sharding the node anyway? Why not just run with a
 single
   shard per node, and do sharding by having separate nodes, to maximize
   parallel processing and availability?
  
   Also be careful to be clear about using the Solr term shard (a slice,
   across all replica nodes) as distinct from the Elasticsearch term
 shard
   (a single slice of an index for a single replica, analogous to a Solr
   core.)
  
  
   -- Jack Krupansky
  
   On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com
 wrote:
  
Hi all -
   
I'm sure this topic has been covered before but I was unable to find
  any
clear references online or in the mailing list.
   
Are there any rules of thumb for how many cores (aka shards, since I
 am
using SolrCloud) is too many for one machine?  I realize there is
 no
   one
answer (depends on size of the machine, etc.) so I'm just looking
 for a
rough idea.  Something like the following would be very useful:
   
* People commonly run up to X cores/shards on a mid-sized (4 or 8
 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards
 on a
single machine, even if you throw a lot of hardware at it.
   
Thanks!
- Ian
   
  
 




-- 
Regards,
Shalin Shekhar Mangar.


maxReplicasPerNode

2015-03-24 Thread Shai Erera
Hi

I saw that we can define maxShardsPerNode when creating a collection, but I
don't see that I can set something similar for replicas. My scenario is the
following:

   - I setup one Solr node
   - Create collection with numShards=1 and replicationFactor=2
   - Hopefully, one replica is created on that node
   - When I bring up the second Solr node, the second replica will be
   created

What I see is that both replicas are created on the first node, and when I
bring up the second Solr node, none of the replicas are moved.

I know that I can move one replica by calling ADDREPLICA on node2, then
DELETEREPLICA on node1, but I was wondering if there's an automated way to
do that.

I've also considered creating the collection with replicationFactor=1 and
when the second node comes up it will look for shards w/ one replica only,
and assign themselves as the replica. But it means I have to own that piece
of logic, where if Solr already does that, that's better.

Also, from what I understand, if I create a collection w/ rf=2 and there
are two nodes, then each node is assigned a replica. If one of the nodes
comes down, and a 3rd node comes up, it will be assigned a replica -- is
that correct?

Another related question, if there are two replicas on node1 and node2, and
node2 goes down -- will node1 be assigned the second replica as well?

If this is explained somewhere, I'd appreciate if you can give me a pointer.

Shai


Re: maxReplicasPerNode

2015-03-24 Thread Anshum Gupta
Hi Shai,

As of now, all replicas for a collections are created to meet the specified
replication factor at the time of collection creation. There's no way to
defer that until more nodes are up. Your best bet is to have the nodes
already up before you CREATE the collection or create the collection with a
lower replication factor and then use ADDREPLICA.

About auto-addition of replicas, that's kind of supported when using shared
file system (HDFS) to host the index. It's doesn't truly work as per your
use-case i.e. it doesn't consider the intended replication factor but only
brings up a Replica in case all replicas for a node are down, so that
SolrCloud continues to be usable. It also doesn't auto-remove replica when
the old node comes back up. You can read more about this in the
Automatically Add Replicas in SolrCloud section here:
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

About #3, i line with my answer to the previous question, Solr wouldn't
auto-add a Replica to meet the replication factor when a node goes down.


On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote:

 Hi

 I saw that we can define maxShardsPerNode when creating a collection, but I
 don't see that I can set something similar for replicas. My scenario is the
 following:

- I setup one Solr node
- Create collection with numShards=1 and replicationFactor=2
- Hopefully, one replica is created on that node
- When I bring up the second Solr node, the second replica will be
created

 What I see is that both replicas are created on the first node, and when I
 bring up the second Solr node, none of the replicas are moved.

 I know that I can move one replica by calling ADDREPLICA on node2, then
 DELETEREPLICA on node1, but I was wondering if there's an automated way to
 do that.

 I've also considered creating the collection with replicationFactor=1 and
 when the second node comes up it will look for shards w/ one replica only,
 and assign themselves as the replica. But it means I have to own that piece
 of logic, where if Solr already does that, that's better.

 Also, from what I understand, if I create a collection w/ rf=2 and there
 are two nodes, then each node is assigned a replica. If one of the nodes
 comes down, and a 3rd node comes up, it will be assigned a replica -- is
 that correct?

 Another related question, if there are two replicas on node1 and node2, and
 node2 goes down -- will node1 be assigned the second replica as well?

 If this is explained somewhere, I'd appreciate if you can give me a
 pointer.

 Shai




-- 
Anshum Gupta


Problem with Terms Query Parser

2015-03-24 Thread Shamik Bandopadhyay
Hi,

  I'm trying to use Terms Query Parser for one of my use cases where I use
an implicit filter on bunch of sources.

When I'm trying to run the following query,

fq={!terms f=Source}help,documentation,sfdc

I'm getting the following error.

lst name=errorstr name=msgUnknown query parser 'terms'/strint
name=code400/int/lst

What am I missing here ? I'm using Solr 5.0 version.

Any pointers will be appreciated.

Regards,
Shamik