Re: Using Properties in /cdcr doesn't seem to work

2017-06-02 Thread Erick Erickson
You haven't really told us what you tried and what the failure was.

Is your problem getting the _configuration_ created or using the
system variables after they're created?

You need to tell us exactly _what_ you tried and exactly _how_ what
you tried didn't work. Details matter, particularly what came through
the Solr log files.

If your problem is the using the config API to set up that section of
solrconfig.xml, let's see the exact command you used. I suspect an
escaping issue, but there's no data to go on.

Best,
Erick




On Fri, Jun 2, 2017 at 9:08 AM, Webster Homer  wrote:
> In the documentation for Solr cdcr there is an example of a source
> configuration that uses properties:
>
>   ${TargetZk}
>   ${SourceCollection}
>   ${TargetCollection}
> 
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-InitialStartup
>
> I tried to configure cdcr for a collection to use properties. It always
> failed. The only thing that seems to work is the literal strings for the
> Target Zookeeper, source and target collection names.
>
> I used the Solr Config API to create the properties, but it just didn't
> work. It was a while back but I believe all it did was throw errors.
>
> Is there some way to make this work? If not it should be removed from the
> documentation
>
> This seems like it would be a useful feature, especially since CDCR doesn't
> support the use of aliases, SOLR-10679.
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Erick Erickson
bq: fq value, say 20 char

Well, my guess here is that you're constructing a huge OR clause
(that's the usual case for such large fq clauses).

It's rare for such a clause to be generated identically very often. Do
you really expect to have this _exact_ clause created over and over
and over and over? Even having one character different in it (even
different orders, i.e. a clause like fq=id:(a OR b) will not be reused
for fq=id:(b OR a)).

So consider using the TermsQParserPlugin and set cache false for the fq clause.

Best,
Erick



On Fri, Jun 2, 2017 at 1:26 PM, Daniel Angelov  wrote:
> In this case, for example:
> http://host1:8983/solr/collName/admin/mbeans?stats=true
> returns us stats in the contex of the shard of "collName", living on host1,
> is not it?
>
> BR
> Daniel
>
> Am 02.06.2017 20:00 schrieb "Daniel Angelov" :
>
> Sorry for the typos in the previous mail, "fg" should be "fq"
>
> Am 02.06.2017 18:15 schrieb "Daniel Angelov" :
>
>> This means, that quering alias NNN pointing 3 collections, each 10 shards
>> and each 2 replicas, a query with very long fg value, say 20 char
>> string. First query with fq will cache all 20 chars 30 times (3 x 10
>> cores). The next query with the same fg, could not use the same cores as
>> the first time, i.e. could locate more mem in the unused replicas from the
>> first query. And in my case the soft commint is each 60 sec. this means a
>> lot of GC, is not it?
>>
>> BR
>> Daniel
>>
>> Am 02.06.2017 17:45 schrieb "Erick Erickson" :
>>
>>> bq: This means, if we have a collection with 2 replicas, there is a
>>> chance,
>>> that 2 queries with identical fq values can be served from different
>>> replicas of the same shards, this means, that the second query will not
>>> use
>>> the cached set from the first query, is not it?
>>>
>>> Yes. In practice autowarming is often used to pre-warm the caches, but
>>> again that's local to each replica, i.e. the fqs used to autowarm
>>> replica1 or shard1 may be different than the ones used to autowarm
>>> replica2 of shard1. What tends to happen is that the replicas "level
>>> out". Any fq clause that's common enough to be useful eventually hits
>>> all the replicas. And the most common ones are run during autowarming
>>> since it's an LRU queue.
>>>
>>> To understand why there isn't a common cache, consider that the
>>> filterCache is conceptually a map. The key is the fq clause and the
>>> value is a bitset where each bit corresponds to the _internal_ Lucene
>>> document ID which is just an integer 0-maxDoc. There are two critical
>>> points here:
>>>
>>> 1> the internal ID changes when segments are merged
>>> 2> different replicas will have different _internal_ ids for the same
>>> document. By "same" here I mean have the same .
>>>
>>> So completely sidestepping the question of the propagation delays of
>>> trying to consult some kind of central filterCache, the nature of that
>>> cache is such that you couldn't share it between replicas anyway.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Jun 2, 2017 at 8:31 AM, Daniel Angelov 
>>> wrote:
>>> > Thanks for the answer!
>>> > This means, if we have a collection with 2 replicas, there is a chance,
>>> > that 2 queries with identical fq values can be served from different
>>> > replicas of the same shards, this means, that the second query will not
>>> use
>>> > the cached set from the first query, is not it?
>>> >
>>> > Thanks
>>> > Daniel
>>> >
>>> > Am 02.06.2017 15:32 schrieb "Susheel Kumar" :
>>> >
>>> >> Thanks for the correction Shawn.  Yes its only the heap allocation
>>> settings
>>> >> are per host/JVM.
>>> >>
>>> >> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey 
>>> wrote:
>>> >>
>>> >> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
>>> >> > > Is the filter cache separate for each host and then for each
>>> >> > > collection and then for each shard and then for each replica in
>>> >> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
>>> >> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
>>> >> > > coll2 shard2 replica2. Does this mean, that we have 4 filter
>>> caches,
>>> >> > > i.e. separate memory for each core? If they are separated and for
>>> >> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec
>>> later
>>> >> > > the same query is handling from coll2 shard1 replica1, this means,
>>> >> > > that the later query will not use the result set cached from the
>>> first
>>> >> > > query...
>>> >> >
>>> >> > That is correct.
>>> >> >
>>> >> > General notes about SolrCloud terminology: SolrCloud is organized
>>> around
>>> >> > collections.  Collections are made up of one or more shards.  Shards
>>> are
>>> >> > made up of one or more replicas.  Each replica is a Solr core.  A
>>> core
>>> >> > contains one Lucene index. 

Re: Steps for building solr/lucene code and starting server

2017-06-02 Thread Erick Erickson
You can just put a  directive in your solrconfig.xml file
that points to the jar in analysis-extras.

I generally prefer that to copying things around on the theory that
it's one less thing to forget to copy sometime later...

Best,
Erick

On Fri, Jun 2, 2017 at 5:05 PM, Nawab Zada Asad Iqbal  wrote:
> When I do 'ant server', the libs from "./build/lucene-libs/" are copied
> over to "./server/solr-webapp/webapp/WEB-INF/lib/" . However, my required
> class is in a lib which is on:
> "./build/contrib/solr-analysis-extras/lucene-libs/"
>
> I guess, I need to do the contrib target?
>
>
> On Fri, Jun 2, 2017 at 4:20 PM, Nawab Zada Asad Iqbal 
> wrote:
>
>> Hi Erick
>>
>> "bin/solr start -e techproducts" works fine. It is probably because it is
>> not referring to 'org.apache.lucene.analysis.ic
>> u.ICUNormalizer2CharFilterFactory' in the schema.xml ?
>>
>> I am not sure what should I try. I am wondering if there is some document
>> about solr dev setup.
>>
>>
>> On Fri, Jun 2, 2017 at 8:29 AM, Erick Erickson 
>> wrote:
>>
>>> "ant server" should be sufficient. "dist" is useful for when
>>> you have custom _external_ programs (say SolrJ) that you
>>> want all the libraries collected in the same place. There's
>>> no need to "ant compile" as the "server" target
>>>
>>> I assume what you're seeing is a ClassNotFound error, right?
>>> I'm a bit puzzled since that filter isn't a contrib, so it should
>>> be found.
>>>
>>> What I'd do is just do the build first then start the example,
>>> "bin/solr start -e techproducts"
>>> Don't specify solrhome or anything else. Once that works,
>>> build up from there.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Jun 2, 2017 at 3:15 AM, Nawab Zada Asad Iqbal 
>>> wrote:
>>> > Hi,
>>> >
>>> > I have synced lucene-solr repo because I (will) have some custom code in
>>> > lucene and solr folders. What are the steps for starting solr server? My
>>> > schema.xml uses ICUNormalizer2CharFilterFactory (which I see in lucene
>>> > folder tree), but I don't know how to make it work with solr webapp. I
>>> know
>>> > the (luncene ant
>>> > target) 'compile',  (solr targets) 'dist', and 'server', but the order
>>> is
>>> > not clear to me.
>>> >
>>> > I have compiled lucene before doing 'ant server' in solr folder, but I
>>> > still see this error when I do 'bin/solr start -f -s ~/solrhome/' :-
>>> >
>>> > Caused by: org.apache.solr.common.SolrException: Plugin init failure
>>> for
>>> > [schema.xml] fieldType "text": Plugin init failure for [schema.xml]
>>> > analyzer/charFilter "nfkc": Error loading class
>>> > 'org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory'
>>> >
>>> >
>>> >
>>> > Thanks
>>> > Nawab
>>>
>>
>>


Re: Steps for building solr/lucene code and starting server

2017-06-02 Thread Nawab Zada Asad Iqbal
When I do 'ant server', the libs from "./build/lucene-libs/" are copied
over to "./server/solr-webapp/webapp/WEB-INF/lib/" . However, my required
class is in a lib which is on:
"./build/contrib/solr-analysis-extras/lucene-libs/"

I guess, I need to do the contrib target?


On Fri, Jun 2, 2017 at 4:20 PM, Nawab Zada Asad Iqbal 
wrote:

> Hi Erick
>
> "bin/solr start -e techproducts" works fine. It is probably because it is
> not referring to 'org.apache.lucene.analysis.ic
> u.ICUNormalizer2CharFilterFactory' in the schema.xml ?
>
> I am not sure what should I try. I am wondering if there is some document
> about solr dev setup.
>
>
> On Fri, Jun 2, 2017 at 8:29 AM, Erick Erickson 
> wrote:
>
>> "ant server" should be sufficient. "dist" is useful for when
>> you have custom _external_ programs (say SolrJ) that you
>> want all the libraries collected in the same place. There's
>> no need to "ant compile" as the "server" target
>>
>> I assume what you're seeing is a ClassNotFound error, right?
>> I'm a bit puzzled since that filter isn't a contrib, so it should
>> be found.
>>
>> What I'd do is just do the build first then start the example,
>> "bin/solr start -e techproducts"
>> Don't specify solrhome or anything else. Once that works,
>> build up from there.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 2, 2017 at 3:15 AM, Nawab Zada Asad Iqbal 
>> wrote:
>> > Hi,
>> >
>> > I have synced lucene-solr repo because I (will) have some custom code in
>> > lucene and solr folders. What are the steps for starting solr server? My
>> > schema.xml uses ICUNormalizer2CharFilterFactory (which I see in lucene
>> > folder tree), but I don't know how to make it work with solr webapp. I
>> know
>> > the (luncene ant
>> > target) 'compile',  (solr targets) 'dist', and 'server', but the order
>> is
>> > not clear to me.
>> >
>> > I have compiled lucene before doing 'ant server' in solr folder, but I
>> > still see this error when I do 'bin/solr start -f -s ~/solrhome/' :-
>> >
>> > Caused by: org.apache.solr.common.SolrException: Plugin init failure
>> for
>> > [schema.xml] fieldType "text": Plugin init failure for [schema.xml]
>> > analyzer/charFilter "nfkc": Error loading class
>> > 'org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory'
>> >
>> >
>> >
>> > Thanks
>> > Nawab
>>
>
>


Re: Steps for building solr/lucene code and starting server

2017-06-02 Thread Nawab Zada Asad Iqbal
Hi Erick

"bin/solr start -e techproducts" works fine. It is probably because it is
not referring to 'org.apache.lucene.analysis.ic
u.ICUNormalizer2CharFilterFactory' in the schema.xml ?

I am not sure what should I try. I am wondering if there is some document
about solr dev setup.


On Fri, Jun 2, 2017 at 8:29 AM, Erick Erickson 
wrote:

> "ant server" should be sufficient. "dist" is useful for when
> you have custom _external_ programs (say SolrJ) that you
> want all the libraries collected in the same place. There's
> no need to "ant compile" as the "server" target
>
> I assume what you're seeing is a ClassNotFound error, right?
> I'm a bit puzzled since that filter isn't a contrib, so it should
> be found.
>
> What I'd do is just do the build first then start the example,
> "bin/solr start -e techproducts"
> Don't specify solrhome or anything else. Once that works,
> build up from there.
>
> Best,
> Erick
>
> On Fri, Jun 2, 2017 at 3:15 AM, Nawab Zada Asad Iqbal 
> wrote:
> > Hi,
> >
> > I have synced lucene-solr repo because I (will) have some custom code in
> > lucene and solr folders. What are the steps for starting solr server? My
> > schema.xml uses ICUNormalizer2CharFilterFactory (which I see in lucene
> > folder tree), but I don't know how to make it work with solr webapp. I
> know
> > the (luncene ant
> > target) 'compile',  (solr targets) 'dist', and 'server', but the order is
> > not clear to me.
> >
> > I have compiled lucene before doing 'ant server' in solr folder, but I
> > still see this error when I do 'bin/solr start -f -s ~/solrhome/' :-
> >
> > Caused by: org.apache.solr.common.SolrException: Plugin init failure for
> > [schema.xml] fieldType "text": Plugin init failure for [schema.xml]
> > analyzer/charFilter "nfkc": Error loading class
> > 'org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory'
> >
> >
> >
> > Thanks
> > Nawab
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Daniel Angelov
In this case, for example:
http://host1:8983/solr/collName/admin/mbeans?stats=true
returns us stats in the contex of the shard of "collName", living on host1,
is not it?

BR
Daniel

Am 02.06.2017 20:00 schrieb "Daniel Angelov" :

Sorry for the typos in the previous mail, "fg" should be "fq"

Am 02.06.2017 18:15 schrieb "Daniel Angelov" :

> This means, that quering alias NNN pointing 3 collections, each 10 shards
> and each 2 replicas, a query with very long fg value, say 20 char
> string. First query with fq will cache all 20 chars 30 times (3 x 10
> cores). The next query with the same fg, could not use the same cores as
> the first time, i.e. could locate more mem in the unused replicas from the
> first query. And in my case the soft commint is each 60 sec. this means a
> lot of GC, is not it?
>
> BR
> Daniel
>
> Am 02.06.2017 17:45 schrieb "Erick Erickson" :
>
>> bq: This means, if we have a collection with 2 replicas, there is a
>> chance,
>> that 2 queries with identical fq values can be served from different
>> replicas of the same shards, this means, that the second query will not
>> use
>> the cached set from the first query, is not it?
>>
>> Yes. In practice autowarming is often used to pre-warm the caches, but
>> again that's local to each replica, i.e. the fqs used to autowarm
>> replica1 or shard1 may be different than the ones used to autowarm
>> replica2 of shard1. What tends to happen is that the replicas "level
>> out". Any fq clause that's common enough to be useful eventually hits
>> all the replicas. And the most common ones are run during autowarming
>> since it's an LRU queue.
>>
>> To understand why there isn't a common cache, consider that the
>> filterCache is conceptually a map. The key is the fq clause and the
>> value is a bitset where each bit corresponds to the _internal_ Lucene
>> document ID which is just an integer 0-maxDoc. There are two critical
>> points here:
>>
>> 1> the internal ID changes when segments are merged
>> 2> different replicas will have different _internal_ ids for the same
>> document. By "same" here I mean have the same .
>>
>> So completely sidestepping the question of the propagation delays of
>> trying to consult some kind of central filterCache, the nature of that
>> cache is such that you couldn't share it between replicas anyway.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 2, 2017 at 8:31 AM, Daniel Angelov 
>> wrote:
>> > Thanks for the answer!
>> > This means, if we have a collection with 2 replicas, there is a chance,
>> > that 2 queries with identical fq values can be served from different
>> > replicas of the same shards, this means, that the second query will not
>> use
>> > the cached set from the first query, is not it?
>> >
>> > Thanks
>> > Daniel
>> >
>> > Am 02.06.2017 15:32 schrieb "Susheel Kumar" :
>> >
>> >> Thanks for the correction Shawn.  Yes its only the heap allocation
>> settings
>> >> are per host/JVM.
>> >>
>> >> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey 
>> wrote:
>> >>
>> >> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
>> >> > > Is the filter cache separate for each host and then for each
>> >> > > collection and then for each shard and then for each replica in
>> >> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
>> >> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
>> >> > > coll2 shard2 replica2. Does this mean, that we have 4 filter
>> caches,
>> >> > > i.e. separate memory for each core? If they are separated and for
>> >> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec
>> later
>> >> > > the same query is handling from coll2 shard1 replica1, this means,
>> >> > > that the later query will not use the result set cached from the
>> first
>> >> > > query...
>> >> >
>> >> > That is correct.
>> >> >
>> >> > General notes about SolrCloud terminology: SolrCloud is organized
>> around
>> >> > collections.  Collections are made up of one or more shards.  Shards
>> are
>> >> > made up of one or more replicas.  Each replica is a Solr core.  A
>> core
>> >> > contains one Lucene index.  It is not correct to say that a shard
>> has no
>> >> > replicas.  The leader *is* a replica.  If you have a leader and one
>> >> > follower, the shard has two replicas.
>> >> >
>> >> > Solr caches (including filterCache) exist at the core level, they
>> have
>> >> > no knowledge of other replicas, other shards, or the collection as a
>> >> > whole.  Susheel says that the caches are per host/JVM -- that's not
>> >> > correct.  Every Solr core in a JVM has separate caches, if they are
>> >> > defined in the configuration for that core.
>> >> >
>> >> > Your query scenario has even more separation -- it asks about
>> querying
>> >> > two completely different collections, which don't use the same cores.
>> >> >
>> >> > Thanks,
>> >> 

Re: Upgrading config from 4.5.0 to 6.5.1

2017-06-02 Thread Tony Wang
Hi Nawab,
We did exact the same way like Rick recommended.  When you apply your
change from your old configs on top of the originals, it will give you the
errors for incompatible settings.  For an example of
"text_general_edge_ngram" fieldType setting, side="front" is no longer
valid attributes.

Tony


On Wed, May 31, 2017 at 3:53 PM, Rick Leir  wrote:

> Hi Nawab
> The recommended way is to use the new version of solrconfig.xml and apply
> your modifications to it. You will want to go through it looking for
> developments that would affect you.
> Cheers
> Rick
>
> On May 31, 2017 3:45:58 PM EDT, Nawab Zada Asad Iqbal 
> wrote:
> >Hi,
> >
> >I am upgrading 4.5.0 to latest stable bits and wondering what will be
> >the
> >quickest way to find out any obsolete or deprecated settings in config
> >files?
> >If I run the latest server with my old config (solr.xml,
> >solrconfig.xml,
> >schema.xml) files, will it warn for deprecated/less-optimal values?
> >
> >
> >Thanks
> >Nawab
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Learn To Rank Questions

2017-06-02 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi, 
Sorry for the delay, here are my replies: 

1. I'm not yet a spark user (but I'm working on that :)) 

2. I'm not sure I understand how you would use a feature that is not a float 
into a model,
in my experience all the learning to rank methods always train and predict from 
a list of
floats. Could you provide more details on how you would use TF-IDF vectors?

3. I've never played with payloads, but if you can access them from the 
IndexReader
then you can write a Java class extending Feature and return them. If you want 
to boost certain documents at query time you can use the efi parameter that 
allows to inject parameters at query time

4. Thanks, that's a good point, we must provide an example. I'll work on that. 

Best,
Diego


From: solr-user@lucene.apache.org At: 05/15/17 13:30:06
To: solr-user@lucene.apache.org
Subject: Re: Learn To Rank Questions

1.
So I think it is a spark problem first (https://issues.apache.org/jir
a/browse/SPARK-10413). What we can do is to create our own model (cf
https://github.com/apache/lucene-solr/tree/master/solr/contr
ib/ltr/src/java/org/apache/solr/ltr/model) that applies the prediction, it
should be easy to do for a simple model, like logistic regression.
For PMML, the idea would also be to implement a Model that reuse a java lib
able to apply PMML.

2.
This function query gives you TF IDF of textField vs userQuery for the doc

 {!edismax qf='textField' mm=100% v=${userQuery} tie=0.1}

Also it seems to me LTR only allows float features which is a limitation.


3.
If the boost value is an index time boost I don't think it is possible. You
could put the feature you want in a field at index time and then use
FieldValueFeature
to extract it.

On Thu, May 11, 2017 at 8:16 PM, Grant Ingersoll 
wrote:

> Hi,
>
> Just getting up to speed on LTR and have a few questions (most of which are
> speculative at this point and exploratory, as I have a couple of talks
> coming up on this and other relevance features):
>
> 1. Has anyone looked at what's involved with supporting SparkML or other
> models (e.g. PMML)?
>
> 2. Has anyone looked at features for text?  i.e. returning TF-IDF vectors
> or similar.  FieldValueFeature is kind of like this, but I might want
> weights for the terms, not just the actual values.  I could get this via
> term vectors, but then it doesn't fit the framework.
>
> 3. How about payloads and/or things like boost values for documents as
> features?
>
> 4. Are there example docs of training and using the
> MultipleAdditiveTreesModel?  I see unit tests for them, but looking for
> something similar to the python script in the example dir.
>
> On 2 and 3, I imagine some of this can be done creatively via the
> SolrFeature and function queries.
>
> Thanks,
> Grant
>




Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Daniel Angelov
Sorry for the typos in the previous mail, "fg" should be "fq"

Am 02.06.2017 18:15 schrieb "Daniel Angelov" :

> This means, that quering alias NNN pointing 3 collections, each 10 shards
> and each 2 replicas, a query with very long fg value, say 20 char
> string. First query with fq will cache all 20 chars 30 times (3 x 10
> cores). The next query with the same fg, could not use the same cores as
> the first time, i.e. could locate more mem in the unused replicas from the
> first query. And in my case the soft commint is each 60 sec. this means a
> lot of GC, is not it?
>
> BR
> Daniel
>
> Am 02.06.2017 17:45 schrieb "Erick Erickson" :
>
>> bq: This means, if we have a collection with 2 replicas, there is a
>> chance,
>> that 2 queries with identical fq values can be served from different
>> replicas of the same shards, this means, that the second query will not
>> use
>> the cached set from the first query, is not it?
>>
>> Yes. In practice autowarming is often used to pre-warm the caches, but
>> again that's local to each replica, i.e. the fqs used to autowarm
>> replica1 or shard1 may be different than the ones used to autowarm
>> replica2 of shard1. What tends to happen is that the replicas "level
>> out". Any fq clause that's common enough to be useful eventually hits
>> all the replicas. And the most common ones are run during autowarming
>> since it's an LRU queue.
>>
>> To understand why there isn't a common cache, consider that the
>> filterCache is conceptually a map. The key is the fq clause and the
>> value is a bitset where each bit corresponds to the _internal_ Lucene
>> document ID which is just an integer 0-maxDoc. There are two critical
>> points here:
>>
>> 1> the internal ID changes when segments are merged
>> 2> different replicas will have different _internal_ ids for the same
>> document. By "same" here I mean have the same .
>>
>> So completely sidestepping the question of the propagation delays of
>> trying to consult some kind of central filterCache, the nature of that
>> cache is such that you couldn't share it between replicas anyway.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 2, 2017 at 8:31 AM, Daniel Angelov 
>> wrote:
>> > Thanks for the answer!
>> > This means, if we have a collection with 2 replicas, there is a chance,
>> > that 2 queries with identical fq values can be served from different
>> > replicas of the same shards, this means, that the second query will not
>> use
>> > the cached set from the first query, is not it?
>> >
>> > Thanks
>> > Daniel
>> >
>> > Am 02.06.2017 15:32 schrieb "Susheel Kumar" :
>> >
>> >> Thanks for the correction Shawn.  Yes its only the heap allocation
>> settings
>> >> are per host/JVM.
>> >>
>> >> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey 
>> wrote:
>> >>
>> >> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
>> >> > > Is the filter cache separate for each host and then for each
>> >> > > collection and then for each shard and then for each replica in
>> >> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
>> >> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
>> >> > > coll2 shard2 replica2. Does this mean, that we have 4 filter
>> caches,
>> >> > > i.e. separate memory for each core? If they are separated and for
>> >> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec
>> later
>> >> > > the same query is handling from coll2 shard1 replica1, this means,
>> >> > > that the later query will not use the result set cached from the
>> first
>> >> > > query...
>> >> >
>> >> > That is correct.
>> >> >
>> >> > General notes about SolrCloud terminology: SolrCloud is organized
>> around
>> >> > collections.  Collections are made up of one or more shards.  Shards
>> are
>> >> > made up of one or more replicas.  Each replica is a Solr core.  A
>> core
>> >> > contains one Lucene index.  It is not correct to say that a shard
>> has no
>> >> > replicas.  The leader *is* a replica.  If you have a leader and one
>> >> > follower, the shard has two replicas.
>> >> >
>> >> > Solr caches (including filterCache) exist at the core level, they
>> have
>> >> > no knowledge of other replicas, other shards, or the collection as a
>> >> > whole.  Susheel says that the caches are per host/JVM -- that's not
>> >> > correct.  Every Solr core in a JVM has separate caches, if they are
>> >> > defined in the configuration for that core.
>> >> >
>> >> > Your query scenario has even more separation -- it asks about
>> querying
>> >> > two completely different collections, which don't use the same cores.
>> >> >
>> >> > Thanks,
>> >> > Shawn
>> >> >
>> >> >
>> >>
>>
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Daniel Angelov
This means, that quering alias NNN pointing 3 collections, each 10 shards
and each 2 replicas, a query with very long fg value, say 20 char
string. First query with fq will cache all 20 chars 30 times (3 x 10
cores). The next query with the same fg, could not use the same cores as
the first time, i.e. could locate more mem in the unused replicas from the
first query. And in my case the soft commint is each 60 sec. this means a
lot of GC, is not it?

BR
Daniel

Am 02.06.2017 17:45 schrieb "Erick Erickson" :

> bq: This means, if we have a collection with 2 replicas, there is a chance,
> that 2 queries with identical fq values can be served from different
> replicas of the same shards, this means, that the second query will not use
> the cached set from the first query, is not it?
>
> Yes. In practice autowarming is often used to pre-warm the caches, but
> again that's local to each replica, i.e. the fqs used to autowarm
> replica1 or shard1 may be different than the ones used to autowarm
> replica2 of shard1. What tends to happen is that the replicas "level
> out". Any fq clause that's common enough to be useful eventually hits
> all the replicas. And the most common ones are run during autowarming
> since it's an LRU queue.
>
> To understand why there isn't a common cache, consider that the
> filterCache is conceptually a map. The key is the fq clause and the
> value is a bitset where each bit corresponds to the _internal_ Lucene
> document ID which is just an integer 0-maxDoc. There are two critical
> points here:
>
> 1> the internal ID changes when segments are merged
> 2> different replicas will have different _internal_ ids for the same
> document. By "same" here I mean have the same .
>
> So completely sidestepping the question of the propagation delays of
> trying to consult some kind of central filterCache, the nature of that
> cache is such that you couldn't share it between replicas anyway.
>
> Best,
> Erick
>
> On Fri, Jun 2, 2017 at 8:31 AM, Daniel Angelov 
> wrote:
> > Thanks for the answer!
> > This means, if we have a collection with 2 replicas, there is a chance,
> > that 2 queries with identical fq values can be served from different
> > replicas of the same shards, this means, that the second query will not
> use
> > the cached set from the first query, is not it?
> >
> > Thanks
> > Daniel
> >
> > Am 02.06.2017 15:32 schrieb "Susheel Kumar" :
> >
> >> Thanks for the correction Shawn.  Yes its only the heap allocation
> settings
> >> are per host/JVM.
> >>
> >> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey 
> wrote:
> >>
> >> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
> >> > > Is the filter cache separate for each host and then for each
> >> > > collection and then for each shard and then for each replica in
> >> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
> >> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
> >> > > coll2 shard2 replica2. Does this mean, that we have 4 filter caches,
> >> > > i.e. separate memory for each core? If they are separated and for
> >> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec
> later
> >> > > the same query is handling from coll2 shard1 replica1, this means,
> >> > > that the later query will not use the result set cached from the
> first
> >> > > query...
> >> >
> >> > That is correct.
> >> >
> >> > General notes about SolrCloud terminology: SolrCloud is organized
> around
> >> > collections.  Collections are made up of one or more shards.  Shards
> are
> >> > made up of one or more replicas.  Each replica is a Solr core.  A core
> >> > contains one Lucene index.  It is not correct to say that a shard has
> no
> >> > replicas.  The leader *is* a replica.  If you have a leader and one
> >> > follower, the shard has two replicas.
> >> >
> >> > Solr caches (including filterCache) exist at the core level, they have
> >> > no knowledge of other replicas, other shards, or the collection as a
> >> > whole.  Susheel says that the caches are per host/JVM -- that's not
> >> > correct.  Every Solr core in a JVM has separate caches, if they are
> >> > defined in the configuration for that core.
> >> >
> >> > Your query scenario has even more separation -- it asks about querying
> >> > two completely different collections, which don't use the same cores.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >> >
> >>
>


Using Properties in /cdcr doesn't seem to work

2017-06-02 Thread Webster Homer
In the documentation for Solr cdcr there is an example of a source
configuration that uses properties:
   
  ${TargetZk}
  ${SourceCollection}
  ${TargetCollection}


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462#CrossDataCenterReplication(CDCR)-InitialStartup

I tried to configure cdcr for a collection to use properties. It always
failed. The only thing that seems to work is the literal strings for the
Target Zookeeper, source and target collection names.

I used the Solr Config API to create the properties, but it just didn't
work. It was a while back but I believe all it did was throw errors.

Is there some way to make this work? If not it should be removed from the
documentation

This seems like it would be a useful feature, especially since CDCR doesn't
support the use of aliases, SOLR-10679.

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Can solrcloud be running on a read-only filesystem?

2017-06-02 Thread Erick Erickson
Mike:

That's one possibility. What I'm really asking for is to be sure that
there's a good reason (yours is one).

It's just that I've spent too much time in my life trying to get
something to work only to discover that it has marginal utility so I
like to ask "is this important enough to take time away from other
work you could be doing?" ;). If the answer is "yes", then.

The other possibility is to have multiple Solr instances reading one
Solr index. Which is possible but risky. Having the filesystem R/O
would prevent multiple Solr's writing to it of course. If that's the
case we get to argue whether it's worth it or not of course ;)...

Best,
Erick

On Fri, Jun 2, 2017 at 8:47 AM, Mike Drob  wrote:
> To throw out one possibility, a read only file systems has no (low?)
> possibility of corruption. If you have a static index then you shouldn't
> need to be doing any recovery. Would still need to run ZK with RW
> filesystem, but mybe Solr could work?
>
> On Fri, Jun 2, 2017 at 10:15 AM, Erick Erickson 
> wrote:
>
>> As Susheel says, this is iffy, very iffy. You can disable tlogs
>> entirely through solrconfig.xml, you can _probably_
>> disable all of the Solr logging.
>>
>> You'd also have to _not_ run in SolrCloud. You say
>> "some of the nodes eventually are stuck in the recovering phase"
>> SolrCloud tries very hard to keep all of the replicas in sync.
>> To do this it _must_ be able to copy from the leader to the follower.
>> If it ever has to sync with the leader, it'll be stuck in recovery
>> as you can see.
>>
>> You could spend a lot of time trying to make this work, but
>> you haven't stated _why_ you want to. Perhaps there are
>> other ways to get the functionality you want.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 2, 2017 at 5:05 AM, Susheel Kumar 
>> wrote:
>> > I doubt it can run in readonly file system.  Even though there is no
>> > ingestion etc.  Solr still needs to write to logs/tlogs for synching /
>> > recovering etc
>> >
>> > Thnx
>> >
>> > On Fri, Jun 2, 2017 at 6:56 AM, Wudong Liu  wrote:
>> >
>> >> Hi All:
>> >>
>> >> We have a normal build/stage -> prod settings for our production
>> pipeline.
>> >> And we would build solr index in the build environment and then the
>> index
>> >> is copied to the prod environment.
>> >>
>> >> The solrcloud in prod seems working fine when the file system backing
>> it is
>> >> writable. However, we see many errors when the file system is readonly.
>> >> Many exceptions are thrown regarding the tlog file cannot be open for
>> write
>> >> when the solr nodes are restarted with the new data; some of the nodes
>> >> eventually are stuck in the recovering phase and never able to go back
>> >> online in the cloud.
>> >>
>> >> Just wondering is anyone has any experience on Solrcloud running in
>> >> readonly file system? Is it possible at all?
>> >>
>> >> Regards,
>> >> Wudong
>> >>
>>


Re: Can solrcloud be running on a read-only filesystem?

2017-06-02 Thread Mike Drob
To throw out one possibility, a read only file systems has no (low?)
possibility of corruption. If you have a static index then you shouldn't
need to be doing any recovery. Would still need to run ZK with RW
filesystem, but mybe Solr could work?

On Fri, Jun 2, 2017 at 10:15 AM, Erick Erickson 
wrote:

> As Susheel says, this is iffy, very iffy. You can disable tlogs
> entirely through solrconfig.xml, you can _probably_
> disable all of the Solr logging.
>
> You'd also have to _not_ run in SolrCloud. You say
> "some of the nodes eventually are stuck in the recovering phase"
> SolrCloud tries very hard to keep all of the replicas in sync.
> To do this it _must_ be able to copy from the leader to the follower.
> If it ever has to sync with the leader, it'll be stuck in recovery
> as you can see.
>
> You could spend a lot of time trying to make this work, but
> you haven't stated _why_ you want to. Perhaps there are
> other ways to get the functionality you want.
>
> Best,
> Erick
>
> On Fri, Jun 2, 2017 at 5:05 AM, Susheel Kumar 
> wrote:
> > I doubt it can run in readonly file system.  Even though there is no
> > ingestion etc.  Solr still needs to write to logs/tlogs for synching /
> > recovering etc
> >
> > Thnx
> >
> > On Fri, Jun 2, 2017 at 6:56 AM, Wudong Liu  wrote:
> >
> >> Hi All:
> >>
> >> We have a normal build/stage -> prod settings for our production
> pipeline.
> >> And we would build solr index in the build environment and then the
> index
> >> is copied to the prod environment.
> >>
> >> The solrcloud in prod seems working fine when the file system backing
> it is
> >> writable. However, we see many errors when the file system is readonly.
> >> Many exceptions are thrown regarding the tlog file cannot be open for
> write
> >> when the solr nodes are restarted with the new data; some of the nodes
> >> eventually are stuck in the recovering phase and never able to go back
> >> online in the cloud.
> >>
> >> Just wondering is anyone has any experience on Solrcloud running in
> >> readonly file system? Is it possible at all?
> >>
> >> Regards,
> >> Wudong
> >>
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Erick Erickson
bq: This means, if we have a collection with 2 replicas, there is a chance,
that 2 queries with identical fq values can be served from different
replicas of the same shards, this means, that the second query will not use
the cached set from the first query, is not it?

Yes. In practice autowarming is often used to pre-warm the caches, but
again that's local to each replica, i.e. the fqs used to autowarm
replica1 or shard1 may be different than the ones used to autowarm
replica2 of shard1. What tends to happen is that the replicas "level
out". Any fq clause that's common enough to be useful eventually hits
all the replicas. And the most common ones are run during autowarming
since it's an LRU queue.

To understand why there isn't a common cache, consider that the
filterCache is conceptually a map. The key is the fq clause and the
value is a bitset where each bit corresponds to the _internal_ Lucene
document ID which is just an integer 0-maxDoc. There are two critical
points here:

1> the internal ID changes when segments are merged
2> different replicas will have different _internal_ ids for the same
document. By "same" here I mean have the same .

So completely sidestepping the question of the propagation delays of
trying to consult some kind of central filterCache, the nature of that
cache is such that you couldn't share it between replicas anyway.

Best,
Erick

On Fri, Jun 2, 2017 at 8:31 AM, Daniel Angelov  wrote:
> Thanks for the answer!
> This means, if we have a collection with 2 replicas, there is a chance,
> that 2 queries with identical fq values can be served from different
> replicas of the same shards, this means, that the second query will not use
> the cached set from the first query, is not it?
>
> Thanks
> Daniel
>
> Am 02.06.2017 15:32 schrieb "Susheel Kumar" :
>
>> Thanks for the correction Shawn.  Yes its only the heap allocation settings
>> are per host/JVM.
>>
>> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey  wrote:
>>
>> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
>> > > Is the filter cache separate for each host and then for each
>> > > collection and then for each shard and then for each replica in
>> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
>> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
>> > > coll2 shard2 replica2. Does this mean, that we have 4 filter caches,
>> > > i.e. separate memory for each core? If they are separated and for
>> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec later
>> > > the same query is handling from coll2 shard1 replica1, this means,
>> > > that the later query will not use the result set cached from the first
>> > > query...
>> >
>> > That is correct.
>> >
>> > General notes about SolrCloud terminology: SolrCloud is organized around
>> > collections.  Collections are made up of one or more shards.  Shards are
>> > made up of one or more replicas.  Each replica is a Solr core.  A core
>> > contains one Lucene index.  It is not correct to say that a shard has no
>> > replicas.  The leader *is* a replica.  If you have a leader and one
>> > follower, the shard has two replicas.
>> >
>> > Solr caches (including filterCache) exist at the core level, they have
>> > no knowledge of other replicas, other shards, or the collection as a
>> > whole.  Susheel says that the caches are per host/JVM -- that's not
>> > correct.  Every Solr core in a JVM has separate caches, if they are
>> > defined in the configuration for that core.
>> >
>> > Your query scenario has even more separation -- it asks about querying
>> > two completely different collections, which don't use the same cores.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>


Re: Configuration of parallel indexing threads

2017-06-02 Thread Erick Erickson
that's pretty much my strategy.

I'll add parenthetically that I often see the bottleneck for indexing
to be acquiring the data from the system of record in the first place
rather than Solr. Assuming you're using SolrJ, an easy test is to
comment out the line that sends to Solr. There's usually some kind of
loop like:

while (more docs) {
gather 1,000 docs into a list
cloudSolrClient.add(docList);
docList.clear()
}

So just comment out the cloudSolrClient.add line. I've seen situations
where the program still takes 95% of the time it takes to actually
index to Solr, in which case you need to focus on getting the data in
the first place.

And you need to batch updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Good Luck!
Erick

On Fri, Jun 2, 2017 at 2:59 AM, gigo314  wrote:
> Thanks for the replies. Just to confirm that I got it right:
> 1. Since there is no setting to control index writers, is it fair to assume
> that Solr always indexes at maximum possible speed?
> 2. The way to control write speed is to control number of clients that are
> simultaneously posting data, right?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Configuration-of-parallel-indexing-threads-tp4338466p4338599.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Daniel Angelov
Thanks for the answer!
This means, if we have a collection with 2 replicas, there is a chance,
that 2 queries with identical fq values can be served from different
replicas of the same shards, this means, that the second query will not use
the cached set from the first query, is not it?

Thanks
Daniel

Am 02.06.2017 15:32 schrieb "Susheel Kumar" :

> Thanks for the correction Shawn.  Yes its only the heap allocation settings
> are per host/JVM.
>
> On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey  wrote:
>
> > On 6/1/2017 11:40 PM, Daniel Angelov wrote:
> > > Is the filter cache separate for each host and then for each
> > > collection and then for each shard and then for each replica in
> > > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
> > > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
> > > coll2 shard2 replica2. Does this mean, that we have 4 filter caches,
> > > i.e. separate memory for each core? If they are separated and for
> > > example, query1 is handling from coll1 shard1 replica1 and 1 sec later
> > > the same query is handling from coll2 shard1 replica1, this means,
> > > that the later query will not use the result set cached from the first
> > > query...
> >
> > That is correct.
> >
> > General notes about SolrCloud terminology: SolrCloud is organized around
> > collections.  Collections are made up of one or more shards.  Shards are
> > made up of one or more replicas.  Each replica is a Solr core.  A core
> > contains one Lucene index.  It is not correct to say that a shard has no
> > replicas.  The leader *is* a replica.  If you have a leader and one
> > follower, the shard has two replicas.
> >
> > Solr caches (including filterCache) exist at the core level, they have
> > no knowledge of other replicas, other shards, or the collection as a
> > whole.  Susheel says that the caches are per host/JVM -- that's not
> > correct.  Every Solr core in a JVM has separate caches, if they are
> > defined in the configuration for that core.
> >
> > Your query scenario has even more separation -- it asks about querying
> > two completely different collections, which don't use the same cores.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Steps for building solr/lucene code and starting server

2017-06-02 Thread Erick Erickson
"ant server" should be sufficient. "dist" is useful for when
you have custom _external_ programs (say SolrJ) that you
want all the libraries collected in the same place. There's
no need to "ant compile" as the "server" target

I assume what you're seeing is a ClassNotFound error, right?
I'm a bit puzzled since that filter isn't a contrib, so it should
be found.

What I'd do is just do the build first then start the example,
"bin/solr start -e techproducts"
Don't specify solrhome or anything else. Once that works,
build up from there.

Best,
Erick

On Fri, Jun 2, 2017 at 3:15 AM, Nawab Zada Asad Iqbal  wrote:
> Hi,
>
> I have synced lucene-solr repo because I (will) have some custom code in
> lucene and solr folders. What are the steps for starting solr server? My
> schema.xml uses ICUNormalizer2CharFilterFactory (which I see in lucene
> folder tree), but I don't know how to make it work with solr webapp. I know
> the (luncene ant
> target) 'compile',  (solr targets) 'dist', and 'server', but the order is
> not clear to me.
>
> I have compiled lucene before doing 'ant server' in solr folder, but I
> still see this error when I do 'bin/solr start -f -s ~/solrhome/' :-
>
> Caused by: org.apache.solr.common.SolrException: Plugin init failure for
> [schema.xml] fieldType "text": Plugin init failure for [schema.xml]
> analyzer/charFilter "nfkc": Error loading class
> 'org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory'
>
>
>
> Thanks
> Nawab


Re: Number of requests spike up, when i do the delta Import.

2017-06-02 Thread Erick Erickson
A similar pattern should work with .NET, all that's
necessary is a JDBC driver for connecting to the database
and an connection to a Solr node.

SolrNet will not be as performant as SolrJ I'd guess
since there's no equivalent to CloudSolrClient. You
can still SolrNet, any connection to any Solr node
will "do the right thing".

Best,
Erick

On Fri, Jun 2, 2017 at 4:01 AM, Rick Leir  wrote:
> Vrin
> We had a good speedup from enabling a SQL cache. You also need to avoid 
> updating the DB tables so the cache does not get flushed.
> Cheers -- Rick
>
> On June 2, 2017 4:49:20 AM EDT, vrindavda  wrote:
>>Thanks Erick ,
>>
>>Could you please suggest some alternative to go with SolrNET.
>>
>>@jlman, I tried your way, that do reduces the number of request, but
>>delta-import still take longer than full-import. There is no
>>improvement in
>>performance.
>>
>>
>>
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-tp4338162p4338591.html
>>Sent from the Solr - User mailing list archive at Nabble.com.
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Can solrcloud be running on a read-only filesystem?

2017-06-02 Thread Erick Erickson
As Susheel says, this is iffy, very iffy. You can disable tlogs
entirely through solrconfig.xml, you can _probably_
disable all of the Solr logging.

You'd also have to _not_ run in SolrCloud. You say
"some of the nodes eventually are stuck in the recovering phase"
SolrCloud tries very hard to keep all of the replicas in sync.
To do this it _must_ be able to copy from the leader to the follower.
If it ever has to sync with the leader, it'll be stuck in recovery
as you can see.

You could spend a lot of time trying to make this work, but
you haven't stated _why_ you want to. Perhaps there are
other ways to get the functionality you want.

Best,
Erick

On Fri, Jun 2, 2017 at 5:05 AM, Susheel Kumar  wrote:
> I doubt it can run in readonly file system.  Even though there is no
> ingestion etc.  Solr still needs to write to logs/tlogs for synching /
> recovering etc
>
> Thnx
>
> On Fri, Jun 2, 2017 at 6:56 AM, Wudong Liu  wrote:
>
>> Hi All:
>>
>> We have a normal build/stage -> prod settings for our production pipeline.
>> And we would build solr index in the build environment and then the index
>> is copied to the prod environment.
>>
>> The solrcloud in prod seems working fine when the file system backing it is
>> writable. However, we see many errors when the file system is readonly.
>> Many exceptions are thrown regarding the tlog file cannot be open for write
>> when the solr nodes are restarted with the new data; some of the nodes
>> eventually are stuck in the recovering phase and never able to go back
>> online in the cloud.
>>
>> Just wondering is anyone has any experience on Solrcloud running in
>> readonly file system? Is it possible at all?
>>
>> Regards,
>> Wudong
>>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Susheel Kumar
I see.  You can create a JIRA and submit patch and see if committers agree
or have different opinion/suggestion.

Thanks,
Susheel

On Fri, Jun 2, 2017 at 10:01 AM, Sergio García Maroto 
wrote:

> You are right about that but in some cases I may need to reindex my data
> and wanted to avoid deleting the full index so
> I can still server queries. I thought reindexing same version would be
> handy or at least to have the flexibility.
>
> On 2 June 2017 at 14:53, Susheel Kumar  wrote:
>
> > I see the difference now between using _version_ vs custom versionField.
> > Both seems to behave differently.  The _version_ field if used allows
> same
> > version to be updated and that's the perception I had in mind for custom
> > versionField.
> >
> > My question is why do you want to update the document if same version.
> > Shouldn't you pass higher version if the doc has changed and that makes
> the
> > update to be accepted ?
> >
> > On Fri, Jun 2, 2017 at 8:13 AM, Susheel Kumar 
> > wrote:
> >
> > > Just to confirm again before go too far,  are you able to execute these
> > > examples and see same output given under "Optimistic Concurrency".
> > > https://cwiki.apache.org/confluence/display/solr/
> > > Updating+Parts+of+Documents#UpdatingPartsofDocuments-In-PlaceUpdates
> > >
> > > Let me know which example you fail to get same output as described in.
> > >
> > > On Fri, Jun 2, 2017 at 5:11 AM, Sergio García Maroto <
> marot...@gmail.com
> > >
> > > wrote:
> > >
> > >> I had a look to the source code and I see
> > >> DocBasedVersionConstraintsProcessorFactory
> > >>
> > >> if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
> > >> oldUserVersion)) {
> > >>   // log.info("VERSION returning true (proceed with update)"
> );
> > >>   return true;
> > >> }
> > >>
> > >> I can't find a way of overwriting same version without changing that
> > piece
> > >> of code.
> > >> Would be possible to add a parameter to the
> > >> "DocBasedVersionConstraintsProcessorFactory" something like
> > >> "overwrite.same.version=true"
> > >> so the new code would look like.
> > >>
> > >>
> > >> int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
> > >> oldUserVersion);
> > >> if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
> > >>   // log.info("VERSION returning true (proceed with update)"
> );
> > >>   return true;
> > >> }
> > >>
> > >>
> > >> Is that thing going to break anyhting? Can i do that change?
> > >>
> > >> Thanks
> > >> Sergio
> > >>
> > >>
> > >> On 2 June 2017 at 10:10, Sergio García Maroto 
> > wrote:
> > >>
> > >> > I am using  6.1.0.
> > >> > I tried with two different  field types, long and date.
> > >> >  > />
> > >> >  > stored="true"/>
> > >> >
> > >> > I am using this configuration on the solrconfig.xml
> > >> >
> > >> > 
> > >> >
> > >> >  false
> > >> >  UpdatedDateSD
> > >> >
> > >> >   
> > >> >
> > >> >   
> > >> >   
> > >> >
> > >> > i had a look to the wiki page and it says https://cwiki.apache.org/
> > >> > confluence/display/solr/Updating+Parts+of+Documents
> > >> >
> > >> > *Once configured, this update processor will reject (HTTP error code
> > >> 409)
> > >> > any attempt to update an existing document where the value of
> > >> > the my_version_l field in the "new" document is not greater then the
> > >> value
> > >> > of that field in the existing document.*
> > >> >
> > >> > Do you have any tip on how to get same versions not getting
> rejected.
> > >> >
> > >> > Thanks a lot.
> > >> >
> > >> >
> > >> > On 1 June 2017 at 19:04, Susheel Kumar 
> wrote:
> > >> >
> > >> >> Which version of solr are you using? I tested in 6.0 and if I
> supply
> > >> same
> > >> >> version, it overwrite/update the document exactly as per the wiki
> > >> >> documentation.
> > >> >>
> > >> >> Thanks,
> > >> >> Susheel
> > >> >>
> > >> >> On Thu, Jun 1, 2017 at 7:57 AM, marotosg 
> wrote:
> > >> >>
> > >> >> > Thanks a lot Susheel.
> > >> >> > I see this is actually what I need.  I have been testing it and
> > >> notice
> > >> >> the
> > >> >> > value of the field has to be always greater for a new document to
> > get
> > >> >> > indexed. if you send the same version number it doesn't work.
> > >> >> >
> > >> >> > Is it possible somehow to overwrite documents with the same
> > version?
> > >> >> >
> > >> >> > Thanks
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > --
> > >> >> > View this message in context: http://lucene.472066.n3.
> > >> >> > nabble.com/version-Versioning-using-timespan-
> > tp4338171p4338475.html
> > >> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>


Re: why MULTILINESTRING can contains polygon in solr spatial search

2017-06-02 Thread David Smiley
Hi,
Solr 4.7 is old but is probably okay.  Is it easy to try a 6.x version?
 (note Spatial4j java package names have changed).  There's also multiple
new pertinent options to your scenario:
https://locationtech.github.io/spatial4j/apidocs/org/locationtech/spatial4j/context/jts/JtsSpatialContextFactory.html
* "useJtsMulti":"false" (defaults to true)
* "useJtsLineString":"false" (defaults to true)

Any way, this could be due to the validationRule="repairBuffer0" logic if
per chance the indexed shape isn't considered "valid" (by JTS).

If flipping these options and using a recent Solr/Lucene/Spatial4j release
don't fix the issue, please file a JIRA issue to the Lucene project.

On Fri, Jun 2, 2017 at 5:53 AM kjdong  wrote:

> solr-version:4.7.0
>
> field spec as follows:
>  class="solr.SpatialRecursivePrefixTreeFieldType"
>
> spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> validationRule="repairBuffer0" geo="true" distErrPct="0.025"
> maxDistErr="0.09" units="degrees" />
>
>  multiValued="true"/>
>
> And i index some MULTILINESTRING (wkt formatted  shape, the road data), and
> i query use "Intersects" spatial predicates like
> fq=geom:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))
> distErrPct=0".
>
> In fact, i want to query the shape(multiline) which is intersect with the
> query polygon, but the searched return document has nothing to do with the
> query polygon(aka, isDisjointTo), then i test it use JTS api ,it indeed
> return false, but solr think the line intersects with the polygon ,even
> contains. is this a bug? or repair it in advanced version?
>
> Geometry line = new WKTReader.read(the line  wkt text string);
> Geometry polygon= new WKTReader.read(the polygon wkt text string);
> line.intersects(polygon);//return false
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/why-MULTILINESTRING-can-contains-polygon-in-solr-spatial-search-tp4338593.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: Enable Gzip compression Solr 6.0

2017-06-02 Thread nilaksh
Hi Rick,

I am not sure that Solr can take that stand once it stopped producing a
standalone war (Rationale for which is rather well documented here:
https://wiki.apache.org/solr/WhyNoWar)
If Solr is asking users not to use standalone containers and wants to be
used as a Server then it must provide optimisations like compression that a
http-server (as http is the protocol supported).

Communication between a Java Web APP and Solr would also be over the wire
and in any decent setup will be likely to get congested under load. Thus,
its important to have the facility to gzip compress communication between
Solr and clients



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Enable-Gzip-compression-Solr-6-0-tp4329496p4338648.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance Issue in Streaming Expressions

2017-06-02 Thread Joel Bernstein
Once you've scaled up the export from collection4 you can test the
performance of the join by moving the NullStream around the join.

parallel(null(innerJoin(collection 3, collection4)))

Again you'll want to test with different numbers of workers and replicas to
see where you max out performance of the join.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 2, 2017 at 10:25 AM, Joel Bernstein  wrote:

> innerJoin(intersect(innerJoin(collection1, collection2),
>innerJoin(collection 3, collection4)),
> collection5)
>
> Let's focus on:
>
> innerJoin(collection 3, collection4))
>
> The first thing to focus on is how fast is the export from collection4.
> You can test this with the NullStream with the following construct:
>
> null(search(collection4))
>
> The null stream will eat all the tuples and report back timing
> information. This will isolate the performance of the export from
> collection4.
>
> Once you have a baseline for how fast you can export from a single node,
> you can test with parallel export from a single node:
>
> parallel(null(search(collection4)))
>
> Then you can add replicas for collection4 and increase workers.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla 
> wrote:
>
>> Hi,
>>
>> Which version of solr are you on?
>> Increasing memory may not be useful as streaming API does not keep stuff
>> in
>> memory (except may be hash joins).
>> Increasing replicas (not sharding) and pushing the join computation on
>> worker solr cluster with #workers > 1 would definitely make things faster.
>> Are you limiting your results at some cutoff? if yes, then SOLR-10698
>>  can be useful fix.
>> Also
>> binary response format for streaming would be faster. (available in 6.5
>> probably)
>>
>>
>>
>> On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
>> ecethiagu2...@yahoo.co.in.invalid> wrote:
>>
>> > We are working on a proposal and feeling streaming API along with export
>> > handler will best fit for our usecases. We are already of having a
>> > structure in solr in which we are using graph queries to produce
>> > hierarchical structure. Now from the structure we need to join couple of
>> > more collections. We have 5 different collections.
>> >   Collection 1- 800 k records.
>> > Collection 2- 200k records.
>>  Collection 3
>> > - 7k records.   Collection 4 - 6
>> > million records. Collection 5 - 150 k
>> records
>> > we are using the below strategy
>> > innerJoin( intersect( innerJoin(collection 1,collection 2),
>> > innerJoin(Collection 3, Collection 4)), collection 5).
>> >We are seeing performance is too slow when we start
>> having
>> > collection 4. Just with collection 1 2 5 the results are coming in 2
>> secs.
>> > The moment I have included collection 4 in the query I could see  a
>> > performance impact. I believe exporting large results from collection 4
>> is
>> > causing the issie. Currently I am using single sharded collection with
>> no
>> > replica. I thinking if we can increase the memory as first option to
>> > increase performance as processing doc values need more memory. Then if
>> > that did not worked I can check using parallel stream/ sharding. Kindly
>> > advise is there could be anything else I  missing?
>> > Sent from Yahoo Mail on Android
>>
>
>


Re: Performance Issue in Streaming Expressions

2017-06-02 Thread Joel Bernstein
innerJoin(intersect(innerJoin(collection1, collection2),
   innerJoin(collection 3, collection4)),
collection5)

Let's focus on:

innerJoin(collection 3, collection4))

The first thing to focus on is how fast is the export from collection4. You
can test this with the NullStream with the following construct:

null(search(collection4))

The null stream will eat all the tuples and report back timing information.
This will isolate the performance of the export from collection4.

Once you have a baseline for how fast you can export from a single node,
you can test with parallel export from a single node:

parallel(null(search(collection4)))

Then you can add replicas for collection4 and increase workers.













Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jun 1, 2017 at 11:51 PM, Susmit Shukla 
wrote:

> Hi,
>
> Which version of solr are you on?
> Increasing memory may not be useful as streaming API does not keep stuff in
> memory (except may be hash joins).
> Increasing replicas (not sharding) and pushing the join computation on
> worker solr cluster with #workers > 1 would definitely make things faster.
> Are you limiting your results at some cutoff? if yes, then SOLR-10698
>  can be useful fix. Also
> binary response format for streaming would be faster. (available in 6.5
> probably)
>
>
>
> On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
> ecethiagu2...@yahoo.co.in.invalid> wrote:
>
> > We are working on a proposal and feeling streaming API along with export
> > handler will best fit for our usecases. We are already of having a
> > structure in solr in which we are using graph queries to produce
> > hierarchical structure. Now from the structure we need to join couple of
> > more collections. We have 5 different collections.
> >   Collection 1- 800 k records.
> > Collection 2- 200k records.   Collection
> 3
> > - 7k records.   Collection 4 - 6
> > million records. Collection 5 - 150 k records
> > we are using the below strategy
> > innerJoin( intersect( innerJoin(collection 1,collection 2),
> > innerJoin(Collection 3, Collection 4)), collection 5).
> >We are seeing performance is too slow when we start having
> > collection 4. Just with collection 1 2 5 the results are coming in 2
> secs.
> > The moment I have included collection 4 in the query I could see  a
> > performance impact. I believe exporting large results from collection 4
> is
> > causing the issie. Currently I am using single sharded collection with no
> > replica. I thinking if we can increase the memory as first option to
> > increase performance as processing doc values need more memory. Then if
> > that did not worked I can check using parallel stream/ sharding. Kindly
> > advise is there could be anything else I  missing?
> > Sent from Yahoo Mail on Android
>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
You are right about that but in some cases I may need to reindex my data
and wanted to avoid deleting the full index so
I can still server queries. I thought reindexing same version would be
handy or at least to have the flexibility.

On 2 June 2017 at 14:53, Susheel Kumar  wrote:

> I see the difference now between using _version_ vs custom versionField.
> Both seems to behave differently.  The _version_ field if used allows same
> version to be updated and that's the perception I had in mind for custom
> versionField.
>
> My question is why do you want to update the document if same version.
> Shouldn't you pass higher version if the doc has changed and that makes the
> update to be accepted ?
>
> On Fri, Jun 2, 2017 at 8:13 AM, Susheel Kumar 
> wrote:
>
> > Just to confirm again before go too far,  are you able to execute these
> > examples and see same output given under "Optimistic Concurrency".
> > https://cwiki.apache.org/confluence/display/solr/
> > Updating+Parts+of+Documents#UpdatingPartsofDocuments-In-PlaceUpdates
> >
> > Let me know which example you fail to get same output as described in.
> >
> > On Fri, Jun 2, 2017 at 5:11 AM, Sergio García Maroto  >
> > wrote:
> >
> >> I had a look to the source code and I see
> >> DocBasedVersionConstraintsProcessorFactory
> >>
> >> if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
> >> oldUserVersion)) {
> >>   // log.info("VERSION returning true (proceed with update)" );
> >>   return true;
> >> }
> >>
> >> I can't find a way of overwriting same version without changing that
> piece
> >> of code.
> >> Would be possible to add a parameter to the
> >> "DocBasedVersionConstraintsProcessorFactory" something like
> >> "overwrite.same.version=true"
> >> so the new code would look like.
> >>
> >>
> >> int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
> >> oldUserVersion);
> >> if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
> >>   // log.info("VERSION returning true (proceed with update)" );
> >>   return true;
> >> }
> >>
> >>
> >> Is that thing going to break anyhting? Can i do that change?
> >>
> >> Thanks
> >> Sergio
> >>
> >>
> >> On 2 June 2017 at 10:10, Sergio García Maroto 
> wrote:
> >>
> >> > I am using  6.1.0.
> >> > I tried with two different  field types, long and date.
> >> >  />
> >> >  stored="true"/>
> >> >
> >> > I am using this configuration on the solrconfig.xml
> >> >
> >> > 
> >> >
> >> >  false
> >> >  UpdatedDateSD
> >> >
> >> >   
> >> >
> >> >   
> >> >   
> >> >
> >> > i had a look to the wiki page and it says https://cwiki.apache.org/
> >> > confluence/display/solr/Updating+Parts+of+Documents
> >> >
> >> > *Once configured, this update processor will reject (HTTP error code
> >> 409)
> >> > any attempt to update an existing document where the value of
> >> > the my_version_l field in the "new" document is not greater then the
> >> value
> >> > of that field in the existing document.*
> >> >
> >> > Do you have any tip on how to get same versions not getting rejected.
> >> >
> >> > Thanks a lot.
> >> >
> >> >
> >> > On 1 June 2017 at 19:04, Susheel Kumar  wrote:
> >> >
> >> >> Which version of solr are you using? I tested in 6.0 and if I supply
> >> same
> >> >> version, it overwrite/update the document exactly as per the wiki
> >> >> documentation.
> >> >>
> >> >> Thanks,
> >> >> Susheel
> >> >>
> >> >> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
> >> >>
> >> >> > Thanks a lot Susheel.
> >> >> > I see this is actually what I need.  I have been testing it and
> >> notice
> >> >> the
> >> >> > value of the field has to be always greater for a new document to
> get
> >> >> > indexed. if you send the same version number it doesn't work.
> >> >> >
> >> >> > Is it possible somehow to overwrite documents with the same
> version?
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > View this message in context: http://lucene.472066.n3.
> >> >> > nabble.com/version-Versioning-using-timespan-
> tp4338171p4338475.html
> >> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Susheel Kumar
Thanks for the correction Shawn.  Yes its only the heap allocation settings
are per host/JVM.

On Fri, Jun 2, 2017 at 9:23 AM, Shawn Heisey  wrote:

> On 6/1/2017 11:40 PM, Daniel Angelov wrote:
> > Is the filter cache separate for each host and then for each
> > collection and then for each shard and then for each replica in
> > SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
> > coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
> > coll2 shard2 replica2. Does this mean, that we have 4 filter caches,
> > i.e. separate memory for each core? If they are separated and for
> > example, query1 is handling from coll1 shard1 replica1 and 1 sec later
> > the same query is handling from coll2 shard1 replica1, this means,
> > that the later query will not use the result set cached from the first
> > query...
>
> That is correct.
>
> General notes about SolrCloud terminology: SolrCloud is organized around
> collections.  Collections are made up of one or more shards.  Shards are
> made up of one or more replicas.  Each replica is a Solr core.  A core
> contains one Lucene index.  It is not correct to say that a shard has no
> replicas.  The leader *is* a replica.  If you have a leader and one
> follower, the shard has two replicas.
>
> Solr caches (including filterCache) exist at the core level, they have
> no knowledge of other replicas, other shards, or the collection as a
> whole.  Susheel says that the caches are per host/JVM -- that's not
> correct.  Every Solr core in a JVM has separate caches, if they are
> defined in the configuration for that core.
>
> Your query scenario has even more separation -- it asks about querying
> two completely different collections, which don't use the same cores.
>
> Thanks,
> Shawn
>
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Shawn Heisey
On 6/1/2017 11:40 PM, Daniel Angelov wrote:
> Is the filter cache separate for each host and then for each
> collection and then for each shard and then for each replica in
> SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and
> coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and
> coll2 shard2 replica2. Does this mean, that we have 4 filter caches,
> i.e. separate memory for each core? If they are separated and for
> example, query1 is handling from coll1 shard1 replica1 and 1 sec later
> the same query is handling from coll2 shard1 replica1, this means,
> that the later query will not use the result set cached from the first
> query... 

That is correct.

General notes about SolrCloud terminology: SolrCloud is organized around
collections.  Collections are made up of one or more shards.  Shards are
made up of one or more replicas.  Each replica is a Solr core.  A core
contains one Lucene index.  It is not correct to say that a shard has no
replicas.  The leader *is* a replica.  If you have a leader and one
follower, the shard has two replicas.

Solr caches (including filterCache) exist at the core level, they have
no knowledge of other replicas, other shards, or the collection as a
whole.  Susheel says that the caches are per host/JVM -- that's not
correct.  Every Solr core in a JVM has separate caches, if they are
defined in the configuration for that core.

Your query scenario has even more separation -- it asks about querying
two completely different collections, which don't use the same cores.

Thanks,
Shawn



Re: Spread SolrCloud across two locations

2017-06-02 Thread Shawn Heisey
On 5/29/2017 8:57 AM, Jan Høydahl wrote:
> And if you start all three in DC1, you have 3+3 voting, what would
> then happen? Any chance of state corruption?
>
> I believe that my solution isolates manual change to two ZK nodes in
> DC2, while your requires config change to 1 in DC2 and manual
> start/stop of 1 in DC1.

I took the scenario to the zookeeper user list.  Here's the thread:

http://zookeeper-user.578899.n2.nabble.com/Yet-another-quot-two-datacenter-quot-discussion-td7583106.html

I'm not completely clear on what they're saying, but here's what I think
it means:  Dealing with a loss of dc1 by reconfiguring ZK servers in DC2
might work, or it might crash and burn once connectivity to DC1 is restored.

> Well, that’s not up to me to decide, it’s the customer environment
> that sets the constraints, they currently have 2 independent geo
> locations. And Solr is just a dependency of some other app they need
> to install, so doubt that they are very happy to start adding racks or
> independent power/network for this alone. Of course, if they already
> have such redundancy within one of the DCs, placing a 3rd ZK there is
> an ideal solution with probably good enough HA. If not, I’m looking
> for the 2nd best low-friction approach with software-only.

Even if all goes well with scripted reconfiguration of DC2, I don't
think I'd want to try and automate it, because of the chance for a brief
outage to trigger it.  Without automation, if the failure happened at
just the wrong moment, it could be a while before anyone notices, and it
might be hours after it gets noticed before relevant personnel are in a
position to run the reconfiguration script on DC2, during which you'd
have a read-only SolrCloud.

Frequently search is such a critical part of of a web applications that
if it doesn't work, there IS no web application.  That certainly
describes the systems that use the Solr installations that I manage. 
For that kind of application, damage to reputation caused by a couple of
hours where the website doesn't get any updates might be MUCH more
expensive than the monthly cost for a virtual private server from a
hosting company.

Thanks,
Shawn



Re: _version_ / Versioning using timespan

2017-06-02 Thread Susheel Kumar
I see the difference now between using _version_ vs custom versionField.
Both seems to behave differently.  The _version_ field if used allows same
version to be updated and that's the perception I had in mind for custom
versionField.

My question is why do you want to update the document if same version.
Shouldn't you pass higher version if the doc has changed and that makes the
update to be accepted ?

On Fri, Jun 2, 2017 at 8:13 AM, Susheel Kumar  wrote:

> Just to confirm again before go too far,  are you able to execute these
> examples and see same output given under "Optimistic Concurrency".
> https://cwiki.apache.org/confluence/display/solr/
> Updating+Parts+of+Documents#UpdatingPartsofDocuments-In-PlaceUpdates
>
> Let me know which example you fail to get same output as described in.
>
> On Fri, Jun 2, 2017 at 5:11 AM, Sergio García Maroto 
> wrote:
>
>> I had a look to the source code and I see
>> DocBasedVersionConstraintsProcessorFactory
>>
>> if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
>> oldUserVersion)) {
>>   // log.info("VERSION returning true (proceed with update)" );
>>   return true;
>> }
>>
>> I can't find a way of overwriting same version without changing that piece
>> of code.
>> Would be possible to add a parameter to the
>> "DocBasedVersionConstraintsProcessorFactory" something like
>> "overwrite.same.version=true"
>> so the new code would look like.
>>
>>
>> int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
>> oldUserVersion);
>> if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
>>   // log.info("VERSION returning true (proceed with update)" );
>>   return true;
>> }
>>
>>
>> Is that thing going to break anyhting? Can i do that change?
>>
>> Thanks
>> Sergio
>>
>>
>> On 2 June 2017 at 10:10, Sergio García Maroto  wrote:
>>
>> > I am using  6.1.0.
>> > I tried with two different  field types, long and date.
>> > 
>> > 
>> >
>> > I am using this configuration on the solrconfig.xml
>> >
>> > 
>> >
>> >  false
>> >  UpdatedDateSD
>> >
>> >   
>> >
>> >   
>> >   
>> >
>> > i had a look to the wiki page and it says https://cwiki.apache.org/
>> > confluence/display/solr/Updating+Parts+of+Documents
>> >
>> > *Once configured, this update processor will reject (HTTP error code
>> 409)
>> > any attempt to update an existing document where the value of
>> > the my_version_l field in the "new" document is not greater then the
>> value
>> > of that field in the existing document.*
>> >
>> > Do you have any tip on how to get same versions not getting rejected.
>> >
>> > Thanks a lot.
>> >
>> >
>> > On 1 June 2017 at 19:04, Susheel Kumar  wrote:
>> >
>> >> Which version of solr are you using? I tested in 6.0 and if I supply
>> same
>> >> version, it overwrite/update the document exactly as per the wiki
>> >> documentation.
>> >>
>> >> Thanks,
>> >> Susheel
>> >>
>> >> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
>> >>
>> >> > Thanks a lot Susheel.
>> >> > I see this is actually what I need.  I have been testing it and
>> notice
>> >> the
>> >> > value of the field has to be always greater for a new document to get
>> >> > indexed. if you send the same version number it doesn't work.
>> >> >
>> >> > Is it possible somehow to overwrite documents with the same version?
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context: http://lucene.472066.n3.
>> >> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >> >
>> >>
>> >
>> >
>>
>
>


Re: Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-02 Thread Susheel Kumar
The heap allocation and cache settings are per host/JVM not for each
collection / shards. In SolrCloud you execute queries against a collection
and every other collection may have different schema/document id's and
all.  So answer to your question, query1 from coll1 can't use results
cached from query against coll2.

Thnx

On Fri, Jun 2, 2017 at 1:40 AM, Daniel Angelov 
wrote:

> Is the filter cache separate for each host and then for each collection and
> then for each shard and then for each replica in SolrCloud?
> For example, on host1 we have, coll1 shard1 replica1 and coll2 shard1
> replica1, on host2 we have, coll1 shard2 replica2 and coll2 shard2
> replica2. Does this mean, that we have 4 filter caches, i.e. separate
> memory for each core?
> If they are separated and for example, query1 is handling from coll1 shard1
> replica1 and 1 sec later the same query is handling from coll2 shard1
> replica1, this means, that the later query will not use the result set
> cached from the first query...
>
> BR
> Daniel
>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Susheel Kumar
Just to confirm again before go too far,  are you able to execute these
examples and see same output given under "Optimistic Concurrency".
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-In-PlaceUpdates


Let me know which example you fail to get same output as described in.

On Fri, Jun 2, 2017 at 5:11 AM, Sergio García Maroto 
wrote:

> I had a look to the source code and I see
> DocBasedVersionConstraintsProcessorFactory
>
> if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
> oldUserVersion)) {
>   // log.info("VERSION returning true (proceed with update)" );
>   return true;
> }
>
> I can't find a way of overwriting same version without changing that piece
> of code.
> Would be possible to add a parameter to the
> "DocBasedVersionConstraintsProcessorFactory" something like
> "overwrite.same.version=true"
> so the new code would look like.
>
>
> int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
> oldUserVersion);
> if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
>   // log.info("VERSION returning true (proceed with update)" );
>   return true;
> }
>
>
> Is that thing going to break anyhting? Can i do that change?
>
> Thanks
> Sergio
>
>
> On 2 June 2017 at 10:10, Sergio García Maroto  wrote:
>
> > I am using  6.1.0.
> > I tried with two different  field types, long and date.
> > 
> > 
> >
> > I am using this configuration on the solrconfig.xml
> >
> > 
> >
> >  false
> >  UpdatedDateSD
> >
> >   
> >
> >   
> >   
> >
> > i had a look to the wiki page and it says https://cwiki.apache.org/
> > confluence/display/solr/Updating+Parts+of+Documents
> >
> > *Once configured, this update processor will reject (HTTP error code 409)
> > any attempt to update an existing document where the value of
> > the my_version_l field in the "new" document is not greater then the
> value
> > of that field in the existing document.*
> >
> > Do you have any tip on how to get same versions not getting rejected.
> >
> > Thanks a lot.
> >
> >
> > On 1 June 2017 at 19:04, Susheel Kumar  wrote:
> >
> >> Which version of solr are you using? I tested in 6.0 and if I supply
> same
> >> version, it overwrite/update the document exactly as per the wiki
> >> documentation.
> >>
> >> Thanks,
> >> Susheel
> >>
> >> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
> >>
> >> > Thanks a lot Susheel.
> >> > I see this is actually what I need.  I have been testing it and
> notice
> >> the
> >> > value of the field has to be always greater for a new document to get
> >> > indexed. if you send the same version number it doesn't work.
> >> >
> >> > Is it possible somehow to overwrite documents with the same version?
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://lucene.472066.n3.
> >> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >
> >>
> >
> >
>


Re: Can solrcloud be running on a read-only filesystem?

2017-06-02 Thread Susheel Kumar
I doubt it can run in readonly file system.  Even though there is no
ingestion etc.  Solr still needs to write to logs/tlogs for synching /
recovering etc

Thnx

On Fri, Jun 2, 2017 at 6:56 AM, Wudong Liu  wrote:

> Hi All:
>
> We have a normal build/stage -> prod settings for our production pipeline.
> And we would build solr index in the build environment and then the index
> is copied to the prod environment.
>
> The solrcloud in prod seems working fine when the file system backing it is
> writable. However, we see many errors when the file system is readonly.
> Many exceptions are thrown regarding the tlog file cannot be open for write
> when the solr nodes are restarted with the new data; some of the nodes
> eventually are stuck in the recovering phase and never able to go back
> online in the cloud.
>
> Just wondering is anyone has any experience on Solrcloud running in
> readonly file system? Is it possible at all?
>
> Regards,
> Wudong
>


Re: Number of requests spike up, when i do the delta Import.

2017-06-02 Thread Rick Leir
Vrin
We had a good speedup from enabling a SQL cache. You also need to avoid 
updating the DB tables so the cache does not get flushed. 
Cheers -- Rick

On June 2, 2017 4:49:20 AM EDT, vrindavda  wrote:
>Thanks Erick ,
>
>Could you please suggest some alternative to go with SolrNET.
>
>@jlman, I tried your way, that do reduces the number of request, but
>delta-import still take longer than full-import. There is no
>improvement in
>performance. 
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-tp4338162p4338591.html
>Sent from the Solr - User mailing list archive at Nabble.com.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Can solrcloud be running on a read-only filesystem?

2017-06-02 Thread Wudong Liu
Hi All:

We have a normal build/stage -> prod settings for our production pipeline.
And we would build solr index in the build environment and then the index
is copied to the prod environment.

The solrcloud in prod seems working fine when the file system backing it is
writable. However, we see many errors when the file system is readonly.
Many exceptions are thrown regarding the tlog file cannot be open for write
when the solr nodes are restarted with the new data; some of the nodes
eventually are stuck in the recovering phase and never able to go back
online in the cloud.

Just wondering is anyone has any experience on Solrcloud running in
readonly file system? Is it possible at all?

Regards,
Wudong


Steps for building solr/lucene code and starting server

2017-06-02 Thread Nawab Zada Asad Iqbal
Hi,

I have synced lucene-solr repo because I (will) have some custom code in
lucene and solr folders. What are the steps for starting solr server? My
schema.xml uses ICUNormalizer2CharFilterFactory (which I see in lucene
folder tree), but I don't know how to make it work with solr webapp. I know
the (luncene ant
target) 'compile',  (solr targets) 'dist', and 'server', but the order is
not clear to me.

I have compiled lucene before doing 'ant server' in solr folder, but I
still see this error when I do 'bin/solr start -f -s ~/solrhome/' :-

Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType "text": Plugin init failure for [schema.xml]
analyzer/charFilter "nfkc": Error loading class
'org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory'



Thanks
Nawab


Re: Configuration of parallel indexing threads

2017-06-02 Thread gigo314
Thanks for the replies. Just to confirm that I got it right:
1. Since there is no setting to control index writers, is it fair to assume
that Solr always indexes at maximum possible speed?
2. The way to control write speed is to control number of clients that are
simultaneously posting data, right?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-of-parallel-indexing-threads-tp4338466p4338599.html
Sent from the Solr - User mailing list archive at Nabble.com.


why MULTILINESTRING can contains polygon in solr spatial search

2017-06-02 Thread kjdong
solr-version:4.7.0

field spec as follows:


 

And i index some MULTILINESTRING (wkt formatted  shape, the road data), and
i query use "Intersects" spatial predicates like
fq=geom:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))
distErrPct=0".

In fact, i want to query the shape(multiline) which is intersect with the
query polygon, but the searched return document has nothing to do with the
query polygon(aka, isDisjointTo), then i test it use JTS api ,it indeed
return false, but solr think the line intersects with the polygon ,even
contains. is this a bug? or repair it in advanced version?

Geometry line = new WKTReader.read(the line  wkt text string);
Geometry polygon= new WKTReader.read(the polygon wkt text string);
line.intersects(polygon);//return false





--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-MULTILINESTRING-can-contains-polygon-in-solr-spatial-search-tp4338593.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Web Crawler - Robots.txt

2017-06-02 Thread Charlie Hull

On 02/06/2017 00:56, Doug Turnbull wrote:

Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work


+1 for Scrapy; it was built by a team at Mydeco.com while we were 
building their search backend and has gone from strength to strength since.


Cheers

Charlie


Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood 
wrote:


Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jun 1, 2017, at 3:31 PM, David Choi  wrote:

In the mean time I have found a better solution at the moment is to test

on

a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi 

wrote:



I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
wrote:


Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a

commercial

web spider for ten years. Occasionally, our customers did need to

bypass

portions of robots.txt. That was usually because of a

poorly-maintained web

server, or because our spider could safely crawl some content that

would

cause problems for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions

of

good web citizenship. Instead, start with sitemap.xml and crawl the
preferred portions of a site.

https://www.sitemaps.org/index.html <

https://www.sitemaps.org/index.html>


If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for
anything big, but I’d start with that for learning.

https://scrapy.org/ 

If you want to learn on a site with a lot of content, try ours,

chegg.com

But if your crawler gets out of hand, crawling too fast, we’ll block

it.

Any other site will do the same.

I would not base the crawler directly on Solr. A crawler needs a
dedicated database to record the URLs visited, errors, duplicates,

etc. The

output of the crawl goes to Solr. That is how we did it with Ultraseek
(before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jun 1, 2017, at 3:01 PM, David Choi 

wrote:


Oh well I guess its ok if a corporation does it but not someone

wanting

to

learn more about the field. I actually have written a crawler before

as

well as the you know Inverted Index of how solr works but I just

thought

its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave 

wrote:



And I mean that in the context of stealing content from sites that
explicitly declare they don't want to be crawled. Robots.txt is to be
followed.


On Jun 1, 2017, at 5:31 PM, David Choi 

wrote:


Hello,

I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if

someone

could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the

name

of

the crawler bin/post uses.











---
This email has been checked for viruses by AVG.
http://www.avg.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
I had a look to the source code and I see
DocBasedVersionConstraintsProcessorFactory

if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
oldUserVersion)) {
  // log.info("VERSION returning true (proceed with update)" );
  return true;
}

I can't find a way of overwriting same version without changing that piece
of code.
Would be possible to add a parameter to the
"DocBasedVersionConstraintsProcessorFactory" something like
"overwrite.same.version=true"
so the new code would look like.


int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
oldUserVersion);
if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
  // log.info("VERSION returning true (proceed with update)" );
  return true;
}


Is that thing going to break anyhting? Can i do that change?

Thanks
Sergio


On 2 June 2017 at 10:10, Sergio García Maroto  wrote:

> I am using  6.1.0.
> I tried with two different  field types, long and date.
> 
> 
>
> I am using this configuration on the solrconfig.xml
>
> 
>
>  false
>  UpdatedDateSD
>
>   
>
>   
>   
>
> i had a look to the wiki page and it says https://cwiki.apache.org/
> confluence/display/solr/Updating+Parts+of+Documents
>
> *Once configured, this update processor will reject (HTTP error code 409)
> any attempt to update an existing document where the value of
> the my_version_l field in the "new" document is not greater then the value
> of that field in the existing document.*
>
> Do you have any tip on how to get same versions not getting rejected.
>
> Thanks a lot.
>
>
> On 1 June 2017 at 19:04, Susheel Kumar  wrote:
>
>> Which version of solr are you using? I tested in 6.0 and if I supply same
>> version, it overwrite/update the document exactly as per the wiki
>> documentation.
>>
>> Thanks,
>> Susheel
>>
>> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
>>
>> > Thanks a lot Susheel.
>> > I see this is actually what I need.  I have been testing it and  notice
>> the
>> > value of the field has to be always greater for a new document to get
>> > indexed. if you send the same version number it doesn't work.
>> >
>> > Is it possible somehow to overwrite documents with the same version?
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.
>> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
>
>


Re: Number of requests spike up, when i do the delta Import.

2017-06-02 Thread vrindavda
Thanks Erick ,

Could you please suggest some alternative to go with SolrNET.

@jlman, I tried your way, that do reduces the number of request, but
delta-import still take longer than full-import. There is no improvement in
performance. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-tp4338162p4338591.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
I am using  6.1.0.
I tried with two different  field types, long and date.



I am using this configuration on the solrconfig.xml


   
 false
 UpdatedDateSD
   
  
   
  
  

i had a look to the wiki page and it says
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

*Once configured, this update processor will reject (HTTP error code 409)
any attempt to update an existing document where the value of
the my_version_l field in the "new" document is not greater then the value
of that field in the existing document.*

Do you have any tip on how to get same versions not getting rejected.

Thanks a lot.


On 1 June 2017 at 19:04, Susheel Kumar  wrote:

> Which version of solr are you using? I tested in 6.0 and if I supply same
> version, it overwrite/update the document exactly as per the wiki
> documentation.
>
> Thanks,
> Susheel
>
> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
>
> > Thanks a lot Susheel.
> > I see this is actually what I need.  I have been testing it and  notice
> the
> > value of the field has to be always greater for a new document to get
> > indexed. if you send the same version number it doesn't work.
> >
> > Is it possible somehow to overwrite documents with the same version?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>