solrcloud -How to delete a doc at a specific shard

2016-01-08 Thread elvis鱼人
my solrcloud,3 shards,and 2replica,
and one shard docs is duplicate,the document router is compositeId
who can help me?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 5.4.0?

2016-01-08 Thread Ere Maijala
Sorry for taking so long. I can confirm that SOLR-8418 is fixed for me 
in a self-built 5.5.0 snapshot. Now the next obvious question is, any 
ETA for a release?


Regards,
Ere

31.12.2015, 19.15, Erick Erickson kirjoitti:

Ere:

Can you help with testing the patch if it's important to you? Ramkumar
is working on it...


Best,
Erick

On Wed, Dec 30, 2015 at 11:07 PM, Ere Maijala  wrote:

Well, for us SOLR-8418 is a major issue. I haven't encountered other issues,
but that one was sort of a show-stopper.

--Ere

31.12.2015, 7.27, William Bell kirjoitti:


How is SOLR 5.4.0 ? I heard there was a quick 5.4.1 coming out?

Any major issues?



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: date difference faceting

2016-01-08 Thread David Santamauro


For anyone wanting to know an answer, I used

facet.query={!frange l=0 u=3110400}ms(d_b,d_a)
facet.query={!frange l=3110401 u=6220800}ms(d_b,d_a)
facet.query={!frange l=6220801 u=15552000}ms(d_b,d_a)

etc ...

Not the prettiest nor most efficient but accomplishes what I need 
without re-indexing TBs of data.


thanks.

On 01/08/2016 12:09 PM, Erick Erickson wrote:

I'm going to side-step your primary question and say that it's nearly
always best to do your calculations up-front during indexing to make
queries more efficient and thus serve more requests on the same
hardware. This assumes that the stat you're interested in is
predictable of course...

Best,
Erick

On Fri, Jan 8, 2016 at 2:23 AM, David Santamauro
 wrote:


Hi,

I have two date fields, d_a and d_b, both of type solr.TrieDateField, that
represent different events associated with a particular document. The
interval between these dates is relevant for corner-case statistics. The
interval is calculated as the difference: sub(d_b,d_a) and I've been able to

   stats=true={!func}sub(d_b,d_a)

What I ultimately would like to report is the interval represented as a
range, which could be seen as facet.query

(pseudo code)
   facet.query=sub(d_b,d_a)[ * TO 8640 ] // day
   facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week
   facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month
etc.

Aside from actually indexing the difference in a separate field, is there
something obvious I'm missing? I'm on SOLR 5.2 in cloud mode.

thanks
David


Re: SOLR replicas performance

2016-01-08 Thread Tomás Fernández Löbbe
Hi Luca,
It looks like your queries are complex wildcard queries. My theory is that
you are CPU-bounded, for a single query one CPU core for each shard will be
at 100% for the duration of the sub-query. Smaller shards make these
sub-queries faster which is why 16 shards is better than 8 in your case.
* In your 16x1 configuration, you have exactly one shard per CPU core, so
in a single query, 16 subqueries will go to both nodes evenly and use one
of the CPU cores.
* In your 8x2 configuration, you still get to use one CPU core per shard,
but the shards are bigger, so maybe each subquery takes longer (for the
single query thread and 8x2 scenario I would expect CPU utilization to be
lower?).
* In your 16x2 case 16 subqueries will be distributed un-evenly, and some
node will get more than 8 subqueries, which means that some of the
subqueries will have to wait for their turn for a CPU core. In addition,
more Solr cores will be competing for resources.
If this theory is correct, adding more replicas won't speedup your queries,
you need to either get faster CPU or simplify your queries/configuration in
some way. Adding more replicas should improve your query throughput, but
only if you add them in more HW, not the same one.

...anyway, just a theory

Tomás

On Fri, Jan 8, 2016 at 7:40 AM, Shawn Heisey  wrote:

> On 1/8/2016 7:55 AM, Luca Quarello wrote:
> > I used solr5.3.1 and I sincerely expected response times with replica
> > configuration near to response times without replica configuration.
> >
> > Do you agree with me?
> >
> > I read here
> >
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
> > that "Queries do not need to be routed to leaders; they can be handled by
> > any replica in a shard. Leaders are only needed for handling update
> > requests. "
> >
> > I haven't found this behaviour. In my case CONF2 e CONF3 have all
> replicas
> > on VM2 but analyzing core utilization during a request is 100% on both
> > machines. Why?
>
> Indexing is a little bit slower with replication -- the update must
> happen on all replicas.
>
> If your index is sharded (which I believe you did indicate in your
> initial message), you may find that all replicas get used even for
> queries.  It is entirely possible that some of the shard subqueries will
> be processed on one replica and some of them will be processed on other
> replicas.  I do not know if this commonly happens, but I would not be
> surprised if it does.  If the machines are sized appropriately for the
> index, this separation should speed up queries, because you have the
> resources of multiple machines handling one query.
>
> That phrase "sized appropriately" is very important.  Your initial
> message indicated that you have a 90GB index, and that you are running
> in virtual machines.  Typically VMs have fairly small memory sizes.  It
> is very possible that you simply don't have enough memory in the VM for
> good performance with an index that large.  With 90GB of index data on
> one machine, I would hope for at least 64GB of RAM, and I would prefer
> to have 128GB.  If there is more than 90GB of data on one machine, then
> even more memory would be needed.
>
> Thanks,
> Shawn
>
>


SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Robert Brown

Hi,

I'm having trouble identifying a replica to delete...

I've created a 3-shard cluster, all 3 created on a single host, then 
added a replica for shard2 onto another host, no problem so far.


Now I want to delete the original shard, but got this error when trying 
a *replica* param value I thought would work...


shard2/uk available replicas are core_node1,core_node4

I can't find any mention of core_node1 or core_node4 via the admin UI, 
how would I know/find the name of each one?


Is it possible to set these names explicitly myself for easier maintenance?

Many thanks for any guidance,
Rob



Re: SOLR replicas performance

2016-01-08 Thread Luca Quarello
Hi Shawn,
I expect that indexing is a little bit slower with replication but in my
case is 3 times worst. I don't explain this.

The monitored consumption of resources is:

   All the test have point out an I/O utilization of 100MB/s during

loading data on disk cache, disk cache utilization of 20GB and core
utilization of 100% (all 8 cores)


 so it seems that the bottleneck are cores and not RAM. I don't expect a
performance improvement increasing RAM. Am I wrong?


Thanks,
Luca

On Fri, Jan 8, 2016 at 4:40 PM, Shawn Heisey  wrote:

> On 1/8/2016 7:55 AM, Luca Quarello wrote:
> > I used solr5.3.1 and I sincerely expected response times with replica
> > configuration near to response times without replica configuration.
> >
> > Do you agree with me?
> >
> > I read here
> >
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
> > that "Queries do not need to be routed to leaders; they can be handled by
> > any replica in a shard. Leaders are only needed for handling update
> > requests. "
> >
> > I haven't found this behaviour. In my case CONF2 e CONF3 have all
> replicas
> > on VM2 but analyzing core utilization during a request is 100% on both
> > machines. Why?
>
> Indexing is a little bit slower with replication -- the update must
> happen on all replicas.
>
> If your index is sharded (which I believe you did indicate in your
> initial message), you may find that all replicas get used even for
> queries.  It is entirely possible that some of the shard subqueries will
> be processed on one replica and some of them will be processed on other
> replicas.  I do not know if this commonly happens, but I would not be
> surprised if it does.  If the machines are sized appropriately for the
> index, this separation should speed up queries, because you have the
> resources of multiple machines handling one query.
>
> That phrase "sized appropriately" is very important.  Your initial
> message indicated that you have a 90GB index, and that you are running
> in virtual machines.  Typically VMs have fairly small memory sizes.  It
> is very possible that you simply don't have enough memory in the VM for
> good performance with an index that large.  With 90GB of index data on
> one machine, I would hope for at least 64GB of RAM, and I would prefer
> to have 128GB.  If there is more than 90GB of data on one machine, then
> even more memory would be needed.
>
> Thanks,
> Shawn
>
>


Re: enable disable filter query caching based on statistics

2016-01-08 Thread Alessandro Benedetti
I read the client was happy, so I am only curious to know more :)
Apart the readibility, shouldn't be more efficient to put the filters
directly in the main query if you don't cache ?
( checking into the code when not caching is adding a lucene boolean query,
with specifically 0 score, maybe this is an indication that at the current
stage this affirmation is not true anymore.
In the past it was a better approach than having them in separate filters.)
How do you specify a filter to be a postFilter and run only over the query
result cache ?
Of course I don't know if you are excluding filters via tags or have some
other requirements.
I saw you specified gain in rpm, and what about the query time ?
Related the rest of the issue is also in the solr comment in the source
code :

org/apache/solr/search/SolrIndexSearcher.java:1597
...

// now actually use the filter cache.
// for large filters that match few documents, this may be
// slower than simply re-executing the query.
if (out.docSet == null) {
out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());
DocSet bigFilt = getDocSet(cmd.getFilterList());
if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
}

...

Cheers


Binoy:

bq: In such a case won't applying fqs normally be the same as applying
them as post filters

Certainly not, at least AFAIK...

By definition, regular FQs are calculated over the entire corpus
(not, NOT just the docs that satisfy the query). Then that entire
bitset is stored in the filterCache where it can be reused. Which
is why filterCache entries can be used for different queries.

Also by definition, post filters are _not_ calculated over the
entire corpus, they are only calculated for docs that
1> pass the query criteria
and
2> pass all lower-cost filters
so they will not apply at all to the next query, are not stored in
the filterCache etc.

So I think what Matteo is seeing is that with a restrictive FQ clause,
very few docs have to be tested against most of the FQs.

Matteo:

My guess (and I'm not intimately familiar with the code) is that, indeed
the restrictive clause is helping you a lot here. Frankly I doubt if
adding a cost will make a measurable difference if the most restrictive
FQ clause is quite sparse

I'm still puzzled in your test scenario why there is such a difference when
making all the filer queries cache=false. _Assuming_ that provincia and type
are relatively low-cardinality fields, they should all be in the
filterCache pretty
quickly But perhaps anding the bitset together is more expensive than the
advantage in this case. I'd be curious as to the hit ratio you were seeing.

But as you say, if the client is satisfied I'm not sure it's worth
pursuing...

Best,
Erick

On Tue, Jan 5, 2016 at 11:09 AM, Matteo Grolla 
wrote:
> Hi Erik,
>  the test was done on thousands of queries of that kind and milions of
> docs
> I went from <1500 qpm to ~ 6000 qpm on modest virtualized hardware (cpu
> bound and cpu was scarce)
> After that customer happy, time finished and didn't go further but
> definitely cost was something I'd try
> When I saw the presentation of CloudSearch where they explained that they
> were enabling/disabling caching based on fq statistics I thought this kind
> of problem were general enough that I could find a plugin already built
>
> 2016-01-05 19:17 GMT+01:00 Erick Erickson :
>
>>
>>
={!cache=false}n_rea:xxx={!cache=false}provincia:,fq={!cache=false}type:
>>
>> You have a comma in front of the last fq clause, typo?
>>
>> Well, the whole point of caching filter queries is so that the
>> _second_ time you use it,
>> very little work has to be done. That comes at a cost of course for
>> first-time execution.
>> Basically any fq clause that you can guarantee won't be re-used should
>> have cache=false
>> set.
>>
>> I'd be surprised if the second time you use the provincia and type fq
>> clauses not caching
>> would be faster, but I've been surprised before. I guess anding two
>> bitsets together could
>> take more time than, say, testing a small number of individual
>> documents
>>
>> And I'm assuming that you're testing multiple queries rather than just
>> one-offs.
>>
>> If you _do_ know that some of your clauses are very restrictive, I
>> wonder what happens if
>> you add a cost in. fq's are evaluated in cost order (when
>> cache=false), so what happens
>> in this case?
>> ={!cache=false cost=101}n_rea:xxx={!cache=false
>> cost=102}provincia:={!cache=false cost=103}type:
>>
>> Best,
>> Erick
>>
>> On Tue, Jan 5, 2016 at 9:41 AM, Matteo Grolla 
>> wrote:
>> > Thanks Erik and Binoy,
>> >  This is a case I stumbled upon: with queries like
>> >
>> >
>>
q=*:*={!cache=false}n_rea:xxx={!cache=false}provincia:,fq={!cache=false}type:
>> >
>> > where n_rea filter is highly selective
>> > I was able to make > 3x performance improvement disabling cache
>> >
>> > I think it's because the 

Re: SOLR replicas performance

2016-01-08 Thread Luca Quarello
Hi Tomas,
I give you other details.


   - The fragment field contains 3KB xml messages.
   - The queries that I used for the test are (I only change the word to
   search inside the fragment field between requests): curl "
   
http://localhost:8983/solr/sepa/select?q=+fragment%3A*A*+=marked%3AT=-fragmentContentType%3ABULK=0=100=creationTimestamp+desc%2Cid+asc;

   - All the tests was executed inside VMs on dedicated HW in details:

2 Hypervisor ESX 5.5 on:


   - Server PowerEdge T420 - Dual Xeon E5-2420 with 128Gb di RAM
   - RAID10 local storage, 4xNear Line Sas 7.200 (about 100MB/s guaranteed
   bandwidth)


I have executed another test with the configuration: 8 shards of 35M
documents on VM1 and 8 empty shards on VM2 (CONF4). The configuration is
without replica.

We can now compare the response times (in seconds) for CONF2 and CONF4:


   - without indexing operations


   -

   CONF2
   -

  *sequential: 12,3 **17,4*
  -

  5 parallel: 32,5 34,2
  -

  10 parallel: 45,4 49
  -

  20 parallel: 64,6 74


   -

   CONF4
   -

  sequential: 5 9,1
  -

  5 parallel: 25 31
  -

  10 parallel: 41 49
  -

  20 parallel: 60 73



   - with indexing operations



   -

   CONF2
   -

  sequential: 12,3 19
  -

  5 parallel: 39 40,8
  -

  10 parallel: 56,6 62,9
  -

  *20 parallel: 79 116*


   -

   CONF4
   -

  sequential: 15,5 17,5
  -

  5 parallel: 30,7 38,3
  -

  10 parallel: 57,5 64,2
  -

  20 parallel: 60 81,4


During the test:

   - CONF2: 8 core on VM1 and 8 core on VM2 100% used (except for
   sequential test without indexing operations where the usage was about 80%).
   - CONF4: 8 core on VM1 100% used


As you can see performance are similar for tests with 5 and 10 parallel
requests both with during indexing operations and without indexing
operations but very different
with sequential requests and with 20 parallel requests. I don't understand
why.

Thanks,
Luca

On Fri, Jan 8, 2016 at 6:47 PM, Tomás Fernández Löbbe  wrote:

> Hi Luca,
> It looks like your queries are complex wildcard queries. My theory is that
> you are CPU-bounded, for a single query one CPU core for each shard will be
> at 100% for the duration of the sub-query. Smaller shards make these
> sub-queries faster which is why 16 shards is better than 8 in your case.
> * In your 16x1 configuration, you have exactly one shard per CPU core, so
> in a single query, 16 subqueries will go to both nodes evenly and use one
> of the CPU cores.
> * In your 8x2 configuration, you still get to use one CPU core per shard,
> but the shards are bigger, so maybe each subquery takes longer (for the
> single query thread and 8x2 scenario I would expect CPU utilization to be
> lower?).
> * In your 16x2 case 16 subqueries will be distributed un-evenly, and some
> node will get more than 8 subqueries, which means that some of the
> subqueries will have to wait for their turn for a CPU core. In addition,
> more Solr cores will be competing for resources.
> If this theory is correct, adding more replicas won't speedup your queries,
> you need to either get faster CPU or simplify your queries/configuration in
> some way. Adding more replicas should improve your query throughput, but
> only if you add them in more HW, not the same one.
>
> ...anyway, just a theory
>
> Tomás
>
> On Fri, Jan 8, 2016 at 7:40 AM, Shawn Heisey  wrote:
>
> > On 1/8/2016 7:55 AM, Luca Quarello wrote:
> > > I used solr5.3.1 and I sincerely expected response times with replica
> > > configuration near to response times without replica configuration.
> > >
> > > Do you agree with me?
> > >
> > > I read here
> > >
> >
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
> > > that "Queries do not need to be routed to leaders; they can be handled
> by
> > > any replica in a shard. Leaders are only needed for handling update
> > > requests. "
> > >
> > > I haven't found this behaviour. In my case CONF2 e CONF3 have all
> > replicas
> > > on VM2 but analyzing core utilization during a request is 100% on both
> > > machines. Why?
> >
> > Indexing is a little bit slower with replication -- the update must
> > happen on all replicas.
> >
> > If your index is sharded (which I believe you did indicate in your
> > initial message), you may find that all replicas get used even for
> > queries.  It is entirely possible that some of the shard subqueries will
> > be processed on one replica and some of them will be processed on other
> > replicas.  I do not know if this commonly happens, but I would not be
> > surprised if it does.  If the machines are sized appropriately for the
> > index, this separation should speed up queries, because you have the
> > resources of multiple machines handling one query.
> >
> > That phrase "sized appropriately" is very important.  Your initial
> > message indicated that you have 

Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes

Honestly, I have no idea which is "old". The solr source itself uses slice 
pretty consistently, so I stuck with that when I started the project last year. 
And logically, a shard being an instance of a slice makes sense to me. But one 
significant place where they word shard is exposed is the default names of the 
slices, so it’s a mixed bag.


See here:
  https://github.com/whitepages/solrcloud_manager#terminology






On 1/8/16, 2:34 PM, "Robert Brown"  wrote:

>Thanks for the pointer Jeff,
>
>For SolrCloud it turned out to be...
>
>=xxx
>
>btw, for your app, isn't "slice" old notation?
>
>
>
>
>On 08/01/16 22:05, Jeff Wartes wrote:
>>
>> I’m pretty sure you could change the name when you ADDREPLICA using a 
>> core.name property. I don’t know if you can when you initially create the 
>> collection though.
>>
>> The CLUSTERSTATUS command will tell you the core names: 
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
>>
>> That said, this tool might make things easier.
>> https://github.com/whitepages/solrcloud_manager
>>
>>
>> # shows cluster status, including core names:
>> java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp
>>
>>
>> # deletes a replica by node/collection/shard (figures out the core name 
>> under the hood)
>> java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
>> zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
>> shard2
>>
>>
>> I mention this tool every now and then on this list because I like it, but 
>> I’m the author, so take that with a pretty big grain of salt. Feedback is 
>> very welcome.
>>
>>
>>
>>
>>
>>
>>
>> On 1/8/16, 1:18 PM, "Robert Brown"  wrote:
>>
>>> Hi,
>>>
>>> I'm having trouble identifying a replica to delete...
>>>
>>> I've created a 3-shard cluster, all 3 created on a single host, then
>>> added a replica for shard2 onto another host, no problem so far.
>>>
>>> Now I want to delete the original shard, but got this error when trying
>>> a *replica* param value I thought would work...
>>>
>>> shard2/uk available replicas are core_node1,core_node4
>>>
>>> I can't find any mention of core_node1 or core_node4 via the admin UI,
>>> how would I know/find the name of each one?
>>>
>>> Is it possible to set these names explicitly myself for easier maintenance?
>>>
>>> Many thanks for any guidance,
>>> Rob
>>>
>


Re: Performance of stats=true={!cardinality=1.0}fl

2016-01-08 Thread Toke Eskildsen
On Wed, 2016-01-06 at 12:39 +0530, Modassar Ather wrote:
> *q=fl1:net*=fl=50=true={!cardinality=1.0}fl*
> is returning cardinality around 15 million. It is taking around 4 minutes.

Is this a single shard or multiple?

Anyway, you might have better luck trying the 'unique' request in JSON
faceting:
https://cwiki.apache.org/confluence/display/solr/Faceted+Search

- Toke Eskildsen, State and University Library, Denmark




Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes


I’m pretty sure you could change the name when you ADDREPLICA using a core.name 
property. I don’t know if you can when you initially create the collection 
though.

The CLUSTERSTATUS command will tell you the core names: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
 

That said, this tool might make things easier.
https://github.com/whitepages/solrcloud_manager


# shows cluster status, including core names:
java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp


# deletes a replica by node/collection/shard (figures out the core name under 
the hood)
java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
shard2


I mention this tool every now and then on this list because I like it, but I’m 
the author, so take that with a pretty big grain of salt. Feedback is very 
welcome.







On 1/8/16, 1:18 PM, "Robert Brown"  wrote:

>Hi,
>
>I'm having trouble identifying a replica to delete...
>
>I've created a 3-shard cluster, all 3 created on a single host, then 
>added a replica for shard2 onto another host, no problem so far.
>
>Now I want to delete the original shard, but got this error when trying 
>a *replica* param value I thought would work...
>
>shard2/uk available replicas are core_node1,core_node4
>
>I can't find any mention of core_node1 or core_node4 via the admin UI, 
>how would I know/find the name of each one?
>
>Is it possible to set these names explicitly myself for easier maintenance?
>
>Many thanks for any guidance,
>Rob
>


Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Robert Brown

Thanks for the pointer Jeff,

For SolrCloud it turned out to be...

=xxx

btw, for your app, isn't "slice" old notation?




On 08/01/16 22:05, Jeff Wartes wrote:


I’m pretty sure you could change the name when you ADDREPLICA using a core.name 
property. I don’t know if you can when you initially create the collection 
though.

The CLUSTERSTATUS command will tell you the core names: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18

That said, this tool might make things easier.
https://github.com/whitepages/solrcloud_manager


# shows cluster status, including core names:
java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp


# deletes a replica by node/collection/shard (figures out the core name under 
the hood)
java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
shard2


I mention this tool every now and then on this list because I like it, but I’m 
the author, so take that with a pretty big grain of salt. Feedback is very 
welcome.







On 1/8/16, 1:18 PM, "Robert Brown"  wrote:


Hi,

I'm having trouble identifying a replica to delete...

I've created a 3-shard cluster, all 3 created on a single host, then
added a replica for shard2 onto another host, no problem so far.

Now I want to delete the original shard, but got this error when trying
a *replica* param value I thought would work...

shard2/uk available replicas are core_node1,core_node4

I can't find any mention of core_node1 or core_node4 via the admin UI,
how would I know/find the name of each one?

Is it possible to set these names explicitly myself for easier maintenance?

Many thanks for any guidance,
Rob





Re: solrcloud -How to delete a doc at a specific shard

2016-01-08 Thread elvis鱼人
solr version is 5.2.0,
this problem is different shards with the same ID,
the document router is compositeId ,
and if i do this
../collection/update?commit=true=idhere,
then this id is missing in whole solrcloud.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354p4249601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Specifying a different txn log directory

2016-01-08 Thread KNitin
Hi,

How do I specify a different directory for transaction logs? I tried using
the updatelog entry in solrconfig.xml and reloaded the collection but that
does not seem to work.

Is there another setting I need to change?

Thanks
Nitin


Re: Performance of stats=true={!cardinality=1.0}fl

2016-01-08 Thread Modassar Ather
Hi,

An input will be helpful.

Thanks,
Modassar

On Wed, Jan 6, 2016 at 12:39 PM, Modassar Ather 
wrote:

> Hi,
>
>
> *q=fl1:net*=fl=50=true={!cardinality=1.0}fl*
> is returning cardinality around 15 million. It is taking around 4 minutes.
> Similar response time is seen with different queries which yields high
> cardinality. Kindly note that the cardinality=1.0 is the desired goal.
> Here in the above example the fl1 is a text field whereas fl is a docValue
> enabled non-stroed, non-indexed field.
> Kindly let me know if such response time is expected or I am missing
> something about this feature in my query.
>
> Thanks,
> Modassar
>


date difference faceting

2016-01-08 Thread David Santamauro


Hi,

I have two date fields, d_a and d_b, both of type solr.TrieDateField, 
that represent different events associated with a particular document. 
The interval between these dates is relevant for corner-case statistics. 
The interval is calculated as the difference: sub(d_b,d_a) and I've been 
able to


  stats=true={!func}sub(d_b,d_a)

What I ultimately would like to report is the interval represented as a 
range, which could be seen as facet.query


(pseudo code)
  facet.query=sub(d_b,d_a)[ * TO 8640 ] // day
  facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week
  facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month
etc.

Aside from actually indexing the difference in a separate field, is 
there something obvious I'm missing? I'm on SOLR 5.2 in cloud mode.


thanks
David


Re: Solr search and index rate optimization

2016-01-08 Thread Toke Eskildsen
On Fri, 2016-01-08 at 10:55 +0500, Zap Org wrote:
> i wanted to ask that i need to index after evey 15 min with hard commit
> (real time records) and currently have 5 zookeeper instances and 2 solr
> instances in one machine serving 200 users with 32GB RAM. whereas i wanted
> to serve more than 10,000 users so what should be my machine specs and what
> should be my architecture for this much serve rate along with index rate.

It depends on your system and if we were forced to guess, our guess
would be very loose.


Fortunately you do have a running system with real queries: Make a copy
on two similar machines (you will probably need more hardware anyway)
and simulate growing traffic, measuring response times at appropriate
points: 200 users, 500, 1000, 2000 etc.

If you are very lucky, your current system scales all the way. If not,
you should have enough data to make an educated guess of the amount of
machines you need. You should have at least 3 measuring point to
extrapolate from as scaling is not always linear.

- Toke Eskildsen, State and University Library, Denmark




Re: Manage schema.xml via Solrj?

2016-01-08 Thread Shawn Heisey
On 1/8/2016 6:30 AM, Bob Lawson wrote:
> Thanks for the replies.  The problem I'm trying to solve is to automate
> whatever steps I can in configuring Solr for our customer.  Rather than an
> admin have to edit schema.xml, I thought it would be easier and less
> error-prone to do it programmatically.  But I'm a novice, so if there is a
> better, more standard way, please let me know.  Thanks!!!

I personally find editing the schema.xml to be the best option, but I
have not actually used the Schema API.  At the point in my deployment
where I was making frequent schema edits (mostly on 1.4 versions, with
some of it on 3.x versions), the API did not exist.

The information about this API in the reference guide looks pretty nice.

> PS:  What do you mean by "XY problem"?

This is summarized here:

https://home.apache.org/~hossman/#xyproblem

Thanks,
Shawn



Re: Performance of stats=true={!cardinality=1.0}fl

2016-01-08 Thread Modassar Ather
Hi Toke,

Is this a single shard or multiple?
It is 12 shard cluster without replicas and has around 90+ GB on each shard.

Thanks for sharing the link. I will look into that.

Regards,
Modassar

On Fri, Jan 8, 2016 at 4:28 PM, Toke Eskildsen 
wrote:

> On Wed, 2016-01-06 at 12:39 +0530, Modassar Ather wrote:
> >
> *q=fl1:net*=fl=50=true={!cardinality=1.0}fl*
> > is returning cardinality around 15 million. It is taking around 4
> minutes.
>
> Is this a single shard or multiple?
>
> Anyway, you might have better luck trying the 'unique' request in JSON
> faceting:
> https://cwiki.apache.org/confluence/display/solr/Faceted+Search
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: Solr UIMA Custom Annotator PEAR file installation on Linux

2016-01-08 Thread Tommaso Teofili
Hi,

do you mean you want to use a PEAR to provide the Annotator for the Solr
UIMA UpdateProcessor ?

Can you please detail a bit more your needs?

Regards,
Tommaso

2016-01-08 1:57 GMT+01:00 techqnq :

> implemented custom annotator and generated the PEAR file.
> Windos has the PEAR installer utility but how to do this from command line
> or what other options on Linux OS?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-UIMA-Custom-Annotator-PEAR-file-installation-on-Linux-tp4249302.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Manage schema.xml via Solrj?

2016-01-08 Thread Bob Lawson
Thanks for the replies.  The problem I'm trying to solve is to automate
whatever steps I can in configuring Solr for our customer.  Rather than an
admin have to edit schema.xml, I thought it would be easier and less
error-prone to do it programmatically.  But I'm a novice, so if there is a
better, more standard way, please let me know.  Thanks!!!

PS:  What do you mean by "XY problem"?

On Thu, Jan 7, 2016 at 11:20 PM, Erick Erickson 
wrote:

> I'd ask first what the high-level problem you're trying to solve is, this
> could be an XY problem.
>
> That said, there's the Schema API you can use, see:
> https://cwiki.apache.org/confluence/display/solr/Schema+API
>
> You can access it from the SolrJ library, see
> SchemaRequest.java. For examples of using this, see:
> SchemaTest.java
>
> to _get_ the Solr source code to see these, see:
> https://wiki.apache.org/solr/HowToContribute
>
> Best,
> Erick
>
> On Thu, Jan 7, 2016 at 7:01 PM, Binoy Dalal 
> wrote:
> > I am not sure about solrj but you can use any XML parsing library to
> > achieve this.
> > Take a look here:
> > http://www.tutorialspoint.com/java_xml/java_xml_parsers.htm
> >
> > On Fri, 8 Jan 2016, 08:06 Bob Lawson  wrote:
> >
> >> I want to programmatically make changes to schema.xml using java to do
> >> it.  Should I use Solrj to do this or is there a better way?  Can I use
> >> Solrj to make the rest calls that make up the schema API?  Whatever the
> >> answer, can anyone point me to an example showing how to do it?  Thanks!
> >>
> >> --
> > Regards,
> > Binoy Dalal
>


Re: SOLR replicas performance

2016-01-08 Thread Luca Quarello
Hi Erick,
I used solr5.3.1 and I sincerely expected response times with replica
configuration near  to response times without replica configuration.

Do you agree with me?

I read here
http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
that "Queries do not need to be routed to leaders; they can be handled by
any replica in a shard. Leaders are only needed for handling update
requests. "

I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas
on VM2 but analyzing core utilization during a request is 100% on both
machines. Why?

Best,
Luca


*Luca Quarello*

M:+39 347 018 3855

luca.quare...@xeffe.it



*X**EFFE * s.r.l

C.so Giovanni Lanza 72, 10131 Torino

T: +39 011 660 5039

F: +39 011 198 26822

www.xeffe.it

On Tue, Jan 5, 2016 at 5:08 PM, Erick Erickson 
wrote:

> What version of Solr? Prior to 5.2 the replicas were doing lots of
> unnecessary work/being blocked, see:
>
> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/
>
> Best,
> Erick
>
> On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla 
> wrote:
> > Hi Luca,
> >   not sure if I understood well. Your question is
> > "Why are index times on a solr cloud collecton with 2 replicas higher
> than
> > on solr cloud with 1 replica" right?
> > Well with 2 replicas all docs have to be deparately indexed in 2 places
> and
> > solr has to confirm that both indexing went well.
> > Indexing times are lower on a solrcloud collection with 2 shards (just
> one
> > replica, the leader, per shard) because docs are indexed just once and
> the
> > load is spread on 2 servers instead of one
> >
> > 2015-12-30 2:03 GMT+01:00 Luca Quarello :
> >
> >> Hi,
> >>
> >> I have an 260M documents index (90GB) with this structure:
> >>
> >>
> >>  >> multiValued="false" termVectors="false" termPositions="false"
> >> termOffsets="false" />
> >>
> >>>> multiValued="false"/>
> >>
> >>>> stored="true" multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>stored="true"
> >> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>
> >>
> >>>> multiValued="true"/>
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>
> >> where the fragmetnt field contains XML messagges.
> >>
> >> There is a search function that provide the messagges satisfying a
> search
> >> criterion.
> >>
> >>
> >> TARGET:
> >>
> >> To find the best configuration to optimize the response time of a two
> solr
> >> instances cloud with 2 VM with 8 core and 32 GB
> >>
> >>
> >> TEST RESULTS:
> >>
> >>
> >>1.
> >>
> >>Configurations:
> >>1.
> >>
> >>   the better configuration without replicas
> >>   - CONF1: 16 shards of 17M documents (8 per VM)
> >>   1.
> >>
> >>   configuration with replica
> >>   - CONF 2: 8 shards of 35M documents with replication factor of 1
> >>  - CONF 3: 16 shards of 35M documents with replication factor
> of 1
> >>
> >>
> >>
> >>1.
> >>
> >>Executed tests
> >>
> >>
> >>- sequential requests
> >>   - 5 parallel requests
> >>   - 10 parallel requests
> >>   - 20 parallel requests
> >>
> >> in two scenarios: during an indexing phase and not
> >>
> >>
> >> Call are: http://localhost:8983/solr/sepa/select?
> >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType
> >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc
> >>
> >>
> >>1.
> >>
> >>Test results
> >>
> >>All the test have point out an I/O utilization of 100MB/s
> during
> >>
> >> loading data on disk cache, disk cache utilization of 20GB and core
> >> utilization of 100% (all 8 cores)
> >>
> >>
> >>
> >>-
> >>
> >>No indexing
> >>-
> >>
> >>   CONF1 (time average and maximum time)
> >>   -
> >>
> >>  sequential: 4,1 6,9
> >>  -
> >>
> >>  5 parallel: 15,6 19,1
> >>  -
> >>
> >>  10 parallel: 23,6 30,2
> >>  -
> >>
> >>  20 parallel: 48 52,2
> >>  -
> >>
> >>   CONF2
> >>   -
> >>
> >>  sequential: 12,3 17,4
> >>  -
> >>
> >>  5 parallel: 32,5 34,2
> >>  -
> >>
> >>  10 parallel: 45,4 49
> >>  -
> >>
> >>  20 parallel: 64,6 74
> >>  -
> >>
> >>   CONF3
> >>   -
> >>
> >>  sequential: 6,9 9,9
> >>  -
> >>
> >>  5 parallel: 33,2 37,5
> >>  -
> >>
> >>  10 parallel: 46 51
> >>  -
> >>
> >>  20 parallel: 68 83
> >>
> >>
> >>
> >>-
> >>
> >>Indexing (into the solr admin console is it possible to view the
> >> total throughput?
> >>I find it only relative to a single shard).
> >>
> >>
> >> CONF1
> >>
> >>-
> >>
> >>   sequential: 7,7 9,5
> >>   -
> >>
> >>   5 parallel: 26,8 28,4
> >>   -
> >>
> >>  

Re: SOLR replicas performance

2016-01-08 Thread Luca Quarello
Hi Erick,
I used solr5.3.1 and I sincerely expected response times with replica
configuration near  to response times without replica configuration.

Do you agree with me?

I read here
http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html that
"Queries do not need to be routed to leaders; they can be handled by any
replica in a shard. Leaders are only needed for handling update requests. "

I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas
on VM2 but analyzing core utilization during a request is 100% on both
machines. Why?

Best,
Luca

On Tue, Jan 5, 2016 at 5:08 PM, Erick Erickson 
wrote:

> What version of Solr? Prior to 5.2 the replicas were doing lots of
> unnecessary work/being blocked, see:
>
> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/
>
> Best,
> Erick
>
> On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla 
> wrote:
> > Hi Luca,
> >   not sure if I understood well. Your question is
> > "Why are index times on a solr cloud collecton with 2 replicas higher
> than
> > on solr cloud with 1 replica" right?
> > Well with 2 replicas all docs have to be deparately indexed in 2 places
> and
> > solr has to confirm that both indexing went well.
> > Indexing times are lower on a solrcloud collection with 2 shards (just
> one
> > replica, the leader, per shard) because docs are indexed just once and
> the
> > load is spread on 2 servers instead of one
> >
> > 2015-12-30 2:03 GMT+01:00 Luca Quarello :
> >
> >> Hi,
> >>
> >> I have an 260M documents index (90GB) with this structure:
> >>
> >>
> >>  >> multiValued="false" termVectors="false" termPositions="false"
> >> termOffsets="false" />
> >>
> >>>> multiValued="false"/>
> >>
> >>>> stored="true" multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>stored="true"
> >> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>>> multiValued="false"/>
> >>
> >>
> >>
> >>>> multiValued="true"/>
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>   
> >>
> >>
> >> where the fragmetnt field contains XML messagges.
> >>
> >> There is a search function that provide the messagges satisfying a
> search
> >> criterion.
> >>
> >>
> >> TARGET:
> >>
> >> To find the best configuration to optimize the response time of a two
> solr
> >> instances cloud with 2 VM with 8 core and 32 GB
> >>
> >>
> >> TEST RESULTS:
> >>
> >>
> >>1.
> >>
> >>Configurations:
> >>1.
> >>
> >>   the better configuration without replicas
> >>   - CONF1: 16 shards of 17M documents (8 per VM)
> >>   1.
> >>
> >>   configuration with replica
> >>   - CONF 2: 8 shards of 35M documents with replication factor of 1
> >>  - CONF 3: 16 shards of 35M documents with replication factor
> of 1
> >>
> >>
> >>
> >>1.
> >>
> >>Executed tests
> >>
> >>
> >>- sequential requests
> >>   - 5 parallel requests
> >>   - 10 parallel requests
> >>   - 20 parallel requests
> >>
> >> in two scenarios: during an indexing phase and not
> >>
> >>
> >> Call are: http://localhost:8983/solr/sepa/select?
> >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType
> >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc
> >>
> >>
> >>1.
> >>
> >>Test results
> >>
> >>All the test have point out an I/O utilization of 100MB/s
> during
> >>
> >> loading data on disk cache, disk cache utilization of 20GB and core
> >> utilization of 100% (all 8 cores)
> >>
> >>
> >>
> >>-
> >>
> >>No indexing
> >>-
> >>
> >>   CONF1 (time average and maximum time)
> >>   -
> >>
> >>  sequential: 4,1 6,9
> >>  -
> >>
> >>  5 parallel: 15,6 19,1
> >>  -
> >>
> >>  10 parallel: 23,6 30,2
> >>  -
> >>
> >>  20 parallel: 48 52,2
> >>  -
> >>
> >>   CONF2
> >>   -
> >>
> >>  sequential: 12,3 17,4
> >>  -
> >>
> >>  5 parallel: 32,5 34,2
> >>  -
> >>
> >>  10 parallel: 45,4 49
> >>  -
> >>
> >>  20 parallel: 64,6 74
> >>  -
> >>
> >>   CONF3
> >>   -
> >>
> >>  sequential: 6,9 9,9
> >>  -
> >>
> >>  5 parallel: 33,2 37,5
> >>  -
> >>
> >>  10 parallel: 46 51
> >>  -
> >>
> >>  20 parallel: 68 83
> >>
> >>
> >>
> >>-
> >>
> >>Indexing (into the solr admin console is it possible to view the
> >> total throughput?
> >>I find it only relative to a single shard).
> >>
> >>
> >> CONF1
> >>
> >>-
> >>
> >>   sequential: 7,7 9,5
> >>   -
> >>
> >>   5 parallel: 26,8 28,4
> >>   -
> >>
> >>   10 parallel: 31,8 37,8
> >>   -
> >>
> >>   20 parallel: 42 52,5
> >>   -
> >>
> >>CONF2
> >>-
> >>
> >>   sequential: 12,3 19
> >>   -
> >>
> 

RE: Manage schema.xml via Solrj?

2016-01-08 Thread Davis, Daniel (NIH/NLM) [C]
Bob,

XY problem means that you are presenting the imagined solution without 
presenting the problem to solve.   In other words, you are presenting X (solve 
for X), without a full statement of the equation to be solved for X.

My guess at your problem is the same as my problem - editing Solr configuration 
(schema and solrconfig.xml) as files is very flexible and Agile compared to a 
form based solution, but that comes with the downside that anyone can "crash" a 
Solr collection by editing the schema wrong.   This goes beyond just XML syntax 
checking, obviously.But only Solr is the authority on what a good schema 
(and other configuration) should look like.

I'm working on a tool that can provide a bit of "smoke testing" on a Solr 
configuration directory.   The workflow I envision is like this:

1. DEVELOPER, TEAM LEAD, or SOLR ADMIN MAKE CHANGES TO CONFIGURATION DIRECTORY

 In the beginning, they may need to make lots of changes.   Eventually, 
they are only making small changes, but we don't want those
 Small changes to crash anything.

2. DEVELOPER, TEAM LEAD, or SOLR ADMIN TRIGGER CONTINUOS INTEGRATION

 When they push or merge to a git branch,  that may trigger a CI workflow.  
 The workflow works like this:

 2a.  Run the "smoke test" tool to (a) create a temporary configset in 
Zookeeper, (b) create a temporary collection in SolrCloud, and (c) do simple 
indexing.
 2b.  Use zkCli.sh and solr.sh to update the actual configset and 
collection in SolrCloud.

3. ITERATE

 This can happen again and again with a "staging", "QA", "Production" set 
of branches.Other checks can be put into the CI workflow as well.

So, along the way to having this vision (of my solution), I also considered the 
advantage of schemaless systems.   I don't want to throw stones, but I think 
schemaless is mostly a marketing term for a couple of reasons:

 - I do Linked Data/RDF - it is different from SQL, but not schemaless.   If 
your "vocabulary" is badly designed, then your users will have problems.
 - ElasticSearch is not really schemaless.   Any ElasticSearch conference is 
filled with tracks/sessions on how to get your "field mappings" right, and what 
happens if you don't (too big indexes, need to re-index to fix stuff, etc.)
 - IBM Watson Explorer is not really schemaless - your update document has to 
specify the type and treatment of each field, or your XSLT must transform your 
document into a structure that does so.

Many of us have also seen what happens with non-dernormalized SQL or fully 
normalized SQL.   "Schemafull" ought to be a marketing term as well.

-Original Message-
From: Bob Lawson [mailto:bwlawson...@gmail.com] 
Sent: Friday, January 08, 2016 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Manage schema.xml via Solrj?

Thanks for the replies.  The problem I'm trying to solve is to automate 
whatever steps I can in configuring Solr for our customer.  Rather than an 
admin have to edit schema.xml, I thought it would be easier and less 
error-prone to do it programmatically.  But I'm a novice, so if there is a 
better, more standard way, please let me know.  Thanks!!!

PS:  What do you mean by "XY problem"?

On Thu, Jan 7, 2016 at 11:20 PM, Erick Erickson 
wrote:

> I'd ask first what the high-level problem you're trying to solve is, 
> this could be an XY problem.
>
> That said, there's the Schema API you can use, see:
> https://cwiki.apache.org/confluence/display/solr/Schema+API
>
> You can access it from the SolrJ library, see SchemaRequest.java. For 
> examples of using this, see:
> SchemaTest.java
>
> to _get_ the Solr source code to see these, see:
> https://wiki.apache.org/solr/HowToContribute
>
> Best,
> Erick
>
> On Thu, Jan 7, 2016 at 7:01 PM, Binoy Dalal 
> wrote:
> > I am not sure about solrj but you can use any XML parsing library to 
> > achieve this.
> > Take a look here:
> > http://www.tutorialspoint.com/java_xml/java_xml_parsers.htm
> >
> > On Fri, 8 Jan 2016, 08:06 Bob Lawson  wrote:
> >
> >> I want to programmatically make changes to schema.xml using java to 
> >> do it.  Should I use Solrj to do this or is there a better way?  
> >> Can I use Solrj to make the rest calls that make up the schema API?  
> >> Whatever the answer, can anyone point me to an example showing how to do 
> >> it?  Thanks!
> >>
> >> --
> > Regards,
> > Binoy Dalal
>


Re: SOLR replicas performance

2016-01-08 Thread Luca Quarello
Hi Matteo,
the questions are two:

   - "Why are response times on a solr cloud collecton with 1 replica
   higher than on solr cloud without replica"

   Configuration1: solrCloud with two 8 cores VMs each with 8
shards of 17M docs
   Configuration2: solrClous with two 8 cores VMs each with 8
shards of 17M docs (8 master and 8 replicas)

I registered worst response time for replicas configuration (conf2) when:

   - Scenario1: I do queries without inserting record into the index
   - Scenario2: I do queries inserting record into the index

I expect similar response times in Scenario1 and better response times for
configuration2 in Scenario2.

Is it correct?

Thanks,
Luca

On Fri, Jan 8, 2016 at 3:56 PM, Luca Quarello 
wrote:

> Hi Erick,
> I used solr5.3.1 and I sincerely expected response times with replica
> configuration near  to response times without replica configuration.
>
> Do you agree with me?
>
> I read here
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html 
> that
> "Queries do not need to be routed to leaders; they can be handled by any
> replica in a shard. Leaders are only needed for handling update requests.
>  "
>
> I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas
> on VM2 but analyzing core utilization during a request is 100% on both
> machines. Why?
>
> Best,
> Luca
>
> On Tue, Jan 5, 2016 at 5:08 PM, Erick Erickson 
> wrote:
>
>> What version of Solr? Prior to 5.2 the replicas were doing lots of
>> unnecessary work/being blocked, see:
>>
>> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/
>>
>> Best,
>> Erick
>>
>> On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla 
>> wrote:
>> > Hi Luca,
>> >   not sure if I understood well. Your question is
>> > "Why are index times on a solr cloud collecton with 2 replicas higher
>> than
>> > on solr cloud with 1 replica" right?
>> > Well with 2 replicas all docs have to be deparately indexed in 2 places
>> and
>> > solr has to confirm that both indexing went well.
>> > Indexing times are lower on a solrcloud collection with 2 shards (just
>> one
>> > replica, the leader, per shard) because docs are indexed just once and
>> the
>> > load is spread on 2 servers instead of one
>> >
>> > 2015-12-30 2:03 GMT+01:00 Luca Quarello :
>> >
>> >> Hi,
>> >>
>> >> I have an 260M documents index (90GB) with this structure:
>> >>
>> >>
>> >> > >> multiValued="false" termVectors="false" termPositions="false"
>> >> termOffsets="false" />
>> >>
>> >>   > >> multiValued="false"/>
>> >>
>> >>   > >> stored="true" multiValued="false"/>
>> >>
>> >>   > >> multiValued="false"/>
>> >>
>> >>   > stored="true"
>> >> multiValued="false"/>
>> >>
>> >>   > >> multiValued="false"/>
>> >>
>> >>   > >> multiValued="false"/>
>> >>
>> >>   > >> multiValued="false"/>
>> >>
>> >>
>> >>
>> >>   > >> multiValued="true"/>
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>   
>> >>
>> >>
>> >> where the fragmetnt field contains XML messagges.
>> >>
>> >> There is a search function that provide the messagges satisfying a
>> search
>> >> criterion.
>> >>
>> >>
>> >> TARGET:
>> >>
>> >> To find the best configuration to optimize the response time of a two
>> solr
>> >> instances cloud with 2 VM with 8 core and 32 GB
>> >>
>> >>
>> >> TEST RESULTS:
>> >>
>> >>
>> >>1.
>> >>
>> >>Configurations:
>> >>1.
>> >>
>> >>   the better configuration without replicas
>> >>   - CONF1: 16 shards of 17M documents (8 per VM)
>> >>   1.
>> >>
>> >>   configuration with replica
>> >>   - CONF 2: 8 shards of 35M documents with replication factor of 1
>> >>  - CONF 3: 16 shards of 35M documents with replication factor
>> of 1
>> >>
>> >>
>> >>
>> >>1.
>> >>
>> >>Executed tests
>> >>
>> >>
>> >>- sequential requests
>> >>   - 5 parallel requests
>> >>   - 10 parallel requests
>> >>   - 20 parallel requests
>> >>
>> >> in two scenarios: during an indexing phase and not
>> >>
>> >>
>> >> Call are: http://localhost:8983/solr/sepa/select?
>> >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType
>> >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc
>> >>
>> >>
>> >>1.
>> >>
>> >>Test results
>> >>
>> >>All the test have point out an I/O utilization of 100MB/s
>> during
>> >>
>> >> loading data on disk cache, disk cache utilization of 20GB and core
>> >> utilization of 100% (all 8 cores)
>> >>
>> >>
>> >>
>> >>-
>> >>
>> >>No indexing
>> >>-
>> >>
>> >>   CONF1 (time average and maximum time)
>> >>   -
>> >>
>> >>  sequential: 4,1 6,9
>> >>  -
>> >>
>> >>  5 parallel: 15,6 19,1
>> >>  -
>> >>
>> >>  10 parallel: 23,6 30,2
>> >>  -
>> >>
>> >>  20 parallel: 48 52,2
>> >>  -
>> >>
>> >>   CONF2

Re: Solr UIMA Custom Annotator PEAR file installation on Linux

2016-01-08 Thread techqnq
Yes, I want to use PEAR file to provide my custom annotator for the solr UIMA
UpdateProcessor.

Basically I have written a custom annotator to capture the certain type of
data from "content" and copies over to other solr field. Generated the PEAR
file using eclipse UIMA plugins. All well till now. Now I want to use this
PEAR file on my solr server to provide this annotator for the SOLR UIMA
UpdateProcessor.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-UIMA-Custom-Annotator-PEAR-file-installation-on-Linux-tp4249302p4249496.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Manage schema.xml via Solrj?

2016-01-08 Thread GW
Bob,

Not sure why you would want to do this. You can set up Solr to guess the
schema. It creates a file called manage_schema.xml for an override. This is
the case with 5.3 I came across it by accident setting it up the first time
and I was a little annoyed but it made for a quick setup. Your programming
would still need to realise the new doc structure and use that new document
structure. The only problem is it's a bit generic in the guess work and I
did not spend much time testing it out so I am not really versed in
operating it. I got myself mack to schema.xml ASAP. My thoughts are you are
looking at a lot of work for little gain.

Best,

GW



On 7 January 2016 at 21:36, Bob Lawson  wrote:

> I want to programmatically make changes to schema.xml using java to do
> it.  Should I use Solrj to do this or is there a better way?  Can I use
> Solrj to make the rest calls that make up the schema API?  Whatever the
> answer, can anyone point me to an example showing how to do it?  Thanks!
>
>


Re: Manage schema.xml via Solrj?

2016-01-08 Thread Erick Erickson
First, Daniel nailed the XY problem, but this isn't that...

You're correct that hand-editing the schema file is error-prone.
The managed schema API is your friend here. There are
several commercial front-ends that already do this.

The managed schema API is all just HTTP, so there's nothing
precluding a Java program from interpreting a form and sending
off the proper HTTP requests to modify the schema.

The SolrJ client library has some sugar around this, there's no
reason you can't use that as it's just a jar (and a dependency on
a logging jar).

For SolrCloud it's a little different. You need to make sure your
changes get to Zookeeper, which the schema API will handle
for you.

One thing that's a bit confusing is "managed schema" and
"schemaless". They both use the same underlying mechanism
to modify the schema.xml file. With "managed schema" you do
what you're talking about, have some process where you make
specific modifications with the schema API to a controlled
schema file.

"schemaless" automatically tries to guess what the schema
_should_ be and uses the managed schema API to implement
those guesses.

GW:
Schema guessing is a great way to get things started, but virtually
every organization I work with takes explicit control of the schema.
They do this for three reasons:
1> the assumptions in managed schema create indexes that can be
made much smaller by judicious options on the fields.
2> the search cases require careful analysis chains.
3> the guesses are wrong. I.e. if the first number encountered in a
field is, say, 3 and the guessing says "Oh, this is an int field". The
next doc is 3.4.. you'll get a parsing error and fail to index the doc.


Best,
Erick

On Fri, Jan 8, 2016 at 7:38 AM, GW  wrote:
> Bob,
>
> Not sure why you would want to do this. You can set up Solr to guess the
> schema. It creates a file called manage_schema.xml for an override. This is
> the case with 5.3 I came across it by accident setting it up the first time
> and I was a little annoyed but it made for a quick setup. Your programming
> would still need to realise the new doc structure and use that new document
> structure. The only problem is it's a bit generic in the guess work and I
> did not spend much time testing it out so I am not really versed in
> operating it. I got myself mack to schema.xml ASAP. My thoughts are you are
> looking at a lot of work for little gain.
>
> Best,
>
> GW
>
>
>
> On 7 January 2016 at 21:36, Bob Lawson  wrote:
>
>> I want to programmatically make changes to schema.xml using java to do
>> it.  Should I use Solrj to do this or is there a better way?  Can I use
>> Solrj to make the rest calls that make up the schema API?  Whatever the
>> answer, can anyone point me to an example showing how to do it?  Thanks!
>>
>>


Re: solrcloud -How to delete a doc at a specific shard

2016-01-08 Thread Erick Erickson
This simply shouldn't be the case if by "duplicate" you mean it has
the same id (i.e. the field defined as the uniqueKey in schema.xml).
If you do have docs in different shards with the same ID, then
something is very strange about your setup.

What version of Solr BTW?

Assuming you mean "same content but different IDs" then you can delete
by ID either through SolrJ or on the URL
.../collection/update?commit=true=idhere

Best,
Erick

On Fri, Jan 8, 2016 at 12:52 AM, elvis鱼人  wrote:
> my solrcloud,3 shards,and 2replica,
> and one shard docs is duplicate,the document router is compositeId
> who can help me?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Kerberos ticket not renewing when storing index on Kerberized HDFS

2016-01-08 Thread Andrew Bumstead
Hello,

I have Solr Cloud configured to stores its index files on a Kerberized HDFS
(I followed documentation at
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS), and
have been able to index some documents with the files being written to the
HDFS as expected. However, it appears that some time after starting, Solr
is unable to connect to HDFS as it no longer has a valid Kerberos TGT. The
time-frame of this occurring is consistent with my default Kerberos ticket
lifetime of 24 hours, so it appears as though Solr is not renewing its
Kerberos ticket upon expiry. A restart of Solr resolves the issue again for
24 hours.

Is there any configuration I can add to make Solr automatically renew its
ticket or is this an issue with Solr?

The following is the stack trace I am getting in Solr.

java.io.IOException: Failed on local exception: java.io.IOException:
Couldn't setup connection for solr/sandbox.hortonworks@hortonworks.com
to sandbox.hortonworks.com/10.0.2.15:8020; Host Details : local host is: "
sandbox.hortonworks.com/10.0.2.15"; destination host is: "
sandbox.hortonworks.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy10.renewLease(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:879)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
at
org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Couldn't setup connection for solr/
sandbox.hortonworks@hortonworks.com to
sandbox.hortonworks.com/10.0.2.15:8020
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:672)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
at
org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 16 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused
by GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553)
at
org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
... 19 more
Caused by: GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)
at
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at

Re: date difference faceting

2016-01-08 Thread Erick Erickson
I'm going to side-step your primary question and say that it's nearly
always best to do your calculations up-front during indexing to make
queries more efficient and thus serve more requests on the same
hardware. This assumes that the stat you're interested in is
predictable of course...

Best,
Erick

On Fri, Jan 8, 2016 at 2:23 AM, David Santamauro
 wrote:
>
> Hi,
>
> I have two date fields, d_a and d_b, both of type solr.TrieDateField, that
> represent different events associated with a particular document. The
> interval between these dates is relevant for corner-case statistics. The
> interval is calculated as the difference: sub(d_b,d_a) and I've been able to
>
>   stats=true={!func}sub(d_b,d_a)
>
> What I ultimately would like to report is the interval represented as a
> range, which could be seen as facet.query
>
> (pseudo code)
>   facet.query=sub(d_b,d_a)[ * TO 8640 ] // day
>   facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week
>   facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month
> etc.
>
> Aside from actually indexing the difference in a separate field, is there
> something obvious I'm missing? I'm on SOLR 5.2 in cloud mode.
>
> thanks
> David


Re: SOLR replicas performance

2016-01-08 Thread Shawn Heisey
On 1/8/2016 7:55 AM, Luca Quarello wrote:
> I used solr5.3.1 and I sincerely expected response times with replica
> configuration near to response times without replica configuration.
> 
> Do you agree with me?
> 
> I read here
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
> that "Queries do not need to be routed to leaders; they can be handled by
> any replica in a shard. Leaders are only needed for handling update
> requests. "
> 
> I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas
> on VM2 but analyzing core utilization during a request is 100% on both
> machines. Why?

Indexing is a little bit slower with replication -- the update must
happen on all replicas.

If your index is sharded (which I believe you did indicate in your
initial message), you may find that all replicas get used even for
queries.  It is entirely possible that some of the shard subqueries will
be processed on one replica and some of them will be processed on other
replicas.  I do not know if this commonly happens, but I would not be
surprised if it does.  If the machines are sized appropriately for the
index, this separation should speed up queries, because you have the
resources of multiple machines handling one query.

That phrase "sized appropriately" is very important.  Your initial
message indicated that you have a 90GB index, and that you are running
in virtual machines.  Typically VMs have fairly small memory sizes.  It
is very possible that you simply don't have enough memory in the VM for
good performance with an index that large.  With 90GB of index data on
one machine, I would hope for at least 64GB of RAM, and I would prefer
to have 128GB.  If there is more than 90GB of data on one machine, then
even more memory would be needed.

Thanks,
Shawn



Re: Solr search and index rate optimization

2016-01-08 Thread Erick Erickson
Here's a longer form of Toke's answer:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

BTW, on the surface, having 5 ZK nodes isn't doing you any real good.
Zookeeper isn't really involved in serving queries or handling
updates, it's purpose is to have the state of the cluster (nodes up,
recovering, down, etc) and notify Solr listeners when that state
changes. There's no good reason to have 5 with a small cluster and by
"small" I mean < 100s of nodes.

Best,
Erick

On Fri, Jan 8, 2016 at 2:40 AM, Toke Eskildsen  wrote:
> On Fri, 2016-01-08 at 10:55 +0500, Zap Org wrote:
>> i wanted to ask that i need to index after evey 15 min with hard commit
>> (real time records) and currently have 5 zookeeper instances and 2 solr
>> instances in one machine serving 200 users with 32GB RAM. whereas i wanted
>> to serve more than 10,000 users so what should be my machine specs and what
>> should be my architecture for this much serve rate along with index rate.
>
> It depends on your system and if we were forced to guess, our guess
> would be very loose.
>
>
> Fortunately you do have a running system with real queries: Make a copy
> on two similar machines (you will probably need more hardware anyway)
> and simulate growing traffic, measuring response times at appropriate
> points: 200 users, 500, 1000, 2000 etc.
>
> If you are very lucky, your current system scales all the way. If not,
> you should have enough data to make an educated guess of the amount of
> machines you need. You should have at least 3 measuring point to
> extrapolate from as scaling is not always linear.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>