Re: Getting to grips with auto-scaling

2020-06-09 Thread Tom Evans
Hi Radu

Thanks for the reply - I'm starting to look that way myself, to create
a different collection for each set of data, that way I can control
more easily the scaling on each collection, eg to increase replication
factor on those that will be queried more. I was looking at Category
Routed Alias, but that seems to have quite a few gotchas:

* Can't restrict the collections queried - even if you specify the
exact collections to query, eg "collections=items__CRA__2020" (which
exists) returns no results. Even when querying the underlying
collection and specifying its name returns no results. I only get
results with collections=items__CRA - its as if the underlying
collection thinks its name really is "items__CRA" rather than
"items__CRA__2020"
* Some problems with indexing to a new category, I get errors the
first time a category is encountered.

Looks like it might be manually setup and managed collections and
aliases for now.

Cheers

Tom

On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe
 wrote:
>
> Hi Tom,
>
> To your last two questions, I'd like to vent an alternative design: have
> dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot
> tier, and 2019, 2018,2017+lists go to the warm tier.
>
> Then you can scale the hot tier based on your query load. For the warm
> tier, I assume there will be less need for scaling, and if it is, I guess
> it's less important for shards of each index to be perfectly balanced (so a
> simple "make sure cores are evenly distributed" should be enough).
>
> Granted, this design isn't as flexible as the one you suggested, but it's
> simpler. So simple that I've seen it done without autoscaling (just a few
> scripts from when you add nodes in each tier).
>
> Best regards,
> Radu
>
> https://sematext.com
>
> vin., 5 iun. 2020, 21:59 Tom Evans  a
> scris:
>
> > Hi
> >
> > I'm trying to get a handle on the newer auto-scaling features in Solr.
> > We're in the process of upgrading an older SolrCloud cluster from 5.5
> > to 8.5, and re-architecture it slightly to improve performance and
> > automate operations.
> >
> > If I boil it down slightly, currently we have two collections, "items"
> > and "lists". Both collections have just one shard. We publish new data
> > to "items" once each day, and our users search and do analysis on
> > them, whilst "lists" contains NRT user-specified collections of ids
> > from items, which we join to from "items" in order to allow them to
> > restrict their searches/analysis to just docs in their curated lists.
> >
> > Most of our searches have specific date ranges in them, usually only
> > from the last 3 years or so, but sometimes we need to do searches
> > across all the data. With the new setup, we want to:
> >
> > * shard by date (year) to make the hottest data available in smaller shards
> > * have more nodes with these shards than we do of the older data.
> > * be able to add/remove nodes predictably based upon our clients
> > (predictable) query load
> > * use TLOG for "items" and NRT for "lists", to avoid unnecessary
> > indexing load for "items" and have NRT for "lists".
> > * spread cores across two AZ
> >
> > With that in mind, I came up with a bunch of simplified rules for
> > testing, with just 4 shards for "items":
> >
> > * "lists" collection has one NRT replica on each node
> > * "items" collection shard 2020 has one TLOG replica on each node
> > * "items" collection shard 2019 has one TLOG replica on 75% of nodes
> > * "items" collection shards 2018 and 2017 each have one TLOG replica
> > on 50% of nodes
> > * all shards have at least 2 replicas if number of nodes > 1
> > * no node should have 2 replicas of the same shard
> > * number of cores should be balanced across nodes
> >
> > Eg, with 1 node, I want to see this topology:
> > A: items: 2020, 2019, 2018, 2017 + lists
> >
> > with 2 nodes:
> > A: items: 2020, 2019, 2018, 2017 + lists
> > B: items: 2020, 2019, 2018, 2017 + lists
> >
> > and if I add two more nodes:
> > A: items: 2020, 2019, 2018 + lists
> > B: items: 2020, 2019, 2017 + lists
> > C: items: 2020, 2019, 2017 + lists
> > D: items: 2020, 2018 + lists
> >
> > To the questions:
> >
> > * The type of replica created when nodeAdded is triggered can't be set
> > per collection. Either everything gets NRT or everything gets TLOG.
> > Even if I specify nrtReplicas=0 when creating a collection, nod

Indexing error when using Category Routed Alias

2020-06-09 Thread Tom Evans
Hi all

1. Setup simple 1 node solrcloud test setup using docker-compose,
solr:8.5.2, zookeeper:3.5.8.
2. Upload a configset
3. Create two collections, one standard collection, one CRA, both
using the same configset

legacy:
action=CREATE=products_old=products=true=1=-1

CRA:

{
  "create-alias": {
"name": "products_20200609",
"router": {
  "name": "category",
  "field": "date_published.year",
  "maxCardinality": 30,
  "mustMatch": "(199[6-9]|20[0,1,2][0-9])"
},
"create-collection": {
  "config": "products",
  "numShards": 1,
  "nrtReplicas": 1,
  "tlogReplicas": 0,
  "maxShardsPerNode": 1,
  "autoAddReplicas": true
}
  }
}

Post a small selection of docs in JSON format using curl to non-CRA
collection -> OK

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_old/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  10071  100 11.6M  5   950k  0:00:14  0:00:12  0:00:02  687k
{
  "responseHeader":{
"rf":1,
"status":0,
"QTime":12541}}

The same documents, sent to the CRA -> boom

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_20200609/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  100   888  100 11.6M366  4913k  0:00:02  0:00:02 --:--:-- 4914k
{
  "responseHeader":{
"status":400,
"QTime":2422},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException",
  
"error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException",
  
"root-error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException"],
"msg":"Async exception during distributed update: Error from
server at 
http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/:
null\n\n\n\nrequest:
http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/\nRemote
error message: Cannot parse provided JSON: JSON Parse Error:
char=\u0002,position=0 AFTER='\u0002'
BEFORE='2update.contentType0applicat'",
"code":400}}

Repeating the request again to the CRA -> OK

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_20200609/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  10071  100 11.6M  6  1041k  0:00:11  0:00:11 --:--:--  706k
{
  "responseHeader":{
"rf":1,
"status":0,
"QTime":11446}}

It seems to be related to when a new collection is needed to be
created by the CRA.

The relevant logs:

2020-06-09 02:12:56.107 INFO
(OverseerThreadFactory-9-thread-3-processing-n:10.20.36.130:8983_solr)
[   ] o.a.s.c.a.c.CreateCollectionCmd Create collection
products_20200609__CRA__2005
2020-06-09 02:12:56.232 INFO
(OverseerStateUpdate-72169202568593409-10.20.36.130:8983_solr-n_00)[
  ] o.a.s.c.o.SliceMutator createReplica() {
  "operation":"ADDREPLICA",
  "collection":"products_20200609__CRA__2005",
  "shard":"shard1",
  "core":"products_20200609__CRA__2005_shard1_replica_n1",
  "state":"down",
  "base_url":"http://10.20.36.130:8983/solr;,
  "node_name":"10.20.36.130:8983_solr",
  "type":"NRT",
  "waitForFinalState":"false"}
2020-06-09 02:12:56.444 INFO  (qtp90045638-25) [
x:products_20200609__CRA__2005_shard1_replica_n1]
o.a.s.h.a.CoreAdminOperation core create command
qt=/admin/cores=core_node2=products=true=products_20200609__CRA__2005_shard1_replica_n1=CREATE=1=products_20200609__CRA__2005=shard1=javabin=2=NRT
2020-06-09 02:12:56.476 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.c.SolrConfig
Using Lucene MatchVersion: 8.5.1
2020-06-09 02:12:56.512 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.s.IndexSchema
[products_20200609__CRA__2005_shard1_replica_n1] Schema name=variants
2020-06-09 02:12:56.543 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.r.RestManager
Registered ManagedResource impl
org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymManager
for path /schema/analysis/synonyms/default
2020-06-09 02:12:56.543 INFO  

Getting to grips with auto-scaling

2020-06-05 Thread Tom Evans
Hi

I'm trying to get a handle on the newer auto-scaling features in Solr.
We're in the process of upgrading an older SolrCloud cluster from 5.5
to 8.5, and re-architecture it slightly to improve performance and
automate operations.

If I boil it down slightly, currently we have two collections, "items"
and "lists". Both collections have just one shard. We publish new data
to "items" once each day, and our users search and do analysis on
them, whilst "lists" contains NRT user-specified collections of ids
from items, which we join to from "items" in order to allow them to
restrict their searches/analysis to just docs in their curated lists.

Most of our searches have specific date ranges in them, usually only
from the last 3 years or so, but sometimes we need to do searches
across all the data. With the new setup, we want to:

* shard by date (year) to make the hottest data available in smaller shards
* have more nodes with these shards than we do of the older data.
* be able to add/remove nodes predictably based upon our clients
(predictable) query load
* use TLOG for "items" and NRT for "lists", to avoid unnecessary
indexing load for "items" and have NRT for "lists".
* spread cores across two AZ

With that in mind, I came up with a bunch of simplified rules for
testing, with just 4 shards for "items":

* "lists" collection has one NRT replica on each node
* "items" collection shard 2020 has one TLOG replica on each node
* "items" collection shard 2019 has one TLOG replica on 75% of nodes
* "items" collection shards 2018 and 2017 each have one TLOG replica
on 50% of nodes
* all shards have at least 2 replicas if number of nodes > 1
* no node should have 2 replicas of the same shard
* number of cores should be balanced across nodes

Eg, with 1 node, I want to see this topology:
A: items: 2020, 2019, 2018, 2017 + lists

with 2 nodes:
A: items: 2020, 2019, 2018, 2017 + lists
B: items: 2020, 2019, 2018, 2017 + lists

and if I add two more nodes:
A: items: 2020, 2019, 2018 + lists
B: items: 2020, 2019, 2017 + lists
C: items: 2020, 2019, 2017 + lists
D: items: 2020, 2018 + lists

To the questions:

* The type of replica created when nodeAdded is triggered can't be set
per collection. Either everything gets NRT or everything gets TLOG.
Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
will add NRT replicas if configured that way.
* I'm having difficulty expressing these rules in terms of a policy -
I can't seem to figure out a way to specify the number of replicas for
a shard based upon the total number of nodes.
* Is this beyond the current scope of autoscaling triggers/policies?
Should I instead use the trigger with a custom plugin action (or to
trigger a web hook) to be a bit more intelligent?
* Am I wasting my time trying to ensure there are more replicas of the
hotter shards than the colder shards? It seems to add a lot of
complexity - should I just instead think that they aren't getting
queried much, so won't be using up cache space that the hot shards
will be using. Disk space is pretty cheap after all (total size for
"items" + "lists" is under 60GB).

Cheers

Tom


Re: Provide suggestion on indexing performance

2017-09-13 Thread Tom Evans
On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon  wrote:
> Hi,
>
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.
>
> With regards,
> Aman Tandon

Your question doesn't make much sense. You turn on stored when you
need to retrieve the original contents of the fields after searching,
and you use docvalues to speed up faceting, sorting and grouping.
Using docvalues to retrieve values during search is more expensive
than simply using stored values, so if your primary aim is retrieving
stored values, use stored=true.

Secondly, the only way to answer performance questions for your schema
and data is to try it out. Generate 10 million docs, store them in a
doc (eg as CSV), and then use the post tool to try different schema
and query options.

Cheers

Tom


Re: Solr returning same object in different page

2017-09-13 Thread Tom Evans
On Tue, Sep 12, 2017 at 7:42 PM, ruby  wrote:
> I'm running into a issue where an object is appearing twice when we are
> paging. My query is gives documents boost based on field values. First query
> returns 50 object. Second query is exactly same as first query, except
> getting next 50 objects. We are noticing that few objects which were
> returned before are being returned again in the second page. Is this a known
> issue with Solr?

Are you using paging (page=N) or deep paging (cursorMark=*)? Do you
have a deterministic sort order (IE, not simply by score)?

Cheers

Tom


Re: Get results in multiple orders (multiple boosts)

2017-08-18 Thread Tom Evans
On Fri, Aug 18, 2017 at 8:21 AM, Luca Dall'Osto
 wrote:
>
> Yes, of course, and excuse me for the misunderstanding.
>
>
> In my scenario I have to display a list with hundreds of documents.
> An user can show this documents in a particular order, this order is decided 
> by user in a settings view.
>
>
> Order levels are for example:
> 1) Order by category, as most important.
> 2) Order by source, as second level.
> 3) Order by date (ascending or descending).
> 4) Order by title (ascending or descending).
>
>
> For category order, in settings view, user has an box with a list of all 
> categories available for him/her.
> User drag elements of the list to set in the favorite order.
> Same thing for sources.
>

Solr can only sort by indexed fields, it needs to be able to compare
one document to another document, and the only information available
at that point are the indexed fields.

This would be untenable in your scenario, because you cannot add a
category..sort_order field to every document for every user.

If this custom sorting is a hard requirement, the only feasible
solution I see is to write a custom sorting plugin, that provides a
function that you can sort on. This blog post describes how this can
be achieved:

https://medium.com/culture-wavelabs/sorting-based-on-a-custom-function-in-solr-c94ddae99a12

I would imagine that you would need one sort function, maybe called
usersortorder(), to which you would provide the users preferred sort
ordering (which you would retrieve from wherever you store such
information) and the field that you want sorted. It would look
something like this:

usersortorder("category_id", "3,5,1,7,2,12,14,58") DESC,
usersortorder("source_id", "5,2,1,4,3") DESC, date DESC, title DESC

Cheers

Tom


Re: setup solrcloud from scratch vie web-ui

2017-05-17 Thread Tom Evans
On Wed, May 17, 2017 at 6:28 AM, Thomas Porschberg
 wrote:
> Hi,
>
> I did not manipulating the data dir. What I did was:
>
> 1. Downloaded solr-6.5.1.zip
> 2. ensured no solr process is running
> 3. unzipped solr-6.5.1.zip to ~/solr_new2/solr-6.5.1
> 3. started an external zookeeper
> 4. copied a conf directory from a working non-cloudsolr (6.5.1) to
>~/solr_new2/solr-6.5.1 so that I have ~/solr_new2/solr-6.5.1/conf
>   (see http://randspringer.de/solrcloud_test/my.zip for content)

..in which you've manipulated the dataDir! :)

The problem (I think) is that you have set a fixed data dir, and when
Solr attempts to create a second core (for whatever reason, in your
case it looks like you are adding a shard), Solr puts it exactly where
you have told it to, in the same directory as the previous one. It
finds the lock and blows up, because each core needs to be in a
separate directory, but you've instructed Solr to put them in the same
one.

Start with a the solrconfig from basic_configs configset that ships
with Solr and add the special things that your installation needs. I
am not massively surprised that your non cloud config does not work in
cloud mode, when we moved to SolrCloud, we rewrote from scratch
solrconfig.xml and schema.xml, starting from basic_configs and adding
anything particular that we needed from our old config, checking every
difference that we have from stock config and noting/discerning why,
and ensuring that our field types are using the same names for the
same types as basic_config wherever possible.

I only say all that because to fix this issue is a single thing, but
you should spend the time comparing configs because this will not be
the only issue. Anyway, to fix this problem, in your solrconfig.xml
you have:

  data

It should be

  ${solr.data.dir:}

Which is still in your config, you've just got it commented out :)

Cheers

Tom


Re: to handle expired documents: collection alias or delete by id query

2017-03-24 Thread Tom Evans
On Thu, Mar 23, 2017 at 6:10 AM, Derek Poh  wrote:
> Hi
>
> I have collections of products. I am doing indexing 3-4 times daily.
> Every day there are products that expired and I need to remove them from
> these collectionsdaily.
>
> Ican think of 2 ways to do this.
> 1. using collection aliasto switch between a main and temp collection.
> - clear and index the temp collection
> - create alias to temp collection.
> - clear and index the main collection.
> - create alias to main collection.
>
> this way require additional collections.
>

Another way of doing this is to have a moving alias (not constantly
clearing the "temp" collection). If you reindex daily, your index
would be called "products_mmdd" with an alias to "products". The
advantage of this is that you can roll back to a previous version of
the index if there are problems, and each index is guaranteed to be
freshly created with no artifacts.

The biggest consideration for me would be how long indexing your full
corpus takes you. If you can do it in a small period of time, then
full indexes would be preferable. If it takes a very long time,
deleting is preferable.

If you are doing a cloud setup, full indexes are even more appealing.
You can create the new collection on a single node (even if sharded;
just place each shard on the same node). This would only place the
indexing cost on that one node, whilst other nodes would be unaffected
by indexing degrading regular query response time. You also don't have
to distribute the documents around the cluster. There is no
distributed indexing in Solr, each replica has to index each document
again, even if it is not the leader.

Once indexing is complete, you can expand the collection by adding
replicas of that shard on other nodes - perhaps even removing it from
the node that did the indexing. We have a node that solely does
indexing, before the collection is queried for anything it is added to
the querying nodes.

You can do this manually, or you can automate it using the collections API.

Cheers

Tom


Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Tom Evans
Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
 wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.
>
> My situation, slightly simplified is:
>
> Solr 4.6.1
>
>   *   Doc set: ~200,000 docs
>   *   Grouping by item_id, an indexed, stored, single value string field with 
> ~50,000 unique values, ~4 docs per item
>   *   Faceting by person_id, an indexed, stored, multi-value string field 
> with ~50,000 values (w/ a very skewed distribution)
>   *   No docValues fields
>
> Each document here is a description of an item, and there are several 
> descriptions per item in multiple languages.
>
> With legacy facets I use group.field=item_id and group.facet=true, which 
> gives me facet counts with the number of items rather than descriptions, and 
> correctly sorted by descending item count.
>
> With JSON facets I'm doing the equivalent like so:
>
> ={
> "people": {
> "type": "terms",
> "field": "person_id",
> "facet": {
> "grouped_count": "unique(item_id)"
> },
> "sort": "grouped_count desc"
> }
> }
>
> This works, and is somewhat faster than legacy faceting, but it also produces 
> a massive spike in memory usage when (and only when) the sort parameter is 
> set to the aggregate field. A server that runs happily with a 512MB heap OOMs 
> unless I give it a 4GB heap. With sort set to (the default) "count desc" 
> there is no memory usage spike.
>
> I would be curious if anyone has experienced this kind of memory usage when 
> sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
> I’ve tried reindexing with docValues enabled on the relevant fields and it 
> seems to make no difference in this respect.
>
> Many thanks,
> ~Mike


Re: Interval Facets with JSON

2017-02-10 Thread Tom Evans
On Wed, Feb 8, 2017 at 11:26 PM, deniz <denizdurmu...@gmail.com> wrote:
> Tom Evans-2 wrote
>> I don't think there is such a thing as an interval JSON facet.
>> Whereabouts in the documentation are you seeing an "interval" as JSON
>> facet type?
>>
>>
>> You want a range facet surely?
>>
>> One thing with range facets is that the gap is fixed size. You can
>> actually do your example however:
>>
>> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
>> hardend:True, field:height}}
>>
>> If you do require arbitrary bucket sizes, you will need to do it by
>> specifying query facets instead, I believe.
>>
>> Cheers
>>
>> Tom
>
>
> nothing other than
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting
> for documentation on intervals...  i am ok with range queries as well but
> intervals would fit better because of different sizes...

That documentation is not for JSON facets though. You can't pick and
choose features from the old facet system and use them in JSON facets
unless they are mentioned in the JSON facet documentation:

https://cwiki.apache.org/confluence/display/solr/JSON+Request+API

and (not official documentation)

http://yonik.com/json-facet-api/

Cheers

Tom


Re: Interval Facets with JSON

2017-02-08 Thread Tom Evans
On Tue, Feb 7, 2017 at 8:54 AM, deniz  wrote:
> Hello,
>
> I am trying to run JSON facets with on interval query as follows:
>
>
> "json.facet":{"height_facet":{"interval":{"field":"height","set":["[160,180]","[180,190]"]}}}
>
> And related field is  stored="true" />
>
> But I keep seeing errors like:
>
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Unknown
> facet or stat. key=height_facet type=interval args={field=height,
> set=[[160,180], [180,190]]} , path=/facet
>

I don't think there is such a thing as an interval JSON facet.
Whereabouts in the documentation are you seeing an "interval" as JSON
facet type?


You want a range facet surely?

One thing with range facets is that the gap is fixed size. You can
actually do your example however:

json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
hardend:True, field:height}}

If you do require arbitrary bucket sizes, you will need to do it by
specifying query facets instead, I believe.

Cheers

Tom


Re: Upgrade SOLR version - facets perfomance regression

2017-01-31 Thread Tom Evans
On Tue, Jan 31, 2017 at 5:49 AM, SOLR4189  wrote:
> But I can't run Json Facet API. I checked on SOLR-5.4.1.
> If I write:
> localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json=true=someField
> It works fine. But if I write:
> localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json={field:someField}
> It doesn't work.
> Are you sure that it is built-in? If it is built-in, why I can't find
> explanation about it in reference guid?
> Thank you for your help.

You do have to follow the correct syntax:

  json.facet={name_of_facet_in_output:{type:terms, field:name_of_field}}

It is documented in confluence:

https://cwiki.apache.org/confluence/display/solr/Faceted+Search

Also by yonik:

http://yonik.com/json-facet-api/

Cheers

Tom

Cheers

Tom


Re: Concat Fields in JSON Facet

2017-01-17 Thread Tom Evans
On Mon, Jan 16, 2017 at 2:58 PM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> I have been using JSON Facet, but I am facing some constraints in
> displaying the field.
>
> For example, I have 2 fields, itemId and itemName. However, when I do the
> JSON Facet, I can only get it to show one of them in the output, and I
> could not get it to show both together.
> I will like to show both the ID and Name together, so that it will be more
> meaningful and easier for user to understand, without having to refer to
> another table to determine the match between the ID and Name.

I don't understand what you mean. If you have these three documents in
your index, what data do you want in the facet?

[
  {itemId: 1, itemName: "Apple"},
  {itemId: 2, itemName: "Android"},
  {itemId: 3, itemName: "Android"},
]

Cheers

Tom


Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-15 Thread Tom Evans
On Thu, Dec 15, 2016 at 12:37 PM, GW  wrote:
> While my client is all PHP it does not use a solr client. I wanted to stay
> with he latest Solt Cloud and the PHP clients all seemed to have some kind
> of issue being unaware of newer Solr Cloud versions. The client makes pure
> REST calls with Curl. It is stateful through local storage. There is no
> persistent connection. There are no cookies and PHP work is not sticky so
> it is designed for round robin on both the internal network.
>
> I'm thinking we have a different idea of persistent. To me something like
> MySQL can be persistent, ie a fifo queue for requests. The stack can be
> always on/connected on something like a heap storage.
>
> I never thought about the impact of a solr node crashing with PHP on top.
> Many thanks!
>
> Was thinking of running a conga line (Ricci & Luci projects) and shutting
> down and replacing failed nodes. Never done this with Solr. I don't see any
> reasons why it would not work.
>
> ** When you say an array of connections per host. It would still require an
> internal DNS because hosts files don't round robin. perhaps this is handled
> in the Python client??


The best Solr clients will take the URIs of the Zookeeper servers;
they do not make queries via Zookeeper, but will read the current
cluster status from zookeeper in order to determine which solr node to
actually connect to, taking in to account what nodes are alive, and
the state of particular shards.

SolrJ (Java) will do this, as will pysolr (python), I'm not aware of a
PHP client that is ZK aware.

If you don't have a ZK aware client, there are several options:

1) Make your favourite client ZK aware, like in [1]
2) Use round robin DNS to distribute requests amongst the cluster.
3) Use a hardware or software load balancer in front of the cluster.
4) Use shared state to store the names of active nodes*

All apart from 1) have significant downsides:

2) Has no concept of a node being down. Down nodes should not cause
query failures, the requests should go elsewhere in the cluster.
Requires updating DNS to add or remove nodes.
3) Can detect "down" nodes. Has no idea about the state of the
cluster/shards (usually).
4) Basically duplicates what ZooKeeper does, but less effectively -
doesn't know cluster state, down nodes, nodes that are up but with
unhealthy replicas...

>
> You have given me some good clarification. I think lol. I know I can spin
> out WWW servers based on load. I'm not sure how shit will fly spinning up
> additional solr nodes. I'm not sure what happens if you spin up an empty
> solr node and what will happen with replication, shards and load cost of
> spinning an instance. I'm facing some experimentation me thinks. This will
> be a manual process at first, for sure
>
> I guess I could put the solr connect requests in my clients into a try
> loop, looking for successful connections by name before any action.

In SolrCloud mode, you can spin up/shut down nodes as you like.
Depending on how you have configured your collections, new replicas
may be automatically created on the new node, or the node will simply
become part of the cluster but empty, ready for you to assign new
replicas to it using the Collections API.

You can also use what are called "snitches" to define rules for how
you want replicas/shards allocated amongst the nodes, eg to avoid
placing all the replicas for a shard in the same rack.

Cheers

Tom

[1] 
https://github.com/django-haystack/pysolr/commit/366f14d75d2de33884334ff7d00f6b19e04e8bbf


Re: Using DIH FileListEntityProcessor with SolrCloud

2016-12-06 Thread Tom Evans
On Fri, Dec 2, 2016 at 4:36 PM, Chris Rogers
 wrote:
> Hi all,
>
> A question regarding using the DIH FileListEntityProcessor with SolrCloud 
> (solr 6.3.0, zookeeper 3.4.8).
>
> I get that the config in SolrCloud lives on the Zookeeper node (a different 
> server from the solr nodes in my setup).
>
> With this in mind, where is the baseDir attribute in the 
> FileListEntityProcessor config relative to? I’m seeing the config in the Solr 
> GUI, and I’ve tried setting it as an absolute path on my Zookeeper server, 
> but this doesn’t seem to work… any ideas how this should be setup?
>
> My DIH config is below:
>
> 
>   
>   
> 
>  fileName=".*xml"
> newerThan="'NOW-5YEARS'"
> recursive="true"
> rootEntity="false"
> dataSource="null"
> baseDir="/home/bodl-zoo-svc/files/">
>
>   
>
>  forEach="/TEI" url="${f.fileAbsolutePath}" 
> transformer="RegexTransformer" >
>  xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/>
>  xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/>
>  xpath="/TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/>
>   
>
> 
>
>   
> 
>
>
> This same script worked as expected on a single solr node (i.e. not in 
> SolrCloud mode).
>
> Thanks,
> Chris
>

Hey Chris

We hit the same problem moving from non-cloud to cloud, we had a
collection that loaded its DIH config from various XML files listing
the DB queries to run. We wrote a simple DataSource plugin function to
load the config from Zookeeper instead of local disk to avoid having
to distribute those config files around the cluster.

https://issues.apache.org/jira/browse/SOLR-8557

Cheers

Tom


Re: insert lat/lon from jpeg into solr

2016-12-01 Thread Tom Evans
On Wed, Nov 30, 2016 at 1:36 PM, win harrington
 wrote:
> I have jpeg files with latitude and longitudein separate fields. When I run 
> the post tool,it stores the lat/lon in separate fields.
> For geospatial search, Solr wants themcombined into one field with the 
> format'latitude,longitude'.
> How can I combine lat+lon into one field?
>

Build the field up using the UpdateRequestProcessorChain, something like this:

  
  

  latitude
  latlon


  longitude
  latlon


  latlon
  ,



  

  

  composite-latlon

  

Cheers

Tom


Re: Import from S3

2016-11-25 Thread Tom Evans
On Fri, Nov 25, 2016 at 7:23 AM, Aniket Khare  wrote:
> You can use Solr DIH for indexing csv data into solr.
> https://wiki.apache.org/solr/DataImportHandler
>

Seems overkill when you can simply post CSV data to the UpdateHandler,
using either the post tool:

https://cwiki.apache.org/confluence/display/solr/Post+Tool

Or by doing it manually however you wish:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates

Cheers

Tom


Re: Query formulation help

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 4:00 PM, Prasanna S. Dhakephalkar
 wrote:
> Hi,
>
> Thanks for reply, I did
>
> "q": "cost:[2 TO (2+5000)]"
>
> Got
>
>   "error": {
> "msg": "org.apache.solr.search.SyntaxError: Cannot parse 'cost:[2 to 
> (2+5000)]': Encountered \"  \"(2+5000) \"\" at line 1, 
> column 18.\nWas expecting one of:\n\"]\" ...\n\"}\" ...\n",
>   }
>
> I want solr to do the addition.
> I tried
> "q": "cost:[2 TO (2+5000)]"
> "q": "cost:[2 TO sum(2,5000)]"
>
> I has not worked. I am missing something. I donot know what. May be how to 
> invoke functions.
>
> Regards,
>
> Prasanna.

Sorry, I was unclear - do the maths before constructing the query!

You might be able to do this with function queries, but why bother? If
the number is fixed, then fix it in the query, if it varies then there
must be some code executing on your client that can be used to do a
simple addition.

Cheers

Tom


Re: Query formulation help

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 8:03 AM, Prasanna S. Dhakephalkar
 wrote:
> Hi,
>
>
>
> May be very rudimentary question
>
>
>
> There is a integer field in a core : "cost"
>
> Need to build a query that will return documents where 0  <
> "cost"-given_number  <  500
>

cost:[given_number TO (500+given_number)]


Re: OOM Error

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 4:53 AM, Shawn Heisey  wrote:
> On 10/25/2016 8:03 PM, Susheel Kumar wrote:
>> Agree, Pushkar.  I had docValues for sorting / faceting fields from
>> begining (since I setup Solr 6.0).  So good on that side. I am going to
>> analyze the queries to find any potential issue. Two questions which I am
>> puzzling with
>>
>> a) Should the below JVM parameter be included for Prod to get heap dump
>>
>> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"
>
> A heap dump can take a very long time to complete, and there may not be
> enough memory in the machine to start another instance of Solr until the
> first one has finished the heap dump.  Also, I do not know whether Java
> would release the listening port before the heap dump finishes.  If not,
> then a new instance would not be able to start immediately.
>
> If a different heap dump file is created each time, that might lead to
> problems with disk space after repeated dumps.  I don't know how the
> option works.
>
>> b) Currently OOM script just kills the Solr instance. Shouldn't it be
>> enhanced to wait and restart Solr instance
>
> As long as there is a problem causing OOMs, it seems rather pointless to
> start Solr right back up, as another OOM is likely.  The safest thing to
> do is kill Solr (since its operation would be unpredictable after OOM)
> and let the admin sort the problem out.
>

Occasionally our cloud nodes can OOM, when particularly complex
faceting is performed. The current OOM management can be exceedingly
annoying; a user will make a too complex analysis request, bringing
down one server, taking it out of the balancer. The user gets fed up
at no response, so reloads the page, re-submitting the analysis and
bringing down the next server in the cluster.

Lather, rinse, repeat - and then you get to have a meeting to discuss
why we invest so much in HA infrastructure that can be made non-HA by
one user with a complex query. In those meetings it is much harder to
justify not restarting.

Cheers

Tom


Re: indexing - offline

2016-10-20 Thread Tom Evans
On Thu, Oct 20, 2016 at 5:38 PM, Rallavagu  wrote:
> Solr 5.4.1 cloud with embedded jetty
>
> Looking for some ideas around offline indexing where an independent node
> will be indexed offline (not in the cloud) and added to the cloud to become
> leader so other cloud nodes will get replicated. Wonder if this is possible
> without interrupting the live service. Thanks.

How we do this, to reindex collection "foo":

1) First, collection "foo" should be an alias to the real collection,
eg "foo_1" aliased to "foo"
2) Have a node "node_i" in the cluster that is used for indexing. It
doesn't hold any shards of any collections
3) Use collections API to create collection "foo_2", with however many
shards required, but all placed on "node_i"
4) Index "foo_2" with new data with DIH or direct indexing to "node_1".
5) Use collections API to expand "foo_2" to all the nodes/replicas
that it should be on
6) Remove "foo_2" from "node_i"
7) Verify contents of "foo_2" are correct
8) Use collections API to change alias for "foo" to "foo_2"
9) Remove "foo_1" collection once happy

This avoids indexing overwhelming the performance of the cluster (or
any nodes in the cluster that receive queries), and can be performed
with zero downtime or config changes on the clients.

Cheers

Tom


min()/max() on date fields using JSON facets

2016-07-25 Thread Tom Evans
Hi all

I'm trying to replace a use of the stats module with JSON facets in
order to calculate the min/max date range of documents in a query. For
the same search, "stats.field=date_published" returns this:

{u'date_published': {u'count': 86760,
 u'max': u'2016-07-13T00:00:00Z',
 u'mean': u'2013-12-11T07:09:17.676Z',
 u'min': u'2011-01-04T00:00:00Z',
 u'missing': 0,
 u'stddev': 50006856043.410477,
 u'sum': u'3814570-11-06T00:00:00Z',
 u'sumOfSquares': 1.670619719649826e+29}}

For the equivalent JSON facet - "{'date.max': 'max(date_published)',
'date.min': 'min(date_published)'}" - I'm returned this:

{u'count': 86760, u'date.max': 146836800.0, u'date.min': 129409920.0}

What do these numbers represent - I'm guessing it is milliseconds
since epoch? In UTC?
Is there any way to control the output format or TZ?
Is there any benefit in using JSON facets to determine this, or should
I just continue using stats?

Cheers

Tom


Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
On the nodes that have the replica in a recovering state we now see:

19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to
recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
lookups slice: shard1
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

19-07-2016 16:18:28 INFO  RecoveryStrategy:444 - Replay not started,
or was not successful... still buffering updates.
19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed -
trying again... (164)
19-07-2016 16:18:28 INFO  RecoveryStrategy:503 - Wait [12.0] seconds
before trying to recover again (attempt=165)


This is with the "leader that is not the leader" shut down.

Issuing a FORCELEADER via collections API doesn't in fact force a
leader election to occur.

Is there any other way to prompt Solr to have an election?

Cheers

Tom

On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans <tevans...@googlemail.com> wrote:
> There are 11 collections, each only has one shard, and each node has
> 10 replicas (9 collections are on every node, 2 are just on one node).
> We're not seeing any OOM errors on restart.
>
> I think we're being patient waiting for the leader election to occur.
> We stopped the troublesome "leader that is not the leader" server
> about 15-20 minutes ago, but we still have not had a leader election.
>
> Cheers
>
> Tom
>
> On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> How many replicas per Solr JVM? And do you
>> see any OOM errors when you bounce a server?
>> And how patient are you being, because it can
>> take 3 minutes for a leaderless shard to decide
>> it needs to elect a leader.
>>
>> See SOLR-7280 and SOLR-7191 for the case
>> where lots of replicas are in the same JVM,
>> the tell-tale symptom is errors in the log as you
>> bring Solr up saying something like
>> "OutOfMemory error unable to create native thread"
>>
>> SOLR-7280 has patches for 6x and 7x, with a 5x one
>> being added momentarily.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
>>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>>> of the collections on it marked as "Recovering" or "Recovery Failed".
>>> It attempts to recover from the leader, but the leader responds with:
>>>
>>> Error while trying to recover.
>>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>>

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
There are 11 collections, each only has one shard, and each node has
10 replicas (9 collections are on every node, 2 are just on one node).
We're not seeing any OOM errors on restart.

I think we're being patient waiting for the leader election to occur.
We stopped the troublesome "leader that is not the leader" server
about 15-20 minutes ago, but we still have not had a leader election.

Cheers

Tom

On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> How many replicas per Solr JVM? And do you
> see any OOM errors when you bounce a server?
> And how patient are you being, because it can
> take 3 minutes for a leaderless shard to decide
> it needs to elect a leader.
>
> See SOLR-7280 and SOLR-7191 for the case
> where lots of replicas are in the same JVM,
> the tell-tale symptom is errors in the log as you
> bring Solr up saying something like
> "OutOfMemory error unable to create native thread"
>
> SOLR-7280 has patches for 6x and 7x, with a 5x one
> being added momentarily.
>
> Best,
> Erick
>
> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>> of the collections on it marked as "Recovering" or "Recovery Failed".
>> It attempts to recover from the leader, but the leader responds with:
>>
>> Error while trying to recover.
>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
>> ... 5 more
>>
>> and recovery never occurs.
>>
>> Each collection in this state has plenty (10+) of active replicas, but
>> stopping the server that is marked as the leader doesn't trigger a
>> leader election amongst these replicas.
>>
>> REBALANCELEADERS did nothing.
>> FORCELEADER complains that there is already a leader.
>> FORCELEADER with the purported leader stopped took 45 seconds,
>> reported status of "0" (and no other message) and kept the down node
>> as the leader (!)
>> Deleting the failed collection from the failed node and re-adding it
>> has the same "Leader said I'm not the leader" error message.
>>
>> Any other ideas?
>>
>> Cheers
>>
>> Tom


Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
of the collections on it marked as "Recovering" or "Recovery Failed".
It attempts to recover from the leader, but the leader responds with:

Error while trying to recover.
core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
... 5 more

and recovery never occurs.

Each collection in this state has plenty (10+) of active replicas, but
stopping the server that is marked as the leader doesn't trigger a
leader election amongst these replicas.

REBALANCELEADERS did nothing.
FORCELEADER complains that there is already a leader.
FORCELEADER with the purported leader stopped took 45 seconds,
reported status of "0" (and no other message) and kept the down node
as the leader (!)
Deleting the failed collection from the failed node and re-adding it
has the same "Leader said I'm not the leader" error message.

Any other ideas?

Cheers

Tom


Strange highlighting on search

2016-06-16 Thread Tom Evans
Hi all

I'm investigating a bug where by every term in the highlighted field
gets marked for highlighting instead of just the words that match the
fulltext portion of the query. This is on Solr 5.5.0, but I didn't see
any bug fixes related to highlighting in 5.5.1 or 6.0 release notes.

The query that affects it is where we have a not clause on a specific
field (not the fulltext field) and also only include documents where
that field has a value:

q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *]
AND -ingredient_tag_id:(35223)

This returns the correct results, but the highlighting has matched
every word in the results (see below for debugQuery output). If I
change the query to put the exclusion in to an fq, the highlighting is
correct again (and the results are correct):

q: cosmetics_packaging_fulltext:(Mist)
fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)

Is there any way I can make the query and highlighting work as
expected as part of q?

Is there any downside to putting the exclusion part in the fq in terms
of performance? We don't use score at all for our results, we always
order by other parameters.

Cheers

Tom

Query with strange highlighting:

{
  "responseHeader":{
"status":0,
"QTime":314,
"params":{
  "q":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
  "hl":"true",
  "hl.simple.post":"",
  "indent":"true",
  "fl":"id,product",
  "hl.fragsize":"0",
  "hl.fl":"product",
  "rows":"5",
  "wt":"json",
  "debugQuery":"true",
  "hl.simple.pre":""}},
  "response":{"numFound":10132,"start":0,"docs":[
  {
"id":"2403841-1498608",
"product":"Mist"},
  {
"id":"2410603-1502577",
"product":"Mist"},
  {
"id":"5988531-3882415",
"product":"Ao + Mist"},
  {
"id":"6020805-3904203",
"product":"UV Mist Cushion SPF 50+ PA+++"},
  {
"id":"2617977-1629335",
"product":"Ultra Radiance Facial Re-Hydrating Mist"}]
  },
  "highlighting":{
"2403841-1498608":{
  "product":["Mist"]},
"2410603-1502577":{
  "product":["Mist"]},
"5988531-3882415":{
  "product":["Ao + Mist"]},
"6020805-3904203":{
  "product":["UV Mist Cushion
SPF 50+ PA+++"]},
"2617977-1629335":{
  "product":["Ultra Radiance Facial
Re-Hydrating Mist"]}},
  "debug":{
"rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"querystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"parsedquery":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"parsedquery_toString":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"explain":{
  "2403841-1498608":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 13983)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=13983,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=13983)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "2410603-1502577":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 14023)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=14023,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=14023)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "5988531-3882415":"\n37.435104 = sum of:\n  37.282352 =
weight(cosmetics_packaging_fulltext:mist in 1062788)
[ClassicSimilarity], result of:\n37.282352 =
score(doc=1062788,freq=34.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  37.725063 =
fieldWeight in 1062788, product of:\n5.8309517 =
tf(freq=34.0), with freq of:\n  34.0 = termFreq=34.0\n
6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=1062788)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "6020805-3904203":"\n30.816679 = sum of:\n  30.663929 =

Re: result grouping in sharded index

2016-06-15 Thread Tom Evans
Do you have to group, or can you collapse instead?

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Cheers

Tom

On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharaju  wrote:
> Any suggestions on how to handle result grouping in sharded index?
>
>
> On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju 
> wrote:
>
>> Hi,
>> I am working on a functionality that would require me to group documents
>> by a id field. I read that the ngroups feature would not work in a sharded
>> index.
>> Can someone recommend how to handle this in a sharded index?
>>
>>
>> Solr Version: 5.5
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
>>
>> --
>> Thanks
>> Jay
>>
>>
>
>
>
> --
> Thanks
> Jay Potharaju


Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]

2016-05-24 Thread Tom Evans
On Tue, May 24, 2016 at 3:06 PM, Scott Chu  wrote:
> p.s. There're really many many extensive, worthy stuffs in Solr. If the
> project team can provide some "dictionary" of them, It would be a "Santa 
> Claus"
> for we solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not 
> possbile.
> I really like to study them one after another, to learn about all of them.
> However, Internet IT goes too fast to have time to congest all of the great
>  stuffs in Solr.

The reference guide is both extensive and also broadly informative.
Start from the top page and browse away!

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Handy to keep the glossary handy for any terms that you don't recognise:

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary

Cheers

Tom


Re: SolrCloud increase replication factor

2016-05-23 Thread Tom Evans
On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorp
 wrote:
> Hi,
>
> I have a SolrCloud 6.0 setup and created my collection with a
> replication factor of 1. Now I want to increase the replication factor
> but would like the replicas for the same shard to be on different nodes,
> so that my collection does not fail when one node fails. I tried two
> approaches so far:
>
> 1) When I use the collections API with the MODIFYCOLLECTION action [1] I
> can set the replication factor but that did not result in the creation
> of additional replicas. The Solr Admin UI showed that my replication
> factor changed but otherwise nothing happened. A reload of the
> collection did also result in no change.
>
> 2) Using the ADDREPLICA action [2] from the collections API I have to
> add the replicas to the shard individually, which is a bit more
> complicated but otherwise worked. During testing this did however at
> least once result in the replica being created on the same node. My
> collection was split in 4 shards and for 2 of them all replicas ended up
> on the same node.
>
> So is the only option to create the replicas manually and also pick the
> nodes manually or is the perceived behavior wrong?
>
> regards,
> Hendrik
>
> [1]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll
> [2]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica


With ADDREPLICA, you can specify the node to create the replica on. If
you are using a script to increase/remove replicas, you can simply
incorporate the logic you desire in to your script - you can also use
CLUSTERSTATUS to get a list of nodes/collections/shards etc in order
to inform the logic in the script. This is the approach we took, we
have a fabric script to add/remove extra nodes to/from the cluster, it
works well.

The alternative is to put the logic in to Solr itself, using what Solr
calls a "snitch" to define the rules on where replicas are created.
The snitch is specified at collection creation time, or you can use
MODIFYCOLLECTION to set it after the fact. See this wiki patch for
details:

https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement

Cheers

Tom


Re: Creating a collection with 1 shard gives a weird range

2016-05-17 Thread Tom Evans
On Tue, May 17, 2016 at 9:40 AM, John Smith  wrote:
> I'm trying to create a collection starting with only one shard
> (numShards=1) using a compositeID router. The purpose is to start small
> and begin splitting shards when the index grows larger. The shard
> created gets a weird range value: 8000-7fff, which doesn't look
> effective. Indeed, if a try to import some documents using a DIH, none
> gets added.
>
> If I create the same collection with 2 shards, the ranges seem more
> logical (0-7fff & 8000-). In this case documents are
> indexed correctly.
>
> Is this behavior by design, i.e. is a minimum of 2 shards required? If
> not, how can I create a working collection with a single shard?
>
> This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8.
>

I believe this is as designed, see this email from Shawn:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E

Cheers

Tom


Re: Indexing 700 docs per second

2016-04-19 Thread Tom Evans
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson  wrote:
> Hi,
>
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.
>
> 1. Can I manage this updation rate with a non-sharded ie single Solr
> instance set up?
> 2. Also is atomic update or a full update (the whole doc) of the changed
> records the better approach in this case.
>
> Could some one please share their views/ experience?

Try it and see - everyone's data/schemas are different and can affect
indexing speed. It certainly sounds achievable enough - presumably you
can at least produce the documents at that rate?

Cheers

Tom


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Tom Evans
On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
 wrote:
> Thanks all - very helpful.
>
> @Shawn - your reply implies that even if I'm hitting the URL for a single
> endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
> (I understand the caveat about that single endpoint being a potential point
> of failure).  I just want to verify that I'm interpreting your response
> correctly...
>
> (I have been asked to provide IT with a comprehensive list of options prior
> to a design discussion - which is why I'm trying to get clear about the
> various options)
>
> In a nutshell, I think I understand the following:
>
> a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> available nodes for searching
>   Caveat: That single URL represents a potential single point of
> failure and this should be taken into account
>
> b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> based on Zookeeper's "knowledge" of all available Solr instances.
>   Note: This is more robust than "a" due to the fact that it
> eliminates the "single point of failure"
>
> c.  Use of a load balancer hitting all known Solr instances will be fine -
> although the search requests may not run on the Solr instance the load
> balancer targeted - due to "a" above.
>
> Corrections or refinements welcomed...

With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom


Re: Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans
Awesome, thanks :)

On Fri, Apr 15, 2016 at 4:19 PM, Anshum Gupta <ans...@anshumgupta.net> wrote:
> Hi Tom,
>
> I plan on getting a release candidate out for vote by Monday. If all goes
> well, it'd be about a week from then for the official release.
>
> On Fri, Apr 15, 2016 at 6:52 AM, Tom Evans <tevans...@googlemail.com> wrote:
>
>> Hi all
>>
>> We're currently using Solr 5.5.0 and converting our regular old style
>> facets into JSON facets, and are running in to SOLR-8155 and
>> SOLR-8835. I can see these have already been back-ported to 5.5.x
>> branch, does anyone know when 5.5.1 may be released?
>>
>> We don't particularly want to move to Solr 6, as we have only just
>> finished validating 5.5.0 with our original queries!
>>
>> Cheers
>>
>> Tom
>>
>
>
>
> --
> Anshum Gupta


Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans
Hi all

We're currently using Solr 5.5.0 and converting our regular old style
facets into JSON facets, and are running in to SOLR-8155 and
SOLR-8835. I can see these have already been back-ported to 5.5.x
branch, does anyone know when 5.5.1 may be released?

We don't particularly want to move to Solr 6, as we have only just
finished validating 5.5.0 with our original queries!

Cheers

Tom


SolrCloud no leader for collection

2016-04-05 Thread Tom Evans
Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections,
most of them in a 1 shard x 8 replicas configuration. We have 5 ZK
nodes.

During the night, we attempted to reindex one of the larger
collections. We reindex by pushing json docs to the update handler
from a number of processes. It seemed this overwhelmed the servers,
and caused all of the collections to fail and end up in either a down
or a recovering state, often with no leader.

Restarting and rebooting the servers brought a lot of the collections
back online, but we are left with a few collections for which all the
nodes hosting those replicas are up, but the replica reports as either
"active" or "down", and with no leader.

Trying to force a leader election has no effect, it keeps choosing a
leader that is in "down" state. Removing all the nodes that are in
"down" state and forcing a leader election also has no effect.


Any ideas? The only viable option I see is to create a new collection,
index it and then remove the old collection and alias it in.

Cheers

Tom


Re: Creating new cluster with existing config in zookeeper

2016-03-23 Thread Tom Evans
On Wed, Mar 23, 2016 at 3:43 PM, Robert Brown  wrote:
> So I setup a new solr server to point to my existing ZK configs.
>
> When going to the admin UI on this new server I can see the shards/replica's
> of the existing collection, and can even query it, even tho this new server
> has no cores on it itself.
>
> Is this all expected behaviour?
>
> Is there any performance gain with what I have at this precise stage?  The
> extra server certainly makes it appear i could balance more load/requests,
> but I guess the queries are just being forwarded on to the servers with the
> actual data?
>
> Am I correct in thinking I can now create a new collection on this host, and
> begin to build up a new cluster?  and they won't interfere with each other
> at all?
>
> Also, that I'll be able to see both collections when using the admin UI
> Cloud page on any of the servers in either collection?
>

I'm confused slightly:

SolrCloud is a (singular) cluster of servers, storing all of its state
and configuration underneath a single zookeeper path. The cluster
contains collections. Collections are tied to a particular config set
within the cluster. Collections are made up of 1 or more shards. Each
shard is a core, and there are 1 or more replicas of each core.

You can add more servers to the cluster, and then create a new
collection with the same config as an existing collection, but it is
still part of the same cluster. Of course, you could think of a set of
servers within a cluster as a "logical" cluster if it just serves
particular collection, but "cluster" to me would be all of the servers
within the same zookeeper tree, because that is where cluster state is
maintained.

Cheers

Tom


Re: Re: Paging and cursorMark

2016-03-23 Thread Tom Evans
On Wed, Mar 23, 2016 at 12:21 PM, Vanlerberghe, Luc
 wrote:
> I worked on something similar a couple of years ago, but didn’t continue work 
> on it in the end.
>
> I've included the text of my original mail.
> If you're interested, I could try to find the sources I was working on at the 
> time
>
> Luc
>

Thanks both Luc and Steve. I'm not sure if we will have time to deploy
patched versions of things to production - time is always the enemy :(
, and we're not a Java shop so there is non trivial time investment in
just building replacement jars, let alone getting that integrated in
to our RPMs - but I'll definitely try it out on my dev server.

The change seems excessively complex imo, but maybe I'm not seeing the
use cases for skip.

To my mind, calculating a nextCursorMark is cheap and only relies on
having a strict sort ordering, which is also cheap to check. If that
condition is met, you should get a nextCursorMark in your response
regardless of whether you specified a cursorMark in the request, to
allow you to efficiently get the next page.

This would still leave slightly pathological performance if you skip
to page N, and then iterate back to page 0, which Luc's idea of a
previousCursorMark can solve. cursorMark is easy to implement, you can
ignore docs which sort lower than that mark. Can you do similar with
previousCursorMark?, as would it not require to keep a buffer of rows
documents, and stop when a document which sorts higher than the
supplied mark appears. Seems more complex, but maybe I'm not
understanding the internals correctly.

Fortunately for us, 90% of our users prefer infinite scroll, and 97%
of them never go beyond page 3.

Cheers

Tom


Paging and cursorMark

2016-03-22 Thread Tom Evans
Hi all

With Solr 5.5.0, we're trying to improve our paging performance. When
we are delivering results using infinite scrolling, cursorMark is
perfectly fine - one page is followed by the next. However, we also
offer traditional paging of results, and this is where it gets a
little tricky.

Say we have 10 results per page, and a user wants to jump from page 1
to page 20, and then wants to view page 21, there doesn't seem to be a
simple way to get the nextCursorMark. We can make an inefficient
request for page 20 (start=190, rows=10), but we cannot give that
request a cursorMark=* as it contains start=190.

Consequently, if the user clicks to page 21, we have to continue along
using start=200, as we have no cursorMark. The only way I can see to
get a cursorMark at that point is to omit the start=200, and instead
say rows=210, and ignore the first 200 results on the client side.
Obviously, this gets more and more inefficient the deeper we page - I
know that internally to Solr, using start=200=10 has to do the
same work as rows=210, but less data is sent over the wire to the
client.

As I understand it, the cursorMark is a hash of the sort values of the
last document returned, so I don't really see why it is forbidden to
specify start=190=10=* - why is it not possible to
calculate the nextCursorMark from the last document returned?

I was also thinking a possible temporary workaround would be to
request start=190=10, note the last document returned, and then
make a subsequent query for q=id:""=1=*.
This seems to work, but means an extra Solr query for no real reason.
Is there any other problem to doing this?

Is there some other simple trick I am missing that we can use to get
both the page of results we want and a nextCursorMark for the
subsequent page?

Cheers

Tom


Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
On Wed, Mar 16, 2016 at 4:10 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 3/16/2016 8:14 AM, Tom Evans wrote:
>> The problem occurs when we attempt to query a node to see if products
>> or items is active on that node. The balancer (haproxy) requests the
>> ping handler for the appropriate collection, however all the nodes
>> return OK for all the collections(!)
>>
>> Eg, on node01, it has replicas for products and skus, but the ping
>> handler for /solr/items/admin/ping returns 200!
>
> This returns OK because as long as one replica for every shard in
> "items" is available somewhere in the cloud, you can make a request for
> "items" on that node and it will work.  Or at least it *should* work,
> and if it's not working, that's a bug.  I remember that one of the older
> 4.x versions *did* have a bug where queries for a collection would only
> work if the node actually contained shards for that collection.

Sorry, this is Solr 5.5, I should have said.

Yes, we can absolutely make a request of "items", and it will work
correctly. However, we are making requests of "skus" that join to
"products", and the query is routed to a node which has only "skus"
and "items", and the request fails because joins can only work over
local replicas.

To fix this, we now have two additional balancers:

solr: has all the nodes, all nodes are valid backends
solr-items: has all the nodes in the cluster, but nodes are only valid
backends if it has "items" and "skus" replicas.
solr-products: has all the nodes in the cluster, but nodes are only
valid backends if it has "products" and "skus" replicas

(I'm simplifying things a bit, there are another 6 collections that
are on all nodes, hence the main balancer.)

The new balancers need a cheap way of checking what nodes are valid,
and ideally I'd like that check to not involve a query with a join
clause!

Cheers

Tom


Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
On Wed, Mar 16, 2016 at 2:14 PM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all
>
> [ .. ]
>
> The option I'm trying now is to make two ping handler for skus that
> join to one of items/products, which should fail on the servers which
> do not support it, but I am concerned that this is a little
> heavyweight for a status check to see whether we can direct requests
> at this server or not.

This worked, I would still be interested in a lighter-weight approach
that doesn't involve joins to see if a given collection has a shard on
this server. I suspect that might require a custom ping handler plugin
however.

Cheers

Tom


Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
Hi all

I have a cloud setup with 8 nodes and 3 collections, products, items
and skus. All collections have just one shard, products has 6
replicas, items has 2 replicas, skus has 8 replicas. No node has both
products and items, all nodes have skus

Some of our queries join from sku to either products or items. If the
query is directed at a node without the appropriate shard on them, we
obviously get an error, so we have separate balancers for products and
items.

The problem occurs when we attempt to query a node to see if products
or items is active on that node. The balancer (haproxy) requests the
ping handler for the appropriate collection, however all the nodes
return OK for all the collections(!)

Eg, on node01, it has replicas for products and skus, but the ping
handler for /solr/items/admin/ping returns 200!

This means that as far as the balancer is concerned, node01 is a valid
destination for item queries, and inevitably it blows up as soon as
such a query is made to it.

As I understand it, this is because the URL we are checking is for the
collection ("items") rather than a specific core
("items_shard1_replica1")

Is there a way to make the ping handler only check local shards? I
have tried with distrib=false=false, but it still
returns a 200.

The option I'm trying now is to make two ping handler for skus that
join to one of items/products, which should fail on the servers which
do not support it, but I am concerned that this is a little
heavyweight for a status check to see whether we can direct requests
at this server or not.

Cheers

Tom


mergeFactor/maxMergeDocs is deprecated

2016-03-03 Thread Tom Evans
Hi all

Updating to Solr 5.5.0, and getting these messages in our error log:

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

However, mergeFactor is only mentioned in a commented out sections of
our solrconfig.xml files, and mergeFactor is not mentioned at all.

> $ ack -B 1 -A 1 ' $ ack --all maxMergeDocs
> $

Any ideas?

Cheers

Tom


Re: Separating cores from Solr home

2016-03-03 Thread Tom Evans
Hmm, I've worked around this by setting the directory where the
indexes should live to be the actual solr home, and symlink the files
from the current release in to that directory, but it feels icky.

Any better ideas?

Cheers

Tom

On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all
>
> I'm struggling to configure solr cloud to put the index files and
> core.properties in the correct places in SolrCloud 5.5. Let me explain
> what I am trying to achieve:
>
> * solr is installed in /opt/solr
> * the user who runs solr only has read only access to that tree
> * the solr home files - custom libraries, log4j.properties, solr.in.sh
> and solr.xml - live in /data/project/solr/releases/, which
> is then the target of a symlink /data/project/solr/releases/current
> * releasing a new version of the solr home (eg adding/changing
> libraries, changing logging options) is done by checking out a fresh
> copy of the solr home, switching the symlink and restarting solr
> * the solr core.properties and any data live in /data/project/indexes,
> so they are preserved when new solr home is released
>
> Setting core specific dataDir with absolute paths in solrconfig.xml
> only gets me part of the way, as the core.properties for each shard is
> created inside the solr home.
>
> This is obviously no good, as when releasing a new version of the solr
> home, they will no longer be in the current solr home.
>
> Cheers
>
> Tom


Separating cores from Solr home

2016-03-03 Thread Tom Evans
Hi all

I'm struggling to configure solr cloud to put the index files and
core.properties in the correct places in SolrCloud 5.5. Let me explain
what I am trying to achieve:

* solr is installed in /opt/solr
* the user who runs solr only has read only access to that tree
* the solr home files - custom libraries, log4j.properties, solr.in.sh
and solr.xml - live in /data/project/solr/releases/, which
is then the target of a symlink /data/project/solr/releases/current
* releasing a new version of the solr home (eg adding/changing
libraries, changing logging options) is done by checking out a fresh
copy of the solr home, switching the symlink and restarting solr
* the solr core.properties and any data live in /data/project/indexes,
so they are preserved when new solr home is released

Setting core specific dataDir with absolute paths in solrconfig.xml
only gets me part of the way, as the core.properties for each shard is
created inside the solr home.

This is obviously no good, as when releasing a new version of the solr
home, they will no longer be in the current solr home.

Cheers

Tom


Re: docValues error

2016-02-29 Thread Tom Evans
On Mon, Feb 29, 2016 at 11:43 AM, David Santamauro
 wrote:
> You will have noticed below, the field definition does not contain
> multiValues=true

What version of the schema are you using? In pre 1.1 schemas,
multiValued="true" is the default if it is omitted.

Cheers

Tom


Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Tom Evans
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma
 wrote:
> Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats 
> over ranges is not yet supported. More specifically, SOLR-6352 is what we 
> would need.
>
> [1]: https://issues.apache.org/jira/browse/SOLR-6348
> [2]: https://issues.apache.org/jira/browse/SOLR-6352
>
> Thanks anyway, at least we found the tickets :)
>

No problem - as I was reading this I was thinking "But wait, I *know*
we do this ourselves for average price vs month published". In fact, I
was forgetting that we index the ranges that we will want to facet
over as part of the document - so a document with a date_published of
"2010-03-29T00:00:00Z" also has a date_published.month of "201003"
(and a bunch of other ranges that we want to facet by). The frontend
then converts those fields in to the appropriate values for display.

This might be an acceptable solution for you guys too, depending on
how many ranges that you require, and how much larger it would make
your index.

Cheers

Tom


Re: Json faceting, aggregate numeric field by day?

2016-02-10 Thread Tom Evans
On Wed, Feb 10, 2016 at 10:21 AM, Markus Jelsma
 wrote:
> Hi - if we assume the following simple documents:
>
> 
>   2015-01-01T00:00:00Z
>   2
> 
> 
>   2015-01-01T00:00:00Z
>   4
> 
> 
>   2015-01-02T00:00:00Z
>   3
> 
> 
>   2015-01-02T00:00:00Z
>   7
> 
>
> Can i get a daily average for the field 'value' by day? e.g.
>
> 
>   3.0
>   5.0
> 
>
> Reading the documentation, i don't think i can, or i am missing it 
> completely. But i just want to be sure.

Yes, you can facet by day, and use the stats component to calculate
the mean average. This blog post explains it:

https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/

Cheers

Tom


fq in SolrCloud

2016-02-05 Thread Tom Evans
I have a small question about fq in cloud mode that I couldn't find an
explanation for in confluence. If I specify a query with an fq, where
is that cached, is it just on the nodes/replicas that process that
specific query, or will it exist on all replicas?

We have a sub type of queries that specify an expensive join condition
that we specify in the fq, so that subsequent requests with the same
fq won't have to do the same expensive query, and was wondering
whether we needed to ensure that the query goes to the same node when
we move to cloud.

Cheers

Tom


Re: Shard allocation across nodes

2016-02-02 Thread Tom Evans
Thank you both, those are exactly what I was looking for!

If I'm reading it right, if I specify a "-Dvmhost=foo" when starting
SolrCloud, and then specify a snitch rule like this when creating the
collection:

  sysprop.vmhost:*,replica:<2

then this would ensure that on each vmhost there is at most one
replica. I'm assuming that a shard leader and a replica are both
treated as replicas in this scenario.

Thanks

Tom

On Mon, Feb 1, 2016 at 8:34 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> See the createNodeset and node parameters for the Collections API CREATE and
> ADDREPLICA commands, respectively. That's more a manual process, there's
> nothing OOB but Jeff's suggestion is sound.
>
> Best,
> Erick
>
>
>
> On Mon, Feb 1, 2016 at 11:00 AM, Jeff Wartes <jwar...@whitepages.com> wrote:
>>
>> You could write your own snitch: 
>> https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement
>>
>> Or, it would be more annoying, but you can always add/remove replicas 
>> manually and juggle things yourself after you create the initial collection.
>>
>>
>>
>>
>> On 2/1/16, 8:42 AM, "Tom Evans" <tevans...@googlemail.com> wrote:
>>
>>>Hi all
>>>
>>>We're setting up a solr cloud cluster, and unfortunately some of our
>>>VMs may be physically located on the same VM host. Is there a way of
>>>ensuring that all copies of a shard are not located on the same
>>>physical server?
>>>
>>>If they do end up in that state, is there a way of rebalancing them?
>>>
>>>Cheers
>>>
>>>Tom


Shard allocation across nodes

2016-02-01 Thread Tom Evans
Hi all

We're setting up a solr cloud cluster, and unfortunately some of our
VMs may be physically located on the same VM host. Is there a way of
ensuring that all copies of a shard are not located on the same
physical server?

If they do end up in that state, is there a way of rebalancing them?

Cheers

Tom


SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans
Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
some problems with a DIH config that attempts to load an XML file and
iterate through the nodes in that file, it trys to load the file from
disk instead of from zookeeper.



The file exists in zookeeper, adjacent to the data_import.conf in the
lookups_config conf folder.

The exception:

2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1
r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full
Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
... 5 more
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
Could not find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127)
at 
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86)
at 
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284)
... 10 more
Caused by: java.io.FileNotFoundException: Could not find file:
lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123)
... 13 more


Any hints gratefully accepted

Cheers

Tom


Re: SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans
On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 1/12/2016 7:45 AM, Tom Evans wrote:
>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
>> just fine, or is that provided to DIH from another module that does
>> know about ZK?
>
> This is accomplished indirectly through a resource loader in the
> SolrCore object that is responsible for config files.  Also, the
> dataimport handler is created by the main Solr code which then hands the
> configuration to the dataimport module.  DIH itself does not know about
> zookeeper.

ZkPropertiesWriter seems to know a little..

>
>> Either way, it is entirely sub-optimal to have SolrCloud store "all"
>> its configuration in ZK, but still require manually storing and
>> updating files on specific nodes in order to influence DIH. If a
>> server is mistakenly not updated, or manually modified locally on
>> disk, that node would start indexing documents differently than other
>> replicas, which sounds dangerous and scary!
>
> The entity processor you are using accesses files through a Java
> interface for mounted filesystems.  As already mentioned, it does not
> know about zookeeper.
>
>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
>> one... I'll see how much I dislike having config files on the host...
>
> Creating your own DIH class would be the only solution available right now.
>
> I don't know how useful this would be in practice.  Without special
> config in multiple places, Zookeeper limits the size of the files it
> contains to 1MB.  It is not designed to deal with a large amount of data
> at once.

This is not large amounts of data, it is a 5kb XML file containing
configuration of what tables to query for what fields and how to map
them in to the document.

>
> You could submit a feature request in Jira, but unless you supply a
> complete patch that survives the review process, I do not know how
> likely an implementation would be.

We've already started implementation, basing around FileDataSource and
using SolrZkClient, which we will deploy as an additional library
whilst that process is ongoing or doesn't survive it.

Cheers

Tom


Re: SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans
On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 1/12/2016 6:05 AM, Tom Evans wrote:
>> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
>> some problems with a DIH config that attempts to load an XML file and
>> iterate through the nodes in that file, it trys to load the file from
>> disk instead of from zookeeper.
>>
>> > dataSource="lookup_conf"
>> rootEntity="false"
>> name="lookups"
>> processor="XPathEntityProcessor"
>> url="lookup_conf.xml"
>> forEach="/lookups/lookup">
>>
>> The file exists in zookeeper, adjacent to the data_import.conf in the
>> lookups_config conf folder.
>
> SolrCloud puts all the *config* for Solr into zookeeper, and adds a new
> abstraction for indexes (the collection), but other parts of Solr like
> DIH are not really affected.  The entity processors in DIH cannot
> retrieve data from zookeeper.  They do not know how.

That makes no sense whatsoever. DIH loads the data_import.conf from ZK
just fine, or is that provided to DIH from another module that does
know about ZK?

Either way, it is entirely sub-optimal to have SolrCloud store "all"
its configuration in ZK, but still require manually storing and
updating files on specific nodes in order to influence DIH. If a
server is mistakenly not updated, or manually modified locally on
disk, that node would start indexing documents differently than other
replicas, which sounds dangerous and scary!

If there is not a ZkFileDataSource, it shouldn't be too tricky to add
one... I'll see how much I dislike having config files on the host...

Cheers

Tom


Re: Defining SOLR nested fields

2015-12-14 Thread Tom Evans
On Sun, Dec 13, 2015 at 6:40 PM, santosh sidnal
 wrote:
> Hi All,
>
> I want to define nested fileds in SOLR using schema.xml. we are using Apache
> Solr 4.7.0.
>
> i see some links which says how to do, but not sure how can i do it in
> schema.xml
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
>
>
> any help over here is appreciable.
>

With nested documents, it is better to not think of them as
"children", but as related documents. All the documents in your index
will follow exactly the same schema, whether they are "children" or
"parents", and the nested aspect of a a document simply allows you to
restrict your queries based upon that relationship.

Solr is extremely efficient dealing with sparse documents (docs with
only a few fields defined), so one way is to define all your fields
for "parent" and "child" in the schema, and only use the appropriate
ones in the right document. Another way is to use a schema-less
structure, although I'm not a fan of that for error checking reasons.
You can also define a suffix or prefix for fields that you use as part
of your methodology, so that you know what domain it belongs in, but
that would just be for your benefit, Solr would not complain if you
put a "child" field in a parent or vice-versa.

Cheers

Tom

PS:

I would not use Solr 4.7 for this. Nested docs are a new-ish feature,
you may encounter bugs that have been fixed in later versions, and
performance has certainly been improved in later versions. Faceting on
a specific domain (eg, on children or parents) is only supported by
the JSON facet API, which was added in 5.2, and the current stable
version of Solr is 5.4.


Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Tom Evans
Hi all

We're currently in the process of migrating our distributed search
running on 5.0 to SolrCloud running on 5.4, and setting up a test
cluster for performance testing etc.

We have several cores/collections, and in each core's solrconfig.xml,
we were specifying an empty , and specifying the same
core.baseDataDir in core.properties.

When I tried this in SolrCloud mode, specifying
"-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
for the first collection, but then the second collection tried to use
the same directory to store its index, which obviously failed. I fixed
this by changing solrconfig.xml in each collection to specify a
specific directory, like so:

  ${solr.data.dir:}products

Looking back after the weekend, I'm not a big fan of this. Is there a
way to add a core.properties to ZK, or a way to specify
core.baseDatadir on the command line, or just a better way of handling
this that I'm not aware of?

Cheers

Tom


Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Tom Evans
On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 12/14/2015 10:49 AM, Tom Evans wrote:
>> When I tried this in SolrCloud mode, specifying
>> "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
>> for the first collection, but then the second collection tried to use
>> the same directory to store its index, which obviously failed. I fixed
>> this by changing solrconfig.xml in each collection to specify a
>> specific directory, like so:
>>
>>   ${solr.data.dir:}products
>>
>> Looking back after the weekend, I'm not a big fan of this. Is there a
>> way to add a core.properties to ZK, or a way to specify
>> core.baseDatadir on the command line, or just a better way of handling
>> this that I'm not aware of?
>
> Since you're running SolrCloud, just let Solr handle the dataDir, don't
> try to override it.  It will default to "data" relative to the
> instanceDir.  Each instanceDir is likely to be in the solr home.
>
> With SolrCloud, your cores will not contain a "conf" directory (unless
> you create it manually), therefore the on-disk locations will be *only*
> data, there's not really any need to have separate locations for
> instanceDir and dataDir.  All active configuration information for
> SolrCloud is in zookeeper.
>

That makes sense, but I guess I was asking the wrong question :)

We have our SSDs mounted on /data/solr, which is where our indexes
should go, but our solr install is on /opt/solr, with the default solr
home in /opt/solr/server/solr. How do we change where the indexes get
put so they end up on the fast storage?

Cheers

Tom


Re: Best way to track cumulative GC pauses in Solr

2015-11-16 Thread Tom Evans
On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwood  wrote:
> Also, what GC settings are you using? We may be able to make some suggestions.
>
> Cumulative GC pauses aren’t very interesting to me. I’m more interested in 
> the longest ones, 90th percentile, 95th, etc.
>

Any advice would be great, but what I'm primarily interested in is how
people are monitoring these statistics in real time, for all time, on
production servers. Eg, for looking at the disk or RAM usage of one of
my servers, I can look at the historical usage in the last week, last
month, last year and so on.

I need to get these stats in to the same monitoring tools as we use
for monitoring every other vital aspect of our servers. Looking at log
files can be useful, but I don't want to keep arbitrarily large log
files on our servers, nor extract data from them, I want to record it
for posterity in one system that understands sampling.

We already use and maintain our own munin systems, so I'm not
interested in paid-for equivalents of munin - regardless of how simple
to set up they are, they don't integrate with our other performance
monitoring stats, and I would never get budget anyway.

So really:

1) Is it OK to turn JMX monitoring on on production systems? The
comments in solr.in.sh suggest not.

2) What JMX beans and attributes should I be using to monitor GC
pauses, particularly maximum length of a single pause in a period, and
the total length of pauses in that period?

Cheers

Tom


Best way to track cumulative GC pauses in Solr

2015-11-13 Thread Tom Evans
Hi all

We have some issues with our Solr servers spending too much time
paused doing GC. From turning on gc debug, and extracting numbers from
the GC log, we're getting an idea of just how much of a problem.

I'm currently doing this in a hacky, inefficient way:

grep -h 'Total time for which application threads were stopped:' solr_gc* \
| awk '($11 > 0.3) { print $1, $11 }' \
| sed 's#:.*:##' \
| sort -n \
| sum_by_date.py

(Yes, I really am using sed, grep and awk all in one line. Just wrong :)

The "sum_by_date.py" program simply adds up all the values with the
same first column, and remembers the largest value seen. This is
giving me the cumulative GC time for extended pauses (over 0.5s), and
the maximum pause seen in a given time period (hourly), eg:

2015-11-13T11 119.124037 2.203569
2015-11-13T12 184.683309 3.156565
2015-11-13T13 65.934526 1.978202
2015-11-13T14 63.970378 1.411700


This is fine for seeing that we have a problem. However, really I need
to get this in to our monitoring systems - we use munin. I'm
struggling to work out the best way to extract this information for
our monitoring systems, and I think this might be my naivety about
Java, and working out what should be logged.

I've turned on JMX debugging, and looking at the different beans
available using jconsole, but I'm drowning in information. What would
be the best thing to monitor?

Ideally, like the stats above, I'd like to know the cumulative time
spent paused in GC since the last poll, and the longest GC pause that
we see. munin polls every 5 minutes, are there suitable counters
exposed by JMX that it could extract?

Thanks in advance

Tom


Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-11-02 Thread Tom Evans
On Mon, Nov 2, 2015 at 1:38 PM, fabigol  wrote:
> Thank
> All works.
> I have 2 last questions:
> How can i put 0 by defaults " clean" during a indexation?
>
> To conclure, i wand to understand:
>
>
> Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433
> (1743/s)
>
> What is the "requests"?
> What is 'Fetched"?
> What is "Processed"?
>
> Thank again for your answer
>

Depends upon how DIH is configured - different things return different
numbers. For a SqlEntityProcessor, "Requests" is the number of SQL
queries, "Fetched" is the number of rows read from those queries, and
"Processed" is the number of documents processed by SOLR.

> For the second question, i try:
> 
> false
> 
>
> and
> true
> false
>

Putting things in "invariants" overrides whatever is passed for that
parameter in the request parameters. By putting "false" in invariants, you are making it impossible
to clean + index as part of DIH, because "clean" is always false.

Cheers

Tom


Re: Checking of Solr Memory and Disk usage

2015-04-24 Thread Tom Evans
On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo
edwinye...@gmail.com wrote:
 Hi,

 So has anyone knows what is the issue with the Heap Memory Usage reading
 showing the value -1. Should I open an issue in Jira?

I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers
the core statistics have values for heap memory, on the solr 5.0.0
ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK
on both versions.

I don't see this issue in the fixed bugs in 5.1.0, but I only looked
at the headlines of the tickets..

http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes

Cheers

Tom


Re: Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans
I do apologise for wasting anyone's time on this, the PEBKAC (my
keyboard and chair unfortunately). When adding the new server to
haproxy, I updated the label for the balancer entry to the new server,
but left the host name the same, so the server that wasn't using any
RAM... wasn't getting any requests.

Again, sorry!

Tom

On Tue, Apr 21, 2015 at 11:54 AM, Tom Evans tevans...@googlemail.com wrote:
 We monitor them with munin, so I have charts if attachments are
 acceptable? Having said that, they have only been running for a day
 with this memory allocation..

 Describing them, the master consistently has 8GB used for apps, the
 8GB used in cache, whilst the slave consistently only uses ~1.5GB for
 apps, 14GB used in cache.

 We are trying to use our SOLR servers to do a lot more facet queries,
 previously we were mainly doing searches, and the
 SolrPerformanceProblems wiki page mentions that faceting (amongst
 others) require a lot of JVM heap, so I'm confused why it is not using
 the heap we've allocated on one server, whilst it is on the other
 server. Perhaps our master server needs even more heap?

 Also, my infra guy is wondering why I asked him to add more memory to
 the slave server, if it is just in cache, although I did try to
 explain that ideally, I'd have even more in cache - we have about 35GB
 of index data.

 Cheers

 Tom

 On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 Hi - what do you see if you monitor memory over time? You should see a 
 typical saw tooth.
 Markus

 -Original message-
 From:Tom Evans tevans...@googlemail.com
 Sent: Tuesday 21st April 2015 12:22
 To: solr-user@lucene.apache.org
 Subject: Confusing SOLR 5 memory usage

 Hi all

 I have two SOLR 5 servers, one is the master and one is the slave.
 They both have 12 cores, fully replicated and giving identical results
 when querying them. The only difference between configuration on the
 two servers is that one is set to slave from the other - identical
 core configs and solr.in.sh.

 They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
 setting the heap size identically:

 SOLR_JAVA_MEM=-Xms512m -Xmx7168m

 The two servers are balanced behind haproxy, and identical numbers and
 types of queries flow to both servers. Indexing only happens once a
 day.

 When viewing the memory usage of the servers, the master server's JVM
 has 8.8GB RSS, but the slave only has 1.2GB RSS.

 Can someone hit me with the cluebat please? :)

 Cheers

 Tom



Re: Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans
We monitor them with munin, so I have charts if attachments are
acceptable? Having said that, they have only been running for a day
with this memory allocation..

Describing them, the master consistently has 8GB used for apps, the
8GB used in cache, whilst the slave consistently only uses ~1.5GB for
apps, 14GB used in cache.

We are trying to use our SOLR servers to do a lot more facet queries,
previously we were mainly doing searches, and the
SolrPerformanceProblems wiki page mentions that faceting (amongst
others) require a lot of JVM heap, so I'm confused why it is not using
the heap we've allocated on one server, whilst it is on the other
server. Perhaps our master server needs even more heap?

Also, my infra guy is wondering why I asked him to add more memory to
the slave server, if it is just in cache, although I did try to
explain that ideally, I'd have even more in cache - we have about 35GB
of index data.

Cheers

Tom

On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi - what do you see if you monitor memory over time? You should see a 
 typical saw tooth.
 Markus

 -Original message-
 From:Tom Evans tevans...@googlemail.com
 Sent: Tuesday 21st April 2015 12:22
 To: solr-user@lucene.apache.org
 Subject: Confusing SOLR 5 memory usage

 Hi all

 I have two SOLR 5 servers, one is the master and one is the slave.
 They both have 12 cores, fully replicated and giving identical results
 when querying them. The only difference between configuration on the
 two servers is that one is set to slave from the other - identical
 core configs and solr.in.sh.

 They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
 setting the heap size identically:

 SOLR_JAVA_MEM=-Xms512m -Xmx7168m

 The two servers are balanced behind haproxy, and identical numbers and
 types of queries flow to both servers. Indexing only happens once a
 day.

 When viewing the memory usage of the servers, the master server's JVM
 has 8.8GB RSS, but the slave only has 1.2GB RSS.

 Can someone hit me with the cluebat please? :)

 Cheers

 Tom



Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans
Hi all

I have two SOLR 5 servers, one is the master and one is the slave.
They both have 12 cores, fully replicated and giving identical results
when querying them. The only difference between configuration on the
two servers is that one is set to slave from the other - identical
core configs and solr.in.sh.

They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
setting the heap size identically:

SOLR_JAVA_MEM=-Xms512m -Xmx7168m

The two servers are balanced behind haproxy, and identical numbers and
types of queries flow to both servers. Indexing only happens once a
day.

When viewing the memory usage of the servers, the master server's JVM
has 8.8GB RSS, but the slave only has 1.2GB RSS.

Can someone hit me with the cluebat please? :)

Cheers

Tom


Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans
On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey apa...@elyograg.org wrote:
 I think you will only need to change the ownership of the solr home and
 the location where the .war file is extracted, which by default is
 server/solr-webapp.  The user must be able to *read* the program data,
 but should not need to write to it. If you are using the start script
 included with Solr 5 and one of the examples, I believe the logging
 destination will also be located under the solr home, but you should
 make sure that's the case.


Thanks Shawn, this sort of makes sense. The thing which I cannot seem
to do is change the location where the war file is extracted. I think
this is probably because, as of solr 5, I am not supposed to know or
be aware that there is a war file, or that the war file is hosted in
jetty, which makes it tricky to specify the jetty temporary directory.

Our use case is that we want to create a single system image that
would be usable for several projects, each project would check out its
solr home and run solr as their own user (possibly on the same
server). Eg, /data/projectA being a solr home for one project,
/data/projectB being a solr home for another project, both running
solr from the same location.

Also, on a dev server, I want to install solr once, and each member of
my team run it from that single location. Because they cannot change
the temporary directory, and they cannot all own server/solr-webapp,
this does not work and they must each have their own copy of the solr
install.

I think the way we will go for this is in production to run all our
solr instance as the solr user, who will own the files in /opt/solr,
and have their solr home directory wherever they choose. In dev, we
will just do something...

Cheers

Tom


Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans
On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote:
 Hi all

 We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
 would prefer we installed SOLR from an RPM rather than extracting the
 tarball where we need it. They are creating the RPM file themselves,
 and it installs an init.d script and the equivalent of the tarball to
 /opt/solr.

 We're having problems running SOLR from the installed files, as SOLR
 wants to (I think) extract the WAR file and create various temporary
 files below /opt/solr/server.

From the SOLR 5 reference guide, section Managing SOLR, sub-section
Taking SOLR to production, it seems changing the ownership of the
installed files to the user that will run SOLR is an explicit
requirement if you do not wish to run as root.

It would be better if this was not required. With most applications
you do not normally require permission to modify the installed files
in order to run the application, eg I do not need write permission to
/usr/share/vim to run vim, it is a shame I need write permission to
/opt/solr to run solr.

Cheers

Tom


Setting up SOLR 5 from an RPM

2015-03-24 Thread Tom Evans
Hi all

We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
would prefer we installed SOLR from an RPM rather than extracting the
tarball where we need it. They are creating the RPM file themselves,
and it installs an init.d script and the equivalent of the tarball to
/opt/solr.

We're having problems running SOLR from the installed files, as SOLR
wants to (I think) extract the WAR file and create various temporary
files below /opt/solr/server.

We currently have this structure:

/data/solr - root directory of our solr instance
/data/solr/{logs,run} - log/run directories
/data/solr/cores - configuration for our cores and solr.in.sh
/opt/solr - the RPM installed solr 5

The user running solr can modify anything under /data/solr, but
nothing under /opt/solr.

Is this sort of configuration supported? Am I missing some variable in
our solr.in.sh that sets where temporary files can be extracted? We
currently set:

SOLR_PID_DIR=/data/solr/run
SOLR_HOME=/data/solr/cores
SOLR_LOGS_DIR=/data/solr/logs


Cheers

Tom


Determining which field caused a document to not be imported

2014-10-03 Thread Tom Evans
Hi all

I recently rewrote our SOLR 4.8 dataimport to read from a set of
denormalised DB tables, in an attempt to increase full indexing speed.
When I tried it out however, indexing broke telling me that
java.lang.Long cannot be cast to java.lang.Integer (full stack
below, with the document elided). From googling, this tends to be some
field that is being selected out as a long, where it should probably
be cast as a string.

Unfortunately, our documents have some 400+ fields and over 100
entities; is there another way to determine which field could not be
cast from Long to Integer other than disabling each integer field in
turn?

Cheers

Tom


Exception while processing: variant document :
SolrInputDocument(fields: [(removed)]):
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.lang.Long cannot be cast to
java.lang.Integer
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast
to java.lang.Integer
at java.lang.Integer.compareTo(Integer.java:52)
at java.util.TreeMap.getEntry(TreeMap.java:346)
at java.util.TreeMap.get(TreeMap.java:273)
at 
org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:147)
at 
org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:179)
at 
org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:145)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:129)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
... 10 more


Re: Determining which field caused a document to not be imported

2014-10-03 Thread Tom Evans
On Fri, Oct 3, 2014 at 2:24 PM, Shawn Heisey apa...@elyograg.org wrote:
 Can you give us the entire stacktrace, with complete details from any
 caused by sections?  Also, is this 4.8.0 or 4.8.1?


Thanks Shawn, this is SOLR 4.8.1 and here is the full traceback from the log:

95191 [Thread-21] INFO
org.apache.solr.update.processor.LogUpdateProcessor  – [products]
webapp=/products path=/dataimport-from-denorm
params={id=2148732optimize=falseclean=falseindent=truecommit=trueverbose=falsecommand=full-importdebug=falsewt=json}
status=0 QTime=32 {} 0 32
95199 [Thread-21] ERROR
org.apache.solr.handler.dataimport.DataImporter  – Full Import
failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Long
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:278)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Long
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:418)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Long
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
... 5 more
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be
cast to java.lang.Long
at java.lang.Long.compareTo(Long.java:50)
at java.util.TreeMap.getEntry(TreeMap.java:346)
at java.util.TreeMap.get(TreeMap.java:273)
at 
org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:147)
at 
org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:179)
at 
org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:145)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:129)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
... 10 more

95199 [Thread-21] INFO  org.apache.solr.update.UpdateHandler  – start rollback{}

I've tracked it down to a single entity now that selects some content
out of the database and then looks up other fields using that data
from sub-entities that have SortedMapBackedCache caching in use, but
I'm still not sure how to fix it.

Eg, the original entity selects out country_id, which is then used
by this entity:

entity dataSource=products name=country_lookup query=
  SELECT
lk_country.id AS xid,
IF(LENGTH(english), CAST(english AS CHAR), description) AS country
  FROM lk_country
  INNER JOIN nl_strings ON lk_country.description_sid=nl_strings.id
  cacheKey=xid
  cacheLookup=product.country_id
  cacheImpl=SortedMapBackedCache
  field column=country name=country/
/entity

I tried converting the selected data to SIGNED INTEGER, eg
CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not
have the desired effect.

The source database is mysql, the source column for country_id is
`country_id` smallint(6) NOT NULL default '0'.

Again, I'm not 100% sure that it is even the country field that
causes this, there are several SortedMapBackedCache sub-entities (but
they are all analogous to this one).

Thanks in advance

Tom


Re: Determining which field caused a document to not be imported

2014-10-03 Thread Tom Evans
On Fri, Oct 3, 2014 at 3:13 PM, Tom Evans tevans...@googlemail.com wrote:
 I tried converting the selected data to SIGNED INTEGER, eg
 CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not
 have the desired effect.

However, changing them to be cast to CHAR changed the error message -
java.lang.Integer cannot be cast to java.lang.String.

I guess this is saying that the type of the map key must match the
type of the key coming from the parent entity (which is logical), so I
guess my question is - what do SQL type do I need to select out to get
a java.lang.Integer, to match what the map is expecting?

Cheers

Tom


Re: Determining which field caused a document to not be imported

2014-10-03 Thread Tom Evans
On Fri, Oct 3, 2014 at 3:24 PM, Tom Evans tevans...@googlemail.com wrote:
 On Fri, Oct 3, 2014 at 3:13 PM, Tom Evans tevans...@googlemail.com wrote:
 I tried converting the selected data to SIGNED INTEGER, eg
 CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not
 have the desired effect.

 However, changing them to be cast to CHAR changed the error message -
 java.lang.Integer cannot be cast to java.lang.String.

 I guess this is saying that the type of the map key must match the
 type of the key coming from the parent entity (which is logical), so I
 guess my question is - what do SQL type do I need to select out to get
 a java.lang.Integer, to match what the map is expecting?


I rewrote the query for the map, which was doing strange casts itself
(integer to integer casts). This then meant that the values from the
parent query were the same type as those in the map query, and no
funky casts are required anywhere.

However, I still don't have a way to determine which field is failing
when indexing fails like this, and it would be neat if I could
determine a way to do so for future debugging.

Cheers

Tom