Re: Getting to grips with auto-scaling
Hi Radu Thanks for the reply - I'm starting to look that way myself, to create a different collection for each set of data, that way I can control more easily the scaling on each collection, eg to increase replication factor on those that will be queried more. I was looking at Category Routed Alias, but that seems to have quite a few gotchas: * Can't restrict the collections queried - even if you specify the exact collections to query, eg "collections=items__CRA__2020" (which exists) returns no results. Even when querying the underlying collection and specifying its name returns no results. I only get results with collections=items__CRA - its as if the underlying collection thinks its name really is "items__CRA" rather than "items__CRA__2020" * Some problems with indexing to a new category, I get errors the first time a category is encountered. Looks like it might be manually setup and managed collections and aliases for now. Cheers Tom On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe wrote: > > Hi Tom, > > To your last two questions, I'd like to vent an alternative design: have > dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot > tier, and 2019, 2018,2017+lists go to the warm tier. > > Then you can scale the hot tier based on your query load. For the warm > tier, I assume there will be less need for scaling, and if it is, I guess > it's less important for shards of each index to be perfectly balanced (so a > simple "make sure cores are evenly distributed" should be enough). > > Granted, this design isn't as flexible as the one you suggested, but it's > simpler. So simple that I've seen it done without autoscaling (just a few > scripts from when you add nodes in each tier). > > Best regards, > Radu > > https://sematext.com > > vin., 5 iun. 2020, 21:59 Tom Evans a > scris: > > > Hi > > > > I'm trying to get a handle on the newer auto-scaling features in Solr. > > We're in the process of upgrading an older SolrCloud cluster from 5.5 > > to 8.5, and re-architecture it slightly to improve performance and > > automate operations. > > > > If I boil it down slightly, currently we have two collections, "items" > > and "lists". Both collections have just one shard. We publish new data > > to "items" once each day, and our users search and do analysis on > > them, whilst "lists" contains NRT user-specified collections of ids > > from items, which we join to from "items" in order to allow them to > > restrict their searches/analysis to just docs in their curated lists. > > > > Most of our searches have specific date ranges in them, usually only > > from the last 3 years or so, but sometimes we need to do searches > > across all the data. With the new setup, we want to: > > > > * shard by date (year) to make the hottest data available in smaller shards > > * have more nodes with these shards than we do of the older data. > > * be able to add/remove nodes predictably based upon our clients > > (predictable) query load > > * use TLOG for "items" and NRT for "lists", to avoid unnecessary > > indexing load for "items" and have NRT for "lists". > > * spread cores across two AZ > > > > With that in mind, I came up with a bunch of simplified rules for > > testing, with just 4 shards for "items": > > > > * "lists" collection has one NRT replica on each node > > * "items" collection shard 2020 has one TLOG replica on each node > > * "items" collection shard 2019 has one TLOG replica on 75% of nodes > > * "items" collection shards 2018 and 2017 each have one TLOG replica > > on 50% of nodes > > * all shards have at least 2 replicas if number of nodes > 1 > > * no node should have 2 replicas of the same shard > > * number of cores should be balanced across nodes > > > > Eg, with 1 node, I want to see this topology: > > A: items: 2020, 2019, 2018, 2017 + lists > > > > with 2 nodes: > > A: items: 2020, 2019, 2018, 2017 + lists > > B: items: 2020, 2019, 2018, 2017 + lists > > > > and if I add two more nodes: > > A: items: 2020, 2019, 2018 + lists > > B: items: 2020, 2019, 2017 + lists > > C: items: 2020, 2019, 2017 + lists > > D: items: 2020, 2018 + lists > > > > To the questions: > > > > * The type of replica created when nodeAdded is triggered can't be set > > per collection. Either everything gets NRT or everything gets TLOG. > > Even if I specify nrtReplicas=0 when creating a collection, nod
Indexing error when using Category Routed Alias
Hi all 1. Setup simple 1 node solrcloud test setup using docker-compose, solr:8.5.2, zookeeper:3.5.8. 2. Upload a configset 3. Create two collections, one standard collection, one CRA, both using the same configset legacy: action=CREATE=products_old=products=true=1=-1 CRA: { "create-alias": { "name": "products_20200609", "router": { "name": "category", "field": "date_published.year", "maxCardinality": 30, "mustMatch": "(199[6-9]|20[0,1,2][0-9])" }, "create-collection": { "config": "products", "numShards": 1, "nrtReplicas": 1, "tlogReplicas": 0, "maxShardsPerNode": 1, "autoAddReplicas": true } } } Post a small selection of docs in JSON format using curl to non-CRA collection -> OK > $ docker-compose exec -T solr curl -H 'Content-Type: application/json' > -d@/resources/product-json/products-12381742.json > http://solr:8983/solr/products_old/update/json/docs % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 11.6M 10071 100 11.6M 5 950k 0:00:14 0:00:12 0:00:02 687k { "responseHeader":{ "rf":1, "status":0, "QTime":12541}} The same documents, sent to the CRA -> boom > $ docker-compose exec -T solr curl -H 'Content-Type: application/json' > -d@/resources/product-json/products-12381742.json > http://solr:8983/solr/products_20200609/update/json/docs % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 11.6M 100 888 100 11.6M366 4913k 0:00:02 0:00:02 --:--:-- 4914k { "responseHeader":{ "status":400, "QTime":2422}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException", "error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException", "root-error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException"], "msg":"Async exception during distributed update: Error from server at http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/: null\n\n\n\nrequest: http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/\nRemote error message: Cannot parse provided JSON: JSON Parse Error: char=\u0002,position=0 AFTER='\u0002' BEFORE='2update.contentType0applicat'", "code":400}} Repeating the request again to the CRA -> OK > $ docker-compose exec -T solr curl -H 'Content-Type: application/json' > -d@/resources/product-json/products-12381742.json > http://solr:8983/solr/products_20200609/update/json/docs % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 11.6M 10071 100 11.6M 6 1041k 0:00:11 0:00:11 --:--:-- 706k { "responseHeader":{ "rf":1, "status":0, "QTime":11446}} It seems to be related to when a new collection is needed to be created by the CRA. The relevant logs: 2020-06-09 02:12:56.107 INFO (OverseerThreadFactory-9-thread-3-processing-n:10.20.36.130:8983_solr) [ ] o.a.s.c.a.c.CreateCollectionCmd Create collection products_20200609__CRA__2005 2020-06-09 02:12:56.232 INFO (OverseerStateUpdate-72169202568593409-10.20.36.130:8983_solr-n_00)[ ] o.a.s.c.o.SliceMutator createReplica() { "operation":"ADDREPLICA", "collection":"products_20200609__CRA__2005", "shard":"shard1", "core":"products_20200609__CRA__2005_shard1_replica_n1", "state":"down", "base_url":"http://10.20.36.130:8983/solr;, "node_name":"10.20.36.130:8983_solr", "type":"NRT", "waitForFinalState":"false"} 2020-06-09 02:12:56.444 INFO (qtp90045638-25) [ x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores=core_node2=products=true=products_20200609__CRA__2005_shard1_replica_n1=CREATE=1=products_20200609__CRA__2005=shard1=javabin=2=NRT 2020-06-09 02:12:56.476 INFO (qtp90045638-25) [c:products_20200609__CRA__2005 s:shard1 r:core_node2 x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.c.SolrConfig Using Lucene MatchVersion: 8.5.1 2020-06-09 02:12:56.512 INFO (qtp90045638-25) [c:products_20200609__CRA__2005 s:shard1 r:core_node2 x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.s.IndexSchema [products_20200609__CRA__2005_shard1_replica_n1] Schema name=variants 2020-06-09 02:12:56.543 INFO (qtp90045638-25) [c:products_20200609__CRA__2005 s:shard1 r:core_node2 x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.r.RestManager Registered ManagedResource impl org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymManager for path /schema/analysis/synonyms/default 2020-06-09 02:12:56.543 INFO
Getting to grips with auto-scaling
Hi I'm trying to get a handle on the newer auto-scaling features in Solr. We're in the process of upgrading an older SolrCloud cluster from 5.5 to 8.5, and re-architecture it slightly to improve performance and automate operations. If I boil it down slightly, currently we have two collections, "items" and "lists". Both collections have just one shard. We publish new data to "items" once each day, and our users search and do analysis on them, whilst "lists" contains NRT user-specified collections of ids from items, which we join to from "items" in order to allow them to restrict their searches/analysis to just docs in their curated lists. Most of our searches have specific date ranges in them, usually only from the last 3 years or so, but sometimes we need to do searches across all the data. With the new setup, we want to: * shard by date (year) to make the hottest data available in smaller shards * have more nodes with these shards than we do of the older data. * be able to add/remove nodes predictably based upon our clients (predictable) query load * use TLOG for "items" and NRT for "lists", to avoid unnecessary indexing load for "items" and have NRT for "lists". * spread cores across two AZ With that in mind, I came up with a bunch of simplified rules for testing, with just 4 shards for "items": * "lists" collection has one NRT replica on each node * "items" collection shard 2020 has one TLOG replica on each node * "items" collection shard 2019 has one TLOG replica on 75% of nodes * "items" collection shards 2018 and 2017 each have one TLOG replica on 50% of nodes * all shards have at least 2 replicas if number of nodes > 1 * no node should have 2 replicas of the same shard * number of cores should be balanced across nodes Eg, with 1 node, I want to see this topology: A: items: 2020, 2019, 2018, 2017 + lists with 2 nodes: A: items: 2020, 2019, 2018, 2017 + lists B: items: 2020, 2019, 2018, 2017 + lists and if I add two more nodes: A: items: 2020, 2019, 2018 + lists B: items: 2020, 2019, 2017 + lists C: items: 2020, 2019, 2017 + lists D: items: 2020, 2018 + lists To the questions: * The type of replica created when nodeAdded is triggered can't be set per collection. Either everything gets NRT or everything gets TLOG. Even if I specify nrtReplicas=0 when creating a collection, nodeAdded will add NRT replicas if configured that way. * I'm having difficulty expressing these rules in terms of a policy - I can't seem to figure out a way to specify the number of replicas for a shard based upon the total number of nodes. * Is this beyond the current scope of autoscaling triggers/policies? Should I instead use the trigger with a custom plugin action (or to trigger a web hook) to be a bit more intelligent? * Am I wasting my time trying to ensure there are more replicas of the hotter shards than the colder shards? It seems to add a lot of complexity - should I just instead think that they aren't getting queried much, so won't be using up cache space that the hot shards will be using. Disk space is pretty cheap after all (total size for "items" + "lists" is under 60GB). Cheers Tom
Re: Provide suggestion on indexing performance
On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandonwrote: > Hi, > > We want to know about the indexing performance in the below mentioned > scenarios, consider the total number of 10 string fields and total number > of documents are 10 million. > > 1) indexed=true, stored=true > 2) indexed=true, docValues=true > > Which one should we prefer in terms of indexing performance, please share > your experience. > > With regards, > Aman Tandon Your question doesn't make much sense. You turn on stored when you need to retrieve the original contents of the fields after searching, and you use docvalues to speed up faceting, sorting and grouping. Using docvalues to retrieve values during search is more expensive than simply using stored values, so if your primary aim is retrieving stored values, use stored=true. Secondly, the only way to answer performance questions for your schema and data is to try it out. Generate 10 million docs, store them in a doc (eg as CSV), and then use the post tool to try different schema and query options. Cheers Tom
Re: Solr returning same object in different page
On Tue, Sep 12, 2017 at 7:42 PM, rubywrote: > I'm running into a issue where an object is appearing twice when we are > paging. My query is gives documents boost based on field values. First query > returns 50 object. Second query is exactly same as first query, except > getting next 50 objects. We are noticing that few objects which were > returned before are being returned again in the second page. Is this a known > issue with Solr? Are you using paging (page=N) or deep paging (cursorMark=*)? Do you have a deterministic sort order (IE, not simply by score)? Cheers Tom
Re: Get results in multiple orders (multiple boosts)
On Fri, Aug 18, 2017 at 8:21 AM, Luca Dall'Ostowrote: > > Yes, of course, and excuse me for the misunderstanding. > > > In my scenario I have to display a list with hundreds of documents. > An user can show this documents in a particular order, this order is decided > by user in a settings view. > > > Order levels are for example: > 1) Order by category, as most important. > 2) Order by source, as second level. > 3) Order by date (ascending or descending). > 4) Order by title (ascending or descending). > > > For category order, in settings view, user has an box with a list of all > categories available for him/her. > User drag elements of the list to set in the favorite order. > Same thing for sources. > Solr can only sort by indexed fields, it needs to be able to compare one document to another document, and the only information available at that point are the indexed fields. This would be untenable in your scenario, because you cannot add a category..sort_order field to every document for every user. If this custom sorting is a hard requirement, the only feasible solution I see is to write a custom sorting plugin, that provides a function that you can sort on. This blog post describes how this can be achieved: https://medium.com/culture-wavelabs/sorting-based-on-a-custom-function-in-solr-c94ddae99a12 I would imagine that you would need one sort function, maybe called usersortorder(), to which you would provide the users preferred sort ordering (which you would retrieve from wherever you store such information) and the field that you want sorted. It would look something like this: usersortorder("category_id", "3,5,1,7,2,12,14,58") DESC, usersortorder("source_id", "5,2,1,4,3") DESC, date DESC, title DESC Cheers Tom
Re: setup solrcloud from scratch vie web-ui
On Wed, May 17, 2017 at 6:28 AM, Thomas Porschbergwrote: > Hi, > > I did not manipulating the data dir. What I did was: > > 1. Downloaded solr-6.5.1.zip > 2. ensured no solr process is running > 3. unzipped solr-6.5.1.zip to ~/solr_new2/solr-6.5.1 > 3. started an external zookeeper > 4. copied a conf directory from a working non-cloudsolr (6.5.1) to >~/solr_new2/solr-6.5.1 so that I have ~/solr_new2/solr-6.5.1/conf > (see http://randspringer.de/solrcloud_test/my.zip for content) ..in which you've manipulated the dataDir! :) The problem (I think) is that you have set a fixed data dir, and when Solr attempts to create a second core (for whatever reason, in your case it looks like you are adding a shard), Solr puts it exactly where you have told it to, in the same directory as the previous one. It finds the lock and blows up, because each core needs to be in a separate directory, but you've instructed Solr to put them in the same one. Start with a the solrconfig from basic_configs configset that ships with Solr and add the special things that your installation needs. I am not massively surprised that your non cloud config does not work in cloud mode, when we moved to SolrCloud, we rewrote from scratch solrconfig.xml and schema.xml, starting from basic_configs and adding anything particular that we needed from our old config, checking every difference that we have from stock config and noting/discerning why, and ensuring that our field types are using the same names for the same types as basic_config wherever possible. I only say all that because to fix this issue is a single thing, but you should spend the time comparing configs because this will not be the only issue. Anyway, to fix this problem, in your solrconfig.xml you have: data It should be ${solr.data.dir:} Which is still in your config, you've just got it commented out :) Cheers Tom
Re: to handle expired documents: collection alias or delete by id query
On Thu, Mar 23, 2017 at 6:10 AM, Derek Pohwrote: > Hi > > I have collections of products. I am doing indexing 3-4 times daily. > Every day there are products that expired and I need to remove them from > these collectionsdaily. > > Ican think of 2 ways to do this. > 1. using collection aliasto switch between a main and temp collection. > - clear and index the temp collection > - create alias to temp collection. > - clear and index the main collection. > - create alias to main collection. > > this way require additional collections. > Another way of doing this is to have a moving alias (not constantly clearing the "temp" collection). If you reindex daily, your index would be called "products_mmdd" with an alias to "products". The advantage of this is that you can roll back to a previous version of the index if there are problems, and each index is guaranteed to be freshly created with no artifacts. The biggest consideration for me would be how long indexing your full corpus takes you. If you can do it in a small period of time, then full indexes would be preferable. If it takes a very long time, deleting is preferable. If you are doing a cloud setup, full indexes are even more appealing. You can create the new collection on a single node (even if sharded; just place each shard on the same node). This would only place the indexing cost on that one node, whilst other nodes would be unaffected by indexing degrading regular query response time. You also don't have to distribute the documents around the cluster. There is no distributed indexing in Solr, each replica has to index each document again, even if it is not the leader. Once indexing is complete, you can expand the collection by adding replicas of that shard on other nodes - perhaps even removing it from the node that did the indexing. We have a node that solely does indexing, before the collection is queried for anything it is added to the querying nodes. You can do this manually, or you can automate it using the collections API. Cheers Tom
Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...
Hi Mike Looks like you are trying to get a list of the distinct item ids in a result set, ordered by the most frequent item ids? Can you use collapsing qparser for this instead? Should be much quicker. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results Every document with the same item_id would need to be on the same shard for this to work, and I'm not sure you can actually get the count of collapsed documents or not, if that is necessary for you. Another option might be to use hyperloglog function - hll() - instead of unique(), which should give slightly better performance. Cheers Tom On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michaelwrote: > Hi all, > > I'm converting my legacy facets to JSON facets and am seeing much better > performance, especially with high cardinality facet fields. However, the one > issue I can't seem to resolve is excessive memory usage (and OOM errors) when > trying to simulate the effect of "group.facet" to sort facets according to a > grouping field. > > My situation, slightly simplified is: > > Solr 4.6.1 > > * Doc set: ~200,000 docs > * Grouping by item_id, an indexed, stored, single value string field with > ~50,000 unique values, ~4 docs per item > * Faceting by person_id, an indexed, stored, multi-value string field > with ~50,000 values (w/ a very skewed distribution) > * No docValues fields > > Each document here is a description of an item, and there are several > descriptions per item in multiple languages. > > With legacy facets I use group.field=item_id and group.facet=true, which > gives me facet counts with the number of items rather than descriptions, and > correctly sorted by descending item count. > > With JSON facets I'm doing the equivalent like so: > > ={ > "people": { > "type": "terms", > "field": "person_id", > "facet": { > "grouped_count": "unique(item_id)" > }, > "sort": "grouped_count desc" > } > } > > This works, and is somewhat faster than legacy faceting, but it also produces > a massive spike in memory usage when (and only when) the sort parameter is > set to the aggregate field. A server that runs happily with a 512MB heap OOMs > unless I give it a 4GB heap. With sort set to (the default) "count desc" > there is no memory usage spike. > > I would be curious if anyone has experienced this kind of memory usage when > sorting JSON facets by stats and if there’s anything I can do to mitigate it. > I’ve tried reindexing with docValues enabled on the relevant fields and it > seems to make no difference in this respect. > > Many thanks, > ~Mike
Re: Interval Facets with JSON
On Wed, Feb 8, 2017 at 11:26 PM, deniz <denizdurmu...@gmail.com> wrote: > Tom Evans-2 wrote >> I don't think there is such a thing as an interval JSON facet. >> Whereabouts in the documentation are you seeing an "interval" as JSON >> facet type? >> >> >> You want a range facet surely? >> >> One thing with range facets is that the gap is fixed size. You can >> actually do your example however: >> >> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190, >> hardend:True, field:height}} >> >> If you do require arbitrary bucket sizes, you will need to do it by >> specifying query facets instead, I believe. >> >> Cheers >> >> Tom > > > nothing other than > https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting > for documentation on intervals... i am ok with range queries as well but > intervals would fit better because of different sizes... That documentation is not for JSON facets though. You can't pick and choose features from the old facet system and use them in JSON facets unless they are mentioned in the JSON facet documentation: https://cwiki.apache.org/confluence/display/solr/JSON+Request+API and (not official documentation) http://yonik.com/json-facet-api/ Cheers Tom
Re: Interval Facets with JSON
On Tue, Feb 7, 2017 at 8:54 AM, denizwrote: > Hello, > > I am trying to run JSON facets with on interval query as follows: > > > "json.facet":{"height_facet":{"interval":{"field":"height","set":["[160,180]","[180,190]"]}}} > > And related field is stored="true" /> > > But I keep seeing errors like: > > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Unknown > facet or stat. key=height_facet type=interval args={field=height, > set=[[160,180], [180,190]]} , path=/facet > I don't think there is such a thing as an interval JSON facet. Whereabouts in the documentation are you seeing an "interval" as JSON facet type? You want a range facet surely? One thing with range facets is that the gap is fixed size. You can actually do your example however: json.facet={hieght_facet:{type:range, gap:20, start:160, end:190, hardend:True, field:height}} If you do require arbitrary bucket sizes, you will need to do it by specifying query facets instead, I believe. Cheers Tom
Re: Upgrade SOLR version - facets perfomance regression
On Tue, Jan 31, 2017 at 5:49 AM, SOLR4189wrote: > But I can't run Json Facet API. I checked on SOLR-5.4.1. > If I write: > localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json=true=someField > It works fine. But if I write: > localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json={field:someField} > It doesn't work. > Are you sure that it is built-in? If it is built-in, why I can't find > explanation about it in reference guid? > Thank you for your help. You do have to follow the correct syntax: json.facet={name_of_facet_in_output:{type:terms, field:name_of_field}} It is documented in confluence: https://cwiki.apache.org/confluence/display/solr/Faceted+Search Also by yonik: http://yonik.com/json-facet-api/ Cheers Tom Cheers Tom
Re: Concat Fields in JSON Facet
On Mon, Jan 16, 2017 at 2:58 PM, Zheng Lin Edwin Yeowrote: > Hi, > > I have been using JSON Facet, but I am facing some constraints in > displaying the field. > > For example, I have 2 fields, itemId and itemName. However, when I do the > JSON Facet, I can only get it to show one of them in the output, and I > could not get it to show both together. > I will like to show both the ID and Name together, so that it will be more > meaningful and easier for user to understand, without having to refer to > another table to determine the match between the ID and Name. I don't understand what you mean. If you have these three documents in your index, what data do you want in the facet? [ {itemId: 1, itemName: "Apple"}, {itemId: 2, itemName: "Android"}, {itemId: 3, itemName: "Android"}, ] Cheers Tom
Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question
On Thu, Dec 15, 2016 at 12:37 PM, GWwrote: > While my client is all PHP it does not use a solr client. I wanted to stay > with he latest Solt Cloud and the PHP clients all seemed to have some kind > of issue being unaware of newer Solr Cloud versions. The client makes pure > REST calls with Curl. It is stateful through local storage. There is no > persistent connection. There are no cookies and PHP work is not sticky so > it is designed for round robin on both the internal network. > > I'm thinking we have a different idea of persistent. To me something like > MySQL can be persistent, ie a fifo queue for requests. The stack can be > always on/connected on something like a heap storage. > > I never thought about the impact of a solr node crashing with PHP on top. > Many thanks! > > Was thinking of running a conga line (Ricci & Luci projects) and shutting > down and replacing failed nodes. Never done this with Solr. I don't see any > reasons why it would not work. > > ** When you say an array of connections per host. It would still require an > internal DNS because hosts files don't round robin. perhaps this is handled > in the Python client?? The best Solr clients will take the URIs of the Zookeeper servers; they do not make queries via Zookeeper, but will read the current cluster status from zookeeper in order to determine which solr node to actually connect to, taking in to account what nodes are alive, and the state of particular shards. SolrJ (Java) will do this, as will pysolr (python), I'm not aware of a PHP client that is ZK aware. If you don't have a ZK aware client, there are several options: 1) Make your favourite client ZK aware, like in [1] 2) Use round robin DNS to distribute requests amongst the cluster. 3) Use a hardware or software load balancer in front of the cluster. 4) Use shared state to store the names of active nodes* All apart from 1) have significant downsides: 2) Has no concept of a node being down. Down nodes should not cause query failures, the requests should go elsewhere in the cluster. Requires updating DNS to add or remove nodes. 3) Can detect "down" nodes. Has no idea about the state of the cluster/shards (usually). 4) Basically duplicates what ZooKeeper does, but less effectively - doesn't know cluster state, down nodes, nodes that are up but with unhealthy replicas... > > You have given me some good clarification. I think lol. I know I can spin > out WWW servers based on load. I'm not sure how shit will fly spinning up > additional solr nodes. I'm not sure what happens if you spin up an empty > solr node and what will happen with replication, shards and load cost of > spinning an instance. I'm facing some experimentation me thinks. This will > be a manual process at first, for sure > > I guess I could put the solr connect requests in my clients into a try > loop, looking for successful connections by name before any action. In SolrCloud mode, you can spin up/shut down nodes as you like. Depending on how you have configured your collections, new replicas may be automatically created on the new node, or the node will simply become part of the cluster but empty, ready for you to assign new replicas to it using the Collections API. You can also use what are called "snitches" to define rules for how you want replicas/shards allocated amongst the nodes, eg to avoid placing all the replicas for a shard in the same rack. Cheers Tom [1] https://github.com/django-haystack/pysolr/commit/366f14d75d2de33884334ff7d00f6b19e04e8bbf
Re: Using DIH FileListEntityProcessor with SolrCloud
On Fri, Dec 2, 2016 at 4:36 PM, Chris Rogerswrote: > Hi all, > > A question regarding using the DIH FileListEntityProcessor with SolrCloud > (solr 6.3.0, zookeeper 3.4.8). > > I get that the config in SolrCloud lives on the Zookeeper node (a different > server from the solr nodes in my setup). > > With this in mind, where is the baseDir attribute in the > FileListEntityProcessor config relative to? I’m seeing the config in the Solr > GUI, and I’ve tried setting it as an absolute path on my Zookeeper server, > but this doesn’t seem to work… any ideas how this should be setup? > > My DIH config is below: > > > > > > fileName=".*xml" > newerThan="'NOW-5YEARS'" > recursive="true" > rootEntity="false" > dataSource="null" > baseDir="/home/bodl-zoo-svc/files/"> > > > > forEach="/TEI" url="${f.fileAbsolutePath}" > transformer="RegexTransformer" > > xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/> > xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/> > xpath="/TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/> > > > > > > > > > This same script worked as expected on a single solr node (i.e. not in > SolrCloud mode). > > Thanks, > Chris > Hey Chris We hit the same problem moving from non-cloud to cloud, we had a collection that loaded its DIH config from various XML files listing the DB queries to run. We wrote a simple DataSource plugin function to load the config from Zookeeper instead of local disk to avoid having to distribute those config files around the cluster. https://issues.apache.org/jira/browse/SOLR-8557 Cheers Tom
Re: insert lat/lon from jpeg into solr
On Wed, Nov 30, 2016 at 1:36 PM, win harringtonwrote: > I have jpeg files with latitude and longitudein separate fields. When I run > the post tool,it stores the lat/lon in separate fields. > For geospatial search, Solr wants themcombined into one field with the > format'latitude,longitude'. > How can I combine lat+lon into one field? > Build the field up using the UpdateRequestProcessorChain, something like this: latitude latlon longitude latlon latlon , composite-latlon Cheers Tom
Re: Import from S3
On Fri, Nov 25, 2016 at 7:23 AM, Aniket Kharewrote: > You can use Solr DIH for indexing csv data into solr. > https://wiki.apache.org/solr/DataImportHandler > Seems overkill when you can simply post CSV data to the UpdateHandler, using either the post tool: https://cwiki.apache.org/confluence/display/solr/Post+Tool Or by doing it manually however you wish: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates Cheers Tom
Re: Query formulation help
On Wed, Oct 26, 2016 at 4:00 PM, Prasanna S. Dhakephalkarwrote: > Hi, > > Thanks for reply, I did > > "q": "cost:[2 TO (2+5000)]" > > Got > > "error": { > "msg": "org.apache.solr.search.SyntaxError: Cannot parse 'cost:[2 to > (2+5000)]': Encountered \" \"(2+5000) \"\" at line 1, > column 18.\nWas expecting one of:\n\"]\" ...\n\"}\" ...\n", > } > > I want solr to do the addition. > I tried > "q": "cost:[2 TO (2+5000)]" > "q": "cost:[2 TO sum(2,5000)]" > > I has not worked. I am missing something. I donot know what. May be how to > invoke functions. > > Regards, > > Prasanna. Sorry, I was unclear - do the maths before constructing the query! You might be able to do this with function queries, but why bother? If the number is fixed, then fix it in the query, if it varies then there must be some code executing on your client that can be used to do a simple addition. Cheers Tom
Re: Query formulation help
On Wed, Oct 26, 2016 at 8:03 AM, Prasanna S. Dhakephalkarwrote: > Hi, > > > > May be very rudimentary question > > > > There is a integer field in a core : "cost" > > Need to build a query that will return documents where 0 < > "cost"-given_number < 500 > cost:[given_number TO (500+given_number)]
Re: OOM Error
On Wed, Oct 26, 2016 at 4:53 AM, Shawn Heiseywrote: > On 10/25/2016 8:03 PM, Susheel Kumar wrote: >> Agree, Pushkar. I had docValues for sorting / faceting fields from >> begining (since I setup Solr 6.0). So good on that side. I am going to >> analyze the queries to find any potential issue. Two questions which I am >> puzzling with >> >> a) Should the below JVM parameter be included for Prod to get heap dump >> >> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump" > > A heap dump can take a very long time to complete, and there may not be > enough memory in the machine to start another instance of Solr until the > first one has finished the heap dump. Also, I do not know whether Java > would release the listening port before the heap dump finishes. If not, > then a new instance would not be able to start immediately. > > If a different heap dump file is created each time, that might lead to > problems with disk space after repeated dumps. I don't know how the > option works. > >> b) Currently OOM script just kills the Solr instance. Shouldn't it be >> enhanced to wait and restart Solr instance > > As long as there is a problem causing OOMs, it seems rather pointless to > start Solr right back up, as another OOM is likely. The safest thing to > do is kill Solr (since its operation would be unpredictable after OOM) > and let the admin sort the problem out. > Occasionally our cloud nodes can OOM, when particularly complex faceting is performed. The current OOM management can be exceedingly annoying; a user will make a too complex analysis request, bringing down one server, taking it out of the balancer. The user gets fed up at no response, so reloads the page, re-submitting the analysis and bringing down the next server in the cluster. Lather, rinse, repeat - and then you get to have a meeting to discuss why we invest so much in HA infrastructure that can be made non-HA by one user with a complex query. In those meetings it is much harder to justify not restarting. Cheers Tom
Re: indexing - offline
On Thu, Oct 20, 2016 at 5:38 PM, Rallavaguwrote: > Solr 5.4.1 cloud with embedded jetty > > Looking for some ideas around offline indexing where an independent node > will be indexed offline (not in the cloud) and added to the cloud to become > leader so other cloud nodes will get replicated. Wonder if this is possible > without interrupting the live service. Thanks. How we do this, to reindex collection "foo": 1) First, collection "foo" should be an alias to the real collection, eg "foo_1" aliased to "foo" 2) Have a node "node_i" in the cluster that is used for indexing. It doesn't hold any shards of any collections 3) Use collections API to create collection "foo_2", with however many shards required, but all placed on "node_i" 4) Index "foo_2" with new data with DIH or direct indexing to "node_1". 5) Use collections API to expand "foo_2" to all the nodes/replicas that it should be on 6) Remove "foo_2" from "node_i" 7) Verify contents of "foo_2" are correct 8) Use collections API to change alias for "foo" to "foo_2" 9) Remove "foo_1" collection once happy This avoids indexing overwhelming the performance of the cluster (or any nodes in the cluster that receive queries), and can be performed with zero downtime or config changes on the clients. Cheers Tom
min()/max() on date fields using JSON facets
Hi all I'm trying to replace a use of the stats module with JSON facets in order to calculate the min/max date range of documents in a query. For the same search, "stats.field=date_published" returns this: {u'date_published': {u'count': 86760, u'max': u'2016-07-13T00:00:00Z', u'mean': u'2013-12-11T07:09:17.676Z', u'min': u'2011-01-04T00:00:00Z', u'missing': 0, u'stddev': 50006856043.410477, u'sum': u'3814570-11-06T00:00:00Z', u'sumOfSquares': 1.670619719649826e+29}} For the equivalent JSON facet - "{'date.max': 'max(date_published)', 'date.min': 'min(date_published)'}" - I'm returned this: {u'count': 86760, u'date.max': 146836800.0, u'date.min': 129409920.0} What do these numbers represent - I'm guessing it is milliseconds since epoch? In UTC? Is there any way to control the output format or TZ? Is there any benefit in using JSON facets to determine this, or should I just continue using stats? Cheers Tom
Re: Node not recovering, leader elections not occuring
On the nodes that have the replica in a recovering state we now see: 19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: lookups slice: shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 19-07-2016 16:18:28 INFO RecoveryStrategy:444 - Replay not started, or was not successful... still buffering updates. 19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed - trying again... (164) 19-07-2016 16:18:28 INFO RecoveryStrategy:503 - Wait [12.0] seconds before trying to recover again (attempt=165) This is with the "leader that is not the leader" shut down. Issuing a FORCELEADER via collections API doesn't in fact force a leader election to occur. Is there any other way to prompt Solr to have an election? Cheers Tom On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans <tevans...@googlemail.com> wrote: > There are 11 collections, each only has one shard, and each node has > 10 replicas (9 collections are on every node, 2 are just on one node). > We're not seeing any OOM errors on restart. > > I think we're being patient waiting for the leader election to occur. > We stopped the troublesome "leader that is not the leader" server > about 15-20 minutes ago, but we still have not had a leader election. > > Cheers > > Tom > > On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> > wrote: >> How many replicas per Solr JVM? And do you >> see any OOM errors when you bounce a server? >> And how patient are you being, because it can >> take 3 minutes for a leaderless shard to decide >> it needs to elect a leader. >> >> See SOLR-7280 and SOLR-7191 for the case >> where lots of replicas are in the same JVM, >> the tell-tale symptom is errors in the log as you >> bring Solr up saying something like >> "OutOfMemory error unable to create native thread" >> >> SOLR-7280 has patches for 6x and 7x, with a 5x one >> being added momentarily. >> >> Best, >> Erick >> >> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote: >>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >>> of the collections on it marked as "Recovering" or "Recovery Failed". >>> It attempts to recover from the leader, but the leader responds with: >>> >>> Error while trying to recover. >>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:3/solr: We are not the >>> leader >>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:3/solr: We are not the >>> leader >>>
Re: Node not recovering, leader elections not occuring
There are 11 collections, each only has one shard, and each node has 10 replicas (9 collections are on every node, 2 are just on one node). We're not seeing any OOM errors on restart. I think we're being patient waiting for the leader election to occur. We stopped the troublesome "leader that is not the leader" server about 15-20 minutes ago, but we still have not had a leader election. Cheers Tom On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> wrote: > How many replicas per Solr JVM? And do you > see any OOM errors when you bounce a server? > And how patient are you being, because it can > take 3 minutes for a leaderless shard to decide > it needs to elect a leader. > > See SOLR-7280 and SOLR-7191 for the case > where lots of replicas are in the same JVM, > the tell-tale symptom is errors in the log as you > bring Solr up saying something like > "OutOfMemory error unable to create native thread" > > SOLR-7280 has patches for 6x and 7x, with a 5x one > being added momentarily. > > Best, > Erick > > On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote: >> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >> of the collections on it marked as "Recovering" or "Recovery Failed". >> It attempts to recover from the leader, but the leader responds with: >> >> Error while trying to recover. >> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:3/solr: We are not the >> leader >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >> at >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:3/solr: We are not the >> leader >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) >> ... 5 more >> >> and recovery never occurs. >> >> Each collection in this state has plenty (10+) of active replicas, but >> stopping the server that is marked as the leader doesn't trigger a >> leader election amongst these replicas. >> >> REBALANCELEADERS did nothing. >> FORCELEADER complains that there is already a leader. >> FORCELEADER with the purported leader stopped took 45 seconds, >> reported status of "0" (and no other message) and kept the down node >> as the leader (!) >> Deleting the failed collection from the failed node and re-adding it >> has the same "Leader said I'm not the leader" error message. >> >> Any other ideas? >> >> Cheers >> >> Tom
Node not recovering, leader elections not occuring
Hi all - problem with a SolrCloud 5.5.0, we have a node that has most of the collections on it marked as "Recovering" or "Recovery Failed". It attempts to recover from the leader, but the leader responds with: Error while trying to recover. core=iris_shard1_replica1:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://172.31.1.171:3/solr: We are not the leader at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://172.31.1.171:3/solr: We are not the leader at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) ... 5 more and recovery never occurs. Each collection in this state has plenty (10+) of active replicas, but stopping the server that is marked as the leader doesn't trigger a leader election amongst these replicas. REBALANCELEADERS did nothing. FORCELEADER complains that there is already a leader. FORCELEADER with the purported leader stopped took 45 seconds, reported status of "0" (and no other message) and kept the down node as the leader (!) Deleting the failed collection from the failed node and re-adding it has the same "Leader said I'm not the leader" error message. Any other ideas? Cheers Tom
Strange highlighting on search
Hi all I'm investigating a bug where by every term in the highlighted field gets marked for highlighting instead of just the words that match the fulltext portion of the query. This is on Solr 5.5.0, but I didn't see any bug fixes related to highlighting in 5.5.1 or 6.0 release notes. The query that affects it is where we have a not clause on a specific field (not the fulltext field) and also only include documents where that field has a value: q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223) This returns the correct results, but the highlighting has matched every word in the results (see below for debugQuery output). If I change the query to put the exclusion in to an fq, the highlighting is correct again (and the results are correct): q: cosmetics_packaging_fulltext:(Mist) fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223) Is there any way I can make the query and highlighting work as expected as part of q? Is there any downside to putting the exclusion part in the fq in terms of performance? We don't use score at all for our results, we always order by other parameters. Cheers Tom Query with strange highlighting: { "responseHeader":{ "status":0, "QTime":314, "params":{ "q":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "hl":"true", "hl.simple.post":"", "indent":"true", "fl":"id,product", "hl.fragsize":"0", "hl.fl":"product", "rows":"5", "wt":"json", "debugQuery":"true", "hl.simple.pre":""}}, "response":{"numFound":10132,"start":0,"docs":[ { "id":"2403841-1498608", "product":"Mist"}, { "id":"2410603-1502577", "product":"Mist"}, { "id":"5988531-3882415", "product":"Ao + Mist"}, { "id":"6020805-3904203", "product":"UV Mist Cushion SPF 50+ PA+++"}, { "id":"2617977-1629335", "product":"Ultra Radiance Facial Re-Hydrating Mist"}] }, "highlighting":{ "2403841-1498608":{ "product":["Mist"]}, "2410603-1502577":{ "product":["Mist"]}, "5988531-3882415":{ "product":["Ao + Mist"]}, "6020805-3904203":{ "product":["UV Mist Cushion SPF 50+ PA+++"]}, "2617977-1629335":{ "product":["Ultra Radiance Facial Re-Hydrating Mist"]}}, "debug":{ "rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "querystring":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "parsedquery":"+cosmetics_packaging_fulltext:mist +ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223", "parsedquery_toString":"+cosmetics_packaging_fulltext:mist +ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223", "explain":{ "2403841-1498608":"\n40.082462 = sum of:\n 39.92971 = weight(cosmetics_packaging_fulltext:mist in 13983) [ClassicSimilarity], result of:\n39.92971 = score(doc=13983,freq=39.0), product of:\n 0.9882648 = queryWeight, product of:\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n0.15275055 = queryNorm\n 40.40386 = fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0), with freq of:\n 39.0 = termFreq=39.0\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 = fieldNorm(doc=13983)\n 0.15275055 = ingredient_tag_id:[0 TO *], product of:\n1.0 = boost\n0.15275055 = queryNorm\n", "2410603-1502577":"\n40.082462 = sum of:\n 39.92971 = weight(cosmetics_packaging_fulltext:mist in 14023) [ClassicSimilarity], result of:\n39.92971 = score(doc=14023,freq=39.0), product of:\n 0.9882648 = queryWeight, product of:\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n0.15275055 = queryNorm\n 40.40386 = fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0), with freq of:\n 39.0 = termFreq=39.0\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 = fieldNorm(doc=14023)\n 0.15275055 = ingredient_tag_id:[0 TO *], product of:\n1.0 = boost\n0.15275055 = queryNorm\n", "5988531-3882415":"\n37.435104 = sum of:\n 37.282352 = weight(cosmetics_packaging_fulltext:mist in 1062788) [ClassicSimilarity], result of:\n37.282352 = score(doc=1062788,freq=34.0), product of:\n 0.9882648 = queryWeight, product of:\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n0.15275055 = queryNorm\n 37.725063 = fieldWeight in 1062788, product of:\n5.8309517 = tf(freq=34.0), with freq of:\n 34.0 = termFreq=34.0\n 6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 = fieldNorm(doc=1062788)\n 0.15275055 = ingredient_tag_id:[0 TO *], product of:\n1.0 = boost\n0.15275055 = queryNorm\n", "6020805-3904203":"\n30.816679 = sum of:\n 30.663929 =
Re: result grouping in sharded index
Do you have to group, or can you collapse instead? https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results Cheers Tom On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharajuwrote: > Any suggestions on how to handle result grouping in sharded index? > > > On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju > wrote: > >> Hi, >> I am working on a functionality that would require me to group documents >> by a id field. I read that the ngroups feature would not work in a sharded >> index. >> Can someone recommend how to handle this in a sharded index? >> >> >> Solr Version: 5.5 >> >> >> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats >> >> -- >> Thanks >> Jay >> >> > > > > -- > Thanks > Jay Potharaju
Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]
On Tue, May 24, 2016 at 3:06 PM, Scott Chuwrote: > p.s. There're really many many extensive, worthy stuffs in Solr. If the > project team can provide some "dictionary" of them, It would be a "Santa > Claus" > for we solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not > possbile. > I really like to study them one after another, to learn about all of them. > However, Internet IT goes too fast to have time to congest all of the great > stuffs in Solr. The reference guide is both extensive and also broadly informative. Start from the top page and browse away! https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Handy to keep the glossary handy for any terms that you don't recognise: https://cwiki.apache.org/confluence/display/solr/Solr+Glossary Cheers Tom
Re: SolrCloud increase replication factor
On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorpwrote: > Hi, > > I have a SolrCloud 6.0 setup and created my collection with a > replication factor of 1. Now I want to increase the replication factor > but would like the replicas for the same shard to be on different nodes, > so that my collection does not fail when one node fails. I tried two > approaches so far: > > 1) When I use the collections API with the MODIFYCOLLECTION action [1] I > can set the replication factor but that did not result in the creation > of additional replicas. The Solr Admin UI showed that my replication > factor changed but otherwise nothing happened. A reload of the > collection did also result in no change. > > 2) Using the ADDREPLICA action [2] from the collections API I have to > add the replicas to the shard individually, which is a bit more > complicated but otherwise worked. During testing this did however at > least once result in the replica being created on the same node. My > collection was split in 4 shards and for 2 of them all replicas ended up > on the same node. > > So is the only option to create the replicas manually and also pick the > nodes manually or is the perceived behavior wrong? > > regards, > Hendrik > > [1] > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll > [2] > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica With ADDREPLICA, you can specify the node to create the replica on. If you are using a script to increase/remove replicas, you can simply incorporate the logic you desire in to your script - you can also use CLUSTERSTATUS to get a list of nodes/collections/shards etc in order to inform the logic in the script. This is the approach we took, we have a fabric script to add/remove extra nodes to/from the cluster, it works well. The alternative is to put the logic in to Solr itself, using what Solr calls a "snitch" to define the rules on where replicas are created. The snitch is specified at collection creation time, or you can use MODIFYCOLLECTION to set it after the fact. See this wiki patch for details: https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement Cheers Tom
Re: Creating a collection with 1 shard gives a weird range
On Tue, May 17, 2016 at 9:40 AM, John Smithwrote: > I'm trying to create a collection starting with only one shard > (numShards=1) using a compositeID router. The purpose is to start small > and begin splitting shards when the index grows larger. The shard > created gets a weird range value: 8000-7fff, which doesn't look > effective. Indeed, if a try to import some documents using a DIH, none > gets added. > > If I create the same collection with 2 shards, the ranges seem more > logical (0-7fff & 8000-). In this case documents are > indexed correctly. > > Is this behavior by design, i.e. is a minimum of 2 shards required? If > not, how can I create a working collection with a single shard? > > This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8. > I believe this is as designed, see this email from Shawn: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E Cheers Tom
Re: Indexing 700 docs per second
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinsonwrote: > Hi, > > I have a requirement to index (mainly updation) 700 docs per second. > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260 > byes (6 fields out of which only 2 will undergo updation at the above > rate). This collection has around 122Million docs and that count is pretty > much a constant. > > 1. Can I manage this updation rate with a non-sharded ie single Solr > instance set up? > 2. Also is atomic update or a full update (the whole doc) of the changed > records the better approach in this case. > > Could some one please share their views/ experience? Try it and see - everyone's data/schemas are different and can affect indexing speed. It certainly sounds achievable enough - presumably you can at least produce the documents at that rate? Cheers Tom
Re: Verifying - SOLR Cloud replaces load balancer?
On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaffwrote: > Thanks all - very helpful. > > @Shawn - your reply implies that even if I'm hitting the URL for a single > endpoint via HTTP - the "balancing" will still occur across the Solr Cloud > (I understand the caveat about that single endpoint being a potential point > of failure). I just want to verify that I'm interpreting your response > correctly... > > (I have been asked to provide IT with a comprehensive list of options prior > to a design discussion - which is why I'm trying to get clear about the > various options) > > In a nutshell, I think I understand the following: > > a. Even if hitting a single URL, the Solr Cloud will "balance" across all > available nodes for searching > Caveat: That single URL represents a potential single point of > failure and this should be taken into account > > b. SolrJ's CloudSolrClient API provides the ability to distribute load -- > based on Zookeeper's "knowledge" of all available Solr instances. > Note: This is more robust than "a" due to the fact that it > eliminates the "single point of failure" > > c. Use of a load balancer hitting all known Solr instances will be fine - > although the search requests may not run on the Solr instance the load > balancer targeted - due to "a" above. > > Corrections or refinements welcomed... With option a), although queries will be distributed across the cluster, all queries will be going through that single node. Not only is that a single point of failure, but you risk saturating the inter-node network traffic, possibly resulting in lower QPS and higher latency on your queries. With option b), as well as SolrJ, recent versions of pysolr have a ZK-aware SolrCloud client that behaves in a similar way. With option c), you can use the preferLocalShards so that shards that are local to the queried node are used in preference to distributed shards. Depending on your shard/cluster topology, this can increase performance if you are returning large amounts of data - many or large fields or many documents. Cheers Tom
Re: Anticipated Solr 5.5.1 release date
Awesome, thanks :) On Fri, Apr 15, 2016 at 4:19 PM, Anshum Gupta <ans...@anshumgupta.net> wrote: > Hi Tom, > > I plan on getting a release candidate out for vote by Monday. If all goes > well, it'd be about a week from then for the official release. > > On Fri, Apr 15, 2016 at 6:52 AM, Tom Evans <tevans...@googlemail.com> wrote: > >> Hi all >> >> We're currently using Solr 5.5.0 and converting our regular old style >> facets into JSON facets, and are running in to SOLR-8155 and >> SOLR-8835. I can see these have already been back-ported to 5.5.x >> branch, does anyone know when 5.5.1 may be released? >> >> We don't particularly want to move to Solr 6, as we have only just >> finished validating 5.5.0 with our original queries! >> >> Cheers >> >> Tom >> > > > > -- > Anshum Gupta
Anticipated Solr 5.5.1 release date
Hi all We're currently using Solr 5.5.0 and converting our regular old style facets into JSON facets, and are running in to SOLR-8155 and SOLR-8835. I can see these have already been back-ported to 5.5.x branch, does anyone know when 5.5.1 may be released? We don't particularly want to move to Solr 6, as we have only just finished validating 5.5.0 with our original queries! Cheers Tom
SolrCloud no leader for collection
Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections, most of them in a 1 shard x 8 replicas configuration. We have 5 ZK nodes. During the night, we attempted to reindex one of the larger collections. We reindex by pushing json docs to the update handler from a number of processes. It seemed this overwhelmed the servers, and caused all of the collections to fail and end up in either a down or a recovering state, often with no leader. Restarting and rebooting the servers brought a lot of the collections back online, but we are left with a few collections for which all the nodes hosting those replicas are up, but the replica reports as either "active" or "down", and with no leader. Trying to force a leader election has no effect, it keeps choosing a leader that is in "down" state. Removing all the nodes that are in "down" state and forcing a leader election also has no effect. Any ideas? The only viable option I see is to create a new collection, index it and then remove the old collection and alias it in. Cheers Tom
Re: Creating new cluster with existing config in zookeeper
On Wed, Mar 23, 2016 at 3:43 PM, Robert Brownwrote: > So I setup a new solr server to point to my existing ZK configs. > > When going to the admin UI on this new server I can see the shards/replica's > of the existing collection, and can even query it, even tho this new server > has no cores on it itself. > > Is this all expected behaviour? > > Is there any performance gain with what I have at this precise stage? The > extra server certainly makes it appear i could balance more load/requests, > but I guess the queries are just being forwarded on to the servers with the > actual data? > > Am I correct in thinking I can now create a new collection on this host, and > begin to build up a new cluster? and they won't interfere with each other > at all? > > Also, that I'll be able to see both collections when using the admin UI > Cloud page on any of the servers in either collection? > I'm confused slightly: SolrCloud is a (singular) cluster of servers, storing all of its state and configuration underneath a single zookeeper path. The cluster contains collections. Collections are tied to a particular config set within the cluster. Collections are made up of 1 or more shards. Each shard is a core, and there are 1 or more replicas of each core. You can add more servers to the cluster, and then create a new collection with the same config as an existing collection, but it is still part of the same cluster. Of course, you could think of a set of servers within a cluster as a "logical" cluster if it just serves particular collection, but "cluster" to me would be all of the servers within the same zookeeper tree, because that is where cluster state is maintained. Cheers Tom
Re: Re: Paging and cursorMark
On Wed, Mar 23, 2016 at 12:21 PM, Vanlerberghe, Lucwrote: > I worked on something similar a couple of years ago, but didn’t continue work > on it in the end. > > I've included the text of my original mail. > If you're interested, I could try to find the sources I was working on at the > time > > Luc > Thanks both Luc and Steve. I'm not sure if we will have time to deploy patched versions of things to production - time is always the enemy :( , and we're not a Java shop so there is non trivial time investment in just building replacement jars, let alone getting that integrated in to our RPMs - but I'll definitely try it out on my dev server. The change seems excessively complex imo, but maybe I'm not seeing the use cases for skip. To my mind, calculating a nextCursorMark is cheap and only relies on having a strict sort ordering, which is also cheap to check. If that condition is met, you should get a nextCursorMark in your response regardless of whether you specified a cursorMark in the request, to allow you to efficiently get the next page. This would still leave slightly pathological performance if you skip to page N, and then iterate back to page 0, which Luc's idea of a previousCursorMark can solve. cursorMark is easy to implement, you can ignore docs which sort lower than that mark. Can you do similar with previousCursorMark?, as would it not require to keep a buffer of rows documents, and stop when a document which sorts higher than the supplied mark appears. Seems more complex, but maybe I'm not understanding the internals correctly. Fortunately for us, 90% of our users prefer infinite scroll, and 97% of them never go beyond page 3. Cheers Tom
Paging and cursorMark
Hi all With Solr 5.5.0, we're trying to improve our paging performance. When we are delivering results using infinite scrolling, cursorMark is perfectly fine - one page is followed by the next. However, we also offer traditional paging of results, and this is where it gets a little tricky. Say we have 10 results per page, and a user wants to jump from page 1 to page 20, and then wants to view page 21, there doesn't seem to be a simple way to get the nextCursorMark. We can make an inefficient request for page 20 (start=190, rows=10), but we cannot give that request a cursorMark=* as it contains start=190. Consequently, if the user clicks to page 21, we have to continue along using start=200, as we have no cursorMark. The only way I can see to get a cursorMark at that point is to omit the start=200, and instead say rows=210, and ignore the first 200 results on the client side. Obviously, this gets more and more inefficient the deeper we page - I know that internally to Solr, using start=200=10 has to do the same work as rows=210, but less data is sent over the wire to the client. As I understand it, the cursorMark is a hash of the sort values of the last document returned, so I don't really see why it is forbidden to specify start=190=10=* - why is it not possible to calculate the nextCursorMark from the last document returned? I was also thinking a possible temporary workaround would be to request start=190=10, note the last document returned, and then make a subsequent query for q=id:""=1=*. This seems to work, but means an extra Solr query for no real reason. Is there any other problem to doing this? Is there some other simple trick I am missing that we can use to get both the page of results we want and a nextCursorMark for the subsequent page? Cheers Tom
Re: Ping handler in SolrCloud mode
On Wed, Mar 16, 2016 at 4:10 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/16/2016 8:14 AM, Tom Evans wrote: >> The problem occurs when we attempt to query a node to see if products >> or items is active on that node. The balancer (haproxy) requests the >> ping handler for the appropriate collection, however all the nodes >> return OK for all the collections(!) >> >> Eg, on node01, it has replicas for products and skus, but the ping >> handler for /solr/items/admin/ping returns 200! > > This returns OK because as long as one replica for every shard in > "items" is available somewhere in the cloud, you can make a request for > "items" on that node and it will work. Or at least it *should* work, > and if it's not working, that's a bug. I remember that one of the older > 4.x versions *did* have a bug where queries for a collection would only > work if the node actually contained shards for that collection. Sorry, this is Solr 5.5, I should have said. Yes, we can absolutely make a request of "items", and it will work correctly. However, we are making requests of "skus" that join to "products", and the query is routed to a node which has only "skus" and "items", and the request fails because joins can only work over local replicas. To fix this, we now have two additional balancers: solr: has all the nodes, all nodes are valid backends solr-items: has all the nodes in the cluster, but nodes are only valid backends if it has "items" and "skus" replicas. solr-products: has all the nodes in the cluster, but nodes are only valid backends if it has "products" and "skus" replicas (I'm simplifying things a bit, there are another 6 collections that are on all nodes, hence the main balancer.) The new balancers need a cheap way of checking what nodes are valid, and ideally I'd like that check to not involve a query with a join clause! Cheers Tom
Re: Ping handler in SolrCloud mode
On Wed, Mar 16, 2016 at 2:14 PM, Tom Evans <tevans...@googlemail.com> wrote: > Hi all > > [ .. ] > > The option I'm trying now is to make two ping handler for skus that > join to one of items/products, which should fail on the servers which > do not support it, but I am concerned that this is a little > heavyweight for a status check to see whether we can direct requests > at this server or not. This worked, I would still be interested in a lighter-weight approach that doesn't involve joins to see if a given collection has a shard on this server. I suspect that might require a custom ping handler plugin however. Cheers Tom
Ping handler in SolrCloud mode
Hi all I have a cloud setup with 8 nodes and 3 collections, products, items and skus. All collections have just one shard, products has 6 replicas, items has 2 replicas, skus has 8 replicas. No node has both products and items, all nodes have skus Some of our queries join from sku to either products or items. If the query is directed at a node without the appropriate shard on them, we obviously get an error, so we have separate balancers for products and items. The problem occurs when we attempt to query a node to see if products or items is active on that node. The balancer (haproxy) requests the ping handler for the appropriate collection, however all the nodes return OK for all the collections(!) Eg, on node01, it has replicas for products and skus, but the ping handler for /solr/items/admin/ping returns 200! This means that as far as the balancer is concerned, node01 is a valid destination for item queries, and inevitably it blows up as soon as such a query is made to it. As I understand it, this is because the URL we are checking is for the collection ("items") rather than a specific core ("items_shard1_replica1") Is there a way to make the ping handler only check local shards? I have tried with distrib=false=false, but it still returns a 200. The option I'm trying now is to make two ping handler for skus that join to one of items/products, which should fail on the servers which do not support it, but I am concerned that this is a little heavyweight for a status check to see whether we can direct requests at this server or not. Cheers Tom
mergeFactor/maxMergeDocs is deprecated
Hi all Updating to Solr 5.5.0, and getting these messages in our error log: Beginning with Solr 5.5, is deprecated, configure it on the relevant instead. Beginning with Solr 5.5, is deprecated, configure it on the relevant instead. However, mergeFactor is only mentioned in a commented out sections of our solrconfig.xml files, and mergeFactor is not mentioned at all. > $ ack -B 1 -A 1 '$ ack --all maxMergeDocs > $ Any ideas? Cheers Tom
Re: Separating cores from Solr home
Hmm, I've worked around this by setting the directory where the indexes should live to be the actual solr home, and symlink the files from the current release in to that directory, but it feels icky. Any better ideas? Cheers Tom On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans <tevans...@googlemail.com> wrote: > Hi all > > I'm struggling to configure solr cloud to put the index files and > core.properties in the correct places in SolrCloud 5.5. Let me explain > what I am trying to achieve: > > * solr is installed in /opt/solr > * the user who runs solr only has read only access to that tree > * the solr home files - custom libraries, log4j.properties, solr.in.sh > and solr.xml - live in /data/project/solr/releases/, which > is then the target of a symlink /data/project/solr/releases/current > * releasing a new version of the solr home (eg adding/changing > libraries, changing logging options) is done by checking out a fresh > copy of the solr home, switching the symlink and restarting solr > * the solr core.properties and any data live in /data/project/indexes, > so they are preserved when new solr home is released > > Setting core specific dataDir with absolute paths in solrconfig.xml > only gets me part of the way, as the core.properties for each shard is > created inside the solr home. > > This is obviously no good, as when releasing a new version of the solr > home, they will no longer be in the current solr home. > > Cheers > > Tom
Separating cores from Solr home
Hi all I'm struggling to configure solr cloud to put the index files and core.properties in the correct places in SolrCloud 5.5. Let me explain what I am trying to achieve: * solr is installed in /opt/solr * the user who runs solr only has read only access to that tree * the solr home files - custom libraries, log4j.properties, solr.in.sh and solr.xml - live in /data/project/solr/releases/, which is then the target of a symlink /data/project/solr/releases/current * releasing a new version of the solr home (eg adding/changing libraries, changing logging options) is done by checking out a fresh copy of the solr home, switching the symlink and restarting solr * the solr core.properties and any data live in /data/project/indexes, so they are preserved when new solr home is released Setting core specific dataDir with absolute paths in solrconfig.xml only gets me part of the way, as the core.properties for each shard is created inside the solr home. This is obviously no good, as when releasing a new version of the solr home, they will no longer be in the current solr home. Cheers Tom
Re: docValues error
On Mon, Feb 29, 2016 at 11:43 AM, David Santamaurowrote: > You will have noticed below, the field definition does not contain > multiValues=true What version of the schema are you using? In pre 1.1 schemas, multiValued="true" is the default if it is omitted. Cheers Tom
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsmawrote: > Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats > over ranges is not yet supported. More specifically, SOLR-6352 is what we > would need. > > [1]: https://issues.apache.org/jira/browse/SOLR-6348 > [2]: https://issues.apache.org/jira/browse/SOLR-6352 > > Thanks anyway, at least we found the tickets :) > No problem - as I was reading this I was thinking "But wait, I *know* we do this ourselves for average price vs month published". In fact, I was forgetting that we index the ranges that we will want to facet over as part of the document - so a document with a date_published of "2010-03-29T00:00:00Z" also has a date_published.month of "201003" (and a bunch of other ranges that we want to facet by). The frontend then converts those fields in to the appropriate values for display. This might be an acceptable solution for you guys too, depending on how many ranges that you require, and how much larger it would make your index. Cheers Tom
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 10:21 AM, Markus Jelsmawrote: > Hi - if we assume the following simple documents: > > > 2015-01-01T00:00:00Z > 2 > > > 2015-01-01T00:00:00Z > 4 > > > 2015-01-02T00:00:00Z > 3 > > > 2015-01-02T00:00:00Z > 7 > > > Can i get a daily average for the field 'value' by day? e.g. > > > 3.0 > 5.0 > > > Reading the documentation, i don't think i can, or i am missing it > completely. But i just want to be sure. Yes, you can facet by day, and use the stats component to calculate the mean average. This blog post explains it: https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/ Cheers Tom
fq in SolrCloud
I have a small question about fq in cloud mode that I couldn't find an explanation for in confluence. If I specify a query with an fq, where is that cached, is it just on the nodes/replicas that process that specific query, or will it exist on all replicas? We have a sub type of queries that specify an expensive join condition that we specify in the fq, so that subsequent requests with the same fq won't have to do the same expensive query, and was wondering whether we needed to ensure that the query goes to the same node when we move to cloud. Cheers Tom
Re: Shard allocation across nodes
Thank you both, those are exactly what I was looking for! If I'm reading it right, if I specify a "-Dvmhost=foo" when starting SolrCloud, and then specify a snitch rule like this when creating the collection: sysprop.vmhost:*,replica:<2 then this would ensure that on each vmhost there is at most one replica. I'm assuming that a shard leader and a replica are both treated as replicas in this scenario. Thanks Tom On Mon, Feb 1, 2016 at 8:34 PM, Erick Erickson <erickerick...@gmail.com> wrote: > See the createNodeset and node parameters for the Collections API CREATE and > ADDREPLICA commands, respectively. That's more a manual process, there's > nothing OOB but Jeff's suggestion is sound. > > Best, > Erick > > > > On Mon, Feb 1, 2016 at 11:00 AM, Jeff Wartes <jwar...@whitepages.com> wrote: >> >> You could write your own snitch: >> https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement >> >> Or, it would be more annoying, but you can always add/remove replicas >> manually and juggle things yourself after you create the initial collection. >> >> >> >> >> On 2/1/16, 8:42 AM, "Tom Evans" <tevans...@googlemail.com> wrote: >> >>>Hi all >>> >>>We're setting up a solr cloud cluster, and unfortunately some of our >>>VMs may be physically located on the same VM host. Is there a way of >>>ensuring that all copies of a shard are not located on the same >>>physical server? >>> >>>If they do end up in that state, is there a way of rebalancing them? >>> >>>Cheers >>> >>>Tom
Shard allocation across nodes
Hi all We're setting up a solr cloud cluster, and unfortunately some of our VMs may be physically located on the same VM host. Is there a way of ensuring that all copies of a shard are not located on the same physical server? If they do end up in that state, is there a way of rebalancing them? Cheers Tom
SolrCloud, DIH, and XPathEntityProcessor
Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having some problems with a DIH config that attempts to load an XML file and iterate through the nodes in that file, it trys to load the file from disk instead of from zookeeper. The file exists in zookeeper, adjacent to the data_import.conf in the lookups_config conf folder. The exception: 2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1 r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) ... 5 more Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284) ... 10 more Caused by: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123) ... 13 more Any hints gratefully accepted Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 1/12/2016 7:45 AM, Tom Evans wrote: >> That makes no sense whatsoever. DIH loads the data_import.conf from ZK >> just fine, or is that provided to DIH from another module that does >> know about ZK? > > This is accomplished indirectly through a resource loader in the > SolrCore object that is responsible for config files. Also, the > dataimport handler is created by the main Solr code which then hands the > configuration to the dataimport module. DIH itself does not know about > zookeeper. ZkPropertiesWriter seems to know a little.. > >> Either way, it is entirely sub-optimal to have SolrCloud store "all" >> its configuration in ZK, but still require manually storing and >> updating files on specific nodes in order to influence DIH. If a >> server is mistakenly not updated, or manually modified locally on >> disk, that node would start indexing documents differently than other >> replicas, which sounds dangerous and scary! > > The entity processor you are using accesses files through a Java > interface for mounted filesystems. As already mentioned, it does not > know about zookeeper. > >> If there is not a ZkFileDataSource, it shouldn't be too tricky to add >> one... I'll see how much I dislike having config files on the host... > > Creating your own DIH class would be the only solution available right now. > > I don't know how useful this would be in practice. Without special > config in multiple places, Zookeeper limits the size of the files it > contains to 1MB. It is not designed to deal with a large amount of data > at once. This is not large amounts of data, it is a 5kb XML file containing configuration of what tables to query for what fields and how to map them in to the document. > > You could submit a feature request in Jira, but unless you supply a > complete patch that survives the review process, I do not know how > likely an implementation would be. We've already started implementation, basing around FileDataSource and using SolrZkClient, which we will deploy as an additional library whilst that process is ongoing or doesn't survive it. Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 1/12/2016 6:05 AM, Tom Evans wrote: >> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having >> some problems with a DIH config that attempts to load an XML file and >> iterate through the nodes in that file, it trys to load the file from >> disk instead of from zookeeper. >> >> > dataSource="lookup_conf" >> rootEntity="false" >> name="lookups" >> processor="XPathEntityProcessor" >> url="lookup_conf.xml" >> forEach="/lookups/lookup"> >> >> The file exists in zookeeper, adjacent to the data_import.conf in the >> lookups_config conf folder. > > SolrCloud puts all the *config* for Solr into zookeeper, and adds a new > abstraction for indexes (the collection), but other parts of Solr like > DIH are not really affected. The entity processors in DIH cannot > retrieve data from zookeeper. They do not know how. That makes no sense whatsoever. DIH loads the data_import.conf from ZK just fine, or is that provided to DIH from another module that does know about ZK? Either way, it is entirely sub-optimal to have SolrCloud store "all" its configuration in ZK, but still require manually storing and updating files on specific nodes in order to influence DIH. If a server is mistakenly not updated, or manually modified locally on disk, that node would start indexing documents differently than other replicas, which sounds dangerous and scary! If there is not a ZkFileDataSource, it shouldn't be too tricky to add one... I'll see how much I dislike having config files on the host... Cheers Tom
Re: Defining SOLR nested fields
On Sun, Dec 13, 2015 at 6:40 PM, santosh sidnalwrote: > Hi All, > > I want to define nested fileds in SOLR using schema.xml. we are using Apache > Solr 4.7.0. > > i see some links which says how to do, but not sure how can i do it in > schema.xml > https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers > > > any help over here is appreciable. > With nested documents, it is better to not think of them as "children", but as related documents. All the documents in your index will follow exactly the same schema, whether they are "children" or "parents", and the nested aspect of a a document simply allows you to restrict your queries based upon that relationship. Solr is extremely efficient dealing with sparse documents (docs with only a few fields defined), so one way is to define all your fields for "parent" and "child" in the schema, and only use the appropriate ones in the right document. Another way is to use a schema-less structure, although I'm not a fan of that for error checking reasons. You can also define a suffix or prefix for fields that you use as part of your methodology, so that you know what domain it belongs in, but that would just be for your benefit, Solr would not complain if you put a "child" field in a parent or vice-versa. Cheers Tom PS: I would not use Solr 4.7 for this. Nested docs are a new-ish feature, you may encounter bugs that have been fixed in later versions, and performance has certainly been improved in later versions. Faceting on a specific domain (eg, on children or parents) is only supported by the JSON facet API, which was added in 5.2, and the current stable version of Solr is 5.4.
Moving to SolrCloud, specifying dataDir correctly
Hi all We're currently in the process of migrating our distributed search running on 5.0 to SolrCloud running on 5.4, and setting up a test cluster for performance testing etc. We have several cores/collections, and in each core's solrconfig.xml, we were specifying an empty , and specifying the same core.baseDataDir in core.properties. When I tried this in SolrCloud mode, specifying "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine for the first collection, but then the second collection tried to use the same directory to store its index, which obviously failed. I fixed this by changing solrconfig.xml in each collection to specify a specific directory, like so: ${solr.data.dir:}products Looking back after the weekend, I'm not a big fan of this. Is there a way to add a core.properties to ZK, or a way to specify core.baseDatadir on the command line, or just a better way of handling this that I'm not aware of? Cheers Tom
Re: Moving to SolrCloud, specifying dataDir correctly
On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 12/14/2015 10:49 AM, Tom Evans wrote: >> When I tried this in SolrCloud mode, specifying >> "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine >> for the first collection, but then the second collection tried to use >> the same directory to store its index, which obviously failed. I fixed >> this by changing solrconfig.xml in each collection to specify a >> specific directory, like so: >> >> ${solr.data.dir:}products >> >> Looking back after the weekend, I'm not a big fan of this. Is there a >> way to add a core.properties to ZK, or a way to specify >> core.baseDatadir on the command line, or just a better way of handling >> this that I'm not aware of? > > Since you're running SolrCloud, just let Solr handle the dataDir, don't > try to override it. It will default to "data" relative to the > instanceDir. Each instanceDir is likely to be in the solr home. > > With SolrCloud, your cores will not contain a "conf" directory (unless > you create it manually), therefore the on-disk locations will be *only* > data, there's not really any need to have separate locations for > instanceDir and dataDir. All active configuration information for > SolrCloud is in zookeeper. > That makes sense, but I guess I was asking the wrong question :) We have our SSDs mounted on /data/solr, which is where our indexes should go, but our solr install is on /opt/solr, with the default solr home in /opt/solr/server/solr. How do we change where the indexes get put so they end up on the fast storage? Cheers Tom
Re: Best way to track cumulative GC pauses in Solr
On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwoodwrote: > Also, what GC settings are you using? We may be able to make some suggestions. > > Cumulative GC pauses aren’t very interesting to me. I’m more interested in > the longest ones, 90th percentile, 95th, etc. > Any advice would be great, but what I'm primarily interested in is how people are monitoring these statistics in real time, for all time, on production servers. Eg, for looking at the disk or RAM usage of one of my servers, I can look at the historical usage in the last week, last month, last year and so on. I need to get these stats in to the same monitoring tools as we use for monitoring every other vital aspect of our servers. Looking at log files can be useful, but I don't want to keep arbitrarily large log files on our servers, nor extract data from them, I want to record it for posterity in one system that understands sampling. We already use and maintain our own munin systems, so I'm not interested in paid-for equivalents of munin - regardless of how simple to set up they are, they don't integrate with our other performance monitoring stats, and I would never get budget anyway. So really: 1) Is it OK to turn JMX monitoring on on production systems? The comments in solr.in.sh suggest not. 2) What JMX beans and attributes should I be using to monitor GC pauses, particularly maximum length of a single pause in a period, and the total length of pauses in that period? Cheers Tom
Best way to track cumulative GC pauses in Solr
Hi all We have some issues with our Solr servers spending too much time paused doing GC. From turning on gc debug, and extracting numbers from the GC log, we're getting an idea of just how much of a problem. I'm currently doing this in a hacky, inefficient way: grep -h 'Total time for which application threads were stopped:' solr_gc* \ | awk '($11 > 0.3) { print $1, $11 }' \ | sed 's#:.*:##' \ | sort -n \ | sum_by_date.py (Yes, I really am using sed, grep and awk all in one line. Just wrong :) The "sum_by_date.py" program simply adds up all the values with the same first column, and remembers the largest value seen. This is giving me the cumulative GC time for extended pauses (over 0.5s), and the maximum pause seen in a given time period (hourly), eg: 2015-11-13T11 119.124037 2.203569 2015-11-13T12 184.683309 3.156565 2015-11-13T13 65.934526 1.978202 2015-11-13T14 63.970378 1.411700 This is fine for seeing that we have a problem. However, really I need to get this in to our monitoring systems - we use munin. I'm struggling to work out the best way to extract this information for our monitoring systems, and I think this might be my naivety about Java, and working out what should be logged. I've turned on JMX debugging, and looking at the different beans available using jconsole, but I'm drowning in information. What would be the best thing to monitor? Ideally, like the stats above, I'd like to know the cumulative time spent paused in GC since the last poll, and the longest GC pause that we see. munin polls every 5 minutes, are there suitable counters exposed by JMX that it could extract? Thanks in advance Tom
Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id
On Mon, Nov 2, 2015 at 1:38 PM, fabigolwrote: > Thank > All works. > I have 2 last questions: > How can i put 0 by defaults " clean" during a indexation? > > To conclure, i wand to understand: > > > Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433 > (1743/s) > > What is the "requests"? > What is 'Fetched"? > What is "Processed"? > > Thank again for your answer > Depends upon how DIH is configured - different things return different numbers. For a SqlEntityProcessor, "Requests" is the number of SQL queries, "Fetched" is the number of rows read from those queries, and "Processed" is the number of documents processed by SOLR. > For the second question, i try: > > false > > > and > true > false > Putting things in "invariants" overrides whatever is passed for that parameter in the request parameters. By putting "false" in invariants, you are making it impossible to clean + index as part of DIH, because "clean" is always false. Cheers Tom
Re: Checking of Solr Memory and Disk usage
On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, So has anyone knows what is the issue with the Heap Memory Usage reading showing the value -1. Should I open an issue in Jira? I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers the core statistics have values for heap memory, on the solr 5.0.0 ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK on both versions. I don't see this issue in the fixed bugs in 5.1.0, but I only looked at the headlines of the tickets.. http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes Cheers Tom
Re: Confusing SOLR 5 memory usage
I do apologise for wasting anyone's time on this, the PEBKAC (my keyboard and chair unfortunately). When adding the new server to haproxy, I updated the label for the balancer entry to the new server, but left the host name the same, so the server that wasn't using any RAM... wasn't getting any requests. Again, sorry! Tom On Tue, Apr 21, 2015 at 11:54 AM, Tom Evans tevans...@googlemail.com wrote: We monitor them with munin, so I have charts if attachments are acceptable? Having said that, they have only been running for a day with this memory allocation.. Describing them, the master consistently has 8GB used for apps, the 8GB used in cache, whilst the slave consistently only uses ~1.5GB for apps, 14GB used in cache. We are trying to use our SOLR servers to do a lot more facet queries, previously we were mainly doing searches, and the SolrPerformanceProblems wiki page mentions that faceting (amongst others) require a lot of JVM heap, so I'm confused why it is not using the heap we've allocated on one server, whilst it is on the other server. Perhaps our master server needs even more heap? Also, my infra guy is wondering why I asked him to add more memory to the slave server, if it is just in cache, although I did try to explain that ideally, I'd have even more in cache - we have about 35GB of index data. Cheers Tom On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - what do you see if you monitor memory over time? You should see a typical saw tooth. Markus -Original message- From:Tom Evans tevans...@googlemail.com Sent: Tuesday 21st April 2015 12:22 To: solr-user@lucene.apache.org Subject: Confusing SOLR 5 memory usage Hi all I have two SOLR 5 servers, one is the master and one is the slave. They both have 12 cores, fully replicated and giving identical results when querying them. The only difference between configuration on the two servers is that one is set to slave from the other - identical core configs and solr.in.sh. They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are setting the heap size identically: SOLR_JAVA_MEM=-Xms512m -Xmx7168m The two servers are balanced behind haproxy, and identical numbers and types of queries flow to both servers. Indexing only happens once a day. When viewing the memory usage of the servers, the master server's JVM has 8.8GB RSS, but the slave only has 1.2GB RSS. Can someone hit me with the cluebat please? :) Cheers Tom
Re: Confusing SOLR 5 memory usage
We monitor them with munin, so I have charts if attachments are acceptable? Having said that, they have only been running for a day with this memory allocation.. Describing them, the master consistently has 8GB used for apps, the 8GB used in cache, whilst the slave consistently only uses ~1.5GB for apps, 14GB used in cache. We are trying to use our SOLR servers to do a lot more facet queries, previously we were mainly doing searches, and the SolrPerformanceProblems wiki page mentions that faceting (amongst others) require a lot of JVM heap, so I'm confused why it is not using the heap we've allocated on one server, whilst it is on the other server. Perhaps our master server needs even more heap? Also, my infra guy is wondering why I asked him to add more memory to the slave server, if it is just in cache, although I did try to explain that ideally, I'd have even more in cache - we have about 35GB of index data. Cheers Tom On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - what do you see if you monitor memory over time? You should see a typical saw tooth. Markus -Original message- From:Tom Evans tevans...@googlemail.com Sent: Tuesday 21st April 2015 12:22 To: solr-user@lucene.apache.org Subject: Confusing SOLR 5 memory usage Hi all I have two SOLR 5 servers, one is the master and one is the slave. They both have 12 cores, fully replicated and giving identical results when querying them. The only difference between configuration on the two servers is that one is set to slave from the other - identical core configs and solr.in.sh. They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are setting the heap size identically: SOLR_JAVA_MEM=-Xms512m -Xmx7168m The two servers are balanced behind haproxy, and identical numbers and types of queries flow to both servers. Indexing only happens once a day. When viewing the memory usage of the servers, the master server's JVM has 8.8GB RSS, but the slave only has 1.2GB RSS. Can someone hit me with the cluebat please? :) Cheers Tom
Confusing SOLR 5 memory usage
Hi all I have two SOLR 5 servers, one is the master and one is the slave. They both have 12 cores, fully replicated and giving identical results when querying them. The only difference between configuration on the two servers is that one is set to slave from the other - identical core configs and solr.in.sh. They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are setting the heap size identically: SOLR_JAVA_MEM=-Xms512m -Xmx7168m The two servers are balanced behind haproxy, and identical numbers and types of queries flow to both servers. Indexing only happens once a day. When viewing the memory usage of the servers, the master server's JVM has 8.8GB RSS, but the slave only has 1.2GB RSS. Can someone hit me with the cluebat please? :) Cheers Tom
Re: Setting up SOLR 5 from an RPM
On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey apa...@elyograg.org wrote: I think you will only need to change the ownership of the solr home and the location where the .war file is extracted, which by default is server/solr-webapp. The user must be able to *read* the program data, but should not need to write to it. If you are using the start script included with Solr 5 and one of the examples, I believe the logging destination will also be located under the solr home, but you should make sure that's the case. Thanks Shawn, this sort of makes sense. The thing which I cannot seem to do is change the location where the war file is extracted. I think this is probably because, as of solr 5, I am not supposed to know or be aware that there is a war file, or that the war file is hosted in jetty, which makes it tricky to specify the jetty temporary directory. Our use case is that we want to create a single system image that would be usable for several projects, each project would check out its solr home and run solr as their own user (possibly on the same server). Eg, /data/projectA being a solr home for one project, /data/projectB being a solr home for another project, both running solr from the same location. Also, on a dev server, I want to install solr once, and each member of my team run it from that single location. Because they cannot change the temporary directory, and they cannot all own server/solr-webapp, this does not work and they must each have their own copy of the solr install. I think the way we will go for this is in production to run all our solr instance as the solr user, who will own the files in /opt/solr, and have their solr home directory wherever they choose. In dev, we will just do something... Cheers Tom
Re: Setting up SOLR 5 from an RPM
On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote: Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. From the SOLR 5 reference guide, section Managing SOLR, sub-section Taking SOLR to production, it seems changing the ownership of the installed files to the user that will run SOLR is an explicit requirement if you do not wish to run as root. It would be better if this was not required. With most applications you do not normally require permission to modify the installed files in order to run the application, eg I do not need write permission to /usr/share/vim to run vim, it is a shame I need write permission to /opt/solr to run solr. Cheers Tom
Setting up SOLR 5 from an RPM
Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. We currently have this structure: /data/solr - root directory of our solr instance /data/solr/{logs,run} - log/run directories /data/solr/cores - configuration for our cores and solr.in.sh /opt/solr - the RPM installed solr 5 The user running solr can modify anything under /data/solr, but nothing under /opt/solr. Is this sort of configuration supported? Am I missing some variable in our solr.in.sh that sets where temporary files can be extracted? We currently set: SOLR_PID_DIR=/data/solr/run SOLR_HOME=/data/solr/cores SOLR_LOGS_DIR=/data/solr/logs Cheers Tom
Determining which field caused a document to not be imported
Hi all I recently rewrote our SOLR 4.8 dataimport to read from a set of denormalised DB tables, in an attempt to increase full indexing speed. When I tried it out however, indexing broke telling me that java.lang.Long cannot be cast to java.lang.Integer (full stack below, with the document elided). From googling, this tends to be some field that is being selected out as a long, where it should probably be cast as a string. Unfortunately, our documents have some 400+ fields and over 100 entities; is there another way to determine which field could not be cast from Long to Integer other than disabling each integer field in turn? Cheers Tom Exception while processing: variant document : SolrInputDocument(fields: [(removed)]): org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464) Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at java.lang.Integer.compareTo(Integer.java:52) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.get(TreeMap.java:273) at org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:147) at org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:179) at org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:145) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:129) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) ... 10 more
Re: Determining which field caused a document to not be imported
On Fri, Oct 3, 2014 at 2:24 PM, Shawn Heisey apa...@elyograg.org wrote: Can you give us the entire stacktrace, with complete details from any caused by sections? Also, is this 4.8.0 or 4.8.1? Thanks Shawn, this is SOLR 4.8.1 and here is the full traceback from the log: 95191 [Thread-21] INFO org.apache.solr.update.processor.LogUpdateProcessor – [products] webapp=/products path=/dataimport-from-denorm params={id=2148732optimize=falseclean=falseindent=truecommit=trueverbose=falsecommand=full-importdebug=falsewt=json} status=0 QTime=32 {} 0 32 95199 [Thread-21] ERROR org.apache.solr.handler.dataimport.DataImporter – Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:278) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:418) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:331) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:239) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:63) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:246) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:477) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:503) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416) ... 5 more Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at java.lang.Long.compareTo(Long.java:50) at java.util.TreeMap.getEntry(TreeMap.java:346) at java.util.TreeMap.get(TreeMap.java:273) at org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:147) at org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:179) at org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:145) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:129) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) ... 10 more 95199 [Thread-21] INFO org.apache.solr.update.UpdateHandler – start rollback{} I've tracked it down to a single entity now that selects some content out of the database and then looks up other fields using that data from sub-entities that have SortedMapBackedCache caching in use, but I'm still not sure how to fix it. Eg, the original entity selects out country_id, which is then used by this entity: entity dataSource=products name=country_lookup query= SELECT lk_country.id AS xid, IF(LENGTH(english), CAST(english AS CHAR), description) AS country FROM lk_country INNER JOIN nl_strings ON lk_country.description_sid=nl_strings.id cacheKey=xid cacheLookup=product.country_id cacheImpl=SortedMapBackedCache field column=country name=country/ /entity I tried converting the selected data to SIGNED INTEGER, eg CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not have the desired effect. The source database is mysql, the source column for country_id is `country_id` smallint(6) NOT NULL default '0'. Again, I'm not 100% sure that it is even the country field that causes this, there are several SortedMapBackedCache sub-entities (but they are all analogous to this one). Thanks in advance Tom
Re: Determining which field caused a document to not be imported
On Fri, Oct 3, 2014 at 3:13 PM, Tom Evans tevans...@googlemail.com wrote: I tried converting the selected data to SIGNED INTEGER, eg CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not have the desired effect. However, changing them to be cast to CHAR changed the error message - java.lang.Integer cannot be cast to java.lang.String. I guess this is saying that the type of the map key must match the type of the key coming from the parent entity (which is logical), so I guess my question is - what do SQL type do I need to select out to get a java.lang.Integer, to match what the map is expecting? Cheers Tom
Re: Determining which field caused a document to not be imported
On Fri, Oct 3, 2014 at 3:24 PM, Tom Evans tevans...@googlemail.com wrote: On Fri, Oct 3, 2014 at 3:13 PM, Tom Evans tevans...@googlemail.com wrote: I tried converting the selected data to SIGNED INTEGER, eg CONVERT(country_id, SIGNED INTEGER) AS country_id, but this did not have the desired effect. However, changing them to be cast to CHAR changed the error message - java.lang.Integer cannot be cast to java.lang.String. I guess this is saying that the type of the map key must match the type of the key coming from the parent entity (which is logical), so I guess my question is - what do SQL type do I need to select out to get a java.lang.Integer, to match what the map is expecting? I rewrote the query for the map, which was doing strange casts itself (integer to integer casts). This then meant that the values from the parent query were the same type as those in the map query, and no funky casts are required anywhere. However, I still don't have a way to determine which field is failing when indexing fails like this, and it would be neat if I could determine a way to do so for future debugging. Cheers Tom