Is Solr ready for Nested Documents importing and querying ?
Hi, I'm using solr and I'm starting to index my database. I work for a book seller, but we have a lot of different publications (i.e: different editions from different publishers ) for the same book, and I was wondering if it would be wise to model this schema using a hierarchical approach (with nested docs). For example: { title: 'The hoobit', author: 'J. R. Tolkien, publications: [{ isbn: 9780007591855, price: 0.99, pages: 200 }, { isbn: 9780007497904, price: 4.00, pages: 230 } ] } And, another question, how can I achieve this with data-import-handler ? I found this: https://issues.apache.org/jira/browse/SOLR-5147 (I'm using solr 5.3) and I was able to index the data, but I cannot retrieve the publications values inside a book. What do you think, guys ? Or is it better to forget about nested documents and get back to the old-fashioned denormalized approach ? Thanks. []'s Rafael
Data Import Handler use of JNDI decayed
NLM tends to be rather security conscious. Nothing appears terribly wrong, but the layout of Solr doesn't include Jetty's start.ini or jetty.xml It will have to be the detailed way - https://wiki.eclipse.org/Jetty/Feature/JNDI#Detailed_Setup Once I've figured it out, I'll request wiki edit permissions to add it in. Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH
Re: Search opening hours
So thanks to the tireless efforts of David Smiley and the devs at Vivid Solutions (not to mention the various contributors that help power Solr and Lucene) spatial search is awesome, efficient and easy. The biggest roadblock I've run into is not having the JTS (Java Topology Suite) JAR where Solr can find it. It doesn't ship with Solr OOB so you have to either add it to one of the dynamic directories, or bundle it with the WAR (I think pre-5.0). The link above has most of what you need to index data and issue queries. I'd also suggest the sections on spatial search in Solr In Action (Grainger, Potter) - they add a few more use cases that I've found interesting. Finally, the aging wiki has some good info too: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 Basically indexing spatial data is as easy as anything else: define the field in the solrconfig.xml, create the data and push it in. Now the data in this case are boxes or polygons (effectively the same here) and come in a specific format known as WKT, or Well-Known-Text https://en.wikipedia.org/wiki/Well-known_text. I'd say unless you're aiming at an advanced use case set the max dist error on the field config a little higher than normal - precision isn't really a requirement here and good unit tests would alert you to any unforeseen issues. Then for the query side of the world you just ask for point inclusion like: q=+polygon:Contains(POINT(my_long my_lat)) Please note that WKT reverses the order of lat/lng because it uses euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell you how many times my brain hurt thanks to this idiom combined with janky client logic :) Anyway, that's about it - let me know if you have any other questions. On Wed, Aug 26, 2015 at 1:56 PM, O. Klein kl...@octoweb.nl wrote: Darren, This sounds like solution I'm looking for. Especially nice fix for the Sunday-Monday problem. Never worked with spatial search before, so any pointers are welcome. Will start working on this solution. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html Sent from the Solr - User mailing list archive at Nabble.com. -- Darren
Re: StrDocValues
Hello Jamie, Check here https://github.com/apache/lucene-solr/blob/7f721a1f9323a85ce2b5b35e12b4788c31271b69/lucene/sandbox/src/java/org/apache/lucene/search/DocValuesRangeQuery.java#L185 Note, SortedSet works there even if an actual field is multivalue=false On Wed, Aug 26, 2015 at 8:48 PM, Jamie Johnson jej2...@gmail.com wrote: Are there any example implementation showing how StrDocValues works? I am not sure if this is the right place or not, but I was thinking about having some document level doc value that I'd like to read in a function query to impact if the document is returned or not. Am I barking up the right tree looking at this or is there another method to supporting this? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Connect and sync two solr server
From the description, this is straight forward SolrCloud where you have replicas on the separate machines, see: https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud A different way of accomplishing this would be the master/slave style, see: https://cwiki.apache.org/confluence/display/solr/Index+Replication Best, Erick On Wed, Aug 26, 2015 at 6:55 AM, shahper shahper.ja...@techblue.co.uk wrote: Hi, I want to connect two solrcloud server. and sync there indexes to each other so that is any server is down we can work with other and whenever I update or add index in any server the other also get updated. shahper
StrDocValues
Are there any example implementation showing how StrDocValues works? I am not sure if this is the right place or not, but I was thinking about having some document level doc value that I'd like to read in a function query to impact if the document is returned or not. Am I barking up the right tree looking at this or is there another method to supporting this?
Securing Solr 5.3 with Basic Authentication
With version 5.3 Solr have full-featured authentication and authorization plugins that use Basic authentication and “permission rules” which are completely driven from ZooKeeper. So I have tried that without success follwong the info in https://cwiki.apache.org/confluence/display/solr/Securing+Solr and http://lucidworks.com/blog/securing-solr-basic-auth-permission-rules: I followed this steps: *1) Set up a Zookeeper Ensemble (3 nodes).* *2) I upload the filesecurity.json to Zookeper* I used this command to upload the file: zkcli.bat -zkhost localhost:2181 -cmd putfile /security.json security.json Content of the file security.json: { authentication:{ class:solr.BasicAuthPlugin, credentials:{solr:IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=} }, authorization:{ class:solr.RuleBasedAuthorizationPlugin, user-role:{solr:admin}, permissions:[{name:security-edit, role:admin}] }} I also tried with this security.json content: {authentication:{class:solr.BasicAuthPlugin},authorization:{class:solr.RuleBasedAuthorizationPlugin}} *3) ** I started Solr 5.3.0 in cloud mode (and 'bootstrap' ):* I used this command: ./solr start -c -z localhost:2181,localhost:2182,localhost:2183 -s ../server/solrcloud_test -Dbootstrap_confdir=../server/solrcloud_test/configsets/basic_configs/conf -Dcollection.configName=c_test_cfg -f However, I can access directly to http://localhost:8983/solr and the browser doesn't ask me the credentials. In Solr Admin I can see the /security.json (with the correct content) and even the c_test_cfg under /cofigs . I can see this in the log when solr starts: 955 INFO (main) [ ] o.a.s.c.CoreContainer Security conf doesn't exist. Skipping setup for authorization module. 955 INFO (main) [ ] o.a.s.c.CoreContainer No authentication plugin used. Can anybody tell me what I'm doing wrong??
Re: Search opening hours
Darren, This sounds like solution I'm looking for. Especially nice fix for the Sunday-Monday problem. Never worked with spatial search before, so any pointers are welcome. Will start working on this solution. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html Sent from the Solr - User mailing list archive at Nabble.com.
IOException, ConnectionTimeout Error while searching
Hello, I indexed 2 million documents and after completing indexing. I tried for searching. It throws IOException and Connection Timeout Error. error:{ msg:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://192.168.1.25:8983/solr/col_ner_shard1_replica1;, trace:org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://192.168.1.25:8983/solr/col_ner_shard1_replica1\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:337)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat
Re: Behavior of grouping on a field with same value spread across shards.
That should be the case. Best, Erick On Tue, Aug 25, 2015 at 8:55 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Erick, I saw the link. So is it that the grouping functionality works fine in distributed search except the two cases mentioned in the link? Regards, Modassar On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson erickerick...@gmail.com wrote: That's not really the case. Perhaps you're confusing group.ngroups and group.facet with just grouping? See the ref guide: https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats Best, Erick On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, As per my understanding, to group on a field all documents with the same value in the field have to be in the same shard. Can we group by a field where the documents with the same value in that field will be distributed across shards? Please let me know what are the limitations, feature not available or performance issues for such fields? Thanks, Modassar
Solr 5.2.1 versus Solr 4.7.0 performance
Hello, We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 (5-15% better than 5.2.1, with an exception of 100% improvement for one of the queries ). Using same JVM (IBM 1.7) and JVM params. Index's size is ~500G, spread over 64 shards, with replication factor 2. Do you know about any config / setup change for Solr 5.2.1 that can improve the performance? Any idea what causes this behavior? Thank you, Esther
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Sure, I think it's fine to raise a JIRA, especially if you can include a patch, even a preliminary one to solicit feedback... which I'll leave to people who are more familiar with that code... I'm not sure how generally useful this would be, and if it comes at a cost to normal searching there's sure to be lively discussion. Best Erick On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson jej2...@gmail.com wrote: Looks like I have something basic working for Trie fields. I am doing exactly what I said in my previous email, so good news there. I think this is a big step as there are only a few field types left that I need to support, those being date (should be similar to Trie) and Spatial fields, which at a glance looked like it provided a way to provide the token stream through an extension. Definitely need to look more though. All of this said though, is this really the right way to get payloads into these types of fields? Should a jira feature request be added for this? On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote: Right, I had assumed (obviously here is my problem) that I'd be able to specify payloads for the field regardless of the field type. Looking at TrieField that is certainly non-trivial. After a bit of digging it appears that if I wanted to do something here I'd need to build a new TrieField, override createField and provide a Field that would return something like NumericTokenStream but also provide the payloads. Like you said sounds interesting to say the least... Were payloads not really intended to be used for these types of fields from a Lucene perspective? On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. H. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here. I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be interesting to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote: We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder
Re: how to index document with multiple words (phrases) and words permutation?
Simon, Thanks a lot. that is a great tool . I am trying to use it. Great solution. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4225425.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search opening hours
Sorry - didn't finish my thought. I need to address querying :) So using the above to define what's in the index your queries for a day/time become a CONTAINS operation against the field. Let's say that the field is defined as a location_rpt using JTS and its Spatial Factory (which supports polygons) - oh, and it would need to be multi-valued. Querying the field would require first translating now or in an hour or Monday at 9am to a geocode, then hitting the index with a CONTAINS request per the docs: https://cwiki.apache.org/confluence/display/solr/Spatial+Search On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr darre...@gmail.com wrote: Sure - and sorry for its density. I reread it and thought the same ;) So imagine a polygon of say 1/2 mile width (I made that up) that stretches around the equator. Let's call this a week's timeline and subdivide it into 7 blocks, one for each day. For the sake of simplicity assume it's a line (which I forget but is supported in Solr as an infinitely small polygon) starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for Sunday at 11:59 PM. By subdivide you can think of it either radially or by longitude, but you have 360 degrees to divide into 7, which means that every hour is represented by a range of roughly 2.143 degrees (360/7/24). These regions represent each day and hour (or less), and the region boundaries represent midnight for the day before. Now for indexing - your open hours then become a combination of these subdivisions. If you're open 24x7 then the whole polygon is indexed. If you're only open on Monday from 9-5 then only the polygon between (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you can index any combination of times this way. So now the varsity question is how to do this with a fluctuating calendar? I think this example can be extended to include searching against any given day of the week in a year, or years. Just imagine a translation layer that adjusts the latitude N or S by some amount to represent which day in which year you're looking for. Make sense? On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote: delightfully dense = really intriguing, but I couldn't quite understand it - really hoping for more info On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote: Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren -- Darren -- Darren
Re: Exact substring search with ngrams
analysis tab does not support multi-valued fields. It only analyses a single field value. On Wed, Aug 26, 2015, at 05:05 PM, Erick Erickson wrote: bq: my dog has fleas I wouldn't want some variant of og ha to match, Here's where the mysterious positionIncrementGap comes in. If you make this field multiValued, and index this like this: doc field name=blahmy dog/field field name=blahhas fleas/field doc or equivalently in SolrJ just doc.addField(blah, my dog); doc.addField(blah, has fleas); then the position of dog will be 2 and the position of has will be 102 assuming the positionIncrementGap is the default 100. N.B. I'm not sure you'll see this in the admin/analysis page or not. Anyway, now your example won't match across the two parts unless you specify a slop up in the 101 range. Best, Erick On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer r...@networkz.ch wrote: On 26/08/15 00:24, Erick Erickson wrote: Hmmm, this sounds like a nonsensical question, but what do you mean by arbitrary substring? Because if your substrings consist of whole _tokens_, then ngramming is totally unnecessary (and gets in the way). Phrase queries with no slop fulfill this requirement. But let's assume you need to march within tokens, i.e. if the doc contains my dog has fleas, you need to match input like as fle, in this case ngramming is an option. Yeah the as fle-thing is exactly what I want to achieve. You have substantially different index and query time chains. The result is that the offsets for all the grams at index time are the same in the quick experiment I tried, all were 1. But at query time, each gram had an incremented position. I'd start by using the query time analysis chain for indexing also. Next, I'd try enclosing multiple words in double quotes at query time and go from there. What you have now is an anti-pattern in that having substantially different index and query time analysis chains is not something that's likely to be very predictable unless you know _exactly_ what the consequences are. The admin/analysis page is your friend, in this case check the verbose checkbox to see what I mean. Hmm interesting. I had the additional \R tokenizer in the index chain because the the document can be multiple lines (but the search text is always a single line) and if the document was my dog has fleas I wouldn't want some variant of og ha to match, but I didn't realize it didn't give me any positions like you noticed. I'll try to experiment some more, thanks for the hints! Chris Best, Erick On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote: Hi I'm trying to build an index for technical documents that basically works like grep, i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: fieldType name=configtext_trigram class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\R group=-1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Then in the search, I use the edismax parser with mm=100%, so given the documents {id:test1,content: encryption 10.0.100.22 description } {id:test2,content: 10.100.0.22 description } and the query content:encryption, this will turn into parsedquery_toString: +((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8), and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be parsedquery_toString: +((content:.10 content:100 content:00. content:0.2 content:.22)~5), and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query content:.10 content:100 content:00. content:0.2 content:.22 with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to
Re: re:New Solr installation fails to create core
Hi Scott, How about having logged in as a privileged user, you to run create_core as solr, something like this on a Redhat env: sudo -u solr ./bin/solr create_core -c demo KR Henry -- View this message in context: http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact substring search with ngrams
bq: my dog has fleas I wouldn't want some variant of og ha to match, Here's where the mysterious positionIncrementGap comes in. If you make this field multiValued, and index this like this: doc field name=blahmy dog/field field name=blahhas fleas/field doc or equivalently in SolrJ just doc.addField(blah, my dog); doc.addField(blah, has fleas); then the position of dog will be 2 and the position of has will be 102 assuming the positionIncrementGap is the default 100. N.B. I'm not sure you'll see this in the admin/analysis page or not. Anyway, now your example won't match across the two parts unless you specify a slop up in the 101 range. Best, Erick On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer r...@networkz.ch wrote: On 26/08/15 00:24, Erick Erickson wrote: Hmmm, this sounds like a nonsensical question, but what do you mean by arbitrary substring? Because if your substrings consist of whole _tokens_, then ngramming is totally unnecessary (and gets in the way). Phrase queries with no slop fulfill this requirement. But let's assume you need to march within tokens, i.e. if the doc contains my dog has fleas, you need to match input like as fle, in this case ngramming is an option. Yeah the as fle-thing is exactly what I want to achieve. You have substantially different index and query time chains. The result is that the offsets for all the grams at index time are the same in the quick experiment I tried, all were 1. But at query time, each gram had an incremented position. I'd start by using the query time analysis chain for indexing also. Next, I'd try enclosing multiple words in double quotes at query time and go from there. What you have now is an anti-pattern in that having substantially different index and query time analysis chains is not something that's likely to be very predictable unless you know _exactly_ what the consequences are. The admin/analysis page is your friend, in this case check the verbose checkbox to see what I mean. Hmm interesting. I had the additional \R tokenizer in the index chain because the the document can be multiple lines (but the search text is always a single line) and if the document was my dog has fleas I wouldn't want some variant of og ha to match, but I didn't realize it didn't give me any positions like you noticed. I'll try to experiment some more, thanks for the hints! Chris Best, Erick On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote: Hi I'm trying to build an index for technical documents that basically works like grep, i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: fieldType name=configtext_trigram class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\R group=-1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Then in the search, I use the edismax parser with mm=100%, so given the documents {id:test1,content: encryption 10.0.100.22 description } {id:test2,content: 10.100.0.22 description } and the query content:encryption, this will turn into parsedquery_toString: +((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8), and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be parsedquery_toString: +((content:.10 content:100 content:00. content:0.2 content:.22)~5), and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query content:.10 content:100 content:00. content:0.2 content:.22 with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian
Re: New Solr installation fails to create collection/core
Deviantcode, did you look at the referenced JIRA: https://issues.apache.org/jira/browse/SOLR-7826 Or is that irrelevant? Best, Erick On Wed, Aug 26, 2015 at 1:58 AM, deviantcode hnoclel...@gmail.com wrote: I run into this exact problem trying out the latest solr, [5.3.0], @Scott, how did you fix it? KR Henry -- View this message in context: http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: StrDocValues
I think I found it. {!boost..} gave me what i was looking for and then a custom collector filtered out anything that I didn't want to show. On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson jej2...@gmail.com wrote: Are there any example implementation showing how StrDocValues works? I am not sure if this is the right place or not, but I was thinking about having some document level doc value that I'd like to read in a function query to impact if the document is returned or not. Am I barking up the right tree looking at this or is there another method to supporting this?
Re: StrDocValues
I don't see it explicitly mentioned, but does the boost only get applied to the final documents/score that matched the provided query or is it called for each field that matched? I'm assuming only once per document that matched the main query, is that right? On Wed, Aug 26, 2015 at 5:35 PM, Jamie Johnson jej2...@gmail.com wrote: I think I found it. {!boost..} gave me what i was looking for and then a custom collector filtered out anything that I didn't want to show. On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson jej2...@gmail.com wrote: Are there any example implementation showing how StrDocValues works? I am not sure if this is the right place or not, but I was thinking about having some document level doc value that I'd like to read in a function query to impact if the document is returned or not. Am I barking up the right tree looking at this or is there another method to supporting this?
Re: find documents based on specific term frequency
: Is there a way to search for documents that have a word appearing more : than a certain number of times? For example, I want to find documents : that only have more than 10 instances of the word genetics … Try... q=text:geneticsfq={!frange+incl=false+l=10}termfreq('text','genetics') Note: the q=text:genetics isn't neccessary -- you could do any query and then filter on the numeric function range of the termfreq() function, or use that {!frange} as your main query (in which case all matchin docs will have identical scores). i just included that in the example to show how you can search sort by the normal style scoring (which takes into account full TF-IDF and length normalization) while filtering on the TF using a function query. You can also request the termfreq() as a psuedo field for each doc in the the results, and parameterize the details to eliminate redundency in the request params... ...fq={!frange+incl=false+l=10+v=$tf}fl=*,$tftf=termfreq('text','genetics') Is the same as... ...fq={!frange+incl=false+l=10}termfreq('text','genetics')fl=*,termfreq('text','genetics') A big caveat to this however is that the termfreq function operates on the *RAW* underlying term values -- no query time analyzer is used -- so if you do stemming, or lowercasing in your index analyzer, you have to pass the stemmed/lowercased values to the function (Although i just filed SOLR-7981 since it occurs to me we can make this automatic in the future with a new function argument) https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser https://cwiki.apache.org/confluence/display/solr/Function+Queries -Hoss http://www.lucidworks.com/
find documents based on specific term frequency
Hi there, We have an index build on solr 5.0. We received an user question: Is there a way to search for documents that have a word appearing more than a certain number of times? For example, I want to find documents that only have more than 10 instances of the word genetics … I'm not sure if it's possible to do this with solr. Does anyone know? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu
Re: best way for adding a new field to all indexed documents...
Sadly, it's always a problem http://searchivarius.org/blog/how-rename-fields-solr On Wed, Aug 26, 2015 at 11:20 AM, Roxana Danger roxana.dan...@reedonline.co.uk wrote: Hello, I have a index created with solr, and I would like to add a new field to all the documents of the index. I suppose I could a) use an updateRequestHandler or b) create another index importing the data from the initial index and the data of my new field. Which could be the best approach? Will the background processing be re-indexing the documents? Thank you very much in advance, Roxana -- Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London, WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] http://www.reed.co.uk/ The UK's #1 job site. http://www.reed.co.uk/ [image: Follow us on Twitter] https://twitter.com/reedcouk https://www.linkedin.com/company/reed.co.uk [image: Like us on Facebook] https://www.facebook.com/reedcouk/ https://plus.google.com/u/0/+reedcouk/posts It's time to Love Mondays » http://www.reed.co.uk/lovemondays -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Search opening hours
Have a look at the links that Alexandre mentioned. a somewhat non-obvious style solution because you'd probably not think about spatial features while dealing with opening time - but it's worth having a look. -Stefan On Wednesday, August 26, 2015 at 10:16 AM, O. Klein wrote: Thank you for responding. Yonik's solution is what I had in mind. Was hoping for something more elegant, as he said, but it will work. The thing I haven't figured out is how to deal with closing times early morning next day. So it's 22:00 now and opening hours are 20:00 to 03:00 Can this be done with either or both approaches? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Thanks again Erick, I created https://issues.apache.org/jira/browse/SOLR-7975, though I didn't attach s patch because my current implementation is not useful generally right now, it meets my use case but likely would not meet others. I will try to look about generalizing this to allow something custom to be plugged in. On Aug 26, 2015 2:46 AM, Erick Erickson erickerick...@gmail.com wrote: Sure, I think it's fine to raise a JIRA, especially if you can include a patch, even a preliminary one to solicit feedback... which I'll leave to people who are more familiar with that code... I'm not sure how generally useful this would be, and if it comes at a cost to normal searching there's sure to be lively discussion. Best Erick On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson jej2...@gmail.com wrote: Looks like I have something basic working for Trie fields. I am doing exactly what I said in my previous email, so good news there. I think this is a big step as there are only a few field types left that I need to support, those being date (should be similar to Trie) and Spatial fields, which at a glance looked like it provided a way to provide the token stream through an extension. Definitely need to look more though. All of this said though, is this really the right way to get payloads into these types of fields? Should a jira feature request be added for this? On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote: Right, I had assumed (obviously here is my problem) that I'd be able to specify payloads for the field regardless of the field type. Looking at TrieField that is certainly non-trivial. After a bit of digging it appears that if I wanted to do something here I'd need to build a new TrieField, override createField and provide a Field that would return something like NumericTokenStream but also provide the payloads. Like you said sounds interesting to say the least... Were payloads not really intended to be used for these types of fields from a Lucene perspective? On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. H. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here. I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be interesting to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote: We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea,
Re: Search opening hours
On Tue, Aug 25, 2015, at 10:54 PM, Yonik Seeley wrote: On Tue, Aug 25, 2015 at 5:02 PM, O. Klein kl...@octoweb.nl wrote: I'm trying to find the best way to search for stores that are open NOW. It's probably not the *best* way, but assuming it's currently 4:10pm, you could do +open:[* TO 1610] +close:[1610 TO *] And to account for days of the week have different fields for each day openM, closeM, openT, closeT, etc... not super elegant, but seems to get the job done. So, the basic question is what does now mean? If it is 5:29pm and a shop closes at 5:30pm, does that count as open? If you want to query a single time within a range, then Yonik's approach will work (although I'd use open0 to open6 for the days of the week). If you want to find a range within another range, then use what Alexandre suggested - spatial search functionality. For example, you could say, is the shop open for 10 minutes either side of now. Of course, you could use spatial for a time within a range, and it might be a little more elegant because you can use a multivalued field to specify the open/close ranges for your store. Upayavira
Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR
On 25/08/2015 13:21, Simer P wrote: http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr . *Question:* How can I get guarantee commits with Apache SOLR where persisting data to disk and visibility are both equally important ? *Background:* We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and *do not* want to use another database on the side. I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ? Can anyone please suggest a solution for achieving guaranteed commits with SOLR ? Firstly, if you're asking here, you're likely to be answered here, not on Stack Overflow. A search engine is not a database. Although both Solr and Elasticsearch are often used as primary stores with varying degrees of success, they are after all search engines and designed for this use. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Behavior of grouping on a field with same value spread across shards.
Thanks Erick. On Wed, Aug 26, 2015 at 12:11 PM, Erick Erickson erickerick...@gmail.com wrote: That should be the case. Best, Erick On Tue, Aug 25, 2015 at 8:55 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Erick, I saw the link. So is it that the grouping functionality works fine in distributed search except the two cases mentioned in the link? Regards, Modassar On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson erickerick...@gmail.com wrote: That's not really the case. Perhaps you're confusing group.ngroups and group.facet with just grouping? See the ref guide: https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats Best, Erick On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, As per my understanding, to group on a field all documents with the same value in the field have to be in the same shard. Can we group by a field where the documents with the same value in that field will be distributed across shards? Please let me know what are the limitations, feature not available or performance issues for such fields? Thanks, Modassar
best way for adding a new field to all indexed documents...
Hello, I have a index created with solr, and I would like to add a new field to all the documents of the index. I suppose I could a) use an updateRequestHandler or b) create another index importing the data from the initial index and the data of my new field. Which could be the best approach? Will the background processing be re-indexing the documents? Thank you very much in advance, Roxana -- Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London, WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] http://www.reed.co.uk/ The UK's #1 job site. http://www.reed.co.uk/ [image: Follow us on Twitter] https://twitter.com/reedcouk https://www.linkedin.com/company/reed.co.uk [image: Like us on Facebook] https://www.facebook.com/reedcouk/ https://plus.google.com/u/0/+reedcouk/posts It's time to Love Mondays » http://www.reed.co.uk/lovemondays
Re: New Solr installation fails to create collection/core
I run into this exact problem trying out the latest solr, [5.3.0], @Scott, how did you fix it? KR Henry -- View this message in context: http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Hash of solr documents
Yes, it´s an XY problem :) We are making the first tests to split our shard (Solr 5.1) The problem we have is this: the number of documents indexed in the new shards is lower than in the original one (19814 and 19653, vs 61100), and always the same. We have no idea why Solr is doing this. A problem with some documents, with the segment? A long time after we changed from normal Solr to Solr Cloud, we found that the parameter router in clusterstate.json was incorrect, because we wanted to have compositeId and it was set as explicit. The solution was deleting the clusterstate.json and restart Solr. And we are thinking that maybe the problem with the SPLIT is related with that: some documents are stored with the hash value and others not, and SPLIT needs that to distribute them. But I know that this likely has nothing to do with the SPLIT problem, it's only an idea. This is the log, all seem to be normal: INFO - 2015-08-26 09:13:47.654; org.apache.solr.handler.admin.CoreAdminHandler; Invoked split action for core: buscon INFO - 2015-08-26 09:13:47.656; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true, waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2015-08-26 09:13:47.656; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping IW.commit. INFO - 2015-08-26 09:13:47.657; org.apache.solr.core.SolrCore; SolrIndexSearcher has not changed - not re-opening: org.apach e.solr.search.SolrIndexSearcher INFO - 2015-08-26 09:13:47.657; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush INFO - 2015-08-26 09:13:47.658; org.apache.solr.update.SolrIndexSplitter; SolrIndexSplitter: partitions=2 segments=1 INFO - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; SolrIndexSplitter: partition #0 partitionCount=2 r ange=0-3fff INFO - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; SolrIndexSplitter: partition #0 partitionCount=2 r ange=0-3fff segment #0 segmentCount=1 INFO - 2015-08-26 09:22:19.533; org.apache.solr.update.SolrIndexSplitter; SolrIndexSplitter: partition #1 partitionCount=2 r ange=4000-7fff INFO - 2015-08-26 09:22:19.536; org.apache.solr.update.SolrIndexSplitter; SolrIndexSplitter: partition #1 partitionCount=2 r ange=4000-7fff segment #0 segmentCount=1 INFO - 2015-08-26 09:30:44.141; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={ta rgetCore=buscon_shard2_0_replica1targetCore=buscon_shard2_1_replica1action=SPLITcore=busconwt=javabinqt=/admin/coresver sion=2} status=0 QTime=1016486 INFO - 2015-08-26 09:30:44.387; org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates on core: buscon_sh ard2_0_replica1 INFO - 2015-08-26 09:30:44.387; org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates available. core=buscon_s hard2_0_replica1 INFO - 2015-08-26 09:30:44.388; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={na me=buscon_shard2_0_replica1action=REQUESTAPPLYUPDATESwt=javabinqt=/admin/coresversion=2} status=0 QTime=2 INFO - 2015-08-26 09:30:44.441; org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates on core: buscon_sh ard2_1_replica1 INFO - 2015-08-26 09:30:44.441; org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates available. core=buscon_s hard2_1_replica1 INFO - 2015-08-26 09:30:44.441; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={na me=buscon_shard2_1_replica1action=REQUESTAPPLYUPDATESwt=javabinqt=/admin/coresversion=2} status=0 QTime=0 INFO - 2015-08-26 09:30:44.743; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:Syn cConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 4) Thanks, David De: Anshum Gupta ans...@anshumgupta.net Para: solr-user@lucene.apache.org solr-user@lucene.apache.org, Fecha: 26/08/2015 10:27 Asunto: Re: Hash of solr documents Hi David, The route key itself is indexed, but not the hash value. Why do you need to know and display the hash value? This seems like an XY problem to me: http://people.apache.org/~hossman/#xyproblem On Wed, Aug 26, 2015 at 1:17 AM, david.dav...@correo.aeat.es wrote: Hi, I have read in one post in the Internet that the hash Solr Cloud calculates over the key field to send each document to a different shard is indexed. Is this true? If true, is there any way to show this hash for each document? Thanks, David -- Anshum Gupta
Re: Exact substring search with ngrams
On 26/08/15 00:24, Erick Erickson wrote: Hmmm, this sounds like a nonsensical question, but what do you mean by arbitrary substring? Because if your substrings consist of whole _tokens_, then ngramming is totally unnecessary (and gets in the way). Phrase queries with no slop fulfill this requirement. But let's assume you need to march within tokens, i.e. if the doc contains my dog has fleas, you need to match input like as fle, in this case ngramming is an option. Yeah the as fle-thing is exactly what I want to achieve. You have substantially different index and query time chains. The result is that the offsets for all the grams at index time are the same in the quick experiment I tried, all were 1. But at query time, each gram had an incremented position. I'd start by using the query time analysis chain for indexing also. Next, I'd try enclosing multiple words in double quotes at query time and go from there. What you have now is an anti-pattern in that having substantially different index and query time analysis chains is not something that's likely to be very predictable unless you know _exactly_ what the consequences are. The admin/analysis page is your friend, in this case check the verbose checkbox to see what I mean. Hmm interesting. I had the additional \R tokenizer in the index chain because the the document can be multiple lines (but the search text is always a single line) and if the document was my dog has fleas I wouldn't want some variant of og ha to match, but I didn't realize it didn't give me any positions like you noticed. I'll try to experiment some more, thanks for the hints! Chris Best, Erick On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote: Hi I'm trying to build an index for technical documents that basically works like grep, i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: fieldType name=configtext_trigram class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\R group=-1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Then in the search, I use the edismax parser with mm=100%, so given the documents {id:test1,content: encryption 10.0.100.22 description } {id:test2,content: 10.100.0.22 description } and the query content:encryption, this will turn into parsedquery_toString: +((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8), and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be parsedquery_toString: +((content:.10 content:100 content:00. content:0.2 content:.22)~5), and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query content:.10 content:100 content:00. content:0.2 content:.22 with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian
Re: Search opening hours
Thank you for responding. Yonik's solution is what I had in mind. Was hoping for something more elegant, as he said, but it will work. The thing I haven't figured out is how to deal with closing times early morning next day. So it's 22:00 now and opening hours are 20:00 to 03:00 Can this be done with either or both approaches? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html Sent from the Solr - User mailing list archive at Nabble.com.
Hash of solr documents
Hi, I have read in one post in the Internet that the hash Solr Cloud calculates over the key field to send each document to a different shard is indexed. Is this true? If true, is there any way to show this hash for each document? Thanks, David
Re: Search opening hours
Those options don't fix my problem with closing times the next morning, or is there a way to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search opening hours
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira
Re: Hash of solr documents
Hi David, The route key itself is indexed, but not the hash value. Why do you need to know and display the hash value? This seems like an XY problem to me: http://people.apache.org/~hossman/#xyproblem On Wed, Aug 26, 2015 at 1:17 AM, david.dav...@correo.aeat.es wrote: Hi, I have read in one post in the Internet that the hash Solr Cloud calculates over the key field to send each document to a different shard is indexed. Is this true? If true, is there any way to show this hash for each document? Thanks, David -- Anshum Gupta
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? I don't know - the quality/performance point as well as which knobs to tweak is extremely dependent on your corpus and your hardware. A person with better understanding of carrot might be able to do better sanity checking, but I am not at all at that level. Related, it seems to me that the question of how to tweak the clustering has little to do with Solr and a lot to do with carrot (assuming here that carrot is the bottleneck). You might have more success asking in a carrot forum? - Toke Eskildsen, State and University Library, Denmark
Re: splitting shards on 4.7.2 with custom plugins
Hi, So i got the shards too split. But they are very unbalanced. 7204922 total docs on the original collection shard1_0 numdocs 3661699 shard1_1 numdocs 3543132 shard2_0 numdocs 0 shard2_1 numdcs 0 Any ideas? This is what i had to do to get this to split with the custom libs I got shard1 to split successfully and it created replicas on the other servers in the cloud for the new shard/shards. This is the jist of it. When you split a shard solr creates a 2 new cores. When creating a core it uses the solr/solr.xml settings for classpath etc This is why searches etc work fine and can find the opa plugins but when we called shardsplit it could not. I had to move the custom jars outside of the collection directory and add this to solr/solr.xml on the 4 nodes. info here https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond solr str name=sharedLib${sharedLib:../lib}/str when you restart you can see it in the log loading the jars form the new location. INFO - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer; loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib INFO - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to classloader INFO - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar' to classloader INFO - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-continuation-8.1.10.v20130312.jar' to classloader INFO - 2015-08-25 23:40:52.301; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/groovy-all-2.0.4.jar' to classloader INFO - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/qpl-solr472-0.2-SNAPSHOT.jar' to classloader INFO - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-jmx-8.1.10.v20130312.jar' to classloader INFO - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-deploy-8.1.10.v20130312.jar' to classloader INFO - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/ext/' to classloader INFO - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-xml-8.1.10.v20130312.jar' to classloader so I then ran the split and checked on it in the morning http://dj01.aws.narasearch.us:8981/solr/admin/collections?action=SPLITSHARDcollection=collection1shard=shard1 it succeeded and created replicas. ls /opt/solr/solr-4.7.2/solr0*/solr/ /opt/solr/solr-4.7.2/solr01/solr/: bin collection1_shard1_0_replica1 README.txt zoo.cfg collection1 collection1_shard1_1_replica1 solr.xml /opt/solr/solr-4.7.2/solr02/solr/: bin collection1 README.txt solr.xml zoo.cfg /opt/solr/solr-4.7.2/solr03/solr/: bin collection1 collection1_shard1_0_replica2 README.txt solr.xml zoo.cfg /opt/solr/solr-4.7.2/solr04/solr/: bin collection1 collection1_shard1_1_replica2 README.txt solr.xml zoo.cfg and it actually distributed it [root@dj01 solr]# du -sh * 4.0Kbin 41G collection1 18G collection1_shard1_0_replica1 16G collection1_shard1_1_replica1 4.0KREADME.txt 4.0Ksolr.xml 4.0Kzoo.cfg [root@dj01 solr]# du -sh /opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2 16G /opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2 [root@dj01 solr]# du -sh /opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2 18G /opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2 Jeff Courtade M: 240.507.6116 On Aug 25, 2015 11:09 PM, Anshum Gupta ans...@anshumgupta.net wrote: Can you elaborate a bit more on the setup, what do the custom plugins do, what error do you get ? It seems like a classloader/classpath issue to me which doesn't really relate to Shard splitting. On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade courtadej...@gmail.com wrote: I am getting failures when trying too split shards on solr 4.2.7 with custom plugins. It fails regularily it cannot find the jar files for plugins when creating the new cores/shards. Ideas? -- Thanks, Jeff Courtade M: 240.507.6116 -- Anshum Gupta
Re: Search opening hours
Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren
Re: Search opening hours
delightfully dense = really intriguing, but I couldn't quite understand it - really hoping for more info On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote: Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren
Re: Search opening hours
If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for the link. I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older version, as I'm using the latest carrot2-workbench-3.10.3, which is only released recently. I've changed all the settings like fragSize and desiredCluserCountBase to be the same on both sides, and I'm now able to get very similar cluster results. Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? Regards, Edwin On 26 August 2015 at 13:58, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? Either that or the carrot bundled with Solr is an older version. By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. That is because it is too long for a single line. Try copy-pasting it: https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration - Toke Eskildsen, State and University Library, Denmark
Re: splitting shards on 4.7.2 with custom plugins
im looking at the clusterstate.json t see why it is doing this I really dont understand it though... {collection1:{ shards:{ shard1:{ range:8000-, state:active, replicas:{ core_node1:{ state:active, base_url:http://10.135.2.153:8981/solr;, core:collection1, node_name:10.135.2.153:8981_solr, leader:true}, core_node10:{ state:active, base_url:http://10.135.2.153:8982/solr;, core:collection1, node_name:10.135.2.153:8982_solr}}}, shard2:{ range:0-7fff, state:inactive, replicas:{ core_node9:{ state:active, base_url:http://10.135.2.153:8984/solr;, core:collection1, node_name:10.135.2.153:8984_solr, leader:true}, core_node11:{ state:active, base_url:http://10.135.2.153:8983/solr;, core:collection1, node_name:10.135.2.153:8983_solr}}}, shard1_1:{ range:null, state:active, parent:null, replicas:{ core_node6:{ state:active, base_url:http://10.135.2.153:8981/solr;, core:collection1_shard1_1_replica1, node_name:10.135.2.153:8981_solr, leader:true}, core_node8:{ state:active, base_url:http://10.135.2.153:8984/solr;, core:collection1_shard1_1_replica2, node_name:10.135.2.153:8984_solr}}}, shard1_0:{ range:null, state:active, parent:null, replicas:{ core_node5:{ state:active, base_url:http://10.135.2.153:8981/solr;, core:collection1_shard1_0_replica1, node_name:10.135.2.153:8981_solr, leader:true}, core_node7:{ state:active, base_url:http://10.135.2.153:8983/solr;, core:collection1_shard1_0_replica2, node_name:10.135.2.153:8983_solr}}}, shard2_0:{ range:0-3fff, state:active, replicas:{ core_node13:{ state:active, base_url:http://10.135.2.153:8984/solr;, core:collection1_shard2_0_replica1, node_name:10.135.2.153:8984_solr, leader:true}, core_node14:{ state:active, base_url:http://10.135.2.153:8982/solr;, core:collection1_shard2_0_replica2, node_name:10.135.2.153:8982_solr}}}, shard2_1:{ range:4000-7fff, state:active, replicas:{ core_node12:{ state:active, base_url:http://10.135.2.153:8984/solr;, core:collection1_shard2_1_replica1, node_name:10.135.2.153:8984_solr, leader:true}, core_node15:{ state:active, base_url:http://10.135.2.153:8981/solr;, core:collection1_shard2_1_replica2, node_name:10.135.2.153:8981_solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:1, autoCreated:true}} -- Thanks, Jeff Courtade M: 240.507.6116 On Wed, Aug 26, 2015 at 8:44 AM, Jeff Courtade courtadej...@gmail.com wrote: Hi, So i got the shards too split. But they are very unbalanced. 7204922 total docs on the original collection shard1_0 numdocs 3661699 shard1_1 numdocs 3543132 shard2_0 numdocs 0 shard2_1 numdcs 0 Any ideas? This is what i had to do to get this to split with the custom libs I got shard1 to split successfully and it created replicas on the other servers in the cloud for the new shard/shards. This is the jist of it. When you split a shard solr creates a 2 new cores. When creating a core it uses the solr/solr.xml settings for classpath etc This is why searches etc work fine and can find the opa plugins but when we called shardsplit it could not. I had to move the custom jars outside of the collection directory and add this to solr/solr.xml on the 4 nodes. info here https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond solr str name=sharedLib${sharedLib:../lib}/str when you restart you can see it in the log loading the jars form the new location. INFO - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer; loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib INFO - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to classloader INFO - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader; Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar' to classloader INFO - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader; Adding
Re: Solr performance is slow with just 1GB of data indexed
Thanks for your recommendation Toke. Will try to ask in the carrot forum. Regards, Edwin On 26 August 2015 at 18:45, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote: Now I've tried to increase the carrot.fragSize to 75 and carrot.summarySnippets to 2, and set the carrot.produceSummary to true. With this setting, I'm mostly able to get the cluster results back within 2 to 3 seconds when I set rows=200. I'm still trying out to see if the cluster labels are ok, but in theory do you think this is a suitable setting to attempt to improve the clustering results and at the same time improve the performance? I don't know - the quality/performance point as well as which knobs to tweak is extremely dependent on your corpus and your hardware. A person with better understanding of carrot might be able to do better sanity checking, but I am not at all at that level. Related, it seems to me that the question of how to tweak the clustering has little to do with Solr and a lot to do with carrot (assuming here that carrot is the bottleneck). You might have more success asking in a carrot forum? - Toke Eskildsen, State and University Library, Denmark
Re: Search opening hours
Sure - and sorry for its density. I reread it and thought the same ;) So imagine a polygon of say 1/2 mile width (I made that up) that stretches around the equator. Let's call this a week's timeline and subdivide it into 7 blocks, one for each day. For the sake of simplicity assume it's a line (which I forget but is supported in Solr as an infinitely small polygon) starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for Sunday at 11:59 PM. By subdivide you can think of it either radially or by longitude, but you have 360 degrees to divide into 7, which means that every hour is represented by a range of roughly 2.143 degrees (360/7/24). These regions represent each day and hour (or less), and the region boundaries represent midnight for the day before. Now for indexing - your open hours then become a combination of these subdivisions. If you're open 24x7 then the whole polygon is indexed. If you're only open on Monday from 9-5 then only the polygon between (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you can index any combination of times this way. So now the varsity question is how to do this with a fluctuating calendar? I think this example can be extended to include searching against any given day of the week in a year, or years. Just imagine a translation layer that adjusts the latitude N or S by some amount to represent which day in which year you're looking for. Make sense? On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote: delightfully dense = really intriguing, but I couldn't quite understand it - really hoping for more info On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote: Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren -- Darren
Connect and sync two solr server
Hi, I want to connect two solrcloud server. and sync there indexes to each other so that is any server is down we can work with other and whenever I update or add index in any server the other also get updated. shahper
Re: StrDocValues
On Wed, Aug 26, 2015 at 6:20 PM, Jamie Johnson jej2...@gmail.com wrote: I don't see it explicitly mentioned, but does the boost only get applied to the final documents/score that matched the provided query or is it called for each field that matched? I'm assuming only once per document that matched the main query, is that right? Correct. -Yonik
Re: Solr 5.2.1 versus Solr 4.7.0 performance
On 8/26/2015 1:11 AM, Esther Goldbraich wrote: We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 (5-15% better than 5.2.1, with an exception of 100% improvement for one of the queries ). Using same JVM (IBM 1.7) and JVM params. Index's size is ~500G, spread over 64 shards, with replication factor 2. Do you know about any config / setup change for Solr 5.2.1 that can improve the performance? Any idea what causes this behavior? I have little experience comparing the performance of different versions, but I have a general sense that OS disk caching becomes increasingly important to Solr's performance as time goes on. What this means in real terms is that if you have enough memory for adequate OS disk caching, using a later version of Solr will probably yield better performance, but if you don't have enough memory, you might actually see *worse* performance. A question that might become important later, but doesn't really affect the immediate things I'm thinking about: What GC tuning options you are using? How much RAM do you have in each machine, and how big is Solr's heap? How much index data actually lives on each server? Be sure to count all replicas on each machine. https://wiki.apache.org/solr/SolrPerformanceProblems#RAM Thanks, Shawn
Re: Lucene/Solr 5.0 and custom FieldCahe implementation
Sorry to poke this again but I'm not following the last comment of how I could go about extending the solr index searcher and have the extension used. Is there an example of this? Again thanks Jamie On Aug 25, 2015 7:18 AM, Jamie Johnson jej2...@gmail.com wrote: I had seen this as well, if I over wrote this by extending SolrIndexSearcher how do I have my extension used? I didn't see a way that could be plugged in. On Aug 25, 2015 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Mikhail. If I'm reading the SimpleFacets class correctly, out delegates to DocValuesFacets when facet method is FC, what used to be FieldCache I believe. DocValuesFacets either uses DocValues or builds then using the UninvertingReader. Ah.. got it. Thanks for reminding this details.It seems like even docValues=true doesn't help with your custom implementation. I am not seeing a clean extension point to add a custom UninvertingReader to Solr, would the only way be to copy the FacetComponent and SimpleFacets and modify as needed? Sadly, yes. There is no proper extension point. Also, consider overriding SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the particular UninvertingReader is created, there you can pass the own one, which refers to custom FieldCache. On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Jamie, I don't understand how it could choose DocValuesFacets (it occurs on docValues=true) field, but then switches to UninvertingReader/FieldCache which means docValues=false. If you can provide more details it would be great. Beside of that, I suppose you can only implement and inject your own UninvertingReader, I don't think there is an extension point for this. It's too specific requirement. On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com wrote: as mentioned in a previous email I have a need to provide security controls at the term level. I know that Lucene/Solr doesn't support this so I had baked something onto a 4.x baseline that was sufficient for my use cases. I am now looking to move that implementation to 5.x and am running into an issue around faceting. Previously we were able to provide a custom cache implementation that would create separate cache entries given a particular set of security controls, but in Solr 5 some faceting is delegated to DocValuesFacets which delegates to UninvertingReader in my case (we are not storing DocValues). The issue I am running into is that before 5.x I had the ability to influence the FieldCache that was used at the Solr level to also include a security token into the key so each cache entry was scoped to a particular level. With the current implementation the FieldCache seems to be an internal detail that I can't influence in anyway. Is this correct? I had noticed this Jira ticket https://issues.apache.org/jira/browse/LUCENE-5427, is there any movement on this? Is there another way to influence the information that is put into these caches? As always thanks in advance for any suggestions. -Jamie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com