Re: Tokenizers and DelimitedPayloadTokenFilterFactory
To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Re: testing with EmbeddedSolrServer
Hello, I'm trying to guess what are you doing. It's not clear so far. I found http://stackoverflow.com/questions/11951695/embedded-solr-dih My conclusion, if you play with DIH and EmbeddedSolrServer you'd better to avoid the third beast, you don't need to bother with tests. I guess that main() is over while DIH runs in background thread. You need to loop status command until import is over. or add synchronous=true parameter to full-import command it should switch to synchronous mode: https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImportHandler.java#L199 Take care On Tue, Aug 25, 2015 at 4:41 PM, Moen Endre endre.m...@imr.no wrote: Is there an example of integration-testing with EmbeddedSolrServer that loads data from a data importhandler - then queries the data? Ive tried doing this based on org.apache.solr.client.solrj.embedded.TestEmbeddedSolrServerConstructors. But no data is being imported. Here is the test-class ive tried: https://gist.github.com/emoen/5d0a28df91c4c1127238 Ive also tried writing a test by extending AbstractSolrTestCase - but havnt got this working. Ive documented some of the log output here: http://stackoverflow.com/questions/32052642/solrcorestate-already-closed-with-unit-test-using-embeddedsolrserver-v-5-2-1 Should I extend AbstractSolrTestCase or SolrTestCaseJ4 when writing tests? Cheers Endre -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
attribute based recommender with solr
Hey Guys, I wanted to create a simple, attributed based food recommender with solr. The User makes his choice concerning ingredients, cooking time, difficulty and so on. It is based on a SQL database where the recipes are stored. So, for example the user likes tomatoes, then the recipes with tomatoes should be boosted and ranked better. Sounds easy, but the sources I found are pretty shallow and theoretical. They don't really help. Maybe someone has done this before and is willing to help me :D I also would be happy about some good sources, I already did research for hours. -- View this message in context: http://lucene.472066.n3.nabble.com/attribute-based-recommender-with-solr-tp4225186.html Sent from the Solr - User mailing list archive at Nabble.com.
Tokenizers and DelimitedPayloadTokenFilterFactory
I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Oh My. What fun! bq: I need a way to specify the payload on the other field types Not to my knowledge. The payload mechanism is built on the capability of having a filter in the analysis chain. And there's no analysis chain with primitive types (string, numeric and the like). Hmmm. Totally off the top of my head, but I wonder if you could use a Binary type and customize all the reading to spoof whatever primitive types you wanted while respecting your auth tokens? Best, Erick On Tue, Aug 25, 2015 at 10:37 AM, Jamie Johnson jej2...@gmail.com wrote: To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Unknown query parser 'terms' with TermsComponent defined
Hi, We've encountered a strange situation, I'm hoping someone might be able to shed some light. We're using Solr 4.9 deployed in Tomcat 7. We build a query that has these params: 'params'={ 'fl'='id', 'sort'='system_create_dtsi asc', 'indent'='true', 'start'='0', 'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms f=id}ft849m81z)', 'qt'='standard', 'wt'='ruby', 'rows'=['1', '1000']}}, And it responds with an error message 'error'={ 'msg'='Unknown query parser \'terms\'', 'code'=400}} The terms component is defined in solrconfig.xml: searchComponent name=termsComponent class=solr.TermsComponent / requestHandler name=/terms class=solr.SearchHandler lst name=defaults bool name=termstrue/bool /lst arr name=components strtermsComponent/str /arr /requestHandler And the Standard Response Handler is defined: requestHandler name=standard class=solr.SearchHandler lst name= defaults str name=echoParamsexplicit/str str name=defTypelucene /str /lst /requestHandler In case its useful, we have luceneMatchVersion4.9/luceneMatchVersion Why would we be getting the Unknown query parser \'terms\' error? Thanks, Tricia
Re: how to prevent uuid-field changing in /update query?
UUIDUpdateProcessorFactory - An update processor that adds a newly generated UUID value to any document being added that does not already have a value in the specified field. See: http://lucene.apache.org/solr/5_2_1/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html -- Jack Krupansky On Tue, Aug 25, 2015 at 4:22 AM, CrazyDiamond crazy_diam...@mail.ru wrote: i have uuid field. it is not set as unique, but nevertheless i want it not to be changed every time when i call /update. it might be because i added requesthandler with name /update which contains uuid update срфшт .But if i not do this i have no uuid at all.May be i can config uuid update-chain to set uuid only if it is blank? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using copyField with dynamicField
Zach, As an alternative to 'copyField', you might want to consider the CloneFieldUpdateProcessorFactory: http://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html It supports specification of field names with regular expressions, exclusion of specific fields that otherwise match the regex, etc. Much more flexible than copyField, in my opinion. Regards, Scott On Mon, Aug 24, 2015 at 10:39 PM, Erick Erickson erickerick...@gmail.com wrote: What is reported in the Solr log? That's usually much more informative. Best, Erick On Mon, Aug 24, 2015 at 5:26 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: It should work (at first glance). copyField does support wildcards. Do you have a field called text? Also, your field name and field type text have the same name. Not sure it is the best idea. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 August 2015 at 17:27, Zach Thompson z...@duckduckgo.com wrote: Hi All, Is it possible to use copyField with dynamicField? I was trying to do the following, dynamicField name=*_text type=text indexed=true stored=true/ copyField source=*_text dest=text maxChars=100 / and getting a 400 error on trying to copy the first dynamic field. Without the copyField the fields seem to load ok. -- Zach Thompson z...@duckduckgo.com
RE: User Authentication
We use CAS as well, and are also not using ZooKeeper/SolrCloud. We may move to SolrCloud after getting our current very-basic setup into production. We'll definitely take a look at the rule-based authorization plugin and see how we can leverage that. -Original Message- From: LeZotte, Tom [mailto:tom.lezo...@vanderbilt.edu] Sent: Monday, August 24, 2015 4:37 PM To: solr-user@lucene.apache.org Subject: Re: User Authentication Bosco, We use CAS for user authentication, not sure if we have Kerberos working anywhere. Also we are not using ZooKeeper, because we are only running one server currently. thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 3:12 PM, Don Bosco Durai bo...@apache.orgmailto:bo...@apache.org wrote: Just curious, is Kerberos an option for you? If so, mostly all your 3 use cases will addressed. Bosco On 8/24/15, 12:18 PM, Steven White swhite4...@gmail.commailto:swhite4...@gmail.com wrote: Hi Noble, Is everything in the link you provided applicable to Solr 5.2.1? Thanks Steve On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul noble.p...@gmail.commailto:noble.p...@gmail.com wrote: did you manage to look at the reference guide? https://cwiki.apache.org/confluence/display/solr/Securing+Solr On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom tom.lezo...@vanderbilt.edu wrote: Alex I got a super secret release of Solr 5.3.1, wasn¹t suppose to say anything. Yes I¹m running 5.2.1, I will check out the release notes for 5.3. Was looking for three types of user authentication, I guess. 1. the Admin Console 2. User auth for each Core ( and select and update) on a server. 3. HTML interface access (example: ajax-solr https://github.com/evolvingweb/ajax-solr) Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch arafa...@gmail.commailto:arafa...@gmail.com mailto:arafa...@gmail.com wrote: Thanks for the email from the future. It is good to start to prepare for 5.3.1 now that 5.3 is nearly out. Joking aside (and assuming Solr 5.2.1), what exactly are you trying to achieve? Solr should not actually be exposed to the users directly. It should be hiding in a backend only visible to your middleware. If you are looking for a HTML interface that talks directly to Solr after authentication, that's not the right way to set it up. That said, some security features are being rolled out and you should definitely check the release notes for the 5.3. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 August 2015 at 10:01, LeZotte, Tom tom.lezo...@vanderbilt.edumailto:tom.lezo...@vanderbilt.edu wrote: Hi Solr Community I have been trying to add user authentication to our Solr 5.3.1 RedHat install. I¹ve found some examples on user authentication on the Jetty side. But they have failed. Does any one have a step by step example on authentication for the admin screen? And a core? Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 -- - Noble Paul
Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR
You could also look at an integrated product such as DataStax Enterprise which fully integrates the Cassandra database and Solr - you execute your database transactions in Cassandra and then DSE Search automatically indexes the data in the embedded version of Solr. See: http://www.datastax.com/products/datastax-enterprise-search About the only downside is that it is a proprietary product and the integration is not open source. -- Jack Krupansky On Tue, Aug 25, 2015 at 10:15 AM, Upayavira u...@odoko.co.uk wrote: On Tue, Aug 25, 2015, at 01:21 PM, Simer P wrote: http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr . *Question:* How can I get guarantee commits with Apache SOLR where persisting data to disk and visibility are both equally important ? *Background:* We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and *do not* want to use another database on the side. I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ? Can anyone please suggest a solution for achieving guaranteed commits with SOLR ? Be sure whether you are trying to use the wrong tool for the job. Solr does not offer per transaction guarantees. It is heavily optimised around high read/low write situations (i.e. more reads than writes). If you commit to disk too often, the implementation will be very inefficient (it will create lots of segments that need to be merged, and caches will become ineffective). Also, when you issue a commit, it commits all pending documents, regardless of whom posted them to Solr. These do not sound like things that suit your application. There remains the possibility (even if extremely uncommon/unlikely) that a transaction could be lost were a server to die/loose power in the few seconds between a post and a subsequent commit. Personally, I'd use a more traditional database for the data, then also post it to Solr for fast search/faceting/etc as needed. But then, perhaps there's more to your usecase than I have so far understood. Upayavira
Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR
On Tue, Aug 25, 2015, at 01:21 PM, Simer P wrote: http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr . *Question:* How can I get guarantee commits with Apache SOLR where persisting data to disk and visibility are both equally important ? *Background:* We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and *do not* want to use another database on the side. I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ? Can anyone please suggest a solution for achieving guaranteed commits with SOLR ? Be sure whether you are trying to use the wrong tool for the job. Solr does not offer per transaction guarantees. It is heavily optimised around high read/low write situations (i.e. more reads than writes). If you commit to disk too often, the implementation will be very inefficient (it will create lots of segments that need to be merged, and caches will become ineffective). Also, when you issue a commit, it commits all pending documents, regardless of whom posted them to Solr. These do not sound like things that suit your application. There remains the possibility (even if extremely uncommon/unlikely) that a transaction could be lost were a server to die/loose power in the few seconds between a post and a subsequent commit. Personally, I'd use a more traditional database for the data, then also post it to Solr for fast search/faceting/etc as needed. But then, perhaps there's more to your usecase than I have so far understood. Upayavira
Re: how to index document with multiple words (phrases) and words permutation?
What you want to do is basically named entity recognition. We have a quite similar use case (medical/scientific documents, need to look for disease names /drug names /MeSH terms, etc). Take a look at David Smiley's Solr Text Tagger ( https://github.com/OpenSextant/SolrTextTagger ) which we've been using with some success for this task. best -Simon On Mon, Aug 24, 2015 at 2:13 PM, afrooz afr.rahm...@gmail.com wrote: Thanks Erick, I will explain the detail scenario so you might give me a solution: I want to annotate a medical document base on only medical dictionary. I don't need to annotate non medical words of document at all. The medical dictionary contains terms which contains multiple words, and these terms all together has a specific medical meanings. For example back Pain, back and pain are two separate words but together they have another meaning. these terms might be using in different orders in a sentences but all with a same meaning. Ex breast cancer or cancer in breast should be consider the same... We have terms even more than 6 words also. So the question is that I have a document with around 700 words and i need to annotate this document base on medical terminology of 3 million size in records any idea how to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4224970.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Behavior of grouping on a field with same value spread across shards.
That's not really the case. Perhaps you're confusing group.ngroups and group.facet with just grouping? See the ref guide: https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats Best, Erick On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, As per my understanding, to group on a field all documents with the same value in the field have to be in the same shard. Can we group by a field where the documents with the same value in that field will be distributed across shards? Please let me know what are the limitations, feature not available or performance issues for such fields? Thanks, Modassar
Re: Query timeAllowed and its behavior.
On 8/25/2015 3:18 AM, Modassar Ather wrote: Kindly help me understand the query time allowed attribute. The following is set in solrconfig.xml. int name=timeAllowed30/int Does this setting stop the query from running after the timeAllowed is reached? If not is there a way to stop it as it will occupy resources in background for no benefit. That is certainly the *goal* of timeAllowed ... but mostly it serves as a way to try and offer a guarantee that a query will not take longer than a certain amount of time, so your user application will receive a response, which might be an error or negative response, within that stated timeframe. Multithreaded programming is tricky in the best circumstances. If you introduce the idea of killing threads into the mix, it becomes REALLY complicated. I would not be very surprised to learn that parts of the query which run in parallel, such as the filter queries, continue to run in the background and populate caches even if the user query has been aborted because of timeAllowed. You could open a feature request issue in Jira, but I suspect that aborting *everything* for timeAllowed is a really hard problem that nobody wants to tackle. If you can figure out how to solve it, your patch will be reviewed and possibly committed. Thanks, Shawn
Re: splitting shards on 4.7.2 with custom plugins
Can you elaborate a bit more on the setup, what do the custom plugins do, what error do you get ? It seems like a classloader/classpath issue to me which doesn't really relate to Shard splitting. On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade courtadej...@gmail.com wrote: I am getting failures when trying too split shards on solr 4.2.7 with custom plugins. It fails regularily it cannot find the jar files for plugins when creating the new cores/shards. Ideas? -- Thanks, Jeff Courtade M: 240.507.6116 -- Anshum Gupta
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Looks like I have something basic working for Trie fields. I am doing exactly what I said in my previous email, so good news there. I think this is a big step as there are only a few field types left that I need to support, those being date (should be similar to Trie) and Spatial fields, which at a glance looked like it provided a way to provide the token stream through an extension. Definitely need to look more though. All of this said though, is this really the right way to get payloads into these types of fields? Should a jira feature request be added for this? On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote: Right, I had assumed (obviously here is my problem) that I'd be able to specify payloads for the field regardless of the field type. Looking at TrieField that is certainly non-trivial. After a bit of digging it appears that if I wanted to do something here I'd need to build a new TrieField, override createField and provide a Field that would return something like NumericTokenStream but also provide the payloads. Like you said sounds interesting to say the least... Were payloads not really intended to be used for these types of fields from a Lucene perspective? On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. H. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here. I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be interesting to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote: We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use
splitting shards on 4.7.2 with custom plugins
I am getting failures when trying too split shards on solr 4.2.7 with custom plugins. It fails regularily it cannot find the jar files for plugins when creating the new cores/shards. Ideas? -- Thanks, Jeff Courtade M: 240.507.6116
Re: Solr performance is slow with just 1GB of data indexed
On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote: I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? Either that or the carrot bundled with Solr is an older version. By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. That is because it is too long for a single line. Try copy-pasting it: https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration - Toke Eskildsen, State and University Library, Denmark
Re: Behavior of grouping on a field with same value spread across shards.
Thanks Erick, I saw the link. So is it that the grouping functionality works fine in distributed search except the two cases mentioned in the link? Regards, Modassar On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson erickerick...@gmail.com wrote: That's not really the case. Perhaps you're confusing group.ngroups and group.facet with just grouping? See the ref guide: https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats Best, Erick On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, As per my understanding, to group on a field all documents with the same value in the field have to be in the same shard. Can we group by a field where the documents with the same value in that field will be distributed across shards? Please let me know what are the limitations, feature not available or performance issues for such fields? Thanks, Modassar
Re: Solr performance is slow with just 1GB of data indexed
On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? That is correct. It is not stated very clearly, but it follows from trading the comments in the third example at https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Such is the nature of on-the-fly clustering. The clustering aims to be as representative of your search result as possible. Assigning more weight to the higher scoring documents (in this case: All the weight, as those beyond the top-100 are not even considered) does this. If that does not fit your expectations, maybe you need something else? Plain faceting perhaps? Or maybe enrichment of the documents with some sort of entity extraction? - Toke Eskildsen, State and University Library, Denmark
how to prevent uuid-field changing in /update query?
i have uuid field. it is not set as unique, but nevertheless i want it not to be changed every time when i call /update. it might be because i added requesthandler with name /update which contains uuid update срфшт .But if i not do this i have no uuid at all.May be i can config uuid update-chain to set uuid only if it is blank? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: User Authentication
You might have to use 5.3 when it is publicly available. It supports Basic Auth. But based on my understanding for the authentication/authorization framework implemented in 5.2, you need to use Solr Cloud/Zookeeper for configuring the plugins. Noble, Anshum or Ishan can confirm it. They are original authors for these features. Thanks Bosco On 8/24/15, 2:30 PM, Steven White swhite4...@gmail.com wrote: For my project, Keberos is not a requirement. What I need is: 1) Basic Auth to Solr server (at all access levels) 2) SSL support My setup is not using ZK, it's a single core. Steve On Mon, Aug 24, 2015 at 4:12 PM, Don Bosco Durai bo...@apache.org wrote: Just curious, is Kerberos an option for you? If so, mostly all your 3 use cases will addressed. Bosco On 8/24/15, 12:18 PM, Steven White swhite4...@gmail.com wrote: Hi Noble, Is everything in the link you provided applicable to Solr 5.2.1? Thanks Steve On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul noble.p...@gmail.com wrote: did you manage to look at the reference guide? https://cwiki.apache.org/confluence/display/solr/Securing+Solr On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom tom.lezo...@vanderbilt.edu wrote: Alex I got a super secret release of Solr 5.3.1, wasn¹t suppose to say anything. Yes I¹m running 5.2.1, I will check out the release notes for 5.3. Was looking for three types of user authentication, I guess. 1. the Admin Console 2. User auth for each Core ( and select and update) on a server. 3. HTML interface access (example: ajax-solr https://github.com/evolvingweb/ajax-solr) Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch arafa...@gmail.com mailto:arafa...@gmail.com wrote: Thanks for the email from the future. It is good to start to prepare for 5.3.1 now that 5.3 is nearly out. Joking aside (and assuming Solr 5.2.1), what exactly are you trying to achieve? Solr should not actually be exposed to the users directly. It should be hiding in a backend only visible to your middleware. If you are looking for a HTML interface that talks directly to Solr after authentication, that's not the right way to set it up. That said, some security features are being rolled out and you should definitely check the release notes for the 5.3. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 August 2015 at 10:01, LeZotte, Tom tom.lezo...@vanderbilt.edu wrote: Hi Solr Community I have been trying to add user authentication to our Solr 5.3.1 RedHat install. I¹ve found some examples on user authentication on the Jetty side. But they have failed. Does any one have a step by step example on authentication for the admin screen? And a core? Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 -- - Noble Paul
Re: Unknown query parser 'terms' with TermsComponent defined
1) The terms Query Parser (TermsQParser) has nothing to do with the TermsComponent (the first is for quering many distinct terms, the later is for requesting info about low level terms in your index) https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser https://cwiki.apache.org/confluence/display/solr/The+Terms+Component 2) TermsQParser (which is what you are trying to use with the {!terms... query syntax) was not added to Solr until 4.10 3) based on your example query, i'm pretty sure what you want is the TermQParser: term (singular, no s) ... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser {!term f=id}ft849m81z : We've encountered a strange situation, I'm hoping someone might be able to : shed some light. We're using Solr 4.9 deployed in Tomcat 7. ... : 'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms f=id}ft849m81z)', ... : 'msg'='Unknown query parser \'terms\'', : 'code'=400}} ... : The terms component is defined in solrconfig.xml: : : searchComponent name=termsComponent class=solr.TermsComponent / -Hoss http://www.lucidworks.com/
RE: Tokenizers and DelimitedPayloadTokenFilterFactory
Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Query timeAllowed and its behavior.
Hi, Kindly help me understand the query time allowed attribute. The following is set in solrconfig.xml. int name=timeAllowed30/int Does this setting stop the query from running after the timeAllowed is reached? If not is there a way to stop it as it will occupy resources in background for no benefit. Thanks, Modassar
Re: Search opening hours
On Tue, Aug 25, 2015 at 5:02 PM, O. Klein kl...@octoweb.nl wrote: I'm trying to find the best way to search for stores that are open NOW. It's probably not the *best* way, but assuming it's currently 4:10pm, you could do +open:[* TO 1610] +close:[1610 TO *] And to account for days of the week have different fields for each day openM, closeM, openT, closeT, etc... not super elegant, but seems to get the job done. -Yonik
Exact substring search with ngrams
Hi I'm trying to build an index for technical documents that basically works like grep, i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: fieldType name=configtext_trigram class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\R group=-1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Then in the search, I use the edismax parser with mm=100%, so given the documents {id:test1,content: encryption 10.0.100.22 description } {id:test2,content: 10.100.0.22 description } and the query content:encryption, this will turn into parsedquery_toString: +((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8), and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be parsedquery_toString: +((content:.10 content:100 content:00. content:0.2 content:.22)~5), and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query content:.10 content:100 content:00. content:0.2 content:.22 with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian
Re: Exact substring search with ngrams
Hmmm, this sounds like a nonsensical question, but what do you mean by arbitrary substring? Because if your substrings consist of whole _tokens_, then ngramming is totally unnecessary (and gets in the way). Phrase queries with no slop fulfill this requirement. But let's assume you need to march within tokens, i.e. if the doc contains my dog has fleas, you need to match input like as fle, in this case ngramming is an option. You have substantially different index and query time chains. The result is that the offsets for all the grams at index time are the same in the quick experiment I tried, all were 1. But at query time, each gram had an incremented position. I'd start by using the query time analysis chain for indexing also. Next, I'd try enclosing multiple words in double quotes at query time and go from there. What you have now is an anti-pattern in that having substantially different index and query time analysis chains is not something that's likely to be very predictable unless you know _exactly_ what the consequences are. The admin/analysis page is your friend, in this case check the verbose checkbox to see what I mean. Best, Erick On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote: Hi I'm trying to build an index for technical documents that basically works like grep, i.e. the user gives an arbitray substring somewhere in a line of a document and the exact matches will be returned. I specifically want no stemming etc. and keep all whitespace, parentheses etc. because they might be significant. The only normalization is that the search should be case-insensitvie. I tried to achieve this by tokenizing on line breaks, and then building trigrams of the individual lines: fieldType name=configtext_trigram class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\R group=-1/ filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=3 maxGramSize=3/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Then in the search, I use the edismax parser with mm=100%, so given the documents {id:test1,content: encryption 10.0.100.22 description } {id:test2,content: 10.100.0.22 description } and the query content:encryption, this will turn into parsedquery_toString: +((content:enc content:ncr content:cry content:ryp content:ypt content:pti content:tio content:ion)~8), and return only the first document. All fine and dandy. But I have a problem with possible false positives. If the search is e.g. content:.100.22 then the generated query will be parsedquery_toString: +((content:.10 content:100 content:00. content:0.2 content:.22)~5), and because all of tokens are also generated for document test2 in the proximity of 5, both documents will wrongly be returned. So somehow I'd need to express the query content:.10 content:100 content:00. content:0.2 content:.22 with *the tokens exactly in this order and nothing in between*. Is this somehow possible, maybe by using the termvectors/termpositions stuff? Or am I trying to do something that's fundamentally impossible? Other good ideas how to achieve this kind of behaviour? Thanks Christian
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. H. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here. I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be interesting to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote: We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
ANNOUNCE: Apache Solr Reference Guide for Solr 5.3 released
The Lucene PMC is pleased to announce the release of the Solr Reference Guide for Solr 5.3. This 577 page PDF is the definitive guide for using Apache Solr and can be downloaded from: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ If you have
Re: Unknown query parser 'terms' with TermsComponent defined
Thanks Hoss! It's obvious what the problem(s) are when you lay it all out that way. On Tue, Aug 25, 2015 at 12:14 PM, Chris Hostetter hossman_luc...@fucit.org wrote: 1) The terms Query Parser (TermsQParser) has nothing to do with the TermsComponent (the first is for quering many distinct terms, the later is for requesting info about low level terms in your index) https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser https://cwiki.apache.org/confluence/display/solr/The+Terms+Component 2) TermsQParser (which is what you are trying to use with the {!terms... query syntax) was not added to Solr until 4.10 3) based on your example query, i'm pretty sure what you want is the TermQParser: term (singular, no s) ... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser {!term f=id}ft849m81z : We've encountered a strange situation, I'm hoping someone might be able to : shed some light. We're using Solr 4.9 deployed in Tomcat 7. ... : 'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms f=id}ft849m81z)', ... : 'msg'='Unknown query parser \'terms\'', : 'code'=400}} ... : The terms component is defined in solrconfig.xml: : : searchComponent name=termsComponent class=solr.TermsComponent / -Hoss http://www.lucidworks.com/
RE: Bot protection (CAPTCHA)
So, usually, the middleware is the answer, just like with a database. With applications backed by database systems, there is usually an application server tier, and then a database tier. There may be a web server tier in front of the application server tier.The search engine and database belong in the same tier.Suppose your search needs the title and some other information to be displayed with search results - store these in the search engine. Suppose your detailed pages need lots of additional fields - maybe you can keep those in your database and retrieve them only as needed for click-through. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, August 25, 2015 9:40 AM To: solr-user solr-user@lucene.apache.org Subject: Re: Bot protection (CAPTCHA) The standard answer is that exposing the API is a REALLY bad idea. To start from, you can issue the delete commands through the API. And they can be escaped in multiple different ways. Plus, you have admin UI there as well to manipulate the cores as well as to see the configuration files for them. So, usually, the middleware is the answer, just like with a database. Most recent (5.3!) version of Solr added some authentication, but that's still not something you could use from a public web page, as that would imply hard-coding password. You could possibly make index read-only, lock down filesystem, etc. But that's a lot of effort and logistics. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 August 2015 at 09:29, Dmitry Savenko d...@dsavenko.com wrote: Hello, I plan to expose Solr search REST API to the world, so it can be called from my web page directly, without additional server layer. I'm concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd like to do it with as little effort as possible. Does Solr provide CAPTCHA support out of the box or via some plugins? I've searched the docs and haven't found any mentions of it. Or, maybe, exposing the API is an extremely bad idea, and I should have a middle layer on the server side? Any help would be much appreciated! Best regards, Dmitry.
Search opening hours
I'm trying to find the best way to search for stores that are open NOW. I have day of week, open and closing times. I've seen some examples, but not an exact fit. What is the best way to tackle this? Thank you for any suggestions you have to offer. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250.html Sent from the Solr - User mailing list archive at Nabble.com.
Re:Query timeAllowed and its behavior.
timeAllowed applies to the time taken by the collector in each shard (TimeLimitingCollector). Once timeAllowed is exceeded the collector terminates early, returning any partial results it has and freeing the resources it was using. From Solr 5.0 timeAllowed also applies to the query expansion phase and SolrClient request retry. From: solr-user@lucene.apache.org At: Aug 25 2015 10:18:07 Subject: Re:Query timeAllowed and its behavior. Hi, Kindly help me understand the query time allowed attribute. The following is set in solrconfig.xml. int name=timeAllowed30/int Does this setting stop the query from running after the timeAllowed is reached? If not is there a way to stop it as it will occupy resources in background for no benefit. Thanks, Modassar
Re: Lucene/Solr 5.0 and custom FieldCahe implementation
On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Mikhail. If I'm reading the SimpleFacets class correctly, out delegates to DocValuesFacets when facet method is FC, what used to be FieldCache I believe. DocValuesFacets either uses DocValues or builds then using the UninvertingReader. Ah.. got it. Thanks for reminding this details.It seems like even docValues=true doesn't help with your custom implementation. I am not seeing a clean extension point to add a custom UninvertingReader to Solr, would the only way be to copy the FacetComponent and SimpleFacets and modify as needed? Sadly, yes. There is no proper extension point. Also, consider overriding SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the particular UninvertingReader is created, there you can pass the own one, which refers to custom FieldCache. On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Jamie, I don't understand how it could choose DocValuesFacets (it occurs on docValues=true) field, but then switches to UninvertingReader/FieldCache which means docValues=false. If you can provide more details it would be great. Beside of that, I suppose you can only implement and inject your own UninvertingReader, I don't think there is an extension point for this. It's too specific requirement. On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com wrote: as mentioned in a previous email I have a need to provide security controls at the term level. I know that Lucene/Solr doesn't support this so I had baked something onto a 4.x baseline that was sufficient for my use cases. I am now looking to move that implementation to 5.x and am running into an issue around faceting. Previously we were able to provide a custom cache implementation that would create separate cache entries given a particular set of security controls, but in Solr 5 some faceting is delegated to DocValuesFacets which delegates to UninvertingReader in my case (we are not storing DocValues). The issue I am running into is that before 5.x I had the ability to influence the FieldCache that was used at the Solr level to also include a security token into the key so each cache entry was scoped to a particular level. With the current implementation the FieldCache seems to be an internal detail that I can't influence in anyway. Is this correct? I had noticed this Jira ticket https://issues.apache.org/jira/browse/LUCENE-5427, is there any movement on this? Is there another way to influence the information that is put into these caches? As always thanks in advance for any suggestions. -Jamie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Lucene/Solr 5.0 and custom FieldCahe implementation
I had seen this as well, if I over wrote this by extending SolrIndexSearcher how do I have my extension used? I didn't see a way that could be plugged in. On Aug 25, 2015 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Mikhail. If I'm reading the SimpleFacets class correctly, out delegates to DocValuesFacets when facet method is FC, what used to be FieldCache I believe. DocValuesFacets either uses DocValues or builds then using the UninvertingReader. Ah.. got it. Thanks for reminding this details.It seems like even docValues=true doesn't help with your custom implementation. I am not seeing a clean extension point to add a custom UninvertingReader to Solr, would the only way be to copy the FacetComponent and SimpleFacets and modify as needed? Sadly, yes. There is no proper extension point. Also, consider overriding SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the particular UninvertingReader is created, there you can pass the own one, which refers to custom FieldCache. On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Jamie, I don't understand how it could choose DocValuesFacets (it occurs on docValues=true) field, but then switches to UninvertingReader/FieldCache which means docValues=false. If you can provide more details it would be great. Beside of that, I suppose you can only implement and inject your own UninvertingReader, I don't think there is an extension point for this. It's too specific requirement. On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com wrote: as mentioned in a previous email I have a need to provide security controls at the term level. I know that Lucene/Solr doesn't support this so I had baked something onto a 4.x baseline that was sufficient for my use cases. I am now looking to move that implementation to 5.x and am running into an issue around faceting. Previously we were able to provide a custom cache implementation that would create separate cache entries given a particular set of security controls, but in Solr 5 some faceting is delegated to DocValuesFacets which delegates to UninvertingReader in my case (we are not storing DocValues). The issue I am running into is that before 5.x I had the ability to influence the FieldCache that was used at the Solr level to also include a security token into the key so each cache entry was scoped to a particular level. With the current implementation the FieldCache seems to be an internal detail that I can't influence in anyway. Is this correct? I had noticed this Jira ticket https://issues.apache.org/jira/browse/LUCENE-5427, is there any movement on this? Is there another way to influence the information that is put into these caches? As always thanks in advance for any suggestions. -Jamie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: how to prevent uuid-field changing in /update query?
It sounds like you need to control when the uuid is and is not created, just feels like you'd get better mileage doing this outside of solr On Aug 25, 2015 7:49 AM, CrazyDiamond crazy_diam...@mail.ru wrote: Why not generate the uuid client side on the initial save and reuse this on updates? i can't do this because i have delta-import queries which also should be able to assign uuid when it is needed -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to prevent uuid-field changing in /update query?
It sounds like you need to control when the uuid is and is not created, just feels like you'd get better mileage doing this outside of solr Can I simply insert a condition(blank or not ) in uuid update-chain? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225141.html Sent from the Solr - User mailing list archive at Nabble.com.
Behavior of grouping on a field with same value spread across shards.
Hi, As per my understanding, to group on a field all documents with the same value in the field have to be in the same shard. Can we group by a field where the documents with the same value in that field will be distributed across shards? Please let me know what are the limitations, feature not available or performance issues for such fields? Thanks, Modassar
Re: how to prevent uuid-field changing in /update query?
Why not generate the uuid client side on the initial save and reuse this on updates? On Aug 25, 2015 4:22 AM, CrazyDiamond crazy_diam...@mail.ru wrote: i have uuid field. it is not set as unique, but nevertheless i want it not to be changed every time when i call /update. it might be because i added requesthandler with name /update which contains uuid update срфшт .But if i not do this i have no uuid at all.May be i can config uuid update-chain to set uuid only if it is blank? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query timeAllowed and its behavior.
Thanks for your response Jonathon. Please correct me if I am wrong in following points. -query actually ceases to run once time allowed is reached and releases all the resources. -query expansion is stopped and the query is terminated from execution releasing all the resources. Thanks, Modassar On Tue, Aug 25, 2015 at 4:46 PM, Jonathon Marks (BLOOMBERG/ LONDON) jmark...@bloomberg.net wrote: timeAllowed applies to the time taken by the collector in each shard (TimeLimitingCollector). Once timeAllowed is exceeded the collector terminates early, returning any partial results it has and freeing the resources it was using. From Solr 5.0 timeAllowed also applies to the query expansion phase and SolrClient request retry. From: solr-user@lucene.apache.org At: Aug 25 2015 10:18:07 Subject: Re:Query timeAllowed and its behavior. Hi, Kindly help me understand the query time allowed attribute. The following is set in solrconfig.xml. int name=timeAllowed30/int Does this setting stop the query from running after the timeAllowed is reached? If not is there a way to stop it as it will occupy resources in background for no benefit. Thanks, Modassar
Re: how to prevent uuid-field changing in /update query?
Why not generate the uuid client side on the initial save and reuse this on updates? i can't do this because i have delta-import queries which also should be able to assign uuid when it is needed -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to prevent uuid-field changing in /update query?
I am honestly not familiar enough to say. Best to try it On Aug 25, 2015 7:59 AM, CrazyDiamond crazy_diam...@mail.ru wrote: It sounds like you need to control when the uuid is and is not created, just feels like you'd get better mileage doing this outside of solr Can I simply insert a condition(blank or not ) in uuid update-chain? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-prevent-uuid-field-changing-in-update-query-tp4225113p4225141.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lucene/Solr 5.0 and custom FieldCahe implementation
Thanks Mikhail. If I'm reading the SimpleFacets class correctly, out delegates to DocValuesFacets when facet method is FC, what used to be FieldCache I believe. DocValuesFacets either uses DocValues or builds then using the UninvertingReader. I am not seeing a clean extension point to add a custom UninvertingReader to Solr, would the only way be to copy the FacetComponent and SimpleFacets and modify as needed? On Aug 25, 2015 12:42 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Jamie, I don't understand how it could choose DocValuesFacets (it occurs on docValues=true) field, but then switches to UninvertingReader/FieldCache which means docValues=false. If you can provide more details it would be great. Beside of that, I suppose you can only implement and inject your own UninvertingReader, I don't think there is an extension point for this. It's too specific requirement. On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com wrote: as mentioned in a previous email I have a need to provide security controls at the term level. I know that Lucene/Solr doesn't support this so I had baked something onto a 4.x baseline that was sufficient for my use cases. I am now looking to move that implementation to 5.x and am running into an issue around faceting. Previously we were able to provide a custom cache implementation that would create separate cache entries given a particular set of security controls, but in Solr 5 some faceting is delegated to DocValuesFacets which delegates to UninvertingReader in my case (we are not storing DocValues). The issue I am running into is that before 5.x I had the ability to influence the FieldCache that was used at the Solr level to also include a security token into the key so each cache entry was scoped to a particular level. With the current implementation the FieldCache seems to be an internal detail that I can't influence in anyway. Is this correct? I had noticed this Jira ticket https://issues.apache.org/jira/browse/LUCENE-5427, is there any movement on this? Is there another way to influence the information that is put into these caches? As always thanks in advance for any suggestions. -Jamie -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
Right, I had assumed (obviously here is my problem) that I'd be able to specify payloads for the field regardless of the field type. Looking at TrieField that is certainly non-trivial. After a bit of digging it appears that if I wanted to do something here I'd need to build a new TrieField, override createField and provide a Field that would return something like NumericTokenStream but also provide the payloads. Like you said sounds interesting to say the least... Were payloads not really intended to be used for these types of fields from a Lucene perspective? On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com wrote: Well, you're going down a path that hasn't been trodden before ;). If you can treat your primitive types as text types you might get some traction, but that makes a lot of operations like numeric comparison difficult. H. another idea from left field. For single-valued types, what about a sidecar field that has the auth token? And even for a multiValued field, two parallel fields are guaranteed to maintain order so perhaps you could do something here. Yes, I'm waving my hands a LOT here. I suspect that trying to have a custom type that incorporates payloads for, say, trie fields will be interesting to say the least. Numeric types are packed to save storage etc. so it'll be an adventure.. Best, Erick On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote: We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote:
Re: how to prevent uuid-field changing in /update query?
: updates? i can't do this because i have delta-import queries which also : should be able to assign uuid when it is needed You really need to give us a full and complete picture of what exactly you are currently doing, what's working, what's not working, and when it's not working what is it doing and how is that differnet from what you expect. example: you mentioned you have requesthandler with name /update which contains uuid update срфшт (presumably you mean the processor) but you haven't shown us your configs, or any of your logs, so we can see how exactly it's configured, or if/how it's being used. If UUIDUpdateProcessorFactory is in place, then it should only generate a new UUID if the document doesn't already have one -- if you are using DIH to add documents to the index, and the uuid you are using/generating isn't also the uniqueKey field, then the UUIDUpdateProcessorFactory doens't have any way of magically knowing when a new document is actually a replacement for an old document. (If you are using Atomic Updates, then registering UUIDUpdateProcessorFactory *after* the DistributedUpdateProcessorFactory can help -- but that doesn't sound like it's relevant if you are using DIH detla updates) Please review this page and give us *all* the details about your current setup, your goal, and the specific problem you are facing... https://wiki.apache.org/solr/UsingMailingLists -Hoss http://www.lucidworks.com/
RE: Spellcheck / Suggestions : Append custom dictionary to SOLR default index
Max, If you know the entire list of words you want to spellcheck against, you can use FileBasedSpellChecker. See http://wiki.apache.org/solr/FileBasedSpellChecker . If, however, you have a field you want to spellcheck against but also want additional words added, consider using a copy of the field for spellcheck purposes, and then index the additional terms to that field. You may be able to accomplish this easily, for instance, by using index-time synonyms in the analysis chain for the spellcheck field. Or you could just append them to any document (more than once if you want to boost the term frequency). Keep in mind that while this will work fine for regular word-by-word spell suggestions, collations are not going to work well with these approaches. James Dyer Ingram Content Group -Original Message- From: Max Chadwick [mailto:mpchadw...@gmail.com] Sent: Monday, August 24, 2015 9:43 PM To: solr-user@lucene.apache.org Subject: Spellcheck / Suggestions : Append custom dictionary to SOLR default index Is there a way to append a set of words the the out-of-box solr index when using the spellcheck / suggestions feature?
Re: Search opening hours
Have you seen: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3c1354991310424-4025359.p...@n3.nabble.com%3E https://wiki.apache.org/solr/SpatialForTimeDurations https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/ Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 August 2015 at 17:02, O. Klein kl...@octoweb.nl wrote: I'm trying to find the best way to search for stores that are open NOW. I have day of week, open and closing times. I've seen some examples, but not an exact fit. What is the best way to tackle this? Thank you for any suggestions you have to offer. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizers and DelimitedPayloadTokenFilterFactory
We were originally using this approach, i.e. run things through the KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter. Again this works fine for text, though I had wanted to use the StandardTokenizer in the chain. Is there an equivalent filter that does what the StandardTokenizer does? All of this said this doesn't address the issue of the primitive field types, which at this point is the bigger issue. Given this use case should there be another way to provide payloads? My current thinking is that I will need to provide custom implementations for all of the field types I would like to support payloads on which will essentially be copies of the standard versions with some extra sugar to read/write the payloads (I don't see a way to wrap/delegate these at this point because AttributeSource has the attribute retrieval related methods as final so I can't simply wrap another tokenizer and return my added attributes + the wrapped attributes). I know my use case is a bit strange, but I had not expected to need to do this given that Lucene/Solr supports payloads on these field types, they just aren't exposed. As always I appreciate any ideas if I'm barking up the wrong tree here. On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, if i remember correctly (i have no testing facility at hand) WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, the entire string gets a payload. You can then split that string up again in individual tokens. It is possible to abuse WordDelimiterFilter for it because it has a types parameter that you can use to split it on whitespace if its input is not trimmed. Otherwise you can use any other character instead of a space as your input. This is a crazy idea, but it might work. -Original message- From:Jamie Johnson jej2...@gmail.com Sent: Tuesday 25th August 2015 19:37 To: solr-user@lucene.apache.org Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory To be clear, we are using payloads as a way to attach authorizations to individual tokens within Solr. The payloads are normal Solr Payloads though we are not using floats, we are using the identity payload encoder (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for storing a byte[] of our choosing into the payload field. This works great for text, but now that I'm indexing more than just text I need a way to specify the payload on the other field types. Does that make more sense? On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson erickerick...@gmail.com wrote: This really sounds like an XY problem. Or when you use payload it's not the Solr payload. So Solr Payloads are a float value that you can attach to individual terms to influence the scoring. Attaching the _same_ payload to all terms in a field is much the same thing as boosting on any matches in the field at query time or boosting on the field at index time (this latter assuming that different docs would have different boosts). So can you back up a bit and tell us what you're trying to accomplish maybe we can be sure we're both talking about the same thing ;) Best, Erick On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson jej2...@gmail.com wrote: I would like to specify a particular payload for all tokens emitted from a tokenizer, but don't see a clear way to do this. Ideally I could specify that something like the DelimitedPayloadTokenFilter be run on the entire field and then standard analysis be done on the rest of the field, so in the case that I had the following text this is a test\Foo I would like to create tokens this, is, a, test each with a payload of Foo. From what I'm seeing though only test get's the payload. Is there anyway to accomplish this or will I need to implement a custom tokenizer?
Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR
http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr . *Question:* How can I get guarantee commits with Apache SOLR where persisting data to disk and visibility are both equally important ? *Background:* We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and *do not* want to use another database on the side. I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ? Can anyone please suggest a solution for achieving guaranteed commits with SOLR ?
Re: Performance gain with setting !cache=false in the query for complex queries
Hi Erick, Up to now, all the tests were based on randomly generated requests. In reality, many requests will get executed more than twice since this is to support the advertising project. On the other hand, new queries could be generated daily. So some of the filter queries will be used frequently for a period of time, and will not be used afterwards. I will take your advice to analyze the real queries once the project is in production. Thank you very much! -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-gain-with-setting-cache-false-in-the-query-for-complex-queries-tp4224931p4225147.html Sent from the Solr - User mailing list archive at Nabble.com.
Bot protection (CAPTCHA)
Hello, I plan to expose Solr search REST API to the world, so it can be called from my web page directly, without additional server layer. I'm concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd like to do it with as little effort as possible. Does Solr provide CAPTCHA support out of the box or via some plugins? I've searched the docs and haven't found any mentions of it. Or, maybe, exposing the API is an extremely bad idea, and I should have a middle layer on the server side? Any help would be much appreciated! Best regards, Dmitry.
Re: Bot protection (CAPTCHA)
The standard answer is that exposing the API is a REALLY bad idea. To start from, you can issue the delete commands through the API. And they can be escaped in multiple different ways. Plus, you have admin UI there as well to manipulate the cores as well as to see the configuration files for them. So, usually, the middleware is the answer, just like with a database. Most recent (5.3!) version of Solr added some authentication, but that's still not something you could use from a public web page, as that would imply hard-coding password. You could possibly make index read-only, lock down filesystem, etc. But that's a lot of effort and logistics. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 August 2015 at 09:29, Dmitry Savenko d...@dsavenko.com wrote: Hello, I plan to expose Solr search REST API to the world, so it can be called from my web page directly, without additional server layer. I'm concerned about bots, so I plan to add CAPTCHA to my page. Surely, I'd like to do it with as little effort as possible. Does Solr provide CAPTCHA support out of the box or via some plugins? I've searched the docs and haven't found any mentions of it. Or, maybe, exposing the API is an extremely bad idea, and I should have a middle layer on the server side? Any help would be much appreciated! Best regards, Dmitry.
testing with EmbeddedSolrServer
Is there an example of integration-testing with EmbeddedSolrServer that loads data from a data importhandler - then queries the data? Ive tried doing this based on org.apache.solr.client.solrj.embedded.TestEmbeddedSolrServerConstructors. But no data is being imported. Here is the test-class ive tried: https://gist.github.com/emoen/5d0a28df91c4c1127238 Ive also tried writing a test by extending AbstractSolrTestCase - but havnt got this working. Ive documented some of the log output here: http://stackoverflow.com/questions/32052642/solrcorestate-already-closed-with-unit-test-using-embeddedsolrserver-v-5-2-1 Should I extend AbstractSolrTestCase or SolrTestCaseJ4 when writing tests? Cheers Endre
CloudSolrClient does not distribute suggest.build=true
When using the new Suggester component (with AnalyzingInfixSuggester) in Solr trunk with solrj, the suggest.build command seems to be executed only on one of the solr cloud nodes. I had to add shards.qt=/suggest and shards=host1:port2/solr/mycollection,host2:port2/solr/mycollection... to distribute the build command on all nodes. Given that we are using SolrCloud, I would have expected the build command to behave like an cloud update and be sent to all nodes without the need of specifying shards and shards.qt Thanks. Arcadius.
Re: Solr performance is slow with just 1GB of data indexed
Hi Toke, Thank you for your reply. I'm currently trying out on the Carrot2 Workbench and get it to call Solr to see how they did the clustering. Although it still takes some time to do the clustering, but the results of the cluster is much better than mine. I think its probably due to the different settings like the fragSize and desiredCluserCountBase? By the way, the link on the clustering example https://cwiki.apache.org/confluence/display/solr/Result is not working as it says 'Page Not Found'. Regards, Edwin On 25 August 2015 at 15:29, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote: Would like to confirm, when I set rows=100, does it mean that it only build the cluster based on the first 100 records that are returned by the search, and if I have 1000 records that matches the search, all the remaining 900 records will not be considered for clustering? That is correct. It is not stated very clearly, but it follows from trading the comments in the third example at https://cwiki.apache.org/confluence/display/solr/Result +Clustering#ResultClustering-Configuration As if that is the case, the result of the cluster may not be so accurate as there is a possibility that the first 100 records might have a large amount of similarities in the records, while the subsequent 900 records have differences that could have impact on the cluster result. Such is the nature of on-the-fly clustering. The clustering aims to be as representative of your search result as possible. Assigning more weight to the higher scoring documents (in this case: All the weight, as those beyond the top-100 are not even considered) does this. If that does not fit your expectations, maybe you need something else? Plain faceting perhaps? Or maybe enrichment of the documents with some sort of entity extraction? - Toke Eskildsen, State and University Library, Denmark