Multilingual text analysis
Hello, Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language. In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields. Is it possible to configure analyzers to use the suitable language for each document, in function of the language field? Thanks, Juan
Re: Multilingual text analysis
Thank you both Paul and Lee for your answer. Luckily in my case there's no problem about knowing language at index time nor we have really to bother about the language of the query, as users can specify the language they are interested in. So I guess our solution would be to use different optional fields, one for each language and that should be good enough. I just had wondered whether it was possible to parametrize the analyzers in function of one field value. I think this would be a very elegant solution for many needs. May it could be a possible improvement for future versions of solr :) Paul, what do you mean when you say it would make sense to start a page at the solr website? Thanks again, Juan El 02/06/2011, a las 16:06, Paul Libbrecht escribió: Juan, An easy way in solr, I think, is indeed to use different fields at index time and expand on multiple fields at query time. I believe using field-names' wildcards allows you to specify a different analyzer per language doing this. There's been long discussions on the java-u...@lucene.apache.org mailing-list about the best design for multilingual indexing and searching. One of the key arguments was wether you were able to detect with faithfulness the language of a query, this is generally very hard. It would make sense to start a page at the solr website... paul Le 2 juin 2011 à 12:52, lee carroll a écrit : Juan I don't think so. you can try indexing fields like myfield_en. myfield_fr, my field_xx if you now what language you are dealing with at index and query time. you can also have seperate cores for your documents for each language if you don't want to complicate your schema again you will need to know language at index and query time On 2 June 2011 08:57, Juan Antonio Farré Basurte juan.fa...@reviewpro.com wrote: Hello, Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language. In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields. Is it possible to configure analyzers to use the suitable language for each document, in function of the language field? Thanks, Juan
Re: Facet Query
Are you talking about a facet query or a facet field? If it's a facet query, I don't get what's going on. If it's a facet field... well, if it's a fixed set of words you're interested in, filter the query to only those words and you'll get counts only for them. If you just need to filter out common words, I don't remember exactly how it works, but when you declare the text field (or its type) you can specify a processor that does exactly that: removes common words from the indexed field and, hence, you shouldn't get counts on them, because they just aren't there. Sorry if my information is inexact. I haven't had to deal with this feature yet. El 27/05/2011, a las 09:51, Jasneet Sabharwal escribió: Hi When I do a facet query on my data, it shows me a list of all the words present in my database with their count. Is it possible to not get the results of common words like a, an, the, http and so one but only get the count of stuff we need like microsoft, ipad, solr, etc. -- Thanx Regards Jasneet Sabharwal
frange vs TrieRange
Hello, I have to perform range queries agains a date field. It is a TrieDateField, and I'm already using it for sorting. Hence, there will be already en entry in the FieldCache for it. According to: http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/ frange queries are typically faster than normal range queries when there are many terms between the endpoints (though it could be slower, if there's less than a 5% of terms between the endpoints). The cost of this speedup is the memory associated with a FieldCache entry for the field. In my case, there's no additional memory overhead, as there's already such entry. It also states that TrieRange queries have the best space/speed tradeoff. Now my doubt is: if I have no memory overhead, then I only care about relative speed between frange and trie. The good speed/space tradeoff of trie is not the measure I need in this case, but just a comparison at pure speed level. Does anybody know if there's data about this? Any clue on whether to choose frange or trie in this case? Thanks, Juan
Re: Nested grouping/field collapsing
I've found the same issue. As long as I know, the only solution is to create a copy field which combines both-fields values and facet on this field. If one of the fields has a set of distinct values known in advance and its cardinality c is not too big, it isn't a great problem: you can do with c queries. El 27/05/2011, a las 15:03, Martijn Laarman escribió: Hi, I was wondering if this issue had already been raised. We currently have a use case where nested field collapsing would be really helpful I.e Collapse on field X then Collapse on Field Y within the groups returned by field X The current behavior of specifying multiple fields seem to be returning mutiple result sets. Has this already been feature requested ? Does anybody know of a workaround ? Many thanks, Martijn
Re: solr 3.1 without slf4j-jdk14-1.5.5.jar
If I'm not wrong, solrj uses slf4j for logging. slf4j-api.jar provides the api, but is not capable by itself to do the actual logging. For it to be able to log, it needs an actual implementation, usually a binding to some other logging library. slf4j-jdk14 is the binding that uses the logging API in the JDK (since v 1.4) to do the actual logging. Solrj needs slf4j-api and at one binding. You have to choose one and can exclude jars for other bindings. The options are: slf4j-log4j12 - binding to log4j library version 1.2. Delegates logging to log4j. slf4j-jdk14 - binding to JDK logging library (in JDK v 1.4 or greater). Delegates logging to the JDK. slf4j-nop - is a dummy implementation that silently discards all log messages slf4j-simple - is itself an implementation that logs messages to System.err (only messages of level INFO or higher). slf4j-jcl - binding for Jakarta Commons Logging library. Delegates logging to JCL. It's also documented a dependency to jcl-over-slf4j. This is quite the opposite of slf4j-jcl. While the latter implements slf4j api delegating logging to jcl, the former implements jcl api delegating logging to slf4j. I don't really think that solrj is using this (not sure). I believe that solrj uses slf4j. Needing jcl-over-slf4j would mean that some code in solrj does not use slf4j api but jcl api and needs also an implementation for it. If you take a look to maven repositories, there is no such dependency for solrj, so I guess it's not really needed. I hope I managed to explain it clearly. Cheers, Juan El 26/05/2011, a las 16:36, antonio escribió: Reading the wiki, for use solrj i must use this lib: From /lib •slf4j-jdk14-1.5.5.jar But there isn't no one directory call lib, and no one jar called slf4j-jdk14-1.5.5.jar . Is it necessary? When i can get it? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-1-without-slf4j-jdk14-1-5-5-jar-tp2988950p2988950.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: FieldCache
fieldCache stores one entry for each field that is used for sorting or for field faceting when you use the fieldCache (fc) method. Before solr 1.4 the method for field faceting was the enum method that executes a filter query for each unique value of the field and stores it in the filterCache. From solr 1.4, the default method is fc, except for boolean fields, that use enum method by default. So, you should have an entry in fieldCache for each field that you use either for sorting or for field faceting with fc facet method. Does it match? I don't know a way to configure the size of the fieldCache. I don't know how much memory does each entry consume, either. Sorry not to be of further help. Cheers El 26/05/2011, a las 16:50, Jean-Sebastien Vachon escribió: 10 unique terms on 1.5M documents each with 50+ fields? I don't think so ;) What I mean is controlling its size like the other caches. There are currently no options in solrconfig.xml to control this cache. Is Solr/Lucene managing this all by itself? It could be that my understanding of the FieldCache is wrong. I thought this was the main cache for Lucene. Is that right? Thanks for your feedback -Original Message- From: pravesh [mailto:suyalprav...@yahoo.com] Sent: May-26-11 2:58 AM To: solr-user@lucene.apache.org Subject: Re: FieldCache This is because you may be having only 10 unique terms in your indexed Field. BTW, what do you mean by controlling the FieldCache? -- View this message in context: http://lucene.472066.n3.nabble.com/FieldCache-tp2987541p2988142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: filter cache and negative filter query
: query that in fact returns the negative results. As a simple example, : I believe that, for a boolean field, -field:true is exactly the same as : +field:false, but the former is a negative query and the latter is a that's not strictly true in all cases... * if the field is multivalued=true, a doc may contain both false and true in field, in which case it would match +field:false but it would not match -field:true * if the field is not multivalued-false, and required=false, a doc may not contain any value, in which case it would match -field:true but it would not match +field:false You're totally right. But it was just an example. I just didn't think about specifying the field to be single valued and required. I did some testing yesterday about how are filteres cached, using the admin interface. I noticed that if I perform a facet.query on a boolean field testing it to be true or false it always looks to add two entries to the query cache. May be it also adds an entry to test for unexsistence of the value? And if I perform a facet.field on the same boolean field, three new entries are inserted into the filter cache. May be one for true, one for false and one for unexsistence? I really don't know what it's exactly doing, but doesn't look, at first sight, like a very optimal behaviour... I'm testing on 1.4.1 lucidworks version of solr, using the boolean field inStock of its example schema, with its example data.
Re: Highlighting does not work when using !boost as a nested query
Hi, The query is generated dynamically and can be more or less complex depending on different parameters. I'm also not free to give many details of our implementation, but I'll give you the minimal query string that fails and the relevant pieces of the config. The query string is: /select?q=+id:12345^0.01 +_query_:{!boost b=$dateboost v=$qq deftype=dismax}dateboost=recip(ms(NOW/DAY,published_date),3.16e-11,1,1)qq=user_textqf=text1^2 text2pf=text1^2 text2tie=0.1q.alt=*:*hl=truehl.fl=text1 text2hl.mergeContiguous=true where id is an int and text1 and text2 are type text. hl.fl has proven to be necessary whenever I use dismax in an inner query. Ohterwise, only text2 (the default field) is highlighted, and not both fields appearing in qf. For example, q={!dismax v=$qq}... does not require hl.fl to highlight both text1 and text2. q=+_query_:{!dismax v=$qq}... only highlights text2, unless I specify hl.fl. The given query is probably not minimal in the sense that some of the dismax-related parameters can be omitted and the query still fails. But the one given always fails (and adding more complexity to it does not make it work, quite obviously). Unfortunately, hl.requireFieldMatch=false does not help. Request handler config is the following: requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler Highlighter config is the following: highlighting fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize70/int float name=hl.regex.slop0.5/float str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.preem/str str name=hl.simple.post/em/str /lst /formatter /highlighting If there's any other information that could be useful, just ask. Thank you very much for your help, Juan El 16/05/2011, a las 23:18, Chris Hostetter escribió: : As I said in my previous message, if I issue: : q=+field1:range +field2:value +_query_:{!dismax v=$qq} : highlighting works. I've just discovered the problem is not just with {!boost...}. If I just add a bf parameter to the previous query, highlighting also fails. : Anybody knows what can be happening? I'm really stuck on this problem... Just a hunch, but i suspect the problem has to do with highlighter (or maybe it's the fragment generator?) trying to determine matches from query types it doens't understand I thought there was a query param you could use to tell the highlighter to use an alternate query string (that would be simpler) instead of the real query ... but i'm not seeing it in the docs. hl.requireFieldMatch=false might also help (not sure) In general it would probably be helpful for folks if you could post the *entire* request you are making (full query string and all request params) along with the solrconfig.xml sections that show how your request handler and highlighter are configured. -Hoss
Re: filter cache and negative filter query
lookups to work with an arbitrary query, you would either need to changed the cache structure from Query=DocSet to a mapping of Query=[DocSet,inverseionBit] and store the same cache value needs needs with two keys -- both the positive and the negative; or you keep the Well, I don't know how it's working right now, but I guess that, as the positive version is being stored, when you look a negative query up, you already have a similar lookup problem: or you store two keys for the same value or you just transform the negative query into a positive canonical one before looking it up. The same could be done in this case, with the difference that yes, you need an inversion bit stored too. The double lookup option sounds worse, though benchmarking should be done to know for sure. Would this optimization influence only memory usage or also smaller sets are faster to intersect, for example? Well, in any case, saving memory allows to use the additional memory to speed up the application, for example, with bigger caches.
Re: Highlighting does not work when using !boost as a nested query
By the way, I was wrong when saying that using bf instead of !boost did not work either. I probably hit more than one problem at the same time when I first tested that. I've retested now and this works: /select?q=+id:12345^0.01 +_query_:{!dismax v=$qq}bf=recip(ms(NOW/DAY,published_date),3.16e-11,1,1)qq=user_textqf=text1^2 text2pf=text1^2 text2tie=0.1q.alt=*:*hl=truehl.fl=text1 text2hl.mergeContiguous=true But I don't get the multiplicative boost I'd like to use... El 19/05/2011, a las 11:31, Juan Antonio Farré Basurte escribió: Hi, The query is generated dynamically and can be more or less complex depending on different parameters. I'm also not free to give many details of our implementation, but I'll give you the minimal query string that fails and the relevant pieces of the config. The query string is: /select?q=+id:12345^0.01 +_query_:{!boost b=$dateboost v=$qq deftype=dismax}dateboost=recip(ms(NOW/DAY,published_date),3.16e-11,1,1)qq=user_textqf=text1^2 text2pf=text1^2 text2tie=0.1q.alt=*:*hl=truehl.fl=text1 text2hl.mergeContiguous=true where id is an int and text1 and text2 are type text. hl.fl has proven to be necessary whenever I use dismax in an inner query. Ohterwise, only text2 (the default field) is highlighted, and not both fields appearing in qf. For example, q={!dismax v=$qq}... does not require hl.fl to highlight both text1 and text2. q=+_query_:{!dismax v=$qq}... only highlights text2, unless I specify hl.fl. The given query is probably not minimal in the sense that some of the dismax-related parameters can be omitted and the query still fails. But the one given always fails (and adding more complexity to it does not make it work, quite obviously). Unfortunately, hl.requireFieldMatch=false does not help. Request handler config is the following: requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler Highlighter config is the following: highlighting fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize70/int float name=hl.regex.slop0.5/float str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.preem/str str name=hl.simple.post/em/str /lst /formatter /highlighting If there's any other information that could be useful, just ask. Thank you very much for your help, Juan El 16/05/2011, a las 23:18, Chris Hostetter escribió: : As I said in my previous message, if I issue: : q=+field1:range +field2:value +_query_:{!dismax v=$qq} : highlighting works. I've just discovered the problem is not just with {!boost...}. If I just add a bf parameter to the previous query, highlighting also fails. : Anybody knows what can be happening? I'm really stuck on this problem... Just a hunch, but i suspect the problem has to do with highlighter (or maybe it's the fragment generator?) trying to determine matches from query types it doens't understand I thought there was a query param you could use to tell the highlighter to use an alternate query string (that would be simpler) instead of the real query ... but i'm not seeing it in the docs. hl.requireFieldMatch=false might also help (not sure) In general it would probably be helpful for folks if you could post the *entire* request you are making (full query string and all request params) along with the solrconfig.xml sections that show how your request handler and highlighter are configured. -Hoss
Re: filter cache and negative filter query
Mmm... I had wondered whether solr reused filters this way (not having both the positive and negative versions) and I'm glad to see it does indeed reuse them. What I don't like is that it systematically uses the positive version. Sometimes the negative version will give many less results (for example, in some cases I filter by documents not having a given field, and there are very few of them). I think it would be much better that solr performed exactly the query requested and, if there's more than a 50% of documents that match the query, then it just stored the negated one. I think (without knowing almost at all how things are implemented) this shouldn't be a problem. Is there any place where you can post a suggestion of improvement? :) Anyway, it would be very useful to know exactly how the current versions work (I think the info in the message I'm answering is about version 1.1 and could have changed), because knowing it, one can sometimes manage to write, for example, a positive query that in fact returns the negative results. As a simple example, I believe that, for a boolean field, -field:true is exactly the same as +field:false, but the former is a negative query and the latter is a positive one. So, knowing the exact behaviour of solr can help you write optimized filters when you know that one version will give many less hits than the other. El 18/05/2011, a las 00:26, Yonik Seeley escribió: On Tue, May 17, 2011 at 6:17 PM, Markus Jelsma markus.jel...@openindex.io wrote: I'm not sure. The filter cache uses your filter as a key and a negation is a different key. You can check this easily in a controlled environment by issueing these queries and watching the filter cache statistics. Gotta hate crossing emails ;-) Anyway, this goes back to Solr 1.1 5. SOLR-80: Negative queries are now allowed everywhere. Negative queries are generated and cached as their positive counterpart, speeding generation and generally resulting in smaller sets to cache. Set intersections in SolrIndexSearcher are more efficient, starting with the smallest positive set, subtracting all negative sets, then intersecting with all other positive sets. (yonik) -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco If I have a query with a filter query such as : q=artfq=history and then run a second query q=artfq=-history, will Solr realize that it can use the cached results of the previous filter query history (in the filter cache) or will it not realize this and have to actually do a second filter query against the index for not history? Tom
Re: TrieIntField for short values
Hi, Thanks for your answer. I am doing range queries on this field, yes, that's why I cared about how all this trie thing works :) If I use precisionStep=0 would it be equivalent to use, say, a SortableIntField? Would it be possible that you explained, for example, the difference in how it would work using a precisionStep=0 or using a precisionStep=Integer.MAX_VALUE? May be this way I could get an idea on how it works. I've read as much information as I've been able to find, but I didn't get a clear idea. Thanks a lot, Juan El dom, 15-05-2011 a las 11:01 -0400, Erick Erickson escribió: Are you doing range queries on this field? Range queries are where Trie shines, so worrying about precision step if you're NOT intending to do range queries is a waste, just use precisionstep=0. In fact, with only 1,000 values, I'd just go with PrecisionStep=0 (which is the int field) Best Erick On Thu, May 12, 2011 at 11:15 AM, Juan Antonio Farré Basurte juan.fa...@reviewpro.com wrote: Hello, I'm quite a beginner in solr and have many doubts while trying to learn how everything works. I have only a slight idea on how TrieFields work. The thing is I have an integer value that will always be in the range 0-1000. A short field would be enough for this, but there is no such TrieShortField (not even a SortableShortField). So, I used a TrieIntField. My doubt is, in this case, what would be a suitable value for precisionStep. If the field had only 1000 distinct values, but they were more or less uniformly distributed in the 32-bit int range, probably a big precisionStep would be suitable. But as my values are in the range 0 to 1000, I think (without much knowledge) that a low precisionStep should be more adequate. For example, 2. Can anybody, please, help me finding a good configuration for this type? And, if possible, can anybody explain in a brief and intuitive way what are the differences and tradeoffs of choosing smaller or bigger precisionSteps? Thanks a lot, Juan
TrieIntField for short values
Hello, I'm quite a beginner in solr and have many doubts while trying to learn how everything works. I have only a slight idea on how TrieFields work. The thing is I have an integer value that will always be in the range 0-1000. A short field would be enough for this, but there is no such TrieShortField (not even a SortableShortField). So, I used a TrieIntField. My doubt is, in this case, what would be a suitable value for precisionStep. If the field had only 1000 distinct values, but they were more or less uniformly distributed in the 32-bit int range, probably a big precisionStep would be suitable. But as my values are in the range 0 to 1000, I think (without much knowledge) that a low precisionStep should be more adequate. For example, 2. Can anybody, please, help me finding a good configuration for this type? And, if possible, can anybody explain in a brief and intuitive way what are the differences and tradeoffs of choosing smaller or bigger precisionSteps? Thanks a lot, Juan
Highlighting does not work when using !boost as a nested query
Hi, I need to boost newer documents in my dismax queries. As I've been able to read in the wiki, it's best to use a multiplicative boost. The only way I know to do this with the dismax (not edismax) query parser is via a {!boost b=$dateboost v=$qq defType=dismax} query. To make things more complicated, I also need to add some filters to the query (by date range, by field value...) that don't fit as filters, as they have a huge number of possible unique values. Hence, I added them to the main query in a form such: q=+field1:range +field2:value +_query_:{!boost b=$dateboost v=$qq defType=dismax} And then I add hl=true as a top-level parameter. The result is that the response includes some empty values in the highlighting list and nothing else: lst name=highlighting lst/ lst/ lst/ lst/ lst/ lst/ lst/ lst/ lst/ lst/ /lst Using just q={!boost b=$dateboost v=$qq defType=dismax} works well. Using something like: q=+field1:range +field2:value +_query_:{!dismax v=$qq} also works. But when I try to use dismax inside boost inside a nested query, highlighting stops working. Am I doing anyhing wrong? Do you know any workaround? Should I post a bug anywhere? Is there another way of specifying a multiplicative boost (without using edismax)? Thanks, Juan