RE: Reverse sort facet query [SOLR-1672]
Date: Sun, 3 Jan 2010 22:18:33 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: RE: Reverse sort facet query [SOLR-1672] : Yes, I thought about adding some 'new syntax', but I opted for a separate 'facet.sortorder' parameter, : : mainly because I'm not familiar enough with the codebase to know what effect this might have on : : backward compatibility. It would be easy enough to modify the patch I created to do it this way. it shouldn't really affect anything -- it wouldn't really be new syntax, just extending hte existing sort param syntax to apply to the facet.sort param. The only back compat concern is making sure we continue to support true/false as aliases, and having the default order match the current bahvior if asc/desc aren't specified. -Hoss Yes, agreed. The current patch doesn't touch the b/w true/false aliasing, and any move to adding a new attr can keep all that intact. I've been using the current patch extensively in our testing, and that's working well. The only caveat to this is that the reverse sort results don't include 0-count facets (see notes in SOLR-1672), so reverse sort results start with the first count=1. This could be confusing as there could well be many facets whose count is 0, and it might be expected that these be returned in the first instance. From my admittedly cursory look into the codebase regading this, I believe patching to include 0 counts could open a can of worms in terms of b/w compat and performance, as 0 counts look to be skipped (by default). I could be wrong, and you may know better how changes to SimpleFacets/UnInvertedField would affect performance and compatibility. If there is indeed a performance optimization in facet counting iteration, it would, imo, be preferable to have the optimization, rather than the 0-counts. Would you like me to go ahead and amend the patch (w/o 0-counts) to define a new 'sort' parameter? For naming, I would propose an extension of FacetParams.FACET_SORT_COUNT ala: public static final String FACET_SORT_COUNT_REVERSE = count.reverse; I can then easily modify the patch to detect/use this value to invoke the new behaviour. Comments? Suggestions? Thanks, Peter _ Have more than one Hotmail account? Link them together to easily access both http://clk.atdmt.com/UKM/go/186394591/direct/01/
RE: Reverse sort facet query [SOLR-1672]
in Solr 1.4 the boolean syntax was deprecated in place of keywords that are more meaninful... http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort ... count and index replaced true and false Yes, I thought about adding some 'new syntax', but I opted for a separate 'facet.sortorder' parameter, mainly because I'm not familiar enough with the codebase to know what effect this might have on backward compatibility. It would be easy enough to modify the patch I created to do it this way. [see SOLR-1672] Thanks, Peter Date: Thu, 24 Dec 2009 22:24:25 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: RE: Reverse sort facet query : I'll have a look at SimpleFacets.java to look at patching it. I should : think the sorting bit will be relatively straightforward. The tricky bit : is how to submit the request via the query interface - there's only a : boolean in Solr 1.4 the boolean syntax was deprecated in place of keywords that are more meaninful... http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort ... count and index replaced true and false we could always start supporting count desc and count asc (with count as an alias for count desct : The reverse facet query is for when you want to know which event (or : group of event types) has happened the least got it, thanks. -Hoss _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Reverse sort facet query
Hello, Thanks very much for your answer. I'll have a look at SimpleFacets.java to look at patching it. I should think the sorting bit will be relatively straightforward. The tricky bit is how to submit the request via the query interface - there's only a boolean for facet sorting - would probably require a new parameter so as to maintain bw compatilibity [e.g. facet.reversesort=true] (if you have any thoughts on how you would like to see such functionality integrated into a query, let me know). When I have something working, I'll probably have to ask you the best way to submit a patch for this. The use case is pretty straightforward, really: In my case, the index is collecting/storing network events (logs, firewall events, Win event logs etc.). The reverse facet query is for when you want to know which event (or group of event types) has happened the least over a given period of time. As a simple example: Let's say you want to look at who has been logging in to a secure server over the past week, and this server is normally accessed by only a handful of users. But you don't want to know the 'typical' users that have logged in, you want to know who's only logged-in once, at say 3 o'clock in the morning on Wednesday. Hmmm, why's he/she doing that? Here, a 'rare' query will show you the atypical behaviour. Capacity Planning and Performance Monitoring is another example - where you might want to know which machines have produced the least number of errors or the least amount of traffic. Outside of networking, stock control would be another example - 'what items are we about to run out of?' Thanks, Peter Date: Tue, 15 Dec 2009 13:12:44 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: Re: Reverse sort facet query : Does anyone know of a good way to perform a reverse-sorted facet query (i.e. rarest first)? I'm fairly confident that code doesn't exist at the moment. If i remember correctly, it would be fairly simply to implement if you'd like to submit a patch: when sorting by count a simple bounded priority queue is used, so we'd just have the change the comparator. If you're interested in working on a patch it should be in SimpleFacets.java. I think the queue is called BoundedTreeSet (that's a pretty novel request actually ... i don't remember anyone else ever asking for anything like this before .. can you describe your use case a bit -- i'm curious as to how/when you would use this data) -Hoss _ Use Hotmail to send and receive mail from your different email accounts http://clk.atdmt.com/UKM/go/186394592/direct/01/
Reverse sort facet query
Hello Forum, I've had a search in the mail archives and on the 'net, but I'm sure I wouldn't be the first to have a requirement for this: Does anyone know of a good way to perform a reverse-sorted facet query (i.e. rarest first)? As you know facet.sort toggles between sorting on count or field name, but there's no built-in method for reverse count. One way I've found to do this is to set facet.limit=-1 (and facet.mincount) to get the entire list, then take 'bottom-5' to get a 'rare' list. This works, but it's not great for very large lists. Does anyone know of a better way? Many thanks, Peter _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Facet query with special characters
Hi, Thanks for your help and answers. I believe I have isolated the issue, and yes, it was 'schema/write'-related. Basically, the issue was this: All indexing is performed via solrj objects (to an EmbeddedSolrServer instance), and this was ported over from 'raw' Lucene java indexing code. When I moved over to SolrJ, I hadn't realized that the schema.xml file will then affect all writes for the given type. Once I sorted out my schema properly, and reindexed - queries started behaving as expected. Thank you very much for your excellent insight - I'm quite new to Solr, so it's really great to have an expert show me the err of my ways. I had only recently discovered the power of debugQuery=true - awesomely good! Many thanks again, Peter Date: Tue, 8 Dec 2009 09:35:31 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: RE: Facet query with special characters : Note that I am (supposed to be) indexing/searching without analysis : tokenization (if that's the correct term) - i.e. field values like : 'pds-comp.domain' shouldn't be (and I believe aren't) broken up as in : 'pds', 'comp' 'domain' etc. (e.g. using the 'text_ws' fieldtype). ... : What would be your opinion on the best way to index/analyze/not-analyze such fields? a whitespace tokenizer is probeably the best bet, but in order to be certain what's going on, you would need to look at a few things (and if you wanted help from other people, you would need to post those things) that i mentioned before : check your analysis configuration for this fieldtype, in particular look : at what debugQuery produces for your parsed query, and look at what : analysis.jsp says it will do at query time with the input string : pds-comp.domain ... because it sounds like you have a disconnect between : how the text is indexed and how it is searched. adding a * to your ...so what does your schema look like, what is the outputfrom debugQuery, what is the output from analysis.jsp, etc... -Hoss _ Have more than one Hotmail account? Link them together to easily access both http://clk.atdmt.com/UKM/go/186394591/direct/01/
RE: Facet query with special characters
Hello Hoss, Many thanks for your answer. That's very interesting. So, are you saying this is an issue on the index side, rather than the query side? Note that I am (supposed to be) indexing/searching without analysis tokenization (if that's the correct term) - i.e. field values like 'pds-comp.domain' shouldn't be (and I believe aren't) broken up as in 'pds', 'comp' 'domain' etc. (e.g. using the 'text_ws' fieldtype). What would be your opinion on the best way to index/analyze/not-analyze such fields? Thanks! Peter Date: Mon, 7 Dec 2009 15:30:47 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: Re: Facet query with special characters : When performing a facet query where part of the value portion has a : special character (a minus sign in this case), the query returns zero : results unless I put a wildcard (*) at the end. check your analysis configuration for this fieldtype, in particular look at what debugQuery produces for your parsed query, and look at what analysis.jsp says it will do at query time with the input string pds-comp.domain ... because it sounds like you have a disconnect between how the text is indexed and how it is searched. adding a * to your input query forces it to make a WildcardQuery which doesn't use analysis, so you get a match on the literal token. in short: i suspect your problem has nothing to do with query string escaping, and everything to do with field tokenization. -Hoss _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Embedded for write, HTTP for read - cache aging
Hi Erik, Thanks for your answer. Yes, I've done an /update to the http server, which certainly works as far as the 'reading' goes. This sends the update to the back-end index though, which essentially defeats the purpose of having the embedded instance do the write (as writes are always local, but reads might be remote, the goal is for super-fast writes, at the potential cost of slower reads). Maybe the http server can be set as 'Read-only' (redirected /update handler) so that it doesn't hit the back-end indexer, but still tells it to check the index on the next read? The main performance bottleneck isn't Solr itself, but the HTTP wrapping/transmission. At low traffic rates, it really makes no difference at all. But when you get into 1000's writes/sec the http wrapping and transmission becomes more and more significant as the traffic rate rises. On average, we've seen ~3-8% efficiency increase at very high rates (using a typical Windows TCP stack). This might not seem like much, but at really high screaming input rates, it does make a difference. The EmbeddedSolr instance itself wraps each request into an XML request, so I believe the performance of the EmbeddedSolr instance could be increased if it handled requests without any wrapping at all (NamedList). Thanks, Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Embedded for write, HTTP for read - cache aging Date: Mon, 7 Dec 2009 05:49:01 +0100 On Dec 5, 2009, at 12:56 PM, Peter 4U wrote: Does anyone know of a way to tell an http SolrServer to reload its back-end index (mark cache as dirty) periodically? Send a commit/ to the HTTP SolrServer. I have a scenario where an EmbeddedSolrServer is used for writing (for fast indexing), and an CommonsHttpSolrServer for reading (for remote access). I'm curious, now much faster is it in your situation? Erik _ Have more than one Hotmail account? Link them together to easily access both http://clk.atdmt.com/UKM/go/186394591/direct/01/
Embedded for write, HTTP for read - cache aging
Hello, Does anyone know of a way to tell an http SolrServer to reload its back-end index (mark cache as dirty) periodically? I have a scenario where an EmbeddedSolrServer is used for writing (for fast indexing), and an CommonsHttpSolrServer for reading (for remote access). If the http server is used for writing, reading clients pick up any updates, as the /update has gone 'through' the http server. For very high indexing rates, I'd rather not have to build an http request for every write (or group of writes), since the writer is always on the same machine as the index. Any help on this is much appreciated. Thanks, Peter _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/
Question: Write to Solr but not via http, and still store date_format
Hi Solr team, Has anyone been able to write to Solr, keeping things like 'date_format', but indexing directly, rather than via http? I've been indexing using Lucene Java, and this works well and is very fast, except that any data indexed this way doesn't store date_format et al information (date.format resuts always return 0). I like indexing directly into Lucene, rather than via http requests, as it is much faster, particularly at very high input rates. Anyone encountered this and managed to solve it? Many thanks, peter _ Got more than one Hotmail account? Save time by linking them together http://clk.atdmt.com/UKM/go/186394591/direct/01/
Answer: RE: Question: Write to Solr but not via http, and still store date_format
Oops, of course the answer was staring me in the face! -- Use the EmbeddedSolrServer, rather than the CommonsHttpSolrServer. Live and learn. Live. and learn. Thanks, Peter From: pete...@hotmail.com To: solr-user@lucene.apache.org Subject: Question: Write to Solr but not via http, and still store date_format Date: Fri, 4 Dec 2009 20:09:19 + Hi Solr team, Has anyone been able to write to Solr, keeping things like 'date_format', but indexing directly, rather than via http? I've been indexing using Lucene Java, and this works well and is very fast, except that any data indexed this way doesn't store date_format et al information (date.format resuts always return 0). I like indexing directly into Lucene, rather than via http requests, as it is much faster, particularly at very high input rates. Anyone encountered this and managed to solve it? Many thanks, peter _ Got more than one Hotmail account? Save time by linking them together http://clk.atdmt.com/UKM/go/186394591/direct/01/ _ Got more than one Hotmail account? Save time by linking them together http://clk.atdmt.com/UKM/go/186394591/direct/01/
Facet query with special characters
Hello, I've encountered some strange behaviour in Solr facet querying, and I've not been able to find anything on this on the web. Perhaps someone can shed some light on this? The problem: When performing a facet query where part of the value portion has a special character (a minus sign in this case), the query returns zero results unless I put a wildcard (*) at the end. Here is my query: This produces zero 'numFound': http://localhost:8983/solr/select/?wt=xmlindent=onrows=20q=((signature:3083 AND host:pds-comp.domain)) AND _time:[091119124039 TO 091203124039]facet=truefacet.field=hostfacet.field=sourcetypefacet.field=userfacet.field=signature This produces 28 'numFound': http://localhost:8983/solr/select/?wt=xmlindent=onrows=20q=((signature:3083 AND host:pds-comp.domain*)) AND _time:[091119124039 TO 091203124039]facet=truefacet.field=hostfacet.field=sourcetypefacet.field=userfacet.field=signature (Note: all hit results are for hostpds-comp.domain/host - there are no other characters in the resulting field values) I've tried escaping the minus sign in various ways, encoding etc., but nothing seems to work. Can anyone help? Many thanks, Peter _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Facet query with special characters
Hello Solr Forum, I believe I have found a solution (workaround?) for performing an explicit (non-wildcarded) field query with values that contain special (escaped) characters. Instead of: field:value-with-escape-chars change this to: field:[value-with-escape-chars TO value-with-escape-chars] (Note that for SolrJ, use QueryParser.escape(), to ultimately turn this into: field:[\value\-with\-escape\-chars\ TO \value\-with\-escape\-chars\]) If the value being queried has no special characters (e.g. host:localhost), the above is not necessary, which leads me to believe this more of a workaround than the 'supported way'. Please do correct me/clarify if you know differently, or know of a better/more efficient method. In early tests with 200,000+ hits, there appears no performance hit for using the range form. Not sure if this affects performance for millions+ hits. Thanks, Peter From: pete...@hotmail.com To: solr-user@lucene.apache.org Subject: Facet query with special characters Date: Thu, 3 Dec 2009 13:29:45 + Hello, I've encountered some strange behaviour in Solr facet querying, and I've not been able to find anything on this on the web. Perhaps someone can shed some light on this? The problem: When performing a facet query where part of the value portion has a special character (a minus sign in this case), the query returns zero results unless I put a wildcard (*) at the end. Here is my query: This produces zero 'numFound': http://localhost:8983/solr/select/?wt=xmlindent=onrows=20q=((signature:3083 AND host:pds-comp.domain)) AND _time:[091119124039 TO 091203124039]facet=truefacet.field=hostfacet.field=sourcetypefacet.field=userfacet.field=signature This produces 28 'numFound': http://localhost:8983/solr/select/?wt=xmlindent=onrows=20q=((signature:3083 AND host:pds-comp.domain*)) AND _time:[091119124039 TO 091203124039]facet=truefacet.field=hostfacet.field=sourcetypefacet.field=userfacet.field=signature (Note: all hit results are for hostpds-comp.domain/host - there are no other characters in the resulting field values) I've tried escaping the minus sign in various ways, encoding etc., but nothing seems to work. Can anyone help? Many thanks, Peter _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/ _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/