Re: Wildcard in FL parameter not working with Solr 4.10.0
This may have been introduced by changes made to solve https://issues.apache.org/jira/browse/SOLR-5968 I created https://issues.apache.org/jira/browse/SOLR-6501 to track the new bug. On Tue, Sep 9, 2014 at 4:53 PM, Mike Hugo m...@piragua.com wrote: Hello, With Solr 4.7 we had some queries that return dynamic fields by passing in a fl=*_exact parameter; this is not working for us after upgrading to Solr 4.10.0. This appears to only be a problem when requesting wildcarded fields via SolrJ With Solr 4.10.0 - I downloaded the binary and set up the example: cd example java -jar start.jar java -jar post.jar solr.xml monitor.xml In a browser, if I request http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=true *fl=*d* All is well with the world: {responseHeader: {status: 0,QTime: 1,params: {fl: *d,indent : true,q: *:*,wt: json}},response: {numFound: 2,start: 0, docs: [{id: SOLR1000},{id: 3007WFP}]}} However if I do the same query with SolrJ (groovy script) @Grab(group = 'org.apache.solr', module = 'solr-solrj', version = '4.10.0') import org.apache.solr.client.solrj.SolrQuery import org.apache.solr.client.solrj.impl.HttpSolrServer HttpSolrServer solrServer = new HttpSolrServer( http://localhost:8983/solr/collection1;) SolrQuery q = new SolrQuery(*:*) *q.setFields(*d)* println solrServer.query(q) No fields are returned: {responseHeader={status=0,QTime=0,params={fl=*d,q=*:*,wt=javabin,version=2}},response={numFound=2,start=0,docs=[*SolrDocument{}, SolrDocument{}*]}} Any ideas as to why when using SolrJ wildcarded fl fields are not returned? Thanks, Mike
Wildcard in FL parameter not working with Solr 4.10.0
Hello, With Solr 4.7 we had some queries that return dynamic fields by passing in a fl=*_exact parameter; this is not working for us after upgrading to Solr 4.10.0. This appears to only be a problem when requesting wildcarded fields via SolrJ With Solr 4.10.0 - I downloaded the binary and set up the example: cd example java -jar start.jar java -jar post.jar solr.xml monitor.xml In a browser, if I request http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=true *fl=*d* All is well with the world: {responseHeader: {status: 0,QTime: 1,params: {fl: *d,indent: true,q: *:*,wt: json}},response: {numFound: 2,start: 0,docs : [{id: SOLR1000},{id: 3007WFP}]}} However if I do the same query with SolrJ (groovy script) @Grab(group = 'org.apache.solr', module = 'solr-solrj', version = '4.10.0') import org.apache.solr.client.solrj.SolrQuery import org.apache.solr.client.solrj.impl.HttpSolrServer HttpSolrServer solrServer = new HttpSolrServer( http://localhost:8983/solr/collection1;) SolrQuery q = new SolrQuery(*:*) *q.setFields(*d)* println solrServer.query(q) No fields are returned: {responseHeader={status=0,QTime=0,params={fl=*d,q=*:*,wt=javabin,version=2}},response={numFound=2,start=0,docs=[*SolrDocument{}, SolrDocument{}*]}} Any ideas as to why when using SolrJ wildcarded fl fields are not returned? Thanks, Mike
Deep paging in parallel with solr cloud - OutOfMemory
Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
I should add each node has 16G of ram, 8GB of which is allocated to the JVM. Each node has about 200k docs and happily uses only about 3 or 4gb of ram during normal operation. It's only during this deep pagination that we have seen OOM errors. On Mon, Mar 17, 2014 at 3:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months? Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote: Hi Mike, The OOM you're seeing is likely a result of the bug described in (and fixed by a commit under) SOLR-5875: https://issues.apache.org/jira/browse/SOLR-5875. If you can build from source, it would be great if you could confirm the fix addresses the issue you're facing. This fix will be part of a to-be-released Solr 4.7.1. Steve On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
Thanks! On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote: Mike, Days. I plan on making a 4.7.1 release candidate a week from today, and assuming nobody finds any problems with the RC, it will be released roughly four days thereafter (three days for voting + one day for release propogation to the Apache mirrors): i.e., next Friday-ish. Steve On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote: Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months? Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote: Hi Mike, The OOM you're seeing is likely a result of the bug described in (and fixed by a commit under) SOLR-5875: https://issues.apache.org/jira/browse/SOLR-5875. If you can build from source, it would be great if you could confirm the fix addresses the issue you're facing. This fix will be part of a to-be-released Solr 4.7.1. Steve On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
Cursor mark definitely seems like the way to go. If I can get it to work in parallel then that's additional bonus On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury greg.pendleb...@gmail.comwrote: Shouldn't all deep pagination against a cluster use the new cursor mark feature instead of 'start' and 'rows'? 4 or 5 requests still seems a very low limit to be running into an OOM issues though, so perhaps it is both issues combined? Ta, Greg On 18 March 2014 07:49, Mike Hugo m...@piragua.com wrote: Thanks! On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote: Mike, Days. I plan on making a 4.7.1 release candidate a week from today, and assuming nobody finds any problems with the RC, it will be released roughly four days thereafter (three days for voting + one day for release propogation to the Apache mirrors): i.e., next Friday-ish. Steve On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote: Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months? Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote: Hi Mike, The OOM you're seeing is likely a result of the bug described in (and fixed by a commit under) SOLR-5875: https://issues.apache.org/jira/browse/SOLR-5875. If you can build from source, it would be great if you could confirm the fix addresses the issue you're facing. This fix will be part of a to-be-released Solr 4.7.1. Steve On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
Greg and I are talking about the same type of parallel. We do the same thing - if I know there are 10,000 results, we can chunk that up across multiple worker threads up front without having to page through the results. We know there are 10 chunks of 1,000, so we can have one thread process 0-1000 while another thread starts on 1000-2000 at the same time. The only idea I've had so far is that you could have a single thread up front iterate through the entire result set, perhaps asking for 'null' from the the fl param (to make the response more light weight) and record all the next cursorMark tokens - then just fire those off to the workers as you get them. depending on the amount of processing being done for each response it might give you some optimizations from being multi-threaded...or maybe the overhead of calculating the cursorMarks isn't worth the effort. Haven't tried either way yet. Mike On Mon, Mar 17, 2014 at 6:54 PM, Greg Pendlebury greg.pendleb...@gmail.comwrote: Sorry, I meant one thread requesting records 1 - 1000, whilst the next thread requests 1001 - 2000 from the same ordered result set. We've observed several of our customers trying to harvest our data with multi-threaded scripts that work like this. I thought it would not work using cursor marks... but: A) I could be wrong, and B) I could be talking about parallel in a different way to Mike. Ta, Greg On 18 March 2014 10:24, Yonik Seeley yo...@heliosearch.com wrote: On Mon, Mar 17, 2014 at 7:14 PM, Greg Pendlebury greg.pendleb...@gmail.com wrote: My suspicion is that it won't work in parallel Deep paging with cursorMark does work with distributed search (assuming that's what you meant by parallel... querying sub-shards in parallel?). -Yonik http://heliosearch.org - solve Solr GC pauses with off-heap filters and fieldcache
Change replication factor
After a collection has been created in SolrCloud, is there a way to modify the Replication Factor? Say I start with a few nodes in the cluster, and have a replication factor of 2. Over time, the index grows and we add more nodes to the cluster, can I increase the replication factor to 3? Thanks! Mike
Re: Change replication factor
Thanks Mark! Mike On Wed, Mar 12, 2014 at 12:43 PM, Mark Miller markrmil...@gmail.com wrote: You can simply create a new SolrCore with the same collection and shard id as the colleciton and shard you want to add a replica too. There is also an addReplica command comming to the collections API. Or perhaps it's in 4.7, I don't know, this JIRA issue is a little confusing as it's still open, though it looks like stuff has been committed: https://issues.apache.org/jira/browse/SOLR-5130 -- Mark Miller about.me/markrmiller On March 12, 2014 at 10:40:15 AM, Mike Hugo (m...@piragua.com) wrote: After a collection has been created in SolrCloud, is there a way to modify the Replication Factor? Say I start with a few nodes in the cluster, and have a replication factor of 2. Over time, the index grows and we add more nodes to the cluster, can I increase the replication factor to 3? Thanks! Mike
Re: Expanding sets of words
I'll buy that book :) Does this work with mutli-word terms? (common lisp or assembly language) (programming or coding or development) I tried: {!surround}(common lisp OR assembly language) W (programming) but that returns a parse error. Putting quotes around the multi-word terms parses but returns 0 results {!surround}(common lisp OR assembly language) W (programming) On Tue, May 21, 2013 at 8:32 AM, Jack Krupansky j...@basetechnology.comwrote: I'll make sure to include that specific example in the new Solr book. -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Tuesday, May 21, 2013 12:29 AM To: solr-user@lucene.apache.org Subject: Re: Expanding sets of words Fantastic! Thanks! On Mon, May 20, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.com **wrote: Yes, with the Solr surround query parser: q=(java OR groovy OR scala) W (programming OR coding OR development) BUT... there is the caveat that the surround query parser does no analysis. So, maybe you need Java OR java etc. Or, if you know that the index is lower case. Try this dataset: curl http://localhost:8983/solr/collection1/update?commit=truehttp://localhost:8983/solr/**collection1/update?commit=true **http://localhost:8983/solr/**collection1/update?commit=truehttp://localhost:8983/solr/collection1/update?commit=true **-H 'Content-type:application/csv' -d ' id,features doc-1,java coding doc-2,java programming doc-3,java development doc-4,groovy coding doc-5,groovy programming doc-6,groovy development doc-7,scala coding doc-8,scala programming doc-9,scala development doc-10,c coding doc-11,c programming doc-12,c development doc-13,java language doc-14,groovy language doc-15,scala language' And try these commands: curl http://localhost:8983/solr/select/?q=(java+OR+scala)+W+**http://localhost:8983/solr/**select/?q=(java+OR+scala)+W+** programming\http://localhost:**8983/solr/select/?q=(java+OR+** scala)+W+programming%5Chttp://localhost:8983/solr/select/?q=(java+OR+scala)+W+programming%5C df=featuresdefType=surroundindent=true curl http://localhost:8983/solr/select/http://localhost:8983/solr/**select/ http://localhost:8983/**solr/select/http://localhost:8983/solr/select/ ?\ q=(java+OR+scala)+W+(programming+OR+coding)\ df=featuresdefType=surroundindent=true curl http://localhost:8983/solr/select/\http://localhost:8983/solr/**select/%5C http://localhost:**8983/solr/select/%5Chttp://localhost:8983/solr/select/%5C ?q=(java+OR+groovy+OR+scala)+W+(programming+OR+coding+OR+*** *development)\ df=featuresdefType=surroundindent=true The LucidWorks Search query parser also supports NEAR, BEFORE, and AFTER operators, in conjunction with OR and - to generate span queries: q=(java OR groovy OR scala) BEFORE:0 (programming OR coding OR development) -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Monday, May 20, 2013 11:42 PM To: solr-user@lucene.apache.org Subject: Expanding sets of words Is there a way to query for combinations of two sets of words? For example, if I had (java or groovy or scala) (programming or coding or development) Is there a query parser that, at query time, would expand that into combinations like java programming groovy programming scala programming java coding java development etc etc etc Thanks! Mike
Re: Expanding sets of words
Fantastic! Thanks for following up - this is great. Mike On Tue, May 21, 2013 at 11:17 PM, Jack Krupansky j...@basetechnology.comwrote: Ah... and the answer is: curl http://localhost:8983/solr/**select/?q=(assembly+W+** language+OR+scala)+W+**programming\http://localhost:8983/solr/select/?q=(assembly+W+language+OR+scala)+W+programming%5C df=featuresdefType=surround**indent=true IOW, any quoted phrase like a b c d can be written in surround as a W b W c W d. Presto! I'll make sure that example is in the book as well. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Tuesday, May 21, 2013 11:37 AM To: solr-user@lucene.apache.org Subject: Re: Expanding sets of words Hmmm... I did a quick test and quoted phrase wasn't working for me either. Oh well. But... it should work for the LucidWorks Search query parser! -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Tuesday, May 21, 2013 11:26 AM To: solr-user@lucene.apache.org Subject: Re: Expanding sets of words I'll buy that book :) Does this work with mutli-word terms? (common lisp or assembly language) (programming or coding or development) I tried: {!surround}(common lisp OR assembly language) W (programming) but that returns a parse error. Putting quotes around the multi-word terms parses but returns 0 results {!surround}(common lisp OR assembly language) W (programming) On Tue, May 21, 2013 at 8:32 AM, Jack Krupansky j...@basetechnology.com**wrote: I'll make sure to include that specific example in the new Solr book. -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Tuesday, May 21, 2013 12:29 AM To: solr-user@lucene.apache.org Subject: Re: Expanding sets of words Fantastic! Thanks! On Mon, May 20, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.com **wrote: Yes, with the Solr surround query parser: q=(java OR groovy OR scala) W (programming OR coding OR development) BUT... there is the caveat that the surround query parser does no analysis. So, maybe you need Java OR java etc. Or, if you know that the index is lower case. Try this dataset: curl http://localhost:8983/solr/**collection1/update?commit=**truehttp://localhost:8983/solr/collection1/update?commit=true http://localhost:8983/**solr/**collection1/update?**commit=truehttp://localhost:8983/solr/**collection1/update?commit=true **http://localhost:8983/solr/collection1/update?commit=**truehttp://localhost:8983/solr/**collection1/update?commit=true http://localhost:8983/**solr/collection1/update?**commit=truehttp://localhost:8983/solr/collection1/update?commit=true **-H 'Content-type:application/csv' -d ' id,features doc-1,java coding doc-2,java programming doc-3,java development doc-4,groovy coding doc-5,groovy programming doc-6,groovy development doc-7,scala coding doc-8,scala programming doc-9,scala development doc-10,c coding doc-11,c programming doc-12,c development doc-13,java language doc-14,groovy language doc-15,scala language' And try these commands: curl http://localhost:8983/solr/**select/?q=(java+OR+scala)+W+http://localhost:8983/solr/select/?q=(java+OR+scala)+W+** http://localhost:8983/solr/select/?q=(java+OR+scala)+W+http://localhost:8983/solr/**select/?q=(java+OR+scala)+W+** programming\http://localhost:8983/solr/select/?q=(java+**OR+** scala)+W+programming%5Chttp:/**/localhost:8983/solr/select/?** q=(java+OR+scala)+W+**programming%5Chttp://localhost:8983/solr/select/?q=(java+OR+scala)+W+programming%5C df=featuresdefType=surround**indent=true curl http://localhost:8983/solr/**select/http://localhost:8983/solr/select/ http://localhost:**8983/solr/**select/http://localhost:8983/solr/**select/ http://localhost:8983/**solr/**select/http://localhost:8983/**solr/select/ http://localhost:8983/**solr/select/http://localhost:8983/solr/select/ ?\ q=(java+OR+scala)+W+(**programming+OR+coding)\ df=featuresdefType=surround**indent=true curl http://localhost:8983/solr/**select/\http://localhost:8983/solr/select/%5C http://localhost:**8983/solr/**select/%5Chttp://localhost:8983/solr/**select/%5C http://localhost:**8983/solr/**select/%5Chttp://localhost:** 8983/solr/select/%5C http://localhost:8983/solr/select/%5C ?q=(java+OR+groovy+OR+scala)+**W+(programming+OR+coding+**OR+*** *development)\ df=featuresdefType=surround**indent=true The LucidWorks Search query parser also supports NEAR, BEFORE, and AFTER operators, in conjunction with OR and - to generate span queries: q=(java OR groovy OR scala) BEFORE:0 (programming OR coding OR development) -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Monday, May 20, 2013 11:42 PM To: solr-user@lucene.apache.org Subject: Expanding sets of words Is there a way to query for combinations of two sets of words? For example, if I had
Expanding sets of words
Is there a way to query for combinations of two sets of words? For example, if I had (java or groovy or scala) (programming or coding or development) Is there a query parser that, at query time, would expand that into combinations like java programming groovy programming scala programming java coding java development etc etc etc Thanks! Mike
Re: Expanding sets of words
Fantastic! Thanks! On Mon, May 20, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.comwrote: Yes, with the Solr surround query parser: q=(java OR groovy OR scala) W (programming OR coding OR development) BUT... there is the caveat that the surround query parser does no analysis. So, maybe you need Java OR java etc. Or, if you know that the index is lower case. Try this dataset: curl http://localhost:8983/solr/**collection1/update?commit=truehttp://localhost:8983/solr/collection1/update?commit=true-H 'Content-type:application/csv' -d ' id,features doc-1,java coding doc-2,java programming doc-3,java development doc-4,groovy coding doc-5,groovy programming doc-6,groovy development doc-7,scala coding doc-8,scala programming doc-9,scala development doc-10,c coding doc-11,c programming doc-12,c development doc-13,java language doc-14,groovy language doc-15,scala language' And try these commands: curl http://localhost:8983/solr/**select/?q=(java+OR+scala)+W+** programming\http://localhost:8983/solr/select/?q=(java+OR+scala)+W+programming%5C df=featuresdefType=surround**indent=true curl http://localhost:8983/solr/**select/http://localhost:8983/solr/select/ ?\ q=(java+OR+scala)+W+(**programming+OR+coding)\ df=featuresdefType=surround**indent=true curl http://localhost:8983/solr/**select/\http://localhost:8983/solr/select/%5C ?q=(java+OR+groovy+OR+scala)+**W+(programming+OR+coding+OR+**development)\ df=featuresdefType=surround**indent=true The LucidWorks Search query parser also supports NEAR, BEFORE, and AFTER operators, in conjunction with OR and - to generate span queries: q=(java OR groovy OR scala) BEFORE:0 (programming OR coding OR development) -- Jack Krupansky -Original Message- From: Mike Hugo Sent: Monday, May 20, 2013 11:42 PM To: solr-user@lucene.apache.org Subject: Expanding sets of words Is there a way to query for combinations of two sets of words? For example, if I had (java or groovy or scala) (programming or coding or development) Is there a query parser that, at query time, would expand that into combinations like java programming groovy programming scala programming java coding java development etc etc etc Thanks! Mike
ConcurrentUpdateSolrServer flush on size of documents rather than queue size
Does anyone know if a version of ConcurrentUpdateSolrServer exists that would use the size in memory of the queue to decide when to send documents to the solr server? For example, if I set up a ConcurrentUpdateSolrServer with 4 threads and a batch size of 200 that works if my documents are small. But if I am building up documents that have a lot of text, I have run into an OutOfMemory exception in my process that builds the docs. The document sizes are variable. What I'd like to be able to do is submit documents to the solr sever when the size of the queue reaches (or is greater than) 200MB or something like that, so rather than specifying the number of document to put in the queue, I'd specify the size in MB to build up before submitting. Does something like this exist already? Thanks, Mike
Re: always getting distinct count of -1 in luke response (solr4 snapshot)
Explicitly running an optimize on the index via the admin screens solved this problem - the correct counts are now being returned. On Tue, May 22, 2012 at 4:33 PM, Mike Hugo m...@piragua.com wrote: We're testing a snapshot of Solr4 and I'm looking at some of the responses from the Luke request handler. Everything looks good so far, with the exception of the distinct attribute which (in Solr3) shows me the distinct number of terms for a given field. Given the request below, I'm consistently getting a response back with a value in the distinct field of -1. Is there something different I need to do to get back the actual distinct count? Thanks! Mike http://localhost:8080/solr/core1/admin/luke?wt=jsonfl=labelnumTerms=1 fields: { label: { type: text_general, schema: IT-M--, index: (unstored field), docs: 63887, *distinct: -1,* topTerms: [
always getting distinct count of -1 in luke response (solr4 snapshot)
We're testing a snapshot of Solr4 and I'm looking at some of the responses from the Luke request handler. Everything looks good so far, with the exception of the distinct attribute which (in Solr3) shows me the distinct number of terms for a given field. Given the request below, I'm consistently getting a response back with a value in the distinct field of -1. Is there something different I need to do to get back the actual distinct count? Thanks! Mike http://localhost:8080/solr/core1/admin/luke?wt=jsonfl=labelnumTerms=1 fields: { label: { type: text_general, schema: IT-M--, index: (unstored field), docs: 63887, *distinct: -1,* topTerms: [
Re: Size of suggest dictionary
Thanks Em! What if we use a threshold value in the suggest configuration, like float name=threshold0.005/float I assume the dictionary size will then be smaller than the total number of distinct terms, is there anyway to determine what that size is? Thanks, Mike On Wednesday, February 15, 2012 at 4:39 PM, Em wrote: Hello Mike, have a look at Solr's Schema Browser. Click on FIELDS, select label and have a look at the number of distinct (term-)values. Regards, Em Am 15.02.2012 23:07, schrieb Mike Hugo: Hello, We're building an auto suggest component based on the label field of documents. Is there a way to see how many terms are in the dictionary, or how much memory it's taking up? I looked on the statistics page but didn't find anything obvious. Thanks in advance, Mike ps- here's the config: searchComponent name=suggestlabel class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggestlabel/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldlabel/str str name=buildOnOptimizetrue/str /lst /searchComponent requestHandler name=suggestlabel class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggestlabel/str str name=spellcheck.count10/str /lst arr name=components strsuggestlabel/str /arr /requestHandler
Size of suggest dictionary
Hello, We're building an auto suggest component based on the label field of documents. Is there a way to see how many terms are in the dictionary, or how much memory it's taking up? I looked on the statistics page but didn't find anything obvious. Thanks in advance, Mike ps- here's the config: searchComponent name=suggestlabel class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggestlabel/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldlabel/str str name=buildOnOptimizetrue/str /lst /searchComponent requestHandler name=suggestlabel class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggestlabel/str str name=spellcheck.count10/str /lst arr name=components strsuggestlabel/str /arr /requestHandler
Re: Solr Join query with fq not correctly filtering results?
Thanks Yonik!! The join functionality is proving extremely useful for us in a specific use case - we're really looking forward to join and other cool features coming in Solr4!! Mike On Wed, Feb 1, 2012 at 3:30 PM, Yonik Seeley yo...@lucidimagination.comwrote: Thanks for your persistence in tracking this down Mike! I'm going to start looking into this now... -Yonik lucidimagination.com On Thu, Jan 26, 2012 at 11:06 PM, Mike Hugo m...@piragua.com wrote: I created issue https://issues.apache.org/jira/browse/SOLR-3062 for this problem. I was able to track it down to something in this commit - http://svn.apache.org/viewvc?view=revisionrevision=1188624(LUCENE-1536: Filters can now be applied down-low, if their DocIdSet implements a new bits() method, returning all documents in a random access way ) - before that commit the join / fq functionality works as expected / documented on the wiki page. After that commit it's broken. Any assistance is greatly appreciated! Thanks, Mike On Thu, Jan 26, 2012 at 11:04 AM, Mike Hugo m...@piragua.com wrote: Hello, I'm trying out the Solr JOIN query functionality on trunk. I have the latest checkout, revision #1236272 - I did the following steps to get the example up and running: cd solr ant example java -jar start.jar cd exampledocs java -jar post.jar *.xml Then I tried a few of the sample queries on the wiki page http://wiki.apache.org/solr/Join. In particular, this is one that I'm interest in Find all manufacturer docs named belkin, then join them against (product) docs and filter that list to only products with a price less than 12 dollars http://localhost:8983/solr/select?q={!join+from=id+to=manu_id_s}compName_s:Belkinfq=price:%5B%2A+TO+12%5D http://localhost:8983/solr/select?q=%7B!join+from=id+to=manu_id_s%7DcompName_s:Belkinfq=price:%5B%2A+TO+12%5D However, when I run that query, I get two results, one with a price of 19.95 and another with a price of 11.5 Because of the filter query, I'm only expecting to see one result - the one with a price of 11.99. I was also able to replicate this in a unit test added to org.apache.solr.TestJoin: @Test public void testJoin_withFilterQuery() throws Exception { assertU(add(doc(id, 1,name, john, title, Director, dept_s,Engineering))); assertU(add(doc(id, 2,name, mark, title, VP, dept_s,Marketing))); assertU(add(doc(id, 3,name, nancy, title, MTS, dept_s,Sales))); assertU(add(doc(id, 4,name, dave, title, MTS, dept_s,Support, dept_s,Engineering))); assertU(add(doc(id, 5,name, tina, title, VP, dept_s,Engineering))); assertU(add(doc(id,10, dept_id_s, Engineering, text,These guys develop stuff))); assertU(add(doc(id,11, dept_id_s, Marketing, text,These guys make you look good))); assertU(add(doc(id,12, dept_id_s, Sales, text,These guys sell stuff))); assertU(add(doc(id,13, dept_id_s, Support, text,These guys help customers))); assertU(commit()); //*** //This works as expected - the correct number of results are found //*** // find people that develop stuff assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id) ,/response=={'numFound':3,'start':0,'docs':[{'id':'1'},{'id':'4'},{'id':'5'}]} ); *// *// this fails - the response returned finds all three people - it should only find John* *//expected =/response=={numFound:1,start:0,docs:[{id:1}]}* *//response = {* *//responseHeader:{* *// status:0,* *// QTime:4},* *//response:{numFound:3,start:0,docs:[* *// {* *//id:1},* *// {* *//id:4},* *// {* *//id:5}]* *//}}* *// *// find people that develop stuff - but limit via filter query to a name of john* *assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id, fq, name:john)* *,/response=={'numFound':1,'start':0,'docs':[{'id':'1'}]}* *);* } Interestingly, I know this worked at some point. I had a snapshot build in my ivy cache from 10/2/2011 and it was working with that build maven_artifacts/org/apache/solr/ solr/4.0-SNAPSHOT/solr-4.0-20111002.161157-1.pom Mike
Re: Solr Join query with fq not correctly filtering results?
I've been looking into this a bit further and am trying to figure out why the FQ isn't getting applied. Can anyone point me to a good spot in the code to start looking at how FQ parameters are applied to query results in Solr4? Thanks, Mike On Thu, Jan 26, 2012 at 10:06 PM, Mike Hugo m...@piragua.com wrote: I created issue https://issues.apache.org/jira/browse/SOLR-3062 for this problem. I was able to track it down to something in this commit - http://svn.apache.org/viewvc?view=revisionrevision=1188624 (LUCENE-1536: Filters can now be applied down-low, if their DocIdSet implements a new bits() method, returning all documents in a random access way ) - before that commit the join / fq functionality works as expected / documented on the wiki page. After that commit it's broken. Any assistance is greatly appreciated! Thanks, Mike On Thu, Jan 26, 2012 at 11:04 AM, Mike Hugo m...@piragua.com wrote: Hello, I'm trying out the Solr JOIN query functionality on trunk. I have the latest checkout, revision #1236272 - I did the following steps to get the example up and running: cd solr ant example java -jar start.jar cd exampledocs java -jar post.jar *.xml Then I tried a few of the sample queries on the wiki page http://wiki.apache.org/solr/Join. In particular, this is one that I'm interest in Find all manufacturer docs named belkin, then join them against (product) docs and filter that list to only products with a price less than 12 dollars http://localhost:8983/solr/select?q={!join+from=id+to=manu_id_s}compName_s:Belkinfq=price:%5B%2A+TO+12%5Dhttp://localhost:8983/solr/select?q=%7B!join+from=id+to=manu_id_s%7DcompName_s:Belkinfq=price:%5B%2A+TO+12%5D However, when I run that query, I get two results, one with a price of 19.95 and another with a price of 11.5 Because of the filter query, I'm only expecting to see one result - the one with a price of 11.99. I was also able to replicate this in a unit test added to org.apache.solr.TestJoin: @Test public void testJoin_withFilterQuery() throws Exception { assertU(add(doc(id, 1,name, john, title, Director, dept_s,Engineering))); assertU(add(doc(id, 2,name, mark, title, VP, dept_s,Marketing))); assertU(add(doc(id, 3,name, nancy, title, MTS, dept_s,Sales))); assertU(add(doc(id, 4,name, dave, title, MTS, dept_s,Support, dept_s,Engineering))); assertU(add(doc(id, 5,name, tina, title, VP, dept_s,Engineering))); assertU(add(doc(id,10, dept_id_s, Engineering, text,These guys develop stuff))); assertU(add(doc(id,11, dept_id_s, Marketing, text,These guys make you look good))); assertU(add(doc(id,12, dept_id_s, Sales, text,These guys sell stuff))); assertU(add(doc(id,13, dept_id_s, Support, text,These guys help customers))); assertU(commit()); //*** //This works as expected - the correct number of results are found //*** // find people that develop stuff assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id) ,/response=={'numFound':3,'start':0,'docs':[{'id':'1'},{'id':'4'},{'id':'5'}]} ); *// *// this fails - the response returned finds all three people - it should only find John* *//expected =/response=={numFound:1,start:0,docs:[{id:1}]}* *//response = {* *//responseHeader:{* *// status:0,* *// QTime:4},* *//response:{numFound:3,start:0,docs:[* *// {* *//id:1},* *// {* *//id:4},* *// {* *//id:5}]* *//}}* *// *// find people that develop stuff - but limit via filter query to a name of john* *assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id, fq, name:john)* *,/response=={'numFound':1,'start':0,'docs':[{'id':'1'}]}* *);* } Interestingly, I know this worked at some point. I had a snapshot build in my ivy cache from 10/2/2011 and it was working with that build maven_artifacts/org/apache/solr/ solr/4.0-SNAPSHOT/solr-4.0-20111002.161157-1.pom Mike
Solr Join query with fq not correctly filtering results?
Hello, I'm trying out the Solr JOIN query functionality on trunk. I have the latest checkout, revision #1236272 - I did the following steps to get the example up and running: cd solr ant example java -jar start.jar cd exampledocs java -jar post.jar *.xml Then I tried a few of the sample queries on the wiki page http://wiki.apache.org/solr/Join. In particular, this is one that I'm interest in Find all manufacturer docs named belkin, then join them against (product) docs and filter that list to only products with a price less than 12 dollars http://localhost:8983/solr/select?q={!join+from=id+to=manu_id_s}compName_s:Belkinfq=price:%5B%2A+TO+12%5D However, when I run that query, I get two results, one with a price of 19.95 and another with a price of 11.5 Because of the filter query, I'm only expecting to see one result - the one with a price of 11.99. I was also able to replicate this in a unit test added to org.apache.solr.TestJoin: @Test public void testJoin_withFilterQuery() throws Exception { assertU(add(doc(id, 1,name, john, title, Director, dept_s,Engineering))); assertU(add(doc(id, 2,name, mark, title, VP, dept_s,Marketing))); assertU(add(doc(id, 3,name, nancy, title, MTS, dept_s,Sales))); assertU(add(doc(id, 4,name, dave, title, MTS, dept_s,Support, dept_s,Engineering))); assertU(add(doc(id, 5,name, tina, title, VP, dept_s,Engineering))); assertU(add(doc(id,10, dept_id_s, Engineering, text,These guys develop stuff))); assertU(add(doc(id,11, dept_id_s, Marketing, text,These guys make you look good))); assertU(add(doc(id,12, dept_id_s, Sales, text,These guys sell stuff))); assertU(add(doc(id,13, dept_id_s, Support, text,These guys help customers))); assertU(commit()); //*** //This works as expected - the correct number of results are found //*** // find people that develop stuff assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id) ,/response=={'numFound':3,'start':0,'docs':[{'id':'1'},{'id':'4'},{'id':'5'}]} ); *// *// this fails - the response returned finds all three people - it should only find John* *//expected =/response=={numFound:1,start:0,docs:[{id:1}]} * *//response = {* *//responseHeader:{* *// status:0,* *// QTime:4},* *//response:{numFound:3,start:0,docs:[* *// {* *//id:1},* *// {* *//id:4},* *// {* *//id:5}]* *//}}* *// *// find people that develop stuff - but limit via filter query to a name of john* *assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id, fq, name:john)* *,/response=={'numFound':1,'start':0,'docs':[{'id':'1'}]}* *);* } Interestingly, I know this worked at some point. I had a snapshot build in my ivy cache from 10/2/2011 and it was working with that build maven_artifacts/org/apache/solr/ solr/4.0-SNAPSHOT/solr-4.0-20111002.161157-1.pom Mike
Re: Solr Join query with fq not correctly filtering results?
I created issue https://issues.apache.org/jira/browse/SOLR-3062 for this problem. I was able to track it down to something in this commit - http://svn.apache.org/viewvc?view=revisionrevision=1188624 (LUCENE-1536: Filters can now be applied down-low, if their DocIdSet implements a new bits() method, returning all documents in a random access way ) - before that commit the join / fq functionality works as expected / documented on the wiki page. After that commit it's broken. Any assistance is greatly appreciated! Thanks, Mike On Thu, Jan 26, 2012 at 11:04 AM, Mike Hugo m...@piragua.com wrote: Hello, I'm trying out the Solr JOIN query functionality on trunk. I have the latest checkout, revision #1236272 - I did the following steps to get the example up and running: cd solr ant example java -jar start.jar cd exampledocs java -jar post.jar *.xml Then I tried a few of the sample queries on the wiki page http://wiki.apache.org/solr/Join. In particular, this is one that I'm interest in Find all manufacturer docs named belkin, then join them against (product) docs and filter that list to only products with a price less than 12 dollars http://localhost:8983/solr/select?q={!join+from=id+to=manu_id_s}compName_s:Belkinfq=price:%5B%2A+TO+12%5Dhttp://localhost:8983/solr/select?q=%7B!join+from=id+to=manu_id_s%7DcompName_s:Belkinfq=price:%5B%2A+TO+12%5D However, when I run that query, I get two results, one with a price of 19.95 and another with a price of 11.5 Because of the filter query, I'm only expecting to see one result - the one with a price of 11.99. I was also able to replicate this in a unit test added to org.apache.solr.TestJoin: @Test public void testJoin_withFilterQuery() throws Exception { assertU(add(doc(id, 1,name, john, title, Director, dept_s,Engineering))); assertU(add(doc(id, 2,name, mark, title, VP, dept_s,Marketing))); assertU(add(doc(id, 3,name, nancy, title, MTS, dept_s,Sales))); assertU(add(doc(id, 4,name, dave, title, MTS, dept_s,Support, dept_s,Engineering))); assertU(add(doc(id, 5,name, tina, title, VP, dept_s,Engineering))); assertU(add(doc(id,10, dept_id_s, Engineering, text,These guys develop stuff))); assertU(add(doc(id,11, dept_id_s, Marketing, text,These guys make you look good))); assertU(add(doc(id,12, dept_id_s, Sales, text,These guys sell stuff))); assertU(add(doc(id,13, dept_id_s, Support, text,These guys help customers))); assertU(commit()); //*** //This works as expected - the correct number of results are found //*** // find people that develop stuff assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id) ,/response=={'numFound':3,'start':0,'docs':[{'id':'1'},{'id':'4'},{'id':'5'}]} ); *// *// this fails - the response returned finds all three people - it should only find John* *//expected =/response=={numFound:1,start:0,docs:[{id:1}]}* *//response = {* *//responseHeader:{* *// status:0,* *// QTime:4},* *//response:{numFound:3,start:0,docs:[* *// {* *//id:1},* *// {* *//id:4},* *// {* *//id:5}]* *//}}* *// *// find people that develop stuff - but limit via filter query to a name of john* *assertJQ(req(q,{!join from=dept_id_s to=dept_s}text:develop, fl,id, fq, name:john)* *,/response=={'numFound':1,'start':0,'docs':[{'id':'1'}]}* *);* } Interestingly, I know this worked at some point. I had a snapshot build in my ivy cache from 10/2/2011 and it was working with that build maven_artifacts/org/apache/solr/ solr/4.0-SNAPSHOT/solr-4.0-20111002.161157-1.pom Mike
Re: HTMLStripCharFilterFactory not working in Solr4?
Thanks guys! I'll grab the latest build from the solr4 jenkins server when those commits get picked up and try it out. Thanks for the quick turnaround! Mike On Wed, Jan 25, 2012 at 11:01 AM, Steven A Rowe sar...@syr.edu wrote: Hi Mike, Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds for me now. (On Solr trunk, *all* CharFilters have been non-functional since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's fix today in r1235810; Solr 3.x was not affected - CharFilters have been working there all along.) Steve -Original Message- From: Mike Hugo [mailto:m...@piragua.com] Sent: Tuesday, January 24, 2012 3:56 PM To: solr-user@lucene.apache.org Subject: Re: HTMLStripCharFilterFactory not working in Solr4? Thanks for the responses everyone. Steve, the test method you provided also works for me. However, when I try a more end to end test with the HTMLStripCharFilterFactory configured for a field I am still having the same problem. I attached a failing unit test and configuration to the following issue in JIRA: https://issues.apache.org/jira/browse/LUCENE-3721 I appreciate all the prompt responses! Looking forward to finding the root cause of this guy :) If there's something I'm doing incorrectly in the configuration, please let me know! Mike On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe sar...@syr.edu wrote: Hi Mike, When I add the following test to TestHTMLStripCharFilterFactory.java on Solr trunk, it passes: public void testNumericCharacterEntities() throws Exception { final String text = Bose#174; #8482;; // |Bose® ™| HTMLStripCharFilterFactory htmlStripFactory = new HTMLStripCharFilterFactory(); htmlStripFactory.init(Collections.String,StringemptyMap()); CharStream charStream = htmlStripFactory.create(CharReader.get(new StringReader(text))); StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory(); stdTokFactory.init(DEFAULT_VERSION_PARAM); Tokenizer stream = stdTokFactory.create(charStream); assertTokenStreamContents(stream, new String[] { Bose }); } What's happening: First, htmlStripFactory converts #174; to ® and #8482; to ™. Then stdTokFactory declines to tokenize ® and ™, because they are belong to the Unicode general category Symbol, Other, and so are not included in any of the output tokens. StandardTokenizer uses the Word Break rules find UAX#29 http://unicode.org/reports/tr29/ to find token boundaries, and then outputs only alphanumeric tokens. See the JFlex grammar for details: http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/ java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view= markup . The behavior you're seeing is not consistent with the above test. Steve -Original Message- From: Mike Hugo [mailto:m...@piragua.com] Sent: Tuesday, January 24, 2012 1:34 PM To: solr-user@lucene.apache.org Subject: HTMLStripCharFilterFactory not working in Solr4? We recently updated to the latest build of Solr4 and everything is working really well so far! There is one case that is not working the same way it was in Solr 3.4 - we strip out certain HTML constructs (like trademark and registered, for example) in a field as defined below - it was working in Solr3.4 with the configuration shown here, but is not working the same way in Solr4. The label field is defined as type=text_general field name=label type=text_general indexed=true stored=false required=false multiValued=true/ Here's the type definition for text_general field: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType In Solr 3.4, that configuration was completely stripping html constructs out of the indexed field which is exactly what we wanted. If for example, we then do a facet on the label field, like in the test below, we're getting some terms
HTMLStripCharFilterFactory not working in Solr4?
We recently updated to the latest build of Solr4 and everything is working really well so far! There is one case that is not working the same way it was in Solr 3.4 - we strip out certain HTML constructs (like trademark and registered, for example) in a field as defined below - it was working in Solr3.4 with the configuration shown here, but is not working the same way in Solr4. The label field is defined as type=text_general field name=label type=text_general indexed=true stored=false required=false multiValued=true/ Here's the type definition for text_general field: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType In Solr 3.4, that configuration was completely stripping html constructs out of the indexed field which is exactly what we wanted. If for example, we then do a facet on the label field, like in the test below, we're getting some terms in the response that we would not like to be there. // test case (groovy) void specialHtmlConstructsGetStripped() { SolrInputDocument inputDocument = new SolrInputDocument() inputDocument.addField('label', 'Bose#174; #8482;') solrServer.add(inputDocument) solrServer.commit() QueryResponse response = solrServer.query(new SolrQuery('bose')) assert 1 == response.results.numFound SolrQuery facetQuery = new SolrQuery('bose') facetQuery.facet = true facetQuery.set(FacetParams.FACET_FIELD, 'label') facetQuery.set(FacetParams.FACET_MINCOUNT, '1') response = solrServer.query(facetQuery) FacetField ff = response.facetFields.find {it.name == 'label'} List suggestResponse = [] for (FacetField.Count facetField in ff?.values) { suggestResponse facetField.name } assert suggestResponse == ['bose'] } With the upgrade to Solr4, the assertion fails, the suggested response contains 174 and 8482 as terms. Test output is: Assertion failed: assert suggestResponse == ['bose'] | | | false [174, 8482, bose] I just tried again using the latest build from today, namely: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still getting the failing assertion. Is there a different way to configure the HTMLStripCharFilterFactory in Solr4? Thanks in advance for any tips! Mike
Re: HTMLStripCharFilterFactory not working in Solr4?
Thanks for the response Yonik, Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory does NOT solve the problem - in fact I get the same result I can see that the LegacyHTMLStripCharFilterFactory is being applied at startup: Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory however, I'm still getting the same assertion error. Any thoughts? Mike On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley yo...@lucidimagination.comwrote: You can use LegacyHTMLStripCharFilterFactory to get the previous behavior. See https://issues.apache.org/jira/browse/LUCENE-3690 for more details. -Yonik http://www.lucidimagination.com On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo m...@piragua.com wrote: We recently updated to the latest build of Solr4 and everything is working really well so far! There is one case that is not working the same way it was in Solr 3.4 - we strip out certain HTML constructs (like trademark and registered, for example) in a field as defined below - it was working in Solr3.4 with the configuration shown here, but is not working the same way in Solr4. The label field is defined as type=text_general field name=label type=text_general indexed=true stored=false required=false multiValued=true/ Here's the type definition for text_general field: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType In Solr 3.4, that configuration was completely stripping html constructs out of the indexed field which is exactly what we wanted. If for example, we then do a facet on the label field, like in the test below, we're getting some terms in the response that we would not like to be there. // test case (groovy) void specialHtmlConstructsGetStripped() { SolrInputDocument inputDocument = new SolrInputDocument() inputDocument.addField('label', 'Bose#174; #8482;') solrServer.add(inputDocument) solrServer.commit() QueryResponse response = solrServer.query(new SolrQuery('bose')) assert 1 == response.results.numFound SolrQuery facetQuery = new SolrQuery('bose') facetQuery.facet = true facetQuery.set(FacetParams.FACET_FIELD, 'label') facetQuery.set(FacetParams.FACET_MINCOUNT, '1') response = solrServer.query(facetQuery) FacetField ff = response.facetFields.find {it.name == 'label'} List suggestResponse = [] for (FacetField.Count facetField in ff?.values) { suggestResponse facetField.name } assert suggestResponse == ['bose'] } With the upgrade to Solr4, the assertion fails, the suggested response contains 174 and 8482 as terms. Test output is: Assertion failed: assert suggestResponse == ['bose'] | | | false [174, 8482, bose] I just tried again using the latest build from today, namely: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still getting the failing assertion. Is there a different way to configure the HTMLStripCharFilterFactory in Solr4? Thanks in advance for any tips! Mike
Re: HTMLStripCharFilterFactory not working in Solr4?
Thanks for the responses everyone. Steve, the test method you provided also works for me. However, when I try a more end to end test with the HTMLStripCharFilterFactory configured for a field I am still having the same problem. I attached a failing unit test and configuration to the following issue in JIRA: https://issues.apache.org/jira/browse/LUCENE-3721 I appreciate all the prompt responses! Looking forward to finding the root cause of this guy :) If there's something I'm doing incorrectly in the configuration, please let me know! Mike On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe sar...@syr.edu wrote: Hi Mike, When I add the following test to TestHTMLStripCharFilterFactory.java on Solr trunk, it passes: public void testNumericCharacterEntities() throws Exception { final String text = Bose#174; #8482;; // |Bose® ™| HTMLStripCharFilterFactory htmlStripFactory = new HTMLStripCharFilterFactory(); htmlStripFactory.init(Collections.String,StringemptyMap()); CharStream charStream = htmlStripFactory.create(CharReader.get(new StringReader(text))); StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory(); stdTokFactory.init(DEFAULT_VERSION_PARAM); Tokenizer stream = stdTokFactory.create(charStream); assertTokenStreamContents(stream, new String[] { Bose }); } What's happening: First, htmlStripFactory converts #174; to ® and #8482; to ™. Then stdTokFactory declines to tokenize ® and ™, because they are belong to the Unicode general category Symbol, Other, and so are not included in any of the output tokens. StandardTokenizer uses the Word Break rules find UAX#29 http://unicode.org/reports/tr29/ to find token boundaries, and then outputs only alphanumeric tokens. See the JFlex grammar for details: http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup . The behavior you're seeing is not consistent with the above test. Steve -Original Message- From: Mike Hugo [mailto:m...@piragua.com] Sent: Tuesday, January 24, 2012 1:34 PM To: solr-user@lucene.apache.org Subject: HTMLStripCharFilterFactory not working in Solr4? We recently updated to the latest build of Solr4 and everything is working really well so far! There is one case that is not working the same way it was in Solr 3.4 - we strip out certain HTML constructs (like trademark and registered, for example) in a field as defined below - it was working in Solr3.4 with the configuration shown here, but is not working the same way in Solr4. The label field is defined as type=text_general field name=label type=text_general indexed=true stored=false required=false multiValued=true/ Here's the type definition for text_general field: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType In Solr 3.4, that configuration was completely stripping html constructs out of the indexed field which is exactly what we wanted. If for example, we then do a facet on the label field, like in the test below, we're getting some terms in the response that we would not like to be there. // test case (groovy) void specialHtmlConstructsGetStripped() { SolrInputDocument inputDocument = new SolrInputDocument() inputDocument.addField('label', 'Bose#174; #8482;') solrServer.add(inputDocument) solrServer.commit() QueryResponse response = solrServer.query(new SolrQuery('bose')) assert 1 == response.results.numFound SolrQuery facetQuery = new SolrQuery('bose') facetQuery.facet = true facetQuery.set(FacetParams.FACET_FIELD, 'label') facetQuery.set(FacetParams.FACET_MINCOUNT, '1') response = solrServer.query(facetQuery) FacetField ff = response.facetFields.find {it.name == 'label'} List suggestResponse = [] for (FacetField.Count facetField in ff?.values) { suggestResponse facetField.name } assert suggestResponse == ['bose'] } With the upgrade to Solr4, the assertion fails, the suggested response contains