solr admin form query (full interface) - unknown handler: select
Hi solr users and solr dev guys, I just wanted to point out that the admin form in solr 3.6 seems to have a bug in the 'full interface' link off 'Make a Query'... I couldn't find any mention of this on markmail under solr-user so I thought I'd bring it up. I just upgraded from solr 1.4 so I don't know if this was an issue in previous 3.x versions of solr. The full interface query form throws an error 'unknown handler: /select' where it appears that there is no trailing '/' character in the url. The qt parameter seems to cause problems also. Form generated url: http://my server url:8983/solr/1/select?indent=onversion=2.2qt=%2Fselectq=gpsfq= start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl= If I manually fix the url and remove the qt parameter then it works of course. http://my server url:8983/solr/1/select/?indent=onversion=2.2q=gpsfq=start=0row s=10fl=*%2Cscorewt=explainOther=hl.fl= I just wanted to mention this for the benefit of others. Thanks, Robi
RE: solr admin form query (full interface) - unknown handler: select
Hi Jack, That is interesting! I hadn't realized but I guess mine varies slightly from the example. I show my version below. It is like this because I basically merged my 1.4 schema and solr configs with the example 3.6 configs (btw everything else is working fine): requestDispatcher handleSelect=true httpCaching never304=true !-- a bunch of comments -- /httpCaching /requestDispatcher requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str str name=fl*,score/str /lst /requestHandler Thanks Robi -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, May 07, 2012 11:12 AM To: solr-user@lucene.apache.org Subject: Re: solr admin form query (full interface) - unknown handler: select I don't see that problem with a fresh, unmodified 3.6 using example. The qt parameter doesn't show up in the query URL unless I modify the Request Handler to something other than /select. Here's the query URL I get: http://localhost:8983/solr/select?indent=onversion=2.2q=gpsfq=start= 0rows=10fl=*%2Cscorewt=explainOther=hl.fl= Have you modified your solrconfig, for example requestDispatcher, handleSelect, the /select request handler, etc.? -- Jack Krupansky -Original Message- From: Robert Petersen Sent: Monday, May 07, 2012 1:44 PM To: solr-user@lucene.apache.org Subject: solr admin form query (full interface) - unknown handler: select Hi solr users and solr dev guys, I just wanted to point out that the admin form in solr 3.6 seems to have a bug in the 'full interface' link off 'Make a Query'... I couldn't find any mention of this on markmail under solr-user so I thought I'd bring it up. I just upgraded from solr 1.4 so I don't know if this was an issue in previous 3.x versions of solr. The full interface query form throws an error 'unknown handler: /select' where it appears that there is no trailing '/' character in the url. The qt parameter seems to cause problems also. Form generated url: http://my server url:8983/solr/1/select?indent=onversion=2.2qt=%2Fselectq=gpsfq= start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl= If I manually fix the url and remove the qt parameter then it works of course. http://my server url:8983/solr/1/select/?indent=onversion=2.2q=gpsfq=start=0row s=10fl=*%2Cscorewt=explainOther=hl.fl= I just wanted to mention this for the benefit of others. Thanks, Robi
RE: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration.
I don't know if this will help but I usually add a dataDir element to each cores solrconfig.xml to point at a local data folder for the core like this: !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- dataDir${solr.data.dir:./solr/core0/data}/dataDir -Original Message- From: loc...@mm.st [mailto:loc...@mm.st] Sent: Wednesday, May 02, 2012 1:06 PM To: solr-user@lucene.apache.org Subject: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration. i've installed tomcat7 and solr 3.6.0 on linux/64 i'm trying to get a single webapp + multicore setup working. my efforts have gone off the rails :-/ i suspect i've followed too many of the wrong examples. i'd appreciate some help/direction getting this working. so far, i've configured grep /etc/tomcat7/server.xml -A2 -B2 Java AJP Connector: /docs/config/ajp.html APR (HTTP/AJP) Connector: /docs/apr.html Define a non-SSL HTTP/1.1 Connector on port -- Connector port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / -- !-- Connector executor=tomcatThreadPool port= protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 / cat /etc/tomcat7/Catalina/localhost/solr.xml Context docBase=/srv/tomcat7/webapps/solr.war debug=0 privileged=true allowLinking=true crossContext=true Environment name=solr/home type=java.lang.String value=/srv/www/solrbase override=true / /Context after tomcat restart, ps ax | grep tomcat 6129 pts/4Sl 0:06 /etc/alternatives/jre/bin/java -classpath :/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli .jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat7 -Dcatalina.home=/usr/share/tomcat7 -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat7/temp -Djava.util.logging.config.file=/usr/share/tomcat7/conf/logging.properti es -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start if i nav to http://127.0.0.1: i see as expected Server Information Tomcat Version JVM VersionJVM Vendor OS Name OS Version OS Architecture Apache Tomcat/7.0.26 1.7.0_147-icedtea-b147 Oracle Corporation Linux 3.1.10-1.9-desktop amd64 now, i'm trying to set up multicore properly. i configured, cat /srv/www/solrbase/solr.xml ?xml version=1.0 encoding=UTF-8 ? solr persistent=false cores adminPath=/admin/cores core name=core0 instanceDir=core0 / core name=core1 instanceDir=core1 / /cores /solr then mkdir -p /srv/www/solrbase/{core0,core1} cp -a/srv/www/solrbase/conf /srv/www/solrbase/core0/ cp -a/srv/www/solrbase/conf /srv/www/solrbase/core1/ if i nav to http://localhost:/solr/core0 i get, HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml - org.apache.solr.common.SolrException: No cores were created, please check the logs for errors at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer. java:172) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 96) at org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationF ilterConfig.java:277) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi lterConfig.java:258) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio nFilterConfig.java:382) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilte rConfig.java:103) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav a:4638) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.j ava:5294)
solr broke a pipe
Anyone have any clues about this exception? It happened during the course of normal indexing. This is new to me (we're running solr 3.6 on tomcat 6/redhat RHEL) and we've been running smoothly for some time now until this showed up: Red Hat Enterprise Linux Server release 5.3 (Tikanga) Apache Tomcat Version 6.0.20 java.runtime.version = 1.6.0_25-b06 java.vm.name = Java HotSpot(TM) 64-Bit Server VM May 2, 2012 4:07:48 PM org.apache.solr.handler.ReplicationHandler$FileStream write WARNING: Exception while writing response for params: indexversion=1276893500358file=_1uca.frqcommand=filecontentchecksum=t ruewt=filestream ClientAbortException: java.net.SocketException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j ava:358) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:354) at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java: 381) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:370) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStrea m.java:89) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java :87) at org.apache.solr.handler.ReplicationHandler$FileStream.write(ReplicationH andler.java:1076) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.ja va:936) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFil ter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:273) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2 93) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84 9) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut putBuffer.java:740) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349) at org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.d oWrite(InternalOutputBuffer.java:764) at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutp utFilter.java:126) at org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuff er.java:573) at org.apache.coyote.Response.doWrite(Response.java:560) at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j ava:353) ... 21 more
what's best to use for monitoring solr 3.6 farm on redhat/tomcat
Hello solr users, Is there any lightweight tool of choice for monitoring multiple solr boxes for memory consumption, heap usage, and other statistics? We have a pretty large farm of RHEL servers running solr now and up until migrating from 1.4 to 3.6 we were running the lucid gaze component on each box for these stats... and this doesn't function under solr 3.x and this was cumbersome anyway as we had to hit each box separately. What do the rest of you guys use to keep tabs on your servers? We're running solr 3.6 in tomcat on RHEL Red Hat Enterprise Linux Server release 5.3 (Tikanga) Apache Tomcat Version 6.0.20 java.runtime.version = 1.6.0_25-b06 java.vm.name = Java HotSpot(TM) 64-Bit Server VM Thanks, Robert (Robi) Petersen Senior Software Engineer Site Search Specialist
RE: what's best to use for monitoring solr 3.6 farm on redhat/tomcat
Wow that looks like just what the doctor ordered! Thanks Otis -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, April 17, 2012 1:29 PM To: solr-user@lucene.apache.org Subject: Re: what's best to use for monitoring solr 3.6 farm on redhat/tomcat Hi Robert, Have a look at SPM for Solr: http://sematext.com/spm/solr-performance-monitoring/index.html It has all Solr metrics, works with 3.*, has a bunch of system metrics, filtering, alerting, email subscriptions, no loss of granularity, and you can use it to monitor other types of systems (e.g. HBase, ElasticSearch, Sensei...) and, starting with the next versions pretty much any Java app (not necessarily a webapp). Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Tuesday, April 17, 2012 12:02 PM Subject: what's best to use for monitoring solr 3.6 farm on redhat/tomcat Hello solr users, Is there any lightweight tool of choice for monitoring multiple solr boxes for memory consumption, heap usage, and other statistics? We have a pretty large farm of RHEL servers running solr now and up until migrating from 1.4 to 3.6 we were running the lucid gaze component on each box for these stats... and this doesn't function under solr 3.x and this was cumbersome anyway as we had to hit each box separately. What do the rest of you guys use to keep tabs on your servers? We're running solr 3.6 in tomcat on RHEL Red Hat Enterprise Linux Server release 5.3 (Tikanga) Apache Tomcat Version 6.0.20 java.runtime.version = 1.6.0_25-b06 java.vm.name = Java HotSpot(TM) 64-Bit Server VM Thanks, Robert (Robi) Petersen Senior Software Engineer Site Search Specialist
RE: [ANNOUNCE] Apache Solr 3.6 released
I think this page needs updating... it says it's not out yet. https://wiki.apache.org/solr/Solr3.6 -Original Message- From: Robert Muir [mailto:rm...@apache.org] Sent: Thursday, April 12, 2012 1:33 PM To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; announce Subject: [ANNOUNCE] Apache Solr 3.6 released 12 April 2012, Apache Solr™ 3.6.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see note below). See the CHANGES.txt file included with the release for a full list of details. Solr 3.6.0 Release Highlights: * New SolrJ client connector using Apache Http Components http client (SOLR-2020) * Many analyzer factories are now multi term query aware allowing for things like field type aware lowercasing when building prefix wildcard queries. (SOLR-2438) * New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056) * Range Faceting (Dates Numbers) is now supported in distributed search (SOLR-1709) * HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690) * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565) * New LFU Cache option for use in Solr's internal caches. (SOLR-2906) * Memory performance improvements to all FST based suggesters (SOLR-2888) * New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714) * New options for configuring the amount of concurrency used in distributed searches (SOLR-3221) * Many bug fixes Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers
RE: how to correctly facet clothing multiple sizes and colors?
Well yes but in my experience people generally search for something particular... then select colors and sizes thereafter. -Original Message- From: danjfoley [mailto:d...@micamedia.com] Sent: Monday, April 09, 2012 4:18 PM To: solr-user@lucene.apache.org Subject: Re: how to correctly facet clothing multiple sizes and colors? The problem with that approach is that if you selected say large and red you'd get back all the products with large and red as variants. Not the products with red in the large size add would be expected. Sent from my phone - Reply message - From: Andrew Harvey [via Lucene] ml-node+s472066n3898049...@n3.nabble.com Date: Mon, Apr 9, 2012 5:21 pm Subject: how to correctly facet clothing multiple sizes and colors? To: danjfoley d...@micamedia.com What we do in our application is exactly what Robert described. We index Products, not variants. The variant data (colour, size etc.) is denormalised into the product document at index time. We then facet on the variant attributes and get product count instead of variant count. What you're seeing are correct results. You are indexing 6 documents, as you said before. You actually only want to index one document with multi-valued fields. Hope that's somehow helpful, Andrew On 10/04/2012, at 3:01, Robert Petersen rober...@buy.com wrote: You *could* do it by making one and only one solr document for each clothing item, then just have the front end render all the sizes and colors available for that item as size/color pickers on the product page. You can add all the colors and sized to the one document in the index so they are searchable also, but the caveat is that they won't show up as a facet. This is just one simple approach. -Original Message- From: danjfoley [mailto:d...@micamedia.com] Sent: Saturday, April 07, 2012 7:04 PM To: solr-user@lucene.apache.org Subject: how to correctly facet clothing multiple sizes and colors? I've been searching for a solution to my issue, and this seems to come closest to it. But not exactly. I am indexing clothing. Each article of clothing comes in many sizes and colors, and can belong to any number of categories. For example take the following: I add 6 documents to solr as follows: product, color, size, category shirt A, red, small, valentines day shirt A, red, large, valentines day shirt A, blue, small, valentines day shirt A, blue, large, valentines day shirt A, green, small, valentines day shirt A, green, large, valentines day I'd like my facet counts to return as follows: color red (1) blue (1) green (1) size small (1) large (1) category valentines day (1) But they come back like this: color: red (2) blue (2) green (2) size: small (2) large (2) category valentines day (6) I see the group.facet parameter in version 4.0 does exactly this. However how can I make this happen now? There are all sorts of ecommerce systems out there that facet exactly how i'm asking. i thought solr is supposed to be the very best fastest search system, yet it doesn't seem to be able to facet correct for items with multiple values? Am i indexing my data wrong? how can i make this happen? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi ple-sizes-and-colors-tp3893747p3893747.html Sent from the Solr - User mailing list archive at Nabble.com. ___ If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi ple-sizes-and-colors-tp3893747p3898049.html To unsubscribe from how to correctly facet clothing multiple sizes and colors?, visit http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubs cribe_by_codenode=3893747code=ZGFuQG1pY2FtZWRpYS5jb218Mzg5Mzc0N3wtMTEy NjQzODIyNg== -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi ple-sizes-and-colors-tp3893747p3898271.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how to correctly facet clothing multiple sizes and colors?
You *could* do it by making one and only one solr document for each clothing item, then just have the front end render all the sizes and colors available for that item as size/color pickers on the product page. You can add all the colors and sized to the one document in the index so they are searchable also, but the caveat is that they won't show up as a facet. This is just one simple approach. -Original Message- From: danjfoley [mailto:d...@micamedia.com] Sent: Saturday, April 07, 2012 7:04 PM To: solr-user@lucene.apache.org Subject: how to correctly facet clothing multiple sizes and colors? I've been searching for a solution to my issue, and this seems to come closest to it. But not exactly. I am indexing clothing. Each article of clothing comes in many sizes and colors, and can belong to any number of categories. For example take the following: I add 6 documents to solr as follows: product, color, size, category shirt A, red, small, valentines day shirt A, red, large, valentines day shirt A, blue, small, valentines day shirt A, blue, large, valentines day shirt A, green, small, valentines day shirt A, green, large, valentines day I'd like my facet counts to return as follows: color red (1) blue (1) green (1) size small (1) large (1) category valentines day (1) But they come back like this: color: red (2) blue (2) green (2) size: small (2) large (2) category valentines day (6) I see the group.facet parameter in version 4.0 does exactly this. However how can I make this happen now? There are all sorts of ecommerce systems out there that facet exactly how i'm asking. i thought solr is supposed to be the very best fastest search system, yet it doesn't seem to be able to facet correct for items with multiple values? Am i indexing my data wrong? how can i make this happen? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi ple-sizes-and-colors-tp3893747p3893747.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: upgrade solr from 1.4 to 3.5 not working
Note that I am trying to upgrade from the Lucid Imagination distribution of Solr 1.4, dunno if that makes a difference. We have an existing index of 11 million documents which I am trying to preserve in the upgrade process. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Thursday, April 05, 2012 2:21 PM To: solr-user@lucene.apache.org Subject: upgrade solr from 1.4 to 3.5 not working Hi folks, I'm a little stumped here. I have an existing solr 1.4 setup which is well configured. I want to upgrade to the latest solr release, and after reading release notes, the wiki, etc, I concluded the correct path would be to not change any config items and just replace the solr.war file in tomcats webapps folder with the new one and then start tomcat back up. This worked fine, solr came up. The problem is that on the solr info page it still says that I am running solr 1.4 even after several restarts and even a server reboot. Am I missing something? Info says this though there is no solr 1.4 war file anywhere under tomcat root: Solr Specification Version: 1.4.0.2009.12.10.10.34.34 Solr Implementation Version: 1.4 exported - sam - 2009-12-10 10:34:34 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 exported - 2009-12-10 10:32:14 Current Time: Thu Apr 05 12:56:12 PDT 2012 Server Start Time:Thu Apr 05 12:52:25 PDT 2012 Any help would be appreciated. Thanks Robi
RE: upgrade solr from 1.4 to 3.5 not working
OK I found in the tomcat documentation that I not only have to drop the war file into webapps but also have to delete the expanded version of the war that tomcat makes. Now tomcat doesn't find the velocity response writer which I seem to recall seeing some note about. I'll try to find that again. Thanks for the help? Oh well... -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Friday, April 06, 2012 8:27 AM To: solr-user@lucene.apache.org Subject: RE: upgrade solr from 1.4 to 3.5 not working Note that I am trying to upgrade from the Lucid Imagination distribution of Solr 1.4, dunno if that makes a difference. We have an existing index of 11 million documents which I am trying to preserve in the upgrade process. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Thursday, April 05, 2012 2:21 PM To: solr-user@lucene.apache.org Subject: upgrade solr from 1.4 to 3.5 not working Hi folks, I'm a little stumped here. I have an existing solr 1.4 setup which is well configured. I want to upgrade to the latest solr release, and after reading release notes, the wiki, etc, I concluded the correct path would be to not change any config items and just replace the solr.war file in tomcats webapps folder with the new one and then start tomcat back up. This worked fine, solr came up. The problem is that on the solr info page it still says that I am running solr 1.4 even after several restarts and even a server reboot. Am I missing something? Info says this though there is no solr 1.4 war file anywhere under tomcat root: Solr Specification Version: 1.4.0.2009.12.10.10.34.34 Solr Implementation Version: 1.4 exported - sam - 2009-12-10 10:34:34 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 exported - 2009-12-10 10:32:14 Current Time: Thu Apr 05 12:56:12 PDT 2012 Server Start Time:Thu Apr 05 12:52:25 PDT 2012 Any help would be appreciated. Thanks Robi
upgrade solr from 1.4 to 3.5 not working
Hi folks, I'm a little stumped here. I have an existing solr 1.4 setup which is well configured. I want to upgrade to the latest solr release, and after reading release notes, the wiki, etc, I concluded the correct path would be to not change any config items and just replace the solr.war file in tomcats webapps folder with the new one and then start tomcat back up. This worked fine, solr came up. The problem is that on the solr info page it still says that I am running solr 1.4 even after several restarts and even a server reboot. Am I missing something? Info says this though there is no solr 1.4 war file anywhere under tomcat root: Solr Specification Version: 1.4.0.2009.12.10.10.34.34 Solr Implementation Version: 1.4 exported - sam - 2009-12-10 10:34:34 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 exported - 2009-12-10 10:32:14 Current Time: Thu Apr 05 12:56:12 PDT 2012 Server Start Time:Thu Apr 05 12:52:25 PDT 2012 Any help would be appreciated. Thanks Robi
RE: Core overhead
I am running eight cores, each core serves up different types of searches so there is no overlap in their function. Some cores have millions of documents. My search times are quite fast. I don't see any real slowdown from multiple cores, but you just have to have enough memory for them. Memory simply has to be big enough to hold what you are loading. Try it out, but make sure that the functionality you are actually looking for isn't sharding instead of multiple cores... http://wiki.apache.org/solr/DistributedSearch -Original Message- From: Yury Kats [mailto:yuryk...@yahoo.com] Sent: Thursday, December 15, 2011 10:31 AM To: solr-user@lucene.apache.org Subject: Re: Core overhead On 12/15/2011 1:07 PM, Robert Stewart wrote: I think overall memory usage would be close to the same. Is this really so? I suspect that the consumed memory is in direct proportion to the number of terms in the index. I also suspect that if I divided 1 core with N terms into 10 smaller cores, each smaller core would have much more than N/10 terms. Let's say I'm indexing English texts, it's likely that all smaller cores would have almost the same number of terms, close to the original N. Not so?
RE: Core overhead
Sure that is possible, but doesn't that defeat the purpose of sharding? Why distribute across one machine? Just keep all in one index in that case is my thought there... -Original Message- From: Yury Kats [mailto:yuryk...@yahoo.com] Sent: Thursday, December 15, 2011 11:47 AM To: solr-user@lucene.apache.org Subject: Re: Core overhead On 12/15/2011 1:41 PM, Robert Petersen wrote: loading. Try it out, but make sure that the functionality you are actually looking for isn't sharding instead of multiple cores... Yes, but the way to achieve sharding is to have multiple cores. The question is then becomes -- how many cores (shards)?
RE: Core overhead
I see there is a lot of discussions about micro-sharding, I'll have to read them. I'm on an older version of solr and just use master index replicating out to a farm of slaves. It always seemed like sharding causes a lot of background traffic to me when I read about it, but I never tried it out. Thanks for the heads up on that topic... :) -Original Message- From: Yury Kats [mailto:yuryk...@yahoo.com] Sent: Thursday, December 15, 2011 2:16 PM To: solr-user@lucene.apache.org Subject: Re: Core overhead On 12/15/2011 4:46 PM, Robert Petersen wrote: Sure that is possible, but doesn't that defeat the purpose of sharding? Why distribute across one machine? Just keep all in one index in that case is my thought there... To be able to scale w/o re-indexing. Also often referred to as micro-sharding.
RE: Questions about Solr's security
Me too! -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, November 01, 2011 1:02 PM To: solr-user@lucene.apache.org Subject: Re: Questions about Solr's security I once had to deal with a severe performance problem caused by a bot that was requesting results starting at 5000. We disallowed requests over a certain number of pages in the front end to fix it. wunder On Nov 1, 2011, at 12:57 PM, Erik Hatcher wrote: Be aware that even /select could have some harmful effects, see https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk). Even disregarding that issue, /select is a potential gateway to any request handler defined via /select?qt=/req_handler Again, in general it's not a good idea to expose Solr to anything but a controlled app server. Erik On Nov 1, 2011, at 15:51 , Alireza Salimi wrote: What if we just expose '/select' paths - by firewalls and load balancers - and also use SSL and HTTP basic or digest access control? On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I was wondering if it's a good idea to expose Solr to the outside world, : so that our clients running on smart phones will be able to use Solr. As a general rule of thumb, i would say that it is not a good idea to expose solr directly to the public internet. there are exceptions to this rule -- AOL hosted some live solr instances of the Sarah Palin emails for HufPo -- but it is definitely an expert level type thing for people who are so familiar with solr they know exactly what to lock down to make it safe for typical users: put an application between your untrusted users and solr and only let that application generate safe welformed requests to Solr... https://wiki.apache.org/solr/SolrSecurity -Hoss -- Alireza Salimi Java EE Developer -- Walter Underwood Venture Asst. Scoutmaster Troop 14, Palo Alto, CA
difference between analysis output and searches
Why is it that I can see in the analysis admin page an obvious match between terms, yet sometimes they don't come back in searches? Debug output on the searches indicate a non-match yet the analysis page shows an obvious match. I don't get it.
i don't get why this says non-match
It looks to me like everything matches down the line but top level says otherQuery is a non-match... I don't get it? - response - lst name=responseHeader int name=status0/int int name=QTime77/int - lst name=params str name=explainOtherSyncMaster/str str name=fl*,score/str str name=debugQueryon/str str name=indenton/str str name=start0/str str name=q+syncmaster -SyncMaster/str str name=hl.fl / str name=qtstandard/str str name=wtstandard/str str name=fq / str name=rows41/str str name=version2.2/str /lst /lst + result name=response numFound=26 start=0 maxScore=1.6049292 - lst name=debug str name=rawquerystring+syncmaster -SyncMaster/str str name=querystring+syncmaster -SyncMaster/str str name=parsedquery+moreWords:syncmaster -MultiPhraseQuery(moreWords:sync (master syncmaster))/str str name=parsedquery_toString+moreWords:syncmaster -moreWords:sync (master syncmaster)/str str name=otherQuerySyncMaster/str - lst name=explainOther str name=2097309980.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 1.4043131 = (MATCH) fieldWeight(moreWords:syncmaster in 46710), product of: 1.4142135 = tf(termFreq(moreWords:syncmaster)=2) 9.078851 = idf(docFreq=41, maxDocs=135472) 0.109375 = fieldNorm(field=moreWords, doc=46710) 0.0 = match on prohibited clause (moreWords:sync (master syncmaster)) 9.393997 = (MATCH) weight(moreWords:sync (master syncmaster) in 46710), product of: 2.5863855 = queryWeight(moreWords:sync (master syncmaster)), product of: 23.481407 = idf(moreWords:sync (master syncmaster)) 0.1101461 = queryNorm 3.6320949 = (MATCH) fieldWeight(moreWords:sync (master syncmaster) in 46710), product of: 1.4142135 = tf(phraseFreq=2.0) 23.481407 = idf(moreWords:sync (master syncmaster)) 0.109375 = fieldNorm(field=moreWords, doc=46710)/str
RE: Trouble configuring multicore / accessing admin page
Just go to localhost:8983 (or whatever other port you are using) and use this path to see all the cores available on the box: In your example this should give you a core list: http://solrhost:8080/solr/ -Original Message- From: Joshua Miller [mailto:jos...@itsecureadmin.com] Sent: Wednesday, September 28, 2011 1:18 PM To: solr-user@lucene.apache.org Subject: Re: Trouble configuring multicore / accessing admin page On Sep 28, 2011, at 1:03 PM, Shawn Heisey wrote: On 9/28/2011 1:40 PM, Joshua Miller wrote: I am trying to get SOLR working with multiple cores and have a problem accessing the admin page once I configure multiple cores. Problem: When accessing the admin page via http://solrhost:8080/solr/admin, I get a 404, missing core name in path. Question: when using the multicore option, is the standard admin page still available? When you enable multiple cores, the URL syntax becomes a little different. On 1.4.1 and 3.2.0, I ran into a problem where the trailing / is required on this URL, but that problem seems to be fixed in 3.4.0: http://host:port/solr/corename/admin/ If you put a defaultCoreName=somecore into the cores tag in solr.xml, the original /solr/admin URL should work as well. I just tried it on Solr 3.4.0 and it does work. According to the wiki, it should work in 1.4 as well. I don't have a 1.4.1 server any more, so I can't verify that. http://wiki.apache.org/solr/CoreAdmin#cores Hi Shawn, Thanks for the quick response. I can't get any of those combinations to work. I've added the defaultCoreName=core0 into the solr.xml and restarted and tried the following combinations: http://host:port/solr/admin http://host:port/solr/admin/ http://host:port/solr/core0/admin/ ... (and many others) I'm stuck on 1.4.1 at least temporarily as I'm taking over an application from another resource and need to get it up and running before modifying anything so any help here would be greatly appreciated. Thanks, Josh Miller Open Source Solutions Architect (425) 737-2590 http://itsecureadmin.com/
synonyms vs replacements
Hello all, Which is better? Say you add an index time synonym between nunchuck and nunchuk and then both words will be in the document and both will be searchable. I can get the same exact behavior by putting an index time replacement of nunchuck = nunchuk and a search time replacement of the same. I figured the replacement strategy keeps the the index size slightly smaller by only having the one term in the index, but the synonym strategy only requires you update the master, not the slave farm, and requires slightly less work for the searchers during a user query. Are there any other considerations I should be aware of? Thanks BTW nunchuk is the correct spelling. J
RE: please help explaining debug output
That didn't help. Seems like another case where I should get matches but don't and this time it is only for some documents. Others with similar content do match just fine. The debug output 'explain other' section for a non-matching document seems to say the term frequency is 0 for my problematic term, although I know it is in the content. I ended up making a synonym to do what the analysis stack *should* be doing: splitting LaserJet on case changes. IE putting LaserJet, laser jet in synonyms at index time makes this work. I don't know why though. Question: Does this debug output mean it is matching the terms but the term frequency vector is returning 0 for the frequency of this term. IE Does this mean the term is in the doc but not in the tf array? 0.0 = no match on required clause (moreWords:laser jet) 0.0 = weight(moreWords:laser jet in 32497), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 0.0 = fieldWeight(moreWords:laser jet in 32497), product of: 0.0 = tf(phraseFreq=0.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.078125 = fieldNorm(field=moreWords, doc=32497) -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 25, 2011 3:28 PM To: solr-user@lucene.apache.org Subject: Re: please help explaining debug output Hmmm, I can't find a convenient 1.4.0 to download, but re-indexing is a good idea since this seems like it *should* work. Erick On Mon, Jul 25, 2011 at 5:32 PM, Robert Petersen rober...@buy.com wrote: I'm still on solr 1.4.0 and the analysis page looks like they should match, and other products with the same content do in fact match. I'm reindexing the non-matching ones to rule that out. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 25, 2011 1:58 PM To: solr-user@lucene.apache.org Subject: Re: please help explaining debug output Hmmm, I'm assuming that moreWords is your default text field, yes? But it works for me (tm), using 1.4.1. What version of Solr are you on? Also, take a glance at the admin/analysis page, that might help... Gotta run Erick On Mon, Jul 25, 2011 at 4:52 PM, Robert Petersen rober...@buy.com wrote: Sorry, to clarify a search for P1102W matches all three docs but a search for p1102w LaserJet only matches the second two. Someone asked me a question while I was typing and I got distracted, apologies for any confusion. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, July 25, 2011 1:42 PM To: solr-user@lucene.apache.org Subject: please help explaining debug output I have three documents with the following product titles in a text field called moreWords with analysis stack matching the solr example text field definition. 1. HP LaserJet P1102W Monochrome Laser Printer http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l oc/101/213824965.html 2. HP CE285A (85A) Remanufactured Black Toner Cartridge for LaserJet M1212nf, P1102, P1102W Series http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri dge-for-laserjet/q/loc/101/217145536.html 3. Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet M1130, LaserJet M1132, LaserJet M1210 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1 102w-laserjet-m1130/q/loc/101/222045267.html A search for P1102W matches (2) and (3), but not (1) above. Can someone explain the debug output? It looks like I am getting a non-match on (1) because term frequency is zero? Am I reading that right? If so, how could that be? the searched terms are equivalently in all three docs. I don't get it. lst name=debug str name=rawquerystringp1102w LaserJet /str str name=querystringp1102w LaserJet /str str name=parsedquery+PhraseQuery(moreWords:p 1102 w) +PhraseQuery(moreWords:laser jet)/str str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser jet/str lst name=explain str name=222045267 3.64852 = (MATCH) sum of: 2.4758534 = weight(moreWords:p 1102 w in 6667236), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product of: 1.7320508 = tf(phraseFreq=3.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6667236) 1.1726664 = weight(moreWords:laser jet in 6667236), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product
please help explaining debug output
I have three documents with the following product titles in a text field called moreWords with analysis stack matching the solr example text field definition. 1. HP LaserJet P1102W Monochrome Laser Printer http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l oc/101/213824965.html 2. HP CE285A (85A) Remanufactured Black Toner Cartridge for LaserJet M1212nf, P1102, P1102W Series http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri dge-for-laserjet/q/loc/101/217145536.html 3. Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet M1130, LaserJet M1132, LaserJet M1210 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1 102w-laserjet-m1130/q/loc/101/222045267.html A search for P1102W matches (2) and (3), but not (1) above. Can someone explain the debug output? It looks like I am getting a non-match on (1) because term frequency is zero? Am I reading that right? If so, how could that be? the searched terms are equivalently in all three docs. I don't get it. lst name=debug str name=rawquerystringp1102w LaserJet /str str name=querystringp1102w LaserJet /str str name=parsedquery+PhraseQuery(moreWords:p 1102 w) +PhraseQuery(moreWords:laser jet)/str str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser jet/str lst name=explain str name=222045267 3.64852 = (MATCH) sum of: 2.4758534 = weight(moreWords:p 1102 w in 6667236), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product of: 1.7320508 = tf(phraseFreq=3.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6667236) 1.1726664 = weight(moreWords:laser jet in 6667236), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product of: 1.4142135 = tf(phraseFreq=2.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6667236) /str str name=222045265 2.8656518 = (MATCH) sum of: 1.4294347 = weight(moreWords:p 1102 w in 6684158), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6684158) 1.4362172 = weight(moreWords:laser jet in 6684158), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product of: 1.7320508 = tf(phraseFreq=3.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6684158) /str /lst str name=otherQuerysku:213824965 /str lst name=explainOther str name=213824965 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 1.1911955 = weight(moreWords:p 1102 w in 32497), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.078125 = fieldNorm(field=moreWords, doc=32497) 0.0 = no match on required clause (moreWords:laser jet) 0.0 = weight(moreWords:laser jet in 32497), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 0.0 = fieldWeight(moreWords:laser jet in 32497), product of: 0.0 = tf(phraseFreq=0.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.078125 = fieldNorm(field=moreWords, doc=32497) /str /lst
RE: please help explaining debug output
Sorry, to clarify a search for P1102W matches all three docs but a search for p1102w LaserJet only matches the second two. Someone asked me a question while I was typing and I got distracted, apologies for any confusion. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, July 25, 2011 1:42 PM To: solr-user@lucene.apache.org Subject: please help explaining debug output I have three documents with the following product titles in a text field called moreWords with analysis stack matching the solr example text field definition. 1. HP LaserJet P1102W Monochrome Laser Printer http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l oc/101/213824965.html 2. HP CE285A (85A) Remanufactured Black Toner Cartridge for LaserJet M1212nf, P1102, P1102W Series http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri dge-for-laserjet/q/loc/101/217145536.html 3. Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet M1130, LaserJet M1132, LaserJet M1210 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1 102w-laserjet-m1130/q/loc/101/222045267.html A search for P1102W matches (2) and (3), but not (1) above. Can someone explain the debug output? It looks like I am getting a non-match on (1) because term frequency is zero? Am I reading that right? If so, how could that be? the searched terms are equivalently in all three docs. I don't get it. lst name=debug str name=rawquerystringp1102w LaserJet /str str name=querystringp1102w LaserJet /str str name=parsedquery+PhraseQuery(moreWords:p 1102 w) +PhraseQuery(moreWords:laser jet)/str str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser jet/str lst name=explain str name=222045267 3.64852 = (MATCH) sum of: 2.4758534 = weight(moreWords:p 1102 w in 6667236), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product of: 1.7320508 = tf(phraseFreq=3.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6667236) 1.1726664 = weight(moreWords:laser jet in 6667236), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product of: 1.4142135 = tf(phraseFreq=2.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6667236) /str str name=222045265 2.8656518 = (MATCH) sum of: 1.4294347 = weight(moreWords:p 1102 w in 6684158), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6684158) 1.4362172 = weight(moreWords:laser jet in 6684158), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product of: 1.7320508 = tf(phraseFreq=3.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6684158) /str /lst str name=otherQuerysku:213824965 /str lst name=explainOther str name=213824965 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 1.1911955 = weight(moreWords:p 1102 w in 32497), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.078125 = fieldNorm(field=moreWords, doc=32497) 0.0 = no match on required clause (moreWords:laser jet) 0.0 = weight(moreWords:laser jet in 32497), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 0.0 = fieldWeight(moreWords:laser jet in 32497), product of: 0.0 = tf(phraseFreq=0.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.078125 = fieldNorm(field=moreWords, doc=32497) /str /lst
RE: please help explaining debug output
I'm still on solr 1.4.0 and the analysis page looks like they should match, and other products with the same content do in fact match. I'm reindexing the non-matching ones to rule that out. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 25, 2011 1:58 PM To: solr-user@lucene.apache.org Subject: Re: please help explaining debug output Hmmm, I'm assuming that moreWords is your default text field, yes? But it works for me (tm), using 1.4.1. What version of Solr are you on? Also, take a glance at the admin/analysis page, that might help... Gotta run Erick On Mon, Jul 25, 2011 at 4:52 PM, Robert Petersen rober...@buy.com wrote: Sorry, to clarify a search for P1102W matches all three docs but a search for p1102w LaserJet only matches the second two. Someone asked me a question while I was typing and I got distracted, apologies for any confusion. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, July 25, 2011 1:42 PM To: solr-user@lucene.apache.org Subject: please help explaining debug output I have three documents with the following product titles in a text field called moreWords with analysis stack matching the solr example text field definition. 1. HP LaserJet P1102W Monochrome Laser Printer http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l oc/101/213824965.html 2. HP CE285A (85A) Remanufactured Black Toner Cartridge for LaserJet M1212nf, P1102, P1102W Series http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri dge-for-laserjet/q/loc/101/217145536.html 3. Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet M1130, LaserJet M1132, LaserJet M1210 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1 102w-laserjet-m1130/q/loc/101/222045267.html A search for P1102W matches (2) and (3), but not (1) above. Can someone explain the debug output? It looks like I am getting a non-match on (1) because term frequency is zero? Am I reading that right? If so, how could that be? the searched terms are equivalently in all three docs. I don't get it. lst name=debug str name=rawquerystringp1102w LaserJet /str str name=querystringp1102w LaserJet /str str name=parsedquery+PhraseQuery(moreWords:p 1102 w) +PhraseQuery(moreWords:laser jet)/str str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser jet/str lst name=explain str name=222045267 3.64852 = (MATCH) sum of: 2.4758534 = weight(moreWords:p 1102 w in 6667236), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product of: 1.7320508 = tf(phraseFreq=3.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6667236) 1.1726664 = weight(moreWords:laser jet in 6667236), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product of: 1.4142135 = tf(phraseFreq=2.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6667236) /str str name=222045265 2.8656518 = (MATCH) sum of: 1.4294347 = weight(moreWords:p 1102 w in 6684158), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.09375 = fieldNorm(field=moreWords, doc=6684158) 1.4362172 = weight(moreWords:laser jet in 6684158), product of: 0.60590804 = queryWeight(moreWords:laser jet), product of: 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.041507367 = queryNorm 2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product of: 1.7320508 = tf(phraseFreq=3.0) 14.597603 = idf(moreWords: laser=26731 jet=12685) 0.09375 = fieldNorm(field=moreWords, doc=6684158) /str /lst str name=otherQuerysku:213824965 /str lst name=explainOther str name=213824965 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 1.1911955 = weight(moreWords:p 1102 w in 32497), product of: 0.7955347 = queryWeight(moreWords:p 1102 w), product of: 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.041507367 = queryNorm 1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of: 1.0 = tf(phraseFreq=1.0) 19.166107 = idf(moreWords: p=189166 1102=1135 w=445720) 0.078125 = fieldNorm
RE: Solr 3.3: Exception in thread Lucene Merge Thread #1
Says it is caused by a Java out of memory error, no? -Original Message- From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] Sent: Wednesday, July 20, 2011 9:18 AM To: solr-user@lucene.apache.org Subject: Re: Solr 3.3: Exception in thread Lucene Merge Thread #1 Here we go ... This time we tried to use the old LogByteSizeMergePolicy and SerialMergeScheduler: mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy/ mergeScheduler class=org.apache.lucene.index.SerialMergeScheduler/ We did this before, just to be sure ... ~300 Documents: / SEVERE: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:782) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirector y.java:264) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:216) at org.apache.lucene.index.FieldsReader.init(FieldsReader.java:129) at org.apache.lucene.index.SegmentCoreReaders.openDocStores(SegmentCoreRead ers.java:244) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:116) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:702) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4192) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3859) at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler. java:37) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2714) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2709) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2705) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3509) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1850) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1814) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1778) at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:143) at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHand ler2.java:183) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2. java:416) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpd ateProcessorFactory.java:85) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte ntStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:240) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:164) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticator Base.java:462) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :164) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :100) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:563 ) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:4 03) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:30 1) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:162) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:140) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.j ava:309) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto r.java:897) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja va:919) at java.lang.Thread.run(Thread.java:736) Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:779) ... 44 more 20.07.2011 18:07:30 org.apache.solr.core.SolrCore execute INFO: [core.digi20] webapp=/solr path=/update params={} status=500 QTime=12302 20.07.2011 18:07:30 org.apache.solr.common.SolrException log SEVERE:
RE: Analysis page output vs. actually getting search matches, a discrepency?
Thanks Eric, Unfortunately I'm stemming the same on both sides, similar to the SOLR example settings for the text type field. Default search field is moreWords, as I want yes. Since I don't have this problem for any other mfg names at all in our index of almost 10 mm product docs, and this shows that is should match in my best estimation. Note: LucidKStemFilterFactory does not take 'Sterling' down to 'Sterl' in indexing nor searching, it stays as 'Sterling'. I have given up on this. I've decided it is just an unexplainable anomaly, and have solved it by inserting a LucidKStemFilterFactory and just modifying that word to it's searchable form before hitting the WhitespaceTokenizerFactory, which is kind of hackish but solves my problem at least. This seller only has a couple hundred cheap products on our site, so I have bigger fish to fry at this point. I've wasted too much time trying to chase this down. Cheers all Robi -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 18, 2011 5:33 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? Hmmm, is there any chance that you're stemming one place and not the other? And I infer from your output that your default search field is moreWords, is that true and expected? You might use luke or the TermsComponent to see what's actually in the index, I'm going to guess that you'll find sterl but not sterling as an indexed term and your problem is stemming, but that's a shot in the dark. Best Erick On Mon, Jul 18, 2011 at 5:37 PM, Robert Petersen rober...@buy.com wrote: OK I did what Hoss said, it only confirms I don't get a match when I should and that the query parser is doing the expected. Here are the details for one test sku. My analysis page output is shown in my email starting this thread and here is my query debug output. This absolutely should match but doesn't. Both the indexing side and the query side are splitting on case changes. This actually isn't a problem for any of our other content, for instance there is no issue searching for 'VideoSecu'. Their products come up fine in our searches regardless of casing in the query. Only SterlingTek's products seem to be causing us issues. Indexed content has camel case, stored in the text field 'moreWords': SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301 Search term not matching with camel case: SterlingTek's Search term matching if no case changes: Sterlingtek's Indexing: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Searching: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Thanks http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q= SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true adf response lst name=responseHeader int name=status0/int int name=QTime4/int str name=handlerorg.apache.solr.handler.component.SearchHandler/str lst name=params str name=explainOthersku:216473417/str str name=indenton/str str name=echoHandlertrue/str str name=hl.fl/ str name=wtstandard/str str name=hlon/str str name=rows1/str str name=version2.2/str str name=fl*,score/str str name=debugQueryon/str str name=start0/str str name=qSterlingTek's/str str name=qtstandard/str str name=fq/ /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ lst name=highlighting/ lst name=debug str name=rawquerystringSterlingTek's/str str name=querystringSterlingTek's/str str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str str name=parsedquery_toStringmoreWords:sterling tek/str lst name=explain/ str name=otherQuerysku:216473417/str lst name=explainOther str name=216473417 0.0 = fieldWeight(moreWords:sterling tek in 76351), product of: 0.0 = tf(phraseFreq=0.0) 19.502613 = idf(moreWords: sterling=1 tek=72) 0.15625 = fieldNorm(field=moreWords, doc=76351) /str /lst str name=QParserLuceneQParser /str arr name=filter_queries str/ /arr -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, July 15, 2011 4:36 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? : Subject: Analysis page output vs. actually getting search matches, : a discrepency? 99% of the time when people ask questions like this, it's because of confusion about how/when QueryParsing comes into play (as opposed to analysis
RE: Analysis page output vs. actually getting search matches, a discrepency?
Um sorry for any confusion. I meant to say I solved my issue by inserting a charFilter before the WhitespaceTokenizerFactory to convert my problem word to a searchable form. I had a cut n paste malfunction below. Thanks guys. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Tuesday, July 19, 2011 11:06 AM To: solr-user@lucene.apache.org Subject: RE: Analysis page output vs. actually getting search matches, a discrepency? Thanks Eric, Unfortunately I'm stemming the same on both sides, similar to the SOLR example settings for the text type field. Default search field is moreWords, as I want yes. Since I don't have this problem for any other mfg names at all in our index of almost 10 mm product docs, and this shows that is should match in my best estimation. Note: LucidKStemFilterFactory does not take 'Sterling' down to 'Sterl' in indexing nor searching, it stays as 'Sterling'. I have given up on this. I've decided it is just an unexplainable anomaly, and have solved it by inserting a LucidKStemFilterFactory and just modifying that word to it's searchable form before hitting the WhitespaceTokenizerFactory, which is kind of hackish but solves my problem at least. This seller only has a couple hundred cheap products on our site, so I have bigger fish to fry at this point. I've wasted too much time trying to chase this down. Cheers all Robi -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 18, 2011 5:33 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? Hmmm, is there any chance that you're stemming one place and not the other? And I infer from your output that your default search field is moreWords, is that true and expected? You might use luke or the TermsComponent to see what's actually in the index, I'm going to guess that you'll find sterl but not sterling as an indexed term and your problem is stemming, but that's a shot in the dark. Best Erick On Mon, Jul 18, 2011 at 5:37 PM, Robert Petersen rober...@buy.com wrote: OK I did what Hoss said, it only confirms I don't get a match when I should and that the query parser is doing the expected. Here are the details for one test sku. My analysis page output is shown in my email starting this thread and here is my query debug output. This absolutely should match but doesn't. Both the indexing side and the query side are splitting on case changes. This actually isn't a problem for any of our other content, for instance there is no issue searching for 'VideoSecu'. Their products come up fine in our searches regardless of casing in the query. Only SterlingTek's products seem to be causing us issues. Indexed content has camel case, stored in the text field 'moreWords': SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301 Search term not matching with camel case: SterlingTek's Search term matching if no case changes: Sterlingtek's Indexing: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Searching: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Thanks http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q= SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true adf response lst name=responseHeader int name=status0/int int name=QTime4/int str name=handlerorg.apache.solr.handler.component.SearchHandler/str lst name=params str name=explainOthersku:216473417/str str name=indenton/str str name=echoHandlertrue/str str name=hl.fl/ str name=wtstandard/str str name=hlon/str str name=rows1/str str name=version2.2/str str name=fl*,score/str str name=debugQueryon/str str name=start0/str str name=qSterlingTek's/str str name=qtstandard/str str name=fq/ /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ lst name=highlighting/ lst name=debug str name=rawquerystringSterlingTek's/str str name=querystringSterlingTek's/str str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str str name=parsedquery_toStringmoreWords:sterling tek/str lst name=explain/ str name=otherQuerysku:216473417/str lst name=explainOther str name=216473417 0.0 = fieldWeight(moreWords:sterling tek in 76351), product of: 0.0 = tf(phraseFreq=0.0) 19.502613 = idf(moreWords: sterling=1 tek=72) 0.15625 = fieldNorm(field=moreWords, doc=76351) /str /lst str name=QParserLuceneQParser /str arr name=filter_queries str/ /arr -Original
RE: Analysis page output vs. actually getting search matches, a discrepency?
OK I did what Hoss said, it only confirms I don't get a match when I should and that the query parser is doing the expected. Here are the details for one test sku. My analysis page output is shown in my email starting this thread and here is my query debug output. This absolutely should match but doesn't. Both the indexing side and the query side are splitting on case changes. This actually isn't a problem for any of our other content, for instance there is no issue searching for 'VideoSecu'. Their products come up fine in our searches regardless of casing in the query. Only SterlingTek's products seem to be causing us issues. Indexed content has camel case, stored in the text field 'moreWords': SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301 Search term not matching with camel case: SterlingTek's Search term matching if no case changes: Sterlingtek's Indexing: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Searching: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=0 / Thanks http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q= SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true adf response lst name=responseHeader int name=status0/int int name=QTime4/int str name=handlerorg.apache.solr.handler.component.SearchHandler/str lst name=params str name=explainOthersku:216473417/str str name=indenton/str str name=echoHandlertrue/str str name=hl.fl/ str name=wtstandard/str str name=hlon/str str name=rows1/str str name=version2.2/str str name=fl*,score/str str name=debugQueryon/str str name=start0/str str name=qSterlingTek's/str str name=qtstandard/str str name=fq/ /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ lst name=highlighting/ lst name=debug str name=rawquerystringSterlingTek's/str str name=querystringSterlingTek's/str str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str str name=parsedquery_toStringmoreWords:sterling tek/str lst name=explain/ str name=otherQuerysku:216473417/str lst name=explainOther str name=216473417 0.0 = fieldWeight(moreWords:sterling tek in 76351), product of: 0.0 = tf(phraseFreq=0.0) 19.502613 = idf(moreWords: sterling=1 tek=72) 0.15625 = fieldNorm(field=moreWords, doc=76351) /str /lst str name=QParserLuceneQParser /str arr name=filter_queries str/ /arr -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, July 15, 2011 4:36 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? : Subject: Analysis page output vs. actually getting search matches, : a discrepency? 99% of the time when people ask questions like this, it's because of confusion about how/when QueryParsing comes into play (as opposed to analysis) -- analysis.jsp only shows you part of the equation, it doesn't know what query parser you are using. you mentioned that you aren't getting matches when you expect them, and you provided the analysis.jsp output, but you didn't mention anything about the request you are making, the query parser used etc it owuld be good to know the full query URL, along with the debugQuery output showing the final query toString info. if that info doesn't clear up the discrepency, you should also take a look at the explainOther info for the doc that you expect to match that isn't -- if you still aren't sure what's going on, post all of that info to solr-user and folks can probably help you make sense of it. (all that said: in some instances this type of problem is simply that someone changed the schema and didn't reindex everything, so the indexed terms don't really match what you think they do) -Hoss
RE: ' invisible ' words
Read my thread RE: Analysis page output vs. actually getting search matches, a discrepancy? and see if it is not somewhat like your problem... even if not, there might be something to help as to how to figure out what is going on in your case... -Original Message- From: deniz [mailto:denizdurmu...@gmail.com] Sent: Sunday, July 17, 2011 6:24 PM To: solr-user@lucene.apache.org Subject: RE: ' invisible ' words Hi Jagdish, thank oyu very much for the tool that you have sent... It is really useful for this problem... After using the tool, I just got interesting results... for some words; when i use the tool. it returns the matched docs, on the other hand when i use solr admin page to make a search i cant get any matches... with the same words... now i am more confused and honestly have no idea about what to do... anyone has ever faced such a problem? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3177907.htm l Sent from the Solr - User mailing list archive at Nabble.com.
Analysis page output vs. actually getting search matches, a discrepency?
I a problem searching for one mfg name (out of our 10mm product titles) and it is indexed in a text type field having about the same analyzer settings as the solr example text field definition, and most everything works fine but we found this one example which I cannot get a direct hit on. In the Field Analysis page, It sure looks like it would *have* to match but sadly during searches it just doesn't. I can get it to match by turning off 'split on case change' but that breaks many other searches like 'appleTV' which need to split on case change to match 'apple tv' in our content! If I search for SterlingTek's anything I get zero results. If I change the casing to Sterlingtek's in my query, I get all the results. If I turn off 'split on case change then the first gets results also. See verbose analysis output to see actual filter settings, I put non-verbose first for easier reading (hope the tables don't get lost during posting to this group) but the analysis shows complete matchup, that is what I don't get: Field Analysis Top of Form Field Field value (Index) verbose output highlight matches SterlingTek's NB-2LH Field value (Query) verbose output SterlingTek's NB-2LH Bottom of Form Index Analyzer SterlingTek's NB-2LH SterlingTek's NB-2LH SterlingTek's NB-2LH Sterling Tek NB 2 LH SterlingTek sterling tek nb 2 lh sterlingtek sterling tek nb 2 lh sterlingtek sterling tek nb 2 lh sterlingtek Note every field is highlighted in the last line above meaning all have a match, right??? Query Analyzer SterlingTek's NB-2LH SterlingTek's NB-2LH SterlingTek's NB-2LH Sterling Tek NB 2 LH sterling tek nb 2 lh sterling tek nb 2 lh sterling tek nb 2 lh VERBOSE OUTPUT FOLLOWS: Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 1 2 3 4 5 term text Sterling Tek NB 2 LH SterlingTek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 1 2 3 4 5 term text Sterling Tek NB 2 LH term type word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 payload
RE: Analysis page output vs. actually getting search matches, a discrepency?
Hi Chris, Well to start from the bottom of your list there, I restrict my testing to one sku while continuously reindexing the sku after every indexer side change, and reload the core every time also. I just search from the admin page using the word in question and the exact match on the sku field (the unique one) like this: response lst name=responseHeader int name=status0/int int name=QTime6/int lst name=params str name=indenton/str str name=start0/str str name=qSterlingTek's NB-2LH sku:216473417/str str name=bbba/str str name=rows10/str str name=version2.2/str /lst /lst I will have to find out more about query parsers before I can answer the rest, Will reply to that later... and it's Friday after all! :) Thanks -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, July 15, 2011 4:36 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? : Subject: Analysis page output vs. actually getting search matches, : a discrepency? 99% of the time when people ask questions like this, it's because of confusion about how/when QueryParsing comes into play (as opposed to analysis) -- analysis.jsp only shows you part of the equation, it doesn't know what query parser you are using. you mentioned that you aren't getting matches when you expect them, and you provided the analysis.jsp output, but you didn't mention anything about the request you are making, the query parser used etc it owuld be good to know the full query URL, along with the debugQuery output showing the final query toString info. if that info doesn't clear up the discrepency, you should also take a look at the explainOther info for the doc that you expect to match that isn't -- if you still aren't sure what's going on, post all of that info to solr-user and folks can probably help you make sense of it. (all that said: in some instances this type of problem is simply that someone changed the schema and didn't reindex everything, so the indexed terms don't really match what you think they do) -Hoss
RE: Feature: skipping caches and info about cache use
Why, I'm just wondering? For a case where you know the next query would not be possible to be already in the cache because it is so different from the norm? Just for timing information for instrumentation used for tuning (ie so you can compare cached response times vs non-cached response times)? -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, June 03, 2011 10:02 AM To: solr-user@lucene.apache.org Subject: Feature: skipping caches and info about cache use Hi, Is it just me, or would others like things like: * The ability to tell Solr (by passing some URL param?) to skip one or more of its caches and get data from the index * An additional attrib in the Solr response that shows whether the query came from the cache or not * Maybe something else along these lines? Or maybe some of this is already there and I just don't know about it? :) Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: Anyway to know changed documents?
...and it works really well!!! :) -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 5:37 AM To: solr-user@lucene.apache.org Subject: Re: Anyway to know changed documents? On 6/1/2011 6:12 AM, pravesh wrote: SOLR wiki will provide help on this. You might be interested in pure Java based replication too. I'm not sure,whether SOLR operational will have this feature(synch'ing only changed segments). You might need to change configuration in searchconfig.xml Yes, this feature is there in the Java/HTTP based replication since Solr 1.4
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
Don't manually group by author from your results, the list will always be incomplete... use faceting instead to show the authors of the books you have found in your search. http://wiki.apache.org/solr/SolrFacetingOverview -Original Message- From: beccax [mailto:bec...@gmail.com] Sent: Wednesday, June 01, 2011 11:56 AM To: solr-user@lucene.apache.org Subject: Newbie question: how to deal with different # of search results per page due to pagination then grouping Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
I think facet.offset allows facet paging nicely by letting you index into the list of facet values. It is working for me... http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping
Yes that is exactly the issue... we're thinking just maybe always have a next button and if you go too far you just get zero results. User gets what the user asks for, and so user could simply back up if desired to where the facet still has values. Could also detect an empty facet results on the front end. You can also only expand one facet only to allow paging only the facet pane and not the whole page using an ajax call. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 2:30 PM To: solr-user@lucene.apache.org Cc: Robert Petersen Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping How do you know whether to provide a 'next' button, or whether you are the end of your facet list? On 6/1/2011 4:47 PM, Robert Petersen wrote: I think facet.offset allows facet paging nicely by letting you index into the list of facet values. It is working for me... http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, June 01, 2011 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping There's no great way to do that. One approach would be using facets, but that will just get you the author names (as stored in fields), and not the documents under it. If you really only want to show the author names, facets could work. One issue with facets though is Solr won't tell you the total number of facet values for your query, so it's tricky to provide next/prev paging through them. There is also a 'field collapsing' feature that I think is not in a released Solr, but may be in the Solr repo. I'm not sure it will quite do what you want either though, although it's related and worth a look. http://wiki.apache.org/solr/FieldCollapsing Another vaguely related thing that is also not yet in a released Solr, is a 'join' function. That could possibly be used to do what you want, although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272 Jonathan On 6/1/2011 2:56 PM, beccax wrote: Apologize if this question has already been raised. I tried searching but couldn't find the relevant posts. We've indexed a bunch of documents by different authors. Then for search results, we'd like to show the authors that have 1 or more documents matching the search keywords. The problem is right now our solr search method first paginates results to 100 documents per page, then we take the results and group by authors. This results in different number of authors per page. (Some authors may only have one matching document and others 5 or 10.) How do we change it to somehow show the same number of authors (say 25) per page? I mean alternatively we could just show all the documents themselves ordered by author, but it's not the user experience we're looking for. Thanks so much. And please let me know if you need more details not provided here. B -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121 68p3012168.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How to index and query C# as whole term?
I have always just converted terms like 'C#' or 'C++' into 'csharp' and 'cplusplus' before indexing them and similarly converted those terms if someone searched on them. That always has worked just fine for me... :) -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, May 16, 2011 8:28 AM To: solr-user@lucene.apache.org Subject: Re: How to index and query C# as whole term? I don't think you'd want to use the string type here. String type is almost never appropriate for a field you want to actually search on (it is appropriate for fields to facet on). But you may want to use Text type with different analyzers selected. You probably want Text type so the value is still split into different tokens on word boundaries; you just don't want an analyzer set that removes punctuation. On 5/16/2011 10:46 AM, Gora Mohanty wrote: On Mon, May 16, 2011 at 7:05 PM, Gnanakumargna...@zoniac.com wrote: Hi, I'm using Apache Solr v3.1. How do I configure/allow Solr to both index and query the term c# as a whole word/term? From Analysis page, I could see that the term c# is being reduced/converted into just c by solr.WordDelimiterFilterFactory. [...] Yes, as you have discovered the analyzers for the field type in question will affect the values indexed. To index c# exactly as is, you can use the string type, instead of the text type. However, what you probably want some filters to be applied, e.g., LowerCaseFilterFactory. Take a look at the definition of the fieldType text in schema.xml, define a new field type that has only the tokenizers and analyzers that you need, and use that type for your field. This Wiki page should be helpful: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Regards, Gora
RE: How to index and query C# as whole term?
Sorry I am also using a synonyms.txt for this in the analysis stack. I was not clear, sorry for any confusion. I am not doing it outside of Solr but on the way into the index it is converted... :) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, May 16, 2011 8:51 AM To: solr-user@lucene.apache.org Subject: Re: How to index and query C# as whole term? Before indexing so outside Solr? Using the SynonymFilter would be easier i guess. On Monday 16 May 2011 17:44:24 Robert Petersen wrote: I have always just converted terms like 'C#' or 'C++' into 'csharp' and 'cplusplus' before indexing them and similarly converted those terms if someone searched on them. That always has worked just fine for me... :) -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, May 16, 2011 8:28 AM To: solr-user@lucene.apache.org Subject: Re: How to index and query C# as whole term? I don't think you'd want to use the string type here. String type is almost never appropriate for a field you want to actually search on (it is appropriate for fields to facet on). But you may want to use Text type with different analyzers selected. You probably want Text type so the value is still split into different tokens on word boundaries; you just don't want an analyzer set that removes punctuation. On 5/16/2011 10:46 AM, Gora Mohanty wrote: On Mon, May 16, 2011 at 7:05 PM, Gnanakumargna...@zoniac.com wrote: Hi, I'm using Apache Solr v3.1. How do I configure/allow Solr to both index and query the term c# as a whole word/term? From Analysis page, I could see that the term c# is being reduced/converted into just c by solr.WordDelimiterFilterFactory. [...] Yes, as you have discovered the analyzers for the field type in question will affect the values indexed. To index c# exactly as is, you can use the string type, instead of the text type. However, what you probably want some filters to be applied, e.g., LowerCaseFilterFactory. Take a look at the definition of the fieldType text in schema.xml, define a new field type that has only the tokenizers and analyzers that you need, and use that type for your field. This Wiki page should be helpful: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Regards, Gora -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
RE: Synonym Filter disable at query time
Very nice! Good job! :) -Original Message- From: mtraynham [mailto:mtrayn...@digitalsmiths.com] Sent: Tuesday, May 10, 2011 9:44 AM To: solr-user@lucene.apache.org Subject: RE: Synonym Filter disable at query time Just a heads up on a solution. copyField wasn't need, but a new fieldType and a non-indexed, non-stored field was added. Within a new Synonym processor that executes right before the AnalyzerQueryNodeProcessor, I was able to modify the field name for each node to point at the new field. Therefore I could build out the necessary synonym values from the tokenizer and then reassign them all back to the original field with whatever boosts they needed. This allowed me to retain the original value match, to keep it's boost at 1 and then boost the synonyms according to a user specified boost value. Works perfectly. Thanks again for the help. -- View this message in context: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time- tp2919876p2923775.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Synonym Filter disable at query time
Just make another field using copyfield which the other field does not apply synonyms to the text and then search either the one with or without from the front end... that will be your selector. :) -Original Message- From: mtraynham [mailto:mtrayn...@digitalsmiths.com] Sent: Monday, May 09, 2011 11:17 AM To: solr-user@lucene.apache.org Subject: Synonym Filter disable at query time I would like to be able to disable the synonym filter during runtime based on a query parameter, say 'synoynms=true' or 'synonyms=false'. Is there a way within the AnaylzerQueryNodeProcessor or QParser that I can remove the SynonymFilter from the AnalyzerAttributes? It seems that the Analyzer has a hashmap for it's 'analyzers' but I cannot find the declaration of this item. Am I going about this wrong is also another question I had... -- View this message in context: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time- tp2919876p2919876.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Synonym Filter disable at query time
I was thinking search both and boost non-synonym field perhaps? -Original Message- From: mtraynham [mailto:mtrayn...@digitalsmiths.com] Sent: Monday, May 09, 2011 1:20 PM To: solr-user@lucene.apache.org Subject: RE: Synonym Filter disable at query time Awesome thanks! Also, you wouldn't happen to have any insight on boosting synonyms lower than the original query after they were stemmed, would you? Say if I had synonyms turned on: The TokenStream is setup to do Synonyms - StopFilter - LowerCaseFilter - SnowballPorter. Say I search for Thomas, synonyms produces Thomas, Tom, Tommy. The SnowballPorter produces Tom, Tommi, Thoma. Is there a way to know Thoma would match the original term, so it could be boosted higher? -- View this message in context: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time- tp2919876p2920342.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Synonym Filter disable at query time
Yay! :) -Original Message- From: mtraynham [mailto:mtrayn...@digitalsmiths.com] Sent: Monday, May 09, 2011 1:59 PM To: solr-user@lucene.apache.org Subject: RE: Synonym Filter disable at query time Actually now that I think about it, with copy fields I can just single out the Synonym reader and boost from an earlier processor. Thanks again though, that solved a lot of headache! -- View this message in context: http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time- tp2919876p2920510.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: stemming for English
From what I have seen, adding a second field with the same terms as the first does *not* double your index size at all. -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: Tuesday, May 03, 2011 4:06 AM To: solr-user@lucene.apache.org Subject: Re: stemming for English Yes, Ludovic. Thus effectively we get index doubled. Given the volume of data we store, we very carefully consider such cases, where the doubling of index is must. Dmitry On Tue, May 3, 2011 at 1:08 PM, lboutros boutr...@gmail.com wrote: Dmitry, I don't know any way to keep both stemming and consistent wildcard support in the same field. To me, you have to create 2 different fields. Ludovic. 2011/5/3 Dmitry Kan [via Lucene] ml-node+2893628-993677979-383...@n3.nabble.com Hi Ludovic, That's an option we had before we decided to go for a full-blown support of wildcards. Do you know of a way to keep both stemming and consistent wildcard support in the same field?` Dmitry - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/stemming-for-English-tp2893599p2893652.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Dmitry Kan
RE: boost fields which have value
I believe the sortMissingLast fieldtype attribute is what you want: fieldType ... sortMissingLast=true ... / http://wiki.apache.org/solr/SchemaXml -Original Message- From: Zoltán Altfatter [mailto:altfatt...@gmail.com] Sent: Thursday, April 28, 2011 6:11 AM To: solr-user@lucene.apache.org Subject: boost fields which have value Hi, How can I achieve that documents which don't have field1 and field2 filled in, are returned in the end of the search result. I have tried with *bf* parameter, which seems to work but just with one field. Is there any function query which I can use in bf value to boost two fields? Thank you. Regards, Zoltan
RE: SynonymFilterFactory case changes
Yes I did, but that's cool because it is useful to make the final determination explicit here on the group for the benefit of other users. :) Thanks Robi -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 26, 2011 5:10 PM To: solr-user@lucene.apache.org Subject: Re: SynonymFilterFactory case changes Ahhh, I mis-read your post.. First, it's not the synonymfilterfactory that's lowercasing anything. The ingorecase=true affects the matching, not the output. The output is probably lowercased because you have it that way in the synonyms.txt file. At least that's what I just saw using the analysis page from the Solr admin page. So yes, if you want the WDF to do anything on tokens put into the input stream by SynonymFilterFactory, you need to make the replacement be the accurate case. But I think you already figured all that out Best Erick On Tue, Apr 26, 2011 at 7:19 PM, Robert Petersen rober...@buy.com wrote: But in this case lowercase is after WDF. The question is that when you get a hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt file are all in lower case do I need to add the case changing versions to make WDF work on case changes because it appears the synonym text is replaced verbatim by what is in the txt file and so that defeats the WDF filter. In fact, adding the case changing versions of this term to the synonyms.txt file makes this use case work. (yay) -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 26, 2011 3:39 PM To: solr-user@lucene.apache.org Subject: Re: SynonymFilterFactory case changes Yes, order does matter. You're right, putting, say, lowercase in front of WordDelimiter... will mess up the operations of WDFF. The admin/analysis page is *extremely* useful for understanding what happens in the analysis of input. Make sure to check the verbose checkbox. Best Erick On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote: So if there is a hit in the synonym filter factory, do I need to put the various case changes for a term so that the following WordDelimiterFilter analyzer can do its 'split on case changes' work? Here we see SynonymFilterFactory makes all terms lowercase because this is what is in my synonmyms.txt file and I have ignoreCase=true: macafee, mcafee Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text McAfee term type word source start,end 0,6 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text macafee mcafee term type word word source start,end 0,6 0,6 payload
RE: term position question from analyzer stack for WordDelimiterFilterFactory
OK this is even more weird... everything is working much better except for one thing: I was testing use cases with our top query terms to make sure the below query settings wouldn't break any existing behavior, and got this most unusual result. The analyzer stack completely eliminated the word McAfee from the query terms! I'm like huh? Here is the analyzer page output for that search term: Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text McAfee term type word source start,end0,6 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text McAfee term type word source start,end0,6 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text McAfee term type word source start,end0,6 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0} term position term text term type source start,end payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position term text term type source start,end payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position term text term type source start,end payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position term text term type source start,end payload -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, April 25, 2011 11:27 AM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: term position question from analyzer stack for WordDelimiterFilterFactory Aha! I knew something must be awry, but when I looked at the analysis page output, well it sure looked like it should match. :) OK here is the query side WDF that finally works, I just turned everything off. (yay) First I tried just completely removeing WDF from the query side analyzer stack but that didn't work. So anyway I suppose I should turn off the catenate all plus the preserve original settings, reindex, and see if I still get a match huh? (PS thank you very much for the help!!!) filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, April 25, 2011 9:24 AM To: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com wrote: The search and index analyzer stack are the same. Ahhh, they should not be! Using both generate and catenate in WDF at query time is a no-no. Same reason you can't have multi-word synonyms at query time: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym FilterFactory I'd recommend going back to the WDF settings in the solr example server as a starting point. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
SynonymFilterFactory case changes
So if there is a hit in the synonym filter factory, do I need to put the various case changes for a term so that the following WordDelimiterFilter analyzer can do its 'split on case changes' work? Here we see SynonymFilterFactory makes all terms lowercase because this is what is in my synonmyms.txt file and I have ignoreCase=true: macafee, mcafee Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text McAfee term type word source start,end0,6 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text macafee mcafee term type word word source start,end0,6 0,6 payload
RE: term position question from analyzer stack for WordDelimiterFilterFactory
Yeah I am about to try turning one on at a time and see what happens. I had a meeting so couldn't do it yet... (darn those meetings) (lol) -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, April 26, 2011 2:37 PM To: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory Hi Robert, I'm no WDFF expert, but all these zero look suspicious: org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0} A quick visit to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel imiterFilterFactory makes me think you want: splitOnCaseChange=1 (if you want Mc Afee for some reason?) generateWordParts=1 (if you want Mc Afee for some reason?) preserveOriginal=1 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org; yo...@lucidimagination.com Sent: Tue, April 26, 2011 4:39:49 PM Subject: RE: term position question from analyzer stack for WordDelimiterFilterFactory OK this is even more weird... everything is working much better except for one thing: I was testing use cases with our top query terms to make sure the below query settings wouldn't break any existing behavior, and got this most unusual result. The analyzer stack completely eliminated the word McAfee from the query terms! I'm like huh? Here is the analyzer page output for that search term: Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text McAfee term type word source start,end 0,6 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text McAfee term type word source start,end 0,6 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text McAfee term type word source start,end 0,6 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0} term position term text term type source start,end payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position term text term type source start,end payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position term text term type source start,end payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position term text term type source start,end payload -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, April 25, 2011 11:27 AM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: term position question from analyzer stack for WordDelimiterFilterFactory Aha! I knew something must be awry, but when I looked at the analysis page output, well it sure looked like it should match. :) OK here is the query side WDF that finally works, I just turned everything off. (yay) First I tried just completely removeing WDF from the query side analyzer stack but that didn't work. So anyway I suppose I should turn off the catenate all plus the preserve original settings, reindex, and see if I still get a match huh? (PS thank you very much for the help!!!) filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, April 25, 2011 9:24 AM To: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com wrote: The search and index analyzer stack are the same. Ahhh, they should not be! Using both generate and catenate in WDF at query time is a no-no. Same reason you can't have multi-word synonyms at query time: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym FilterFactory I'd recommend going back to the WDF settings in the solr example server as a starting point. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: SynonymFilterFactory case changes
But in this case lowercase is after WDF. The question is that when you get a hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt file are all in lower case do I need to add the case changing versions to make WDF work on case changes because it appears the synonym text is replaced verbatim by what is in the txt file and so that defeats the WDF filter. In fact, adding the case changing versions of this term to the synonyms.txt file makes this use case work. (yay) -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 26, 2011 3:39 PM To: solr-user@lucene.apache.org Subject: Re: SynonymFilterFactory case changes Yes, order does matter. You're right, putting, say, lowercase in front of WordDelimiter... will mess up the operations of WDFF. The admin/analysis page is *extremely* useful for understanding what happens in the analysis of input. Make sure to check the verbose checkbox. Best Erick On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote: So if there is a hit in the synonym filter factory, do I need to put the various case changes for a term so that the following WordDelimiterFilter analyzer can do its 'split on case changes' work? Here we see SynonymFilterFactory makes all terms lowercase because this is what is in my synonmyms.txt file and I have ignoreCase=true: macafee, mcafee Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text McAfee term type word source start,end 0,6 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text macafee mcafee term type word word source start,end 0,6 0,6 payload
RE: term position question from analyzer stack for WordDelimiterFilterFactory
Sorry, that was supposed to be just another way to say the same thing... OK look here is my current situation. Even with preserveOriginal and concatAll set, I am still getting an even odder result. I set up sku=218078624 with title= Beanbag AppleTV Friction Dash Mount for GPS and index it in dev. The search and index analyzer stack are the same. When I do this search in the solr admin page I get zero results sku:218078624 title:AppleTV but when I do this search I get one result sku:218078624 title:appletv . This is the opposite of what was happening before I added the preserve original setting. In the analysis page I plug in that title and term, and it looks to me like it should match... which is why I started asking about term positions and such. I don't understand why I don't get a hit in both cases. It is so weird. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Friday, April 22, 2011 5:55 PM To: Robert Petersen Cc: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory On Fri, Apr 22, 2011 at 8:24 PM, Robert Petersen rober...@buy.com wrote: I can repeatedly demonstrate this in my dev environment, where I get entirely different results searching for AppleTV vs. appletv You originally said I cannot get a match between AppleTV on the indexing side and appletv on the search side. Getting different numbers of results or different results is slightly different. For example, if there were a document with Apple TV in it, then a query of AppleTV would match that doc, but a query of appletv would not. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: term position question from analyzer stack for WordDelimiterFilterFactory
Aha! I knew something must be awry, but when I looked at the analysis page output, well it sure looked like it should match. :) OK here is the query side WDF that finally works, I just turned everything off. (yay) First I tried just completely removeing WDF from the query side analyzer stack but that didn't work. So anyway I suppose I should turn off the catenate all plus the preserve original settings, reindex, and see if I still get a match huh? (PS thank you very much for the help!!!) filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, April 25, 2011 9:24 AM To: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com wrote: The search and index analyzer stack are the same. Ahhh, they should not be! Using both generate and catenate in WDF at query time is a no-no. Same reason you can't have multi-word synonyms at query time: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym FilterFactory I'd recommend going back to the WDF settings in the solr example server as a starting point. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: term position question from analyzer stack for WordDelimiterFilterFactory
I can repeatedly demonstrate this in my dev environment, where I get entirely different results searching for AppleTV vs. appletv and I really just don't get it. I set up a specific sku in dev with AppleTV in its title to experiment with. What can I provide to help diagnose? I need to make this work... thanks for the help! -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, April 21, 2011 5:54 PM To: solr-user@lucene.apache.org Subject: Re: term position question from analyzer stack for WordDelimiterFilterFactory On Thu, Apr 21, 2011 at 8:06 PM, Robert Petersen rober...@buy.com wrote: So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings I cannot get a match between AppleTV on the indexing side and appletv on the search side. Hmmm, that shouldn't be the case. The text field in the solr example config doesn't use preserveOriginal, and AppleTV is indexed as appl, tv/appletv And a search for appletv does match fine. Perhaps on the search side there is actually a phrase query like big appletv? One workaround for that is to add a little slop... big appletv~1 -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: stemming filter analyzers, any favorites?
Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms
RE: stemming filter analyzers, any favorites?
Nice! Thanks! -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 9:23 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? As far as I know Lucene does not store an inverted index per field, so no, it would not double the size of the index. However, it could influence the score a little bit. For example: If both stemmers reduce schools to school and you are searching for all schools in america the term school has more weight to the resulting score, since it definitly occurs in two fields which consist of nearly the same value. To reduce this effect you could write your own queryParser which creates a disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 - so only the better scoring stemmed-field contributes to the total score of your document. Regards, Em Robert Petersen-3 wrote: Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position1 term textbags term typeword source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true
term position question from analyzer stack for WordDelimiterFilterFactory
So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings I cannot get a match between AppleTV on the indexing side and appletv on the search side. Without that setting the all lowercase version of AppleTV is in term position two due to the catenateWords=1 or the catenateAll=1 settings. I am surprised. How does term position affect searching? Here is my analysis with preserveOriginal=1 to make the lower case occur in both term position 1 and 2: Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text AppleTV term type word source start,end0,7 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1} term position 1 2 term text AppleTV TV Apple AppleTV term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 2 term text appletv tv apple appletv term type wordword wordword source start,end0,7 5,7 0,5 0,7 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 term text appletv term type word source start,end0,7 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text appletv term type word source start,end0,7 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 term text appletv term type word source start,end0,7 payload
stemming filter analyzers, any favorites?
Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
RE: stemming filter analyzers, any favorites?
I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
RE: what happens to docsPending if stop solr before commit
Oh woe is me... lol NP good to know. I'll get them on the next go 'round. :) Thanks for the answer! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 06, 2011 6:05 AM To: solr-user@lucene.apache.org Subject: Re: what happens to docsPending if stop solr before commit They're lost, never to be seen again. You'll have to reindex them. Best Erick On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com wrote: Hello fellow enthusiastic solr users, I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. TIA! Robi
RE: what happens to docsPending if stop solr before commit
Really? Great! I was wondering if there was some cleanup cycle like that which would occur upon shutdown. That sounds like much more logical behavior! -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Wednesday, April 06, 2011 4:03 PM To: solr-user@lucene.apache.org Subject: Re: what happens to docsPending if stop solr before commit (11/04/06 5:25), Robert Petersen wrote: I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. Robi, Usually they are never lost, but they are committed. When you stop Solr, servlet container (Jetty) calls servlets/filters destroy() methods. This causes closing all SolrCores. Then SolrCore.close() calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then pending docs are flushed, then committed. Koji -- http://www.rondhuit.com/en/
what happens to docsPending if stop solr before commit
Hello fellow enthusiastic solr users, I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. TIA! Robi
RE: FW: no results searching for stadium seating chairs
Thanks for the input! We've discussed using synonyms to help here. We have product managers who are supposed to add keywords on to skus also which our indexer will automatically consume. Getting them to do that is a different matter! haha -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, March 29, 2011 11:19 AM To: solr-user@lucene.apache.org Subject: Re: FW: no results searching for stadium seating chairs It seems unlikely you are going to find something that stems everything exactly how you want it, and nothing how you don't want it. This is very domain dependent, as you've discovered. I doubt there's even such a thing as the way everyone doing a 'retail product title search' would want it, it's going to vary. You could use the synonym feature to make your own stemming dictionary, tell it to stem seating to seat. Of course, that's also very expensive in terms of your time, to create your own custom dictionary. But you're going to have to live with one of the compromises, software cant' do magic! For particular titles, you could also, in your own metadata control, add alternate titles that you want it to match on, before it even gets indexed. On 3/29/2011 1:43 PM, Robert Petersen wrote: For retail product title search, would there be a better stemmer to use? We wanted a less aggressive stemmer, but I would expect the term seating to stem. I have found several other words which end in ing and do not get stemmed. Amongst our product lines are four million books with all kinds of crazy titles, like the following oddity! Here counseling stems and unknowing doesn't: 1. The Cloud of Unknowing and the Book of Privy Counseling Buy New: $29.95 $18.30 3 New and Used from $18.30 -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, March 29, 2011 10:27 AM To: solr-user@lucene.apache.org Cc: Robert Petersen Subject: Re: FW: no results searching for stadium seating chairs On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersenrober...@buy.com wrote: Very interestingly, LucidKStemFilterFactory is stemming 'ing's differently for different words. The word 'seating' doesn't lose the 'ing' but the word 'counseling' does! Can anyone explain the difference here? protwords.txt is empty btw. KStem is dictionary driven, so seating is probably in the dictionary. I guess the author decided that seating and seat were sufficiently different. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: FW: no results searching for stadium seating chairs
Wow that sounds rad! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, March 30, 2011 9:39 AM To: solr-user@lucene.apache.org Subject: Re: FW: no results searching for stadium seating chairs There are some new features in 3.1 to make it easier to tune this stuff, especially: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_1/solr/src/java/org/apache/solr/analysis/StemmerOverrideFilterFactory.java This takes a tab separate list of words-stems, and sets a flag to any downstream stemmer to not mess with any of your mappings (thus the name: StemmerOverrideFilter). So the idea is you pick a stemmer thats close to what you want, then you put this filter before it to tune it to your needs. On Wed, Mar 30, 2011 at 12:05 PM, Robert Petersen rober...@buy.com wrote: Thanks for the input! We've discussed using synonyms to help here. We have product managers who are supposed to add keywords on to skus also which our indexer will automatically consume. Getting them to do that is a different matter! haha -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, March 29, 2011 11:19 AM To: solr-user@lucene.apache.org Subject: Re: FW: no results searching for stadium seating chairs It seems unlikely you are going to find something that stems everything exactly how you want it, and nothing how you don't want it. This is very domain dependent, as you've discovered. I doubt there's even such a thing as the way everyone doing a 'retail product title search' would want it, it's going to vary. You could use the synonym feature to make your own stemming dictionary, tell it to stem seating to seat. Of course, that's also very expensive in terms of your time, to create your own custom dictionary. But you're going to have to live with one of the compromises, software cant' do magic! For particular titles, you could also, in your own metadata control, add alternate titles that you want it to match on, before it even gets indexed. On 3/29/2011 1:43 PM, Robert Petersen wrote: For retail product title search, would there be a better stemmer to use? We wanted a less aggressive stemmer, but I would expect the term seating to stem. I have found several other words which end in ing and do not get stemmed. Amongst our product lines are four million books with all kinds of crazy titles, like the following oddity! Here counseling stems and unknowing doesn't: 1. The Cloud of Unknowing and the Book of Privy Counseling Buy New: $29.95 $18.30 3 New and Used from $18.30 -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, March 29, 2011 10:27 AM To: solr-user@lucene.apache.org Cc: Robert Petersen Subject: Re: FW: no results searching for stadium seating chairs On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersenrober...@buy.com wrote: Very interestingly, LucidKStemFilterFactory is stemming 'ing's differently for different words. The word 'seating' doesn't lose the 'ing' but the word 'counseling' does! Can anyone explain the difference here? protwords.txt is empty btw. KStem is dictionary driven, so seating is probably in the dictionary. I guess the author decided that seating and seat were sufficiently different. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
FW: no results searching for stadium seating chairs
Very interestingly, LucidKStemFilterFactory is stemming ‘ing’s differently for different words. The word ‘seating’ doesn't lose the 'ing' but the word ‘counseling’ does! Can anyone explain the difference here? protwords.txt is empty btw. com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 2 term text privy counsel term type word word source start,end 0,5 6,16 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text seating term type word source start,end 0,7
RE: FW: no results searching for stadium seating chairs
For retail product title search, would there be a better stemmer to use? We wanted a less aggressive stemmer, but I would expect the term seating to stem. I have found several other words which end in ing and do not get stemmed. Amongst our product lines are four million books with all kinds of crazy titles, like the following oddity! Here counseling stems and unknowing doesn't: 1. The Cloud of Unknowing and the Book of Privy Counseling Buy New: $29.95 $18.30 3 New and Used from $18.30 -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, March 29, 2011 10:27 AM To: solr-user@lucene.apache.org Cc: Robert Petersen Subject: Re: FW: no results searching for stadium seating chairs On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersen rober...@buy.com wrote: Very interestingly, LucidKStemFilterFactory is stemming 'ing's differently for different words. The word 'seating' doesn't lose the 'ing' but the word 'counseling' does! Can anyone explain the difference here? protwords.txt is empty btw. KStem is dictionary driven, so seating is probably in the dictionary. I guess the author decided that seating and seat were sufficiently different. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: Different options for autocomplete/autosuggestion
I take raw user search term data, 'collapse' it into a form where I have only unique terms, per store, ordered by frequency of searches over some time period. The suggestions are then grouped and presented with store breakouts. That sounds kind of like what this page is talking about here, but I could be using the wrong terminology: http://wiki.apache.org/solr/FieldCollapsing -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Tuesday, March 15, 2011 9:00 PM To: solr-user@lucene.apache.org Subject: Re: Different options for autocomplete/autosuggestion Hi, I actually don't follow how field collapsing helps with autocompletion...? Over at http://search-lucene.com we eat our own autocomplete dog food: http://sematext.com/products/autocomplete/index.html . Tasty stuff. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Kai Schlamp schl...@gmx.de To: solr-user@lucene.apache.org Sent: Mon, March 14, 2011 11:52:48 PM Subject: Re: Different options for autocomplete/autosuggestion @Robert: That sounds interesting and very flexible, but also like a lot of work. This approach also doesn't seem to allow querying Solr directly by using Ajax ... one of the big benefits in my opinion when using Solr. @Bill: There are some things I don't like about the Suggester component. It doesn't seem to allow infix searches (at least it is not mentioned in the Wiki or elsewhere). It also uses a separate index that has to be rebuild independently of the main index. And it doesn't support any filter queries. The Lucid Imagination blog also describes a further autosuggest approach (http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popu lar-queries-using-edgengrams/). The disadvantage here is that the source documents must have distinct fields (resp. the dih selects must provide distinct data). Otherwise duplications would come up in the Solr query result, cause of the document nature of Solr. In my opinion field collapsing seems to be most promising for a full featured autosuggestion solution. Unfortunately it is not available for Solr 1.4.x or 3.x (I tried patching those branches several times without success). 2011/3/15 Bill Bell billnb...@gmail.com: http://lucidworks.lucidimagination.com/display/LWEUG/Spell+Checking+and+ Aut omatic+Completion+of+User+Queries For Auto-Complete, find the following section in the solrconfig.xml file for the collection: !-- Auto-Complete component -- searchComponent name=autocomplete class=solr.SpellCheckComponent lst name=spellchecker str name=nameautocomplete/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup /s tr str name=fieldautocomplete/str str name=buildOnCommittrue/str !-- str name=sourceLocationamerican-english/str -- /lst On 3/14/11 8:16 PM, Andy angelf...@yahoo.com wrote: Can you provide more details? Or a link? --- On Mon, 3/14/11, Bill Bell billnb...@gmail.com wrote: See how Lucid Enterprise does it... A bit differently. On 3/14/11 12:14 AM, Kai Schlamp kai.schl...@googlemail.com wrote: Hi. There seems to be several options for implementing an autocomplete/autosuggestions feature with Solr. I am trying to summarize those possibilities together with their advantages and disadvantages. It would be really nice to read some of your opinions. * Using N-Gram filter + text field query + available in stable 1.4.x + results can be boosted + sorted by best matches - may return duplicate results * Facets + available in stable 1.4.x + no duplicate entries - sorted by count - may need an extra N-Gram field for infix queries * Terms + available in stable 1.4.x + infix query by using regex in 3.x - only prefix query in 1.4.x - regexp may be slow (just a guess) * Suggestions ? Did not try that yet. Does it allow infix queries? * Field Collapsing + no duplications - only available in 4.x branch ? Does it work together with highlighting? That would be a big plus. What are your experiences regarding autocomplete/autosuggestion with Solr? Any additions, suggestions or corrections? What do you prefer? Kai -- Dr. med. Kai Schlamp Am Fort Elisabeth 17 55131 Mainz Germany Phone +49-177-7402778 Email: schl...@gmx.de
i don't get why my index didn't grow more...
OK I have a 30 gb index where there are lots of sparsly populated int fields and then one title field and one catchall field with title and everything else we want as keywords, the catchall field. I figure it is the biggest field in our documents which as I mentioned is otherwise composed of a variety if int fields and a title. So my puzzlement is that my biggest field is copied into a double metaphone field and now I added another copyfield to also copy the catchall field into a newly created soundex field for an experiment to compare the effectiveness of the two. I expected the index to grow by at least 25% to 30%, but it barely grew at all. Can someone explain this to me? Thanks! J
RE: Different options for autocomplete/autosuggestion
I like field collapsing because that way my suggestions gives phrase results (ie the suggestion starts with what the user has typed so far) and thus I limit suggestions to be in the order of the words typed. I think that looks better for our retail oriented site. I populate the index with previous user queries. I just put wildcards on the end of the collapsed version what the user has typed so far. It is very fast. I make suggestions for every keystroke as a user types in his query on our site. Hope that helps. -Original Message- From: Kai Schlamp [mailto:kai.schl...@googlemail.com] Sent: Sunday, March 13, 2011 11:14 PM To: solr-user@lucene.apache.org Subject: Different options for autocomplete/autosuggestion Hi. There seems to be several options for implementing an autocomplete/autosuggestions feature with Solr. I am trying to summarize those possibilities together with their advantages and disadvantages. It would be really nice to read some of your opinions. * Using N-Gram filter + text field query + available in stable 1.4.x + results can be boosted + sorted by best matches - may return duplicate results * Facets + available in stable 1.4.x + no duplicate entries - sorted by count - may need an extra N-Gram field for infix queries * Terms + available in stable 1.4.x + infix query by using regex in 3.x - only prefix query in 1.4.x - regexp may be slow (just a guess) * Suggestions ? Did not try that yet. Does it allow infix queries? * Field Collapsing + no duplications - only available in 4.x branch ? Does it work together with highlighting? That would be a big plus. What are your experiences regarding autocomplete/autosuggestion with Solr? Any additions, suggestions or corrections? What do you prefer? Kai
RE: Different options for autocomplete/autosuggestion
I am doing this very differently. We are on solr 1.4.0 and I accomplish the collapsing in my wrapper layer. I have written a layer of code around SOLR, an indexer on one end and a search service wrapping solrs on the other end. I manually collapse the field in my code. I keep both a collapsed and uncollapsed version of the phrase in my index, the uncollapsed is the only one stored for retrieval btw. I do this on both ends so I have complete control here... works well! Different than a patch of course tho. -Original Message- From: kai.schl...@googlemail.com [mailto:kai.schl...@googlemail.com] On Behalf Of Kai Schlamp Sent: Monday, March 14, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: Different options for autocomplete/autosuggestion Robert, thanks for your answer. What Solr version do you use? 4.0? As mentioned in my other post here I tried to patch 1.4 for using field collapsing, but couldn't get it to work (compiled fine, but collapsed parameters seems to be completely ignored). 2011/3/14 Robert Petersen rober...@buy.com: I like field collapsing because that way my suggestions gives phrase results (ie the suggestion starts with what the user has typed so far) and thus I limit suggestions to be in the order of the words typed. I think that looks better for our retail oriented site. I populate the index with previous user queries. I just put wildcards on the end of the collapsed version what the user has typed so far. It is very fast. I make suggestions for every keystroke as a user types in his query on our site. Hope that helps. -Original Message- From: Kai Schlamp [mailto:kai.schl...@googlemail.com] Sent: Sunday, March 13, 2011 11:14 PM To: solr-user@lucene.apache.org Subject: Different options for autocomplete/autosuggestion Hi. There seems to be several options for implementing an autocomplete/autosuggestions feature with Solr. I am trying to summarize those possibilities together with their advantages and disadvantages. It would be really nice to read some of your opinions. * Using N-Gram filter + text field query + available in stable 1.4.x + results can be boosted + sorted by best matches - may return duplicate results * Facets + available in stable 1.4.x + no duplicate entries - sorted by count - may need an extra N-Gram field for infix queries * Terms + available in stable 1.4.x + infix query by using regex in 3.x - only prefix query in 1.4.x - regexp may be slow (just a guess) * Suggestions ? Did not try that yet. Does it allow infix queries? * Field Collapsing + no duplications - only available in 4.x branch ? Does it work together with highlighting? That would be a big plus. What are your experiences regarding autocomplete/autosuggestion with Solr? Any additions, suggestions or corrections? What do you prefer? Kai -- Dr. med. Kai Schlamp Am Fort Elisabeth 17 55131 Mainz Germany Phone +49-177-7402778 Email: schl...@gmx.de
RE: Different options for autocomplete/autosuggestion
Note that due to the 'raw' nature of my source data I also have to heavily filter my data before collapsing it also. I don't want to suggest garbage phrases just because a lot of people searched on them. We store auxiliary data in the index for filtering on to perform the grouping. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Monday, March 14, 2011 4:25 PM To: solr-user@lucene.apache.org Subject: RE: Different options for autocomplete/autosuggestion I am doing this very differently. We are on solr 1.4.0 and I accomplish the collapsing in my wrapper layer. I have written a layer of code around SOLR, an indexer on one end and a search service wrapping solrs on the other end. I manually collapse the field in my code. I keep both a collapsed and uncollapsed version of the phrase in my index, the uncollapsed is the only one stored for retrieval btw. I do this on both ends so I have complete control here... works well! Different than a patch of course tho. -Original Message- From: kai.schl...@googlemail.com [mailto:kai.schl...@googlemail.com] On Behalf Of Kai Schlamp Sent: Monday, March 14, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: Different options for autocomplete/autosuggestion Robert, thanks for your answer. What Solr version do you use? 4.0? As mentioned in my other post here I tried to patch 1.4 for using field collapsing, but couldn't get it to work (compiled fine, but collapsed parameters seems to be completely ignored). 2011/3/14 Robert Petersen rober...@buy.com: I like field collapsing because that way my suggestions gives phrase results (ie the suggestion starts with what the user has typed so far) and thus I limit suggestions to be in the order of the words typed. I think that looks better for our retail oriented site. I populate the index with previous user queries. I just put wildcards on the end of the collapsed version what the user has typed so far. It is very fast. I make suggestions for every keystroke as a user types in his query on our site. Hope that helps. -Original Message- From: Kai Schlamp [mailto:kai.schl...@googlemail.com] Sent: Sunday, March 13, 2011 11:14 PM To: solr-user@lucene.apache.org Subject: Different options for autocomplete/autosuggestion Hi. There seems to be several options for implementing an autocomplete/autosuggestions feature with Solr. I am trying to summarize those possibilities together with their advantages and disadvantages. It would be really nice to read some of your opinions. * Using N-Gram filter + text field query + available in stable 1.4.x + results can be boosted + sorted by best matches - may return duplicate results * Facets + available in stable 1.4.x + no duplicate entries - sorted by count - may need an extra N-Gram field for infix queries * Terms + available in stable 1.4.x + infix query by using regex in 3.x - only prefix query in 1.4.x - regexp may be slow (just a guess) * Suggestions ? Did not try that yet. Does it allow infix queries? * Field Collapsing + no duplications - only available in 4.x branch ? Does it work together with highlighting? That would be a big plus. What are your experiences regarding autocomplete/autosuggestion with Solr? Any additions, suggestions or corrections? What do you prefer? Kai -- Dr. med. Kai Schlamp Am Fort Elisabeth 17 55131 Mainz Germany Phone +49-177-7402778 Email: schl...@gmx.de
RE: True master-master fail-over without data gaps
If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: True master-master fail-over without data gaps
Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: True master-master fail-over without data gaps
...but the index resides on disk doesn't it??? lol -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:06 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: True master-master fail-over without data gaps
I guess you could put a LB between slaves and masters, never thought of that! :) -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:10 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message Currently I use an application connected to a queue containing incoming data which my indexer app turns into solr docs. I log everything to a log table and have never had an issue with losing anything. Yeah, if everything goes through some storage that can be polled (either a DB or a durable JMS Topic or some such), then N masters could connect to it, not miss anything, and be more or less in near real-time sync. I can trace incoming docs exactly, and keep timing data in there also. If I added a second solr url for a second master and resent the same doc to master02 that I sent to master01, I would expect near 100% synchronization. The problem here is how to get the slave farm to start replicating from the second master if and when the first goes down. I can only see that as being a manual operation, repointing the slaves to master02 and restarting or reloading them etc... Actually, you can configure a LB to handle that, so that's less of a problem, I think. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 8:52 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps Hi, - Original Message From: Robert Petersen rober...@buy.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 11:40:56 AM Subject: RE: True master-master fail-over without data gaps If you have a wrapper, like an indexer app which prepares solr docs and sends them into solr, then it is simple. The wrapper is your 'tee' and it can send docs to both (or N) masters. Doesn't this make it too easy for 2 masters to get out of sync even if the problem is not with them? e.g. something happens in this tee component and it indexes a doc to master A, but not master B. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, March 09, 2011 4:14 AM To: solr-user@lucene.apache.org Cc: Jonathan Rochkind Subject: Re: True master-master fail-over without data gaps Yes, I think this should be pushed upstream - insert a tee in the document stream so that all documents go to both masters. Then use a load balancer to make requests of the masters. The tee itself then becomes a possible single point of failure, but you didn't say anything about the architecture of the document feed. Is that also fault-tolerant? -Mike On 3/9/2011 1:06 AM, Jonathan Rochkind wrote: I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD
RE: True master-master fail-over without data gaps (choosing CA in CAP)
Can't you skip the SAN and keep the indexes locally? Then you would have two redundant copies of the index and no lock issues. Also, Can't master02 just be a slave to master01 (in the master farm and separate from the slave farm) until such time as master01 fails? Then master02 would start receiving the new documents with an indexes complete up to the last replication at least and the other slaves would be directed by LB to poll master02 also... -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, March 09, 2011 9:47 AM To: solr-user@lucene.apache.org Subject: Re: True master-master fail-over without data gaps (choosing CA in CAP) Hi, - Original Message From: Walter Underwood wun...@wunderwood.org On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote: You mean it's not possible to have 2 masters that are in nearly real-time sync? How about with DRBD? I know people use DRBD to keep 2 Hadoop NNs (their edit logs) in sync to avoid the current NN SPOF, for example, so I'm thinking this could be doable with Solr masters, too, no? If you add fault-tolerant, you run into the CAP Theorem. Consistency, availability, partition: choose two. You cannot have it all. Right, so I'll take Consistency and Availability, and I'll put my 2 masters in the same rack (which has redundant switches, power supply, etc.) and thus minimize/avoid partitioning. Assuming the above actually works, I think my Q remains: How do you set up 2 Solr masters so they are in near real-time sync? DRBD? But here is maybe a simpler scenario that more people may be considering: Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index on the shared storage (SAN) that also happens to live in the same rack. 2 Solr masters are behind 1 LB VIP that indexer talks to. The VIP is configured so that all requests always get routed to the primary master (because only 1 master can be modifying an index at a time), except when this primary is down, in which case the requests are sent to the secondary master. So in this case my Q is around automation of this, around Lucene index locks, around the need for manual intervention, and such. Concretely, if you have these 2 master instances, the primary master has the Lucene index lock in the index dir. When the secondary master needs to take over (i.e., when it starts receiving documents via LB), it needs to be able to write to that same index. But what if that lock is still around? One could use the Native lock to make the lock disappear if the primary master's JVM exited unexpectedly, and in that case everything *should* work and be completely transparent, right? That is, the secondary will start getting new docs, it will use its IndexWriter to write to that same shared index, which won't be locked for writes because the lock is gone, and everyone will be happy. Did I miss something important here? Assuming the above is correct, what if the lock is *not* gone because the primary master's JVM is actually not dead, although maybe unresponsive, so LB thinks the primary master is dead. Then the LB will route indexing requests to the secondary master, which will attempt to write to the index, but be denied because of the lock. So a human needs to jump in, remove the lock, and manually reindex failed docs if the upstream component doesn't buffer docs that failed to get indexed and doesn't retry indexing them automatically. Is this correct or is there a way to avoid humans here? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: Memory use during merges (OOM)
Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after: concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument tom
RE: Memory use during merges (OOM)
Thanks Mike! When you say 'term index of the segment readers', are you referring to the term vectors? In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon. Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products). I'm thinking that our unique terms are low vs the size of our index. The way we spin out deletes and adds should keep the terms loaded all the time. Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs. We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point. That is why I jumped into this discussion, sorry for butting in like that. you guys are discussing very interesting settings I had not considered before. Rob -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) It's not that it's bad, it's just that Lucene must do extra work to check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote: Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after
RE: entire farm fails at the same time with OOM issues
It has typically been when query traffic was lowest! We are at 12 GB heap, so I will try to bump it to 14 GB. We have 64GB main memory installed now. Here is our settings, do these look OK? export JAVA_OPTS=-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, November 30, 2010 6:44 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
RE: entire farm fails at the same time with OOM issues
Good idea. Our farm is behind Akamai so that should be ok to do. -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Wednesday, December 01, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues also try to minimize maxWarming searchers to 1(?) or 2. And decrease cache usage (especially autowarming) if possible at all. But again: only if it doesn't affect performance ... Regards, Peter. On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
shutdown.sh does not kill the tomcat process running solr./?
Greetings, we're wondering why we can issue the command to shutdown tomcat/solr but the process remains visible in memory (by using the top command) and we have to manually kill the PID for it to release its memory before we can (re)start tomcat/solr? Anybody have any ideas? The process is using 12+ GB main memory typically but can go up to 40 GB on the master where we index. We have 64GB main memory on these servers. I set the heap at 12 GB and use the concurrent garbage collector too. That raises another question: top can show only 20 GB free out of 64 but the tomcat/solr process only shows its using half of that. What is using the rest? The numbers don't add up... Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18
entire farm fails at the same time with OOM issues
Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB. However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous...=:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc
RE: entire farm fails at the same time with OOM issues
What would I do with the heap dump though? Run one of those java heap analyzers looking for memory leaks or something? I have no experience with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory leak occurring on each commit, but it would take thousands of commits to make that add up to anything right? -Original Message- From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Tuesday, November 30, 2010 3:12 PM To: solr-user@lucene.apache.org Subject: Re: entire farm fails at the same time with OOM issues Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=path to where you want the file to go, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB. However, twice now recently during a time of low load we have had a fire drill where I have seen tomcat/solr fail and become unresponsive after some OOM heap errors. Solr wouldn't even serve up its admin pages. I've had to go in and manually knock tomcat out of memory and then restart it. These solr slaves are load balanced and the load balancers always probe the solr slaves so if they stop serving up searches they are automatically removed from the load balancer. When all four fail at the same time we have an issue! My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? The load balancer kicks them all out at the same time each time. Each slave only talks to the master and not to each other, but the master show no errors in the logs at all. Something must be triggering this though. The only other odd thing I saw in the logs was after the first OOM errors were recorded, the slaves started occasionally not being able to get to the master. This behavior makes me a little nervous...=:-o eek! Environment: Lucid Imagination distro of Solr 1.4 on Tomcat Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with 64GB memory etc etc http://ken-blog.krugler.org +1 530-265-2225 -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
RE: Adding new field after data is already indexed
1) Just put the new field in the schema and stop/start solr. Documents in the index will not have the field until you reindex them but it won't hurt anything. 2) Just turn off their handlers in solrconfig is all I think that takes. -Original Message- From: gauravshetti [mailto:gaurav.she...@tcs.com] Sent: Monday, November 08, 2010 5:21 AM To: solr-user@lucene.apache.org Subject: Adding new field after data is already indexed Hi, I had a few questions regarding Solr. Say my schema file looks like field name=folder_id type=long indexed=true stored=true/ field name=indexed type=boolean indexed=true stored=true/ and i index data on the basis of these fields. Now, incase i need to add a new field, is there a way i can add the field without corrupting the previous data. Is there any feature which adds a new field with a default value to the existing records. 2) Is there any security mechanism/authorization check to prevent url like /admin and /update to only a few users. -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread y-indexed-tp1862575p1862575.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: phrase query with autosuggest (SOLR-1316)
My simple but effective solution to that problem was to replace the white spaces in the items you index for autosuggest with some special character, then your wildcarding will work with the whole phrase as you desire. Index: mike_shaffer Query: mike_sha* -Original Message- From: mike anderson [mailto:saidthero...@gmail.com] Sent: Wednesday, October 06, 2010 7:33 AM To: solr-user@lucene.apache.org Subject: phrase query with autosuggest (SOLR-1316) It seemed like SOLR-1316 was a little too long to continue the conversation. Is there support for quotes indicating a phrase query. For example, my autosuggest query for mike sha ought to return mike shaffer, mike sharp, etc. Instead I get suggestions for mike and for sha, resulting in a collated result mike r meyer shaw, Cheers, Mike
RE: Do commits block updates in SOLR 1.4?
So you are saying we definitely do not need to pause ADD activity on other threads while we send the COMMIT? And the same goes with AUTOCOMMIT right? We are using SOLR 1.4 now. We were on 1.3 previously. We pretty much just assumed pausing ADDs during COMMITs was required by SOLR when we designed our indexing system, mostly due to our experience with an older and different search engine. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, September 02, 2010 6:10 PM To: solr-user@lucene.apache.org Subject: Re: Do commits block updates in SOLR 1.4? Yes, indexing synchronized during commits. You can call commit all you want, and index docs, and commit will finish and then indexing will restart. Previous Solr release did this also; how far back is your existing Solr? On Thu, Sep 2, 2010 at 1:11 PM, Robert Petersen rober...@buy.com wrote: Hello sorry to bother but does anyone know the answer to this? This is the closest thing I can find on the subject: http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S peedup-td498465.html -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Wednesday, September 01, 2010 11:35 AM To: solr-user@lucene.apache.org Subject: Do commits block updates in SOLR 1.4? I can't seem to find a definitive answer. I have ten threads doing my indexing and I block all the threads when one is ready to do a commit so no adds are done until the commit finishes. Is this still required in SOLR 1.4 or could I take it out? I tried testing this on a separate small index where I set autocommit in solrconfig and seem to have no issues just continuously adding documents from multiple threads to it despite its commit activity. I'd like to do the same in my big main index, is it safe? Also, is there any difference in behavior between autocommits and explicit commits in this regard? -- Lance Norskog goks...@gmail.com
RE: Do commits block updates in SOLR 1.4?
Thanks guys! I will be quite happy to remove the unnecessary complexity from our code. -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, September 03, 2010 10:28 AM To: solr-user@lucene.apache.org Subject: Re: Do commits block updates in SOLR 1.4? Solr handles all of this concurrency for you - it's actually even a little too aggressive about that these days, as Lucene has changed a lot - but yes - you can add while committing and commit while adding - Solr will block itself as needed. - Mark On 9/3/10 1:27 PM, Robert Petersen wrote: So you are saying we definitely do not need to pause ADD activity on other threads while we send the COMMIT? And the same goes with AUTOCOMMIT right? We are using SOLR 1.4 now. We were on 1.3 previously. We pretty much just assumed pausing ADDs during COMMITs was required by SOLR when we designed our indexing system, mostly due to our experience with an older and different search engine. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, September 02, 2010 6:10 PM To: solr-user@lucene.apache.org Subject: Re: Do commits block updates in SOLR 1.4? Yes, indexing synchronized during commits. You can call commit all you want, and index docs, and commit will finish and then indexing will restart. Previous Solr release did this also; how far back is your existing Solr? On Thu, Sep 2, 2010 at 1:11 PM, Robert Petersen rober...@buy.com wrote: Hello sorry to bother but does anyone know the answer to this? This is the closest thing I can find on the subject: http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S peedup-td498465.html -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Wednesday, September 01, 2010 11:35 AM To: solr-user@lucene.apache.org Subject: Do commits block updates in SOLR 1.4? I can't seem to find a definitive answer. I have ten threads doing my indexing and I block all the threads when one is ready to do a commit so no adds are done until the commit finishes. Is this still required in SOLR 1.4 or could I take it out? I tried testing this on a separate small index where I set autocommit in solrconfig and seem to have no issues just continuously adding documents from multiple threads to it despite its commit activity. I'd like to do the same in my big main index, is it safe? Also, is there any difference in behavior between autocommits and explicit commits in this regard?
RE: Do commits block updates in SOLR 1.4?
Hello sorry to bother but does anyone know the answer to this? This is the closest thing I can find on the subject: http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S peedup-td498465.html -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Wednesday, September 01, 2010 11:35 AM To: solr-user@lucene.apache.org Subject: Do commits block updates in SOLR 1.4? I can't seem to find a definitive answer. I have ten threads doing my indexing and I block all the threads when one is ready to do a commit so no adds are done until the commit finishes. Is this still required in SOLR 1.4 or could I take it out? I tried testing this on a separate small index where I set autocommit in solrconfig and seem to have no issues just continuously adding documents from multiple threads to it despite its commit activity. I'd like to do the same in my big main index, is it safe? Also, is there any difference in behavior between autocommits and explicit commits in this regard?
Do commits block updates in SOLR 1.4?
I can't seem to find a definitive answer. I have ten threads doing my indexing and I block all the threads when one is ready to do a commit so no adds are done until the commit finishes. Is this still required in SOLR 1.4 or could I take it out? I tried testing this on a separate small index where I set autocommit in solrconfig and seem to have no issues just continuously adding documents from multiple threads to it despite its commit activity. I'd like to do the same in my big main index, is it safe? Also, is there any difference in behavior between autocommits and explicit commits in this regard?
RE: Auto Suggest
I do this by replacing the spaces with a '%' in a separate search field which is not parsed nor tokenized and then you can wildcard across the whole phrase like you want and the spaces don't mess you up. Just store the original phrase with spaces in a separate field for returning to the front end for display. -Original Message- From: Jazz Globe [mailto:jazzgl...@hotmail.com] Sent: Wednesday, September 01, 2010 7:33 AM To: solr-user@lucene.apache.org Subject: Auto Suggest Hallo How would one implement a multiple term auto-suggest feature in Solr that is filter sensitive? For example, a user enters : mp3 and solr might suggest: - mp3 player - mp3 nano - mp3 sony and then the user starts the second word : mp3 n and that narrows it down to: - mp3 nano I had a quick look at the Terms Component. I suppose it just returns term totals for the entire index and cannot be used with a filter or query? Thanks Johan
RE: Auto Suggest
We don't have that many, just a hundred thousand, and solr response times (since the index's docs are small and not complex) are logged as typically 1 ms if not 0 ms. It's funny but sometimes it is so fast no milliseconds have elapsed. Incredible if you ask me... :) Once you get SOLR to consider the whole phrase as just one big term, the wildcard is very fast. -Original Message- From: Eric Grobler [mailto:impalah...@googlemail.com] Sent: Wednesday, September 01, 2010 12:35 PM To: solr-user@lucene.apache.org Subject: Re: Auto Suggest Hi Robert, Interesting approach, how many documents do you have in Solr? I have about 2 million and I just wonder if it might be a bit slow. Regards Johan On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen rober...@buy.com wrote: I do this by replacing the spaces with a '%' in a separate search field which is not parsed nor tokenized and then you can wildcard across the whole phrase like you want and the spaces don't mess you up. Just store the original phrase with spaces in a separate field for returning to the front end for display. -Original Message- From: Jazz Globe [mailto:jazzgl...@hotmail.com] Sent: Wednesday, September 01, 2010 7:33 AM To: solr-user@lucene.apache.org Subject: Auto Suggest Hallo How would one implement a multiple term auto-suggest feature in Solr that is filter sensitive? For example, a user enters : mp3 and solr might suggest: - mp3 player - mp3 nano - mp3 sony and then the user starts the second word : mp3 n and that narrows it down to: - mp3 nano I had a quick look at the Terms Component. I suppose it just returns term totals for the entire index and cannot be used with a filter or query? Thanks Johan
It seems like using a wildcard causes lowercase filter to not do the lowercasing?
I have a field with lowercase filter on search and index sides, and searching in this field works fine with uppercase or lowercase terms, except if I wildcard! So searching for 'gps' or 'GPS' returns the same result set, but searching for 'gps*' returns results as expected and searching for 'GPS*' returns nothing. It seems the asterisk blocks the lower case filter operation and then no matches occur because the index is all lowercased. This is a very simple index with very simple docs, and the field is defined like this in the schema: field name=phraseNoSpaces type=alphaOnlySort indexed=true stored=false required=true/ fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType
RE: It seems like using a wildcard causes lowercase filter to not do the lowercasing?
Aha, I overlooked that. Thank you. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Monday, August 09, 2010 1:28 PM To: solr-user@lucene.apache.org Subject: Re: It seems like using a wildcard causes lowercase filter to not do the lowercasing? I have a field with lowercase filter on search and index sides, and searching in this field works fine with uppercase or lowercase terms, except if I wildcard! So searching for 'gps' or 'GPS' returns the same result set, but searching for 'gps*' returns results as expected and searching for 'GPS*' returns nothing. It seems the asterisk blocks the lower case filter operation and then no matches occur because the index is all lowercased. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer [1] [1]http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
does this indicate a commit happened for every add?
I'm adding lots of small docs with several threads to solr and the adds start fast but then slow down. I didn't do any explicit commits and autocommit is turned off but the logs show lots of commit activity on this core and restarting this solr core logged the below. Where did all these commits come from, the exact same number as my adds? I'm stumped... Jul 27, 2010 10:07:17 AM org.apache.solr.update.DirectUpdateHandler2 close INFO: closed DirectUpdateHandler2{commits=456389,autocommits=0,optimizes=0,rollbacks= 0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,e rrors=0,cumulative_adds=456393,cumulative_deletesById=0,cumulative_delet esByQuery=0,cumulative_errors=0}
RE: CommonsHttpSolrServer add document hangs
Maybe solr is busy doing a commit or optimize? -Original Message- From: Max Lynch [mailto:ihas...@gmail.com] Sent: Monday, July 12, 2010 9:59 AM To: solr-user@lucene.apache.org Subject: CommonsHttpSolrServer add document hangs Hey guys, I'm using Solr 1.4.1 and I've been having some problems lately with code that adds documents through a CommonsHttpSolrServer. It seems that randomly the call to theserver.add() will hang. I am currently running my code in a single thread, but I noticed this would happen in multi threaded code as well. The jar version of commons-httpclient is 3.1. I got a thread dump of the process, and one thread seems to be waiting on the org.apache.commons.httpclient.MultiThreadedHttpConnectionManager as shown below. All other threads are in a RUNNABLE state (besides the Finalizer daemon). [java] Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode): [java] [java] MultiThreadedHttpConnectionManager cleanup daemon prio=10 tid=0x7f441051c800 nid=0x527c in Object.wait() [0x7f4417e2f000] [java]java.lang.Thread.State: WAITING (on object monitor) [java] at java.lang.Object.wait(Native Method) [java] - waiting on 0x7f443ae5b290 (a java.lang.ref.ReferenceQueue$Lock) [java] at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) [java] - locked 0x7f443ae5b290 (a java.lang.ref.ReferenceQueue$Lock) [java] at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) [java] at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Referen ceQueueThread.run(MultiThreadedHttpConnectionManager.java:1122) Any ideas? Thanks.
RE: CommonsHttpSolrServer add document hangs
You could try a master slave setup using replication perhaps, then the slave serves searches and indexing commits on the master won't hang up searches at least... Here is the description: http://wiki.apache.org/solr/SolrReplication -Original Message- From: Max Lynch [mailto:ihas...@gmail.com] Sent: Monday, July 12, 2010 11:57 AM To: solr-user@lucene.apache.org Subject: Re: CommonsHttpSolrServer add document hangs Thanks Robert, My script did start going again, but it was waiting for about half an hour which seems a bit excessive to me. Is there some tuning I can do on the solr end to optimize for my use case, which is very heavy on commits and very light on searches (I do most of my searches on the raw Lucene index in the background)? Thanks. On Mon, Jul 12, 2010 at 12:06 PM, Robert Petersen rober...@buy.com wrote: Maybe solr is busy doing a commit or optimize? -Original Message- From: Max Lynch [mailto:ihas...@gmail.com] Sent: Monday, July 12, 2010 9:59 AM To: solr-user@lucene.apache.org Subject: CommonsHttpSolrServer add document hangs Hey guys, I'm using Solr 1.4.1 and I've been having some problems lately with code that adds documents through a CommonsHttpSolrServer. It seems that randomly the call to theserver.add() will hang. I am currently running my code in a single thread, but I noticed this would happen in multi threaded code as well. The jar version of commons-httpclient is 3.1. I got a thread dump of the process, and one thread seems to be waiting on the org.apache.commons.httpclient.MultiThreadedHttpConnectionManager as shown below. All other threads are in a RUNNABLE state (besides the Finalizer daemon). [java] Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode): [java] [java] MultiThreadedHttpConnectionManager cleanup daemon prio=10 tid=0x7f441051c800 nid=0x527c in Object.wait() [0x7f4417e2f000] [java]java.lang.Thread.State: WAITING (on object monitor) [java] at java.lang.Object.wait(Native Method) [java] - waiting on 0x7f443ae5b290 (a java.lang.ref.ReferenceQueue$Lock) [java] at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) [java] - locked 0x7f443ae5b290 (a java.lang.ref.ReferenceQueue$Lock) [java] at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) [java] at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Referen ceQueueThread.run(MultiThreadedHttpConnectionManager.java:1122) Any ideas? Thanks.
GC tuning - heap size autoranging
Is this a true statement??? This seems to contradict other statements regarding setting the heap size I have seen here... Default Heap Size If not otherwise set on the command line, the initial and maximum heap sizes are calculated based on the amount of memory on the machine. The proportion of memory to use for the heap is controlled by the command line options DefaultInitialRAMFraction and DefaultMaxRAMFraction, as shown in the table below. (In the table, memory represents the amount of memory on the machine.) Pasted from http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#available_collectors.selecting
RE: OOM on uninvert field request
Hey so after adding those GC options, I was able to incrementally push my max (and min) memory settings up and when we got to max=min=12GB we started looking much better! One slave handles all the load with no OOMs at all! I'm watching the live tomcat log using 'tail'. Next I will convert that field type to (trie) int and reindex. I'll have to start a new index from scratch with a field type change like that so I'll have to delete the old one first on our master... It takes us a couple days to index 15 million products (some are sets so the final index size is only 8 million) so I don't want to do *that* too often as the slaves will be quite stale by the time it's done! :) Thanks for the help! -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Wednesday, June 30, 2010 9:49 AM To: solr-user@lucene.apache.org Subject: RE: OOM on uninvert field request At and above 4GB we get those GC errors though! Should I switch to something like this? Recommended Options To use i-cms in Java SE 6, use the following command line options: -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \ -XX:+PrintGCDetails -XX:+PrintGCTimeStamps Caused by: java.lang.RuntimeException: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:418) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:467) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) ... 11 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Tuesday, June 29, 2010 8:42 PM To: solr-user@lucene.apache.org Subject: Re: OOM on uninvert field request Yes, it is better to use ints for ids than strings. Also, the Trie int fields have a compressed format that may cut the storage needs even more. 8m * 4 = 32mb, times a few hundred, we'll say 300, is 900mb of IDs. I don't know how these fields are stored, but if they are separate objects we've blown up to several gigs (per-object overheads are surprising). 4G is probably not enough for what you want. If you watch the total memory with 'top' and hit it with different queries, you will get a stronger sense of how much memory your use cases need. On Tue, Jun 29, 2010 at 4:32 PM, Robert Petersen rober...@buy.com wrote: Hello I am trying to find the right max and min settings for Java 1.6 on 20GB index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am currently have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m -Xms4096m) for both min and max which is doing pretty well but occasionally still getting the below OOM errors. We're running on dual quad core xeons with 16GB memory installed. I've been getting the below OOM exceptions still though. Is the memsize mentioned in the INFO for the uninvert in bytes? Ie is memSize=29604020 mean 29MB? We have a few hundred of these fields and they contain ints used as IDs, and so I guess could they eat all the memory to uninvert them all after we apply load and enough queries are performed. Does the field type matter, would int be better than string if these are lookup ids sparsely populated across the index? BTW these are used for faceting and filtering only. dynamicField name=*_contentAttributeToken type=string indexed=true multiValued=true stored=true required=false/ Jun 29, 2010 3:54:50 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=768_contentAttributeToken,memSize=29604014,tindexSize=50,time=1841,phase1=1824,nTerms=1,bigTerms=0,termInstances=18,uses=0} Jun 29, 2010 3:54:52 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=749_contentAttributeToken,memSize=29604020,tindexSize=56,time=1847,phase1=1829,nTerms=143,bigTerms=0,termInstances=951,uses=0} Jun 29, 2010 3:54:59 PM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:191) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:178) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:250) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166) -- Lance Norskog goks...@gmail.com
tomcat solr logs
Sorry if this is at all off topic. Our solr log files need grooming and we would also like to analyze them, perhaps pulling various data points into a DB table, is there a preferred app for doing log file analysis and/or an easy way to delete the old log files?
RE: OOM on uninvert field request
Most of these hundreds of facet fields have tens of values but a couple have thousands, is thousands of different values too many to do enum or is that still ok? If so I could apply it carte blanche to the whole field... -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, June 30, 2010 1:38 PM To: solr-user@lucene.apache.org Subject: Re: OOM on uninvert field request On Tue, Jun 29, 2010 at 7:32 PM, Robert Petersen rober...@buy.com wrote: Hello I am trying to find the right max and min settings for Java 1.6 on 20GB index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am currently have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m -Xms4096m) for both min and max which is doing pretty well but occasionally still getting the below OOM errors. We're running on dual quad core xeons with 16GB memory installed. I've been getting the below OOM exceptions still though. Is the memsize mentioned in the INFO for the uninvert in bytes? is memSize=29604020 mean 29MB? Yes. We have a few hundred of these fields and they contain ints used as IDs, and so I guess could they eat all the memory to uninvert them all after we apply load and enough queries are performed. Does the field type matter, would int be better than string if these are lookup ids sparsely populated across the index? No, using UnInvertedField faceting, the fieldType won't matter much at all for the space it takes up. The key here is that it looks like the number of unique terms in these fields is low - you would probably do much better with facet.method=enum (which iterates over terms rather than documents). -Yonik http://www.lucidimagination.com
OOM on uninvert field request
Hello I am trying to find the right max and min settings for Java 1.6 on 20GB index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am currently have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m -Xms4096m) for both min and max which is doing pretty well but occasionally still getting the below OOM errors. We're running on dual quad core xeons with 16GB memory installed. I've been getting the below OOM exceptions still though. Is the memsize mentioned in the INFO for the uninvert in bytes? Ie is memSize=29604020 mean 29MB? We have a few hundred of these fields and they contain ints used as IDs, and so I guess could they eat all the memory to uninvert them all after we apply load and enough queries are performed. Does the field type matter, would int be better than string if these are lookup ids sparsely populated across the index? BTW these are used for faceting and filtering only. dynamicField name=*_contentAttributeToken type=string indexed=true multiValued=true stored=true required=false/ Jun 29, 2010 3:54:50 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=768_contentAttributeToken,memSize=29604014,tindexSize=50,time=1841,phase1=1824,nTerms=1,bigTerms=0,termInstances=18,uses=0} Jun 29, 2010 3:54:52 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=749_contentAttributeToken,memSize=29604020,tindexSize=56,time=1847,phase1=1829,nTerms=143,bigTerms=0,termInstances=951,uses=0} Jun 29, 2010 3:54:59 PM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:191) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:178) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:250) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166)
RE: 99.9% uptime requirement
Here is another idea. With solr multicore you can dynamically spin up extra cores and bring them online. I'm not sure how well this would work for us since we have hard coded the names of the cores we are hitting in our config files. -Original Message- From: Brian Klippel [mailto:br...@theport.com] Sent: Thursday, August 06, 2009 8:38 AM To: solr-user@lucene.apache.org Subject: RE: 99.9% uptime requirement You could create a new working core, then call the swap command once it is ready. Then remove the work core and delete the appropriate index folder at your convenience. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Wednesday, August 05, 2009 6:41 PM To: solr-user@lucene.apache.org Subject: RE: 99.9% uptime requirement Maintenance Questions: In a two slave one master setup where the two slaves are behind load balancers what happens if I have to restart solr? If I have to restart solr say for a schema update where I have added a new field then what is the recommended procedure? If I can guarantee no commits or optimizes happen on the master during the schema update so no new snapshots become available then can I safely leave rsyncd enabled? When I stop and start a slave server, should I first pull it out of the load balancers list or will solr gracefully release connections as it shuts down so no searches are lost? What do you guys do to push out updates? Thanks for any thoughts, Robi -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, August 04, 2009 8:57 AM To: solr-user@lucene.apache.org Subject: Re: 99.9% uptime requirement Right. You don't get to 99.9% by assuming that an 8 hour outage is OK. Design for continuous uptime, with plans for how long it takes to patch around a single point of failure. For example, if your load balancer is a single point of failure, make sure that you can redirect the front end servers to a single Solr server in much less than 8 hours. Also, think about your SLA. Can the search index be more than 8 hours stale? How quickly do you need to be able to replace a failed indexing server? You might be able to run indexing locally on each search server if they are lightly loaded. wunder On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote: On Mon, 3 Aug 2009 13:15:44 -0700 Robert Petersen rober...@buy.com wrote: Thanks all, I figured there would be more talk about daemontools if there were really a need. I appreciate the input and for starters we'll put two slaves behind a load balancer and grow it from there. Robert, not taking away from daemon tools, but daemon tools won't help you if your whole server goes down. don't put all your eggs in one basket - several servers, load balancer (hardware load balancers x 2, haproxy, etc) and sure, use daemon tools to keep your services running within each server... B _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.