help: DIH and Multivalue does not work
Hi, Solr gurus: I am totally new to solr and hope somebody can help me on this. I have multivalue field channel in my schema: field name=channel type=string indexed=true stored=true multivalue=true/ and the DIH config: entity name=autocomplete query=select * from view_autocomplete transformer=RegexTransformer field column=channel sourceColcolumn=channels splitBy=,/ /entity the DIH full-import is Ok with other fields, but channel has nothing (solr query: q=*:*). If i remove field column=channel sourceColcolumn=channels splitBy=,/ or make my db returns channel column, channel would have the correct data, i.e. a,b,c but problem is a,b,c is no more multivalue but single value. I am using postgresql 9 and the sql is pretty simple: select ..., 'a,b,c'::text AS channels from table-xxx; I did try out solr 1.4.1 and 3.1 snapshot from svn, both having the same problem. Thanks.
Re: help: DIH and Multivalue does not work
There are several typos, multiValued=true = multivalue sourceColName = sourceColcolumn By the way, after you correct those and restart tomcat, you can debug on /admin/dataimport.jsp --- On Thu, 12/23/10, lun zhong zhong...@gmail.com wrote: From: lun zhong zhong...@gmail.com Subject: help: DIH and Multivalue does not work To: solr-user@lucene.apache.org Date: Thursday, December 23, 2010, 10:44 AM Hi, Solr gurus: I am totally new to solr and hope somebody can help me on this. I have multivalue field channel in my schema: field name=channel type=string indexed=true stored=true multivalue=true/ and the DIH config: entity name=autocomplete query=select * from view_autocomplete transformer=RegexTransformer field column=channel sourceColcolumn=channels splitBy=,/ /entity the DIH full-import is Ok with other fields, but channel has nothing (solr query: q=*:*). If i remove field column=channel sourceColcolumn=channels splitBy=,/ or make my db returns channel column, channel would have the correct data, i.e. a,b,c but problem is a,b,c is no more multivalue but single value. I am using postgresql 9 and the sql is pretty simple: select ..., 'a,b,c'::text AS channels from table-xxx; I did try out solr 1.4.1 and 3.1 snapshot from svn, both having the same problem. Thanks.
Re: help: DIH and Multivalue does not work
Yup, it is working, thanks so much. On Thu, Dec 23, 2010 at 4:55 PM, Ahmet Arslan iori...@yahoo.com wrote: There are several typos, multiValued=true = multivalue sourceColName = sourceColcolumn By the way, after you correct those and restart tomcat, you can debug on /admin/dataimport.jsp --- On Thu, 12/23/10, lun zhong zhong...@gmail.com wrote: From: lun zhong zhong...@gmail.com Subject: help: DIH and Multivalue does not work To: solr-user@lucene.apache.org Date: Thursday, December 23, 2010, 10:44 AM Hi, Solr gurus: I am totally new to solr and hope somebody can help me on this. I have multivalue field channel in my schema: field name=channel type=string indexed=true stored=true multivalue=true/ and the DIH config: entity name=autocomplete query=select * from view_autocomplete transformer=RegexTransformer field column=channel sourceColcolumn=channels splitBy=,/ /entity the DIH full-import is Ok with other fields, but channel has nothing (solr query: q=*:*). If i remove field column=channel sourceColcolumn=channels splitBy=,/ or make my db returns channel column, channel would have the correct data, i.e. a,b,c but problem is a,b,c is no more multivalue but single value. I am using postgresql 9 and the sql is pretty simple: select ..., 'a,b,c'::text AS channels from table-xxx; I did try out solr 1.4.1 and 3.1 snapshot from svn, both having the same problem. Thanks.
Re: Solr index directory '/solr/data/index' doesn't exist. Creating new index... on Geronimo
Just to share with solr community that the problem has been resolved in a simple way: move the solr/data/index out of the /opt/dev/config The root cause is permission. It seems Geronimo doesn't allow write permission to /opt/dev/config and its sub-folders Cheers, Bac Hoang On 12/22/2010 6:25 PM, Bac Hoang wrote: Hello Erick, Could you kindly give a hand on my problem. Any ideas, hints, suggestions are highly appreciated. Many thanks 1. The problem: Solr index directory '/solr/data/index' doesn't exist. Creating new index... 2. Some other info.: - use the solr example 1.4.1 - Geronimo 2.1.6 - solr home: /opt/dev/config/solr - dataDir: /opt/dev/config/solr/data/index. I set the read, write right to every and each folder, from opt, dev...to the last one, index (just for sure ;) ) - lockType: - single/ simple: Cannot create directory: /solr/data/index at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:397) - native: Cannot create directory: /solr/data/index at org.apache.lucene.store.NativeFSLockFactory.acquireTestLock - the Geronimo log: === 2010-12-22 15:13:03,001 INFO [SupportedModesServiceImpl] Portlet mode 'edit' not found for portletId: '/console-base.WARModules!874780194|0' 2010-12-22 15:13:03,001 INFO [SupportedModesServiceImpl] Portlet mode 'help' not found for portletId: '/console-base.WARModules!874780194|0' 2010-12-22 15:13:07,941 INFO [DirectoryMonitor] Hot deployer notified that an artifact was removed: default/solr2/1293005281314/war 2010-12-22 15:13:09,148 INFO [SupportedModesServiceImpl] Portlet mode 'edit' not found for portletId: '/console-base.WARModules!874780194|0' 2010-12-22 15:13:09,148 INFO [SupportedModesServiceImpl] Portlet mode 'help' not found for portletId: '/console-base.WARModules!874780194|0' 2010-12-22 15:13:14,139 INFO [SupportedModesServiceImpl] Portlet mode 'edit' not found for portletId: '/plugin.Deployment!227983155|0' 2010-12-22 15:13:18,795 WARN [TomcatModuleBuilder] Web application . does not contain a WEB-INF/geronimo-web.xml deployment plan. This may or may not be a problem, depending on whether you have things like resource references that need to be resolved. You can also give the deployer a separate deployment plan file on the command line. 2010-12-22 15:13:19,040 INFO [SolrResourceLoader] Using JNDI solr.home: /opt/dev/config/solr 2010-12-22 15:13:19,040 INFO [SolrResourceLoader] Solr home set to '/opt/dev/config/solr/' 2010-12-22 15:13:19,051 INFO [SolrDispatchFilter] SolrDispatchFilter.init() 2010-12-22 15:13:19,462 INFO [IndexSchema] default search field is text 2010-12-22 15:13:19,463 INFO [IndexSchema] query parser default operator is OR 2010-12-22 15:13:19,464 INFO [IndexSchema] unique key field: id 2010-12-22 15:13:19,490 INFO [JmxMonitoredMap] JMX monitoring is enabled. Adding Solr mbeans to JMX Server: com.sun.jmx.mbeanserver.jmxmbeanser...@144752d 2010-12-22 15:13:19,525 INFO [SolrCore] Added SolrEventListener: org.apache.solr.core.QuerySenderListener{queries=[]} 2010-12-22 15:13:19,525 INFO [SolrCore] Added SolrEventListener: org.apache.solr.core.QuerySenderListener{queries=[{q=solr rocks,start=0,rows=10}, {q=static firstSearcher warming query from solrconfig.xml}]} 2010-12-22 15:13:19,533 WARN [SolrCore] Solr index directory '/solr/data/index' doesn't exist. Creating new index... 2010-12-22 15:13:19,599 ERROR [SolrDispatchFilter] Could not start SOLR. Check solr/home property java.lang.RuntimeException: java.io.IOException: Cannot create directory: /solr/data/index at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:397) at org.apache.solr.core.SolrCore.init(SolrCore.java:545) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) ... 2010-12-22 15:13:19,601 INFO [SolrDispatchFilter] SolrDispatchFilter.init() done 2010-12-22 15:13:19,601 INFO [SolrServlet] SolrServlet.init() 2010-12-22 15:13:19,602 INFO [SolrResourceLoader] Using JNDI solr.home: /opt/dev/config/solr 2010-12-22 15:13:19,602 INFO [SolrServlet] SolrServlet.init() done 2010-12-22 15:13:19,606 INFO [SolrResourceLoader] Using JNDI solr.home: /opt/dev/config/solr 2010-12-22 15:13:19,606 INFO [SolrUpdateServlet] SolrUpdateServlet.init() done 2010-12-22 15:13:19,721 INFO [SupportedModesServiceImpl] Portlet mode 'edit' not found for portletId: '/plugin.Deployment!227983155|0' === With regards, Bac Hoang
error in html???
Hi All, I am able to get the response in the success case in json format by stating wt=json in the query. But as in case if any errors i am geting in html format. 1) Is there any specified reason to get in html format?? 2)cant we get the error result in json format?? Regards, satya
Re: Configuration option for disableReplication
Hi, Were running a cloud based cluster of servers and its not that easy to get a list of the current slaves. Since my problem is only around the restart/redeployment of the master it seems an unnecessary complication to have to start interacting with slaves as part of the scripts that do this. As you say there seems to be a proliferation of features you can enable and disable for the replication handler. Setting enabled=false for the master turns off all the features relating the the instance being a master. This is slightly different to the calling the 'disablereplication' command, which simply causes the 'indexversion' command to return 0 which effectively stops the slaves from knowing if there is a new version and hence trying to replicate it. Im not entirely clear whether this distinction is actually a useful one, combining them would be a fairly reasonable re factoring of the update handler, and would probably have an affect on backwards compatibility. Having the replicateAfter parameter set to just 'commit' (ie not on start up) has a similar affect to the 'disablereplication' command until you do the first commit after startup. So this is a workable solution for me, as the the process that pushes updates and commits to the index can also check and swap the cores before it does any work. However it feels like a bit of a tenuous way of disabling replication, particularly as there is an explicit mechanism for doing so, its just not configurable on startup. I have a patch, I was looking for a bit of feedback as to whether I should submit it. Thanks, Francis On 22 December 2010 21:30, Upayavira u...@odoko.co.uk wrote: I've just done a bit of playing here, because I've spent a lot of time reading the SolrReplication wiki page[1], and have often wondered how some features interact. Unfortunately, if you specify str name=enablefalse/str in your replication request handler for your master, you cannot re-enable it with a call to /solr/replication?command=enablereplication Therefore, it would seem your best bet is to call /solr/replication?command=disablepolling on all of your slaves prior to upgrading. Then, when you're sure everything is right, call /solr/replication?command=enablepolling on each slave, and you should be good to go. I tried this, watching the request log on my master, and the incoming replication requests did actually stop due to the disablepolling command, so you should be fine with this approach. Does this get you to where you want to be? Upayavira On Wed, 22 Dec 2010 17:10 +, Francis Rhys-Jones francis.rhys-jo...@guardian.co.uk wrote: Hi, I am looking into using a multi core configuration to allow us to fully rebuild our index while still applying updates. I have two cores main-core and rebuild-core. I push the whole dataset into the rebuild core, during which time I can happily keep pushing updates into the main-core. Once the rebuild is complete I swap the cores and delete *:* from the rebuild core. This works fine however there are a couple of edge cases: On server restart solr needs to remember which core has been swapped in to be the main core, this can be solved by adding the persistent=true attribute to the solr config, however this does require the solr.xml to be writeable. While deploying a new version of our application we overwrite the solr.xml, as the new version could potentially have legitimate changes to the solr.xml that need to be rolled out, again leaving the cores out of sync. My proposed solution is to have the indexing process do some sanity checking at the start of each run, and swap in the correct core if necessary. This works however there is the potential for the slaves to start replicating the empty index before the correct index is swapped in. To get round this problem I would like to have replication disabled on start up. Removing replicateAfter=startup has this affect but it would be more future proof to be able to specify a default for the replicationEnabled field (see SOLR-1175) in the ReplcationHandler, stopping replication until I explicitly turn it on. The change looks fairly simple. --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source Please consider the environment before printing this email. -- Visit guardian.co.uk - newspaper website of the year www.guardian.co.uk www.observer.co.uk To save up to 33% when you subscribe to the Guardian and the Observer visit http://www.guardian.co.uk/subscriber - This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any
Item catagorization problem.
Hi all, I am using solr in my web application for search purposes. However, i am having a problem with the default behaviour of the solr search. From my understanding, if i query for a keyword, let's say Laptop, preference is given to result rows having more occurences of the search keyword Laptop in the field name. This, however, is producing undesirable scenarios, for example: 1. I index an item A with name value Sony Laptop. 2. I index another item B with name value: Laptop bags for laptops. 3. I search for the keyword Laptop According to the default behaviour, precedence would be given to item B since the keyword appears more times in the name field for that item. In my schema, i have another field by the name of Category and, for example's sake, let's assume that my application supports only two categories: computers and accessories. Now, what i require is a mechanism to assign correct categories to the items during item indexing so that this field can be used to better filter the search results, item A would belong to Computer category and item B would belong to Accessories category. So then, searching for Laptop would only look for items in the Computers category and return item A only. I would like to point out here that setting the category field manually is not an option since the data might be in the vicinity of thousands of records. I am not asking for an in-depth algorithm. Just a high level design would be sufficient to set me in the right direction. thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Item-catagorization-problem-tp2136415p2136415.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Configuration option for disableReplication
Having played with it, I can see that it would be extremely useful to be able to disable replication in the solrconfig.xml, and then enable it with a URL. So, as to your patch, I'd say yes, submit it. But do try to make it backwards compatible. It'll make it much more likely to get accepted. Upayavira On Thu, 23 Dec 2010 12:12 +, Francis Rhys-Jones francis.rhys-jo...@guardian.co.uk wrote: Hi, Were running a cloud based cluster of servers and its not that easy to get a list of the current slaves. Since my problem is only around the restart/redeployment of the master it seems an unnecessary complication to have to start interacting with slaves as part of the scripts that do this. As you say there seems to be a proliferation of features you can enable and disable for the replication handler. Setting enabled=false for the master turns off all the features relating the the instance being a master. This is slightly different to the calling the 'disablereplication' command, which simply causes the 'indexversion' command to return 0 which effectively stops the slaves from knowing if there is a new version and hence trying to replicate it. Im not entirely clear whether this distinction is actually a useful one, combining them would be a fairly reasonable re factoring of the update handler, and would probably have an affect on backwards compatibility. Having the replicateAfter parameter set to just 'commit' (ie not on start up) has a similar affect to the 'disablereplication' command until you do the first commit after startup. So this is a workable solution for me, as the the process that pushes updates and commits to the index can also check and swap the cores before it does any work. However it feels like a bit of a tenuous way of disabling replication, particularly as there is an explicit mechanism for doing so, its just not configurable on startup. I have a patch, I was looking for a bit of feedback as to whether I should submit it. Thanks, Francis On 22 December 2010 21:30, Upayavira u...@odoko.co.uk wrote: I've just done a bit of playing here, because I've spent a lot of time reading the SolrReplication wiki page[1], and have often wondered how some features interact. Unfortunately, if you specify str name=enablefalse/str in your replication request handler for your master, you cannot re-enable it with a call to /solr/replication?command=enablereplication Therefore, it would seem your best bet is to call /solr/replication?command=disablepolling on all of your slaves prior to upgrading. Then, when you're sure everything is right, call /solr/replication?command=enablepolling on each slave, and you should be good to go. I tried this, watching the request log on my master, and the incoming replication requests did actually stop due to the disablepolling command, so you should be fine with this approach. Does this get you to where you want to be? Upayavira On Wed, 22 Dec 2010 17:10 +, Francis Rhys-Jones francis.rhys-jo...@guardian.co.uk wrote: Hi, I am looking into using a multi core configuration to allow us to fully rebuild our index while still applying updates. I have two cores main-core and rebuild-core. I push the whole dataset into the rebuild core, during which time I can happily keep pushing updates into the main-core. Once the rebuild is complete I swap the cores and delete *:* from the rebuild core. This works fine however there are a couple of edge cases: On server restart solr needs to remember which core has been swapped in to be the main core, this can be solved by adding the persistent=true attribute to the solr config, however this does require the solr.xml to be writeable. While deploying a new version of our application we overwrite the solr.xml, as the new version could potentially have legitimate changes to the solr.xml that need to be rolled out, again leaving the cores out of sync. My proposed solution is to have the indexing process do some sanity checking at the start of each run, and swap in the correct core if necessary. This works however there is the potential for the slaves to start replicating the empty index before the correct index is swapped in. To get round this problem I would like to have replication disabled on start up. Removing replicateAfter=startup has this affect but it would be more future proof to be able to specify a default for the replicationEnabled field (see SOLR-1175) in the ReplcationHandler, stopping replication until I explicitly turn it on. The change looks fairly simple. --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source Please consider the environment before printing this email. -- Visit guardian.co.uk - newspaper website of the year www.guardian.co.uk
Solr 1.4.1 stats component count not matching facet count for multi valued field
Hi, I have a facet field called option which may be multi-valued and a weight field which is single-valued. When I use the Solr 1.4.1 stats component with a facet field, i.e. q=*:*version=2.2stats=true stats.field=weightstats.facet=option I get conflicting results for the stats count result long name=count1/long when compared with the faceting counts obtained by q=*:*version=2.2facet=truefacet.field=option I would expect the same count for either method. This happens if multiple values are stored in the options field. It seem that for a multiple values only the last entered value is being considered in the stats component? What am I doing wrong here? Thanks, Johannes
Total number of groups after collapsing
Hi, I have been using collapsing in my application. I have a requirement of finding the no of groups matching some filter criteria. Something like a COUNT(DISTINCT columnName). The only solution I can currently think of is using the query: q=*:*rows=Integer.MAX_VALUEstart=0fl=scorecollapse.field=abccollapse.threshold=1collapse.type=normal I get the number of groups from 'numFound', but this seems like a bad solution in terms of performance. Is there a cleaner way? Thanks, Samarth
Using remote Nutch Server to crawl, then merging results into local index
I want to use Solr to index two types of documents: - local documents in Drupal (ca. 10M) - a large number of web sites to be crawled thru Nutch (ca 100M) Our data center does not have the necessary bandwith to crawl all the external sites and we want to use a hosting provider to do the crawling for us, but we want the actual serving of results to happen locally. It seems as if it would be probably be easiest to delegate all the indexing to a remote server and replicated those indexes to a slave in our data center using built-in Solr replication, but then the indexing of our internal sites would have to happen remotely, too, which I would like to avoid. I think Hadoop/MapReduce would be overkill for this scenario, so what other options are there? I was considering - using Solr merge to merge the Drupal Nutch indexes - have Nutch post the crawled results to the local Solr index Any suggestions would be highly appreciated. Dietrich Schmidt http://www.linkedin.com/in/dietrichschmidt
Custom match scoring
Hi, I'm implementing a search that has peculiar scoring rules that, as I can see, isn't supported natively. The rules are like: - Given a set of tokens, the final score would be the sum of scores of all token by each token can only be scored for its best match over a set of fields that it might match i.e. restaurant food (2 tokens) must match Category^10 Name^5 Description the token restaurant might match documents with all fields but it must only be given the score of Category match the token food also counts for the score, again, with its best match on any of indicated fields Can anyone guide me truth a solution or for a extension point where I can capture only the best match for a given field? Thanks in advance. -- Nelson Branco smime.p7s Description: S/MIME cryptographic signature
Re: full text search in multiple fields
Correct! Thanks again, it now works! :) -- View this message in context: http://lucene.472066.n3.nabble.com/full-text-search-in-multiple-fields-tp1888328p2137284.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using remote Nutch Server to crawl, then merging results into local index
Hi, In order to crawl and index your web sites, may you can have a look at www.crawl-anywhere.com. It includes a web crawler, a document processing pipeline and a solr indexer. Dominique Le 23/12/10 16:27, Dietrich a écrit : I want to use Solr to index two types of documents: - local documents in Drupal (ca. 10M) - a large number of web sites to be crawled thru Nutch (ca 100M) Our data center does not have the necessary bandwith to crawl all the external sites and we want to use a hosting provider to do the crawling for us, but we want the actual serving of results to happen locally. It seems as if it would be probably be easiest to delegate all the indexing to a remote server and replicated those indexes to a slave in our data center using built-in Solr replication, but then the indexing of our internal sites would have to happen remotely, too, which I would like to avoid. I think Hadoop/MapReduce would be overkill for this scenario, so what other options are there? I was considering - using Solr merge to merge the Drupal Nutch indexes - have Nutch post the crawled results to the local Solr index Any suggestions would be highly appreciated. Dietrich Schmidt http://www.linkedin.com/in/dietrichschmidt
Re: DIH for taxonomy faceting in Lucid webcast
SolrJ is often used when DIH doesn't do what you wish. Using SolrJ is really quite easy, but you're doing the DB queries yourself, often with the appropriate jdbc driver. Within DIH, the transformers, as Chris says, *might* work for you. Best Erick On Wed, Dec 22, 2010 at 6:16 PM, Andy angelf...@yahoo.com wrote: --- On Wed, 12/22/10, Chris Hostetter hossman_luc...@fucit.org wrote: : 2) Once I have the fully spelled out category path such as : NonFic/Science, how do I turn that into 0/NonFic : 1/NonFic/Science using the DIH? I don't have any specific suggestions for you -- i've never tried it in DIH myself. the ScriptTransformer might be able to help you out, but i'm not sure. Thanks Chris. What did you use to generate those encodings if not DIH?
Re: error in html???
What html format? Solr responds in XML, not HTML. Any HTML has to be created somewhere in the chain. Your browser may not be set up to render XML, so you could be seeing problems because of that. If hit is off-base, could you explain your issue in a bit more detail? Best Erick On Thu, Dec 23, 2010 at 6:30 AM, satya swaroop satya.yada...@gmail.comwrote: Hi All, I am able to get the response in the success case in json format by stating wt=json in the query. But as in case if any errors i am geting in html format. 1) Is there any specified reason to get in html format?? 2)cant we get the error result in json format?? Regards, satya
Re: Item catagorization problem.
What you're asking for appears to me to be auto-categorization, and there's nothing built into Solr to do this. Somehow you need to analyze the documents at index time and add the proper categories, but I have no clue how. This is especially hard with short fields since most auto-categorization algorithms try to do some statistical analysis of the document to figure this out. Best Erick On Thu, Dec 23, 2010 at 8:12 AM, Hasnain hasn...@hotmail.com wrote: Hi all, I am using solr in my web application for search purposes. However, i am having a problem with the default behaviour of the solr search. From my understanding, if i query for a keyword, let's say Laptop, preference is given to result rows having more occurences of the search keyword Laptop in the field name. This, however, is producing undesirable scenarios, for example: 1. I index an item A with name value Sony Laptop. 2. I index another item B with name value: Laptop bags for laptops. 3. I search for the keyword Laptop According to the default behaviour, precedence would be given to item B since the keyword appears more times in the name field for that item. In my schema, i have another field by the name of Category and, for example's sake, let's assume that my application supports only two categories: computers and accessories. Now, what i require is a mechanism to assign correct categories to the items during item indexing so that this field can be used to better filter the search results, item A would belong to Computer category and item B would belong to Accessories category. So then, searching for Laptop would only look for items in the Computers category and return item A only. I would like to point out here that setting the category field manually is not an option since the data might be in the vicinity of thousands of records. I am not asking for an in-depth algorithm. Just a high level design would be sufficient to set me in the right direction. thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Item-catagorization-problem-tp2136415p2136415.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom match scoring
Hmmm, have you looked at dismax? If I'm reading your message correctly, it sounds like this may already be there. Of course I've missed the point of messages before. Best Erick On Thu, Dec 23, 2010 at 10:29 AM, Nelson Branco nelson-bra...@telecom.ptwrote: Hi, I’m implementing a search that has peculiar scoring rules that, as I can see, isn’t supported natively. The rules are like: - Given a set of tokens, the final score would be the sum of scores of all token by each token can only be scored for its best match over a set of fields that it might match i.e. “restaurant food” (2 tokens) must match “Category^10 Name^5 Description” the token “restaurant” might match documents with all fields but it must only be given the score of “Category” match the token “food” also counts for the score, again, with its best match on any of indicated fields Can anyone guide me truth a solution or for a extension point where I can capture only the best match for a given field? Thanks in advance. -- Nelson Branco
Re: Item precedence search problem
On Wed, Dec 22, 2010 at 3:53 PM, Hasnain hasn...@hotmail.com wrote: [...] In my schema, i have another field by the name of Category and, for example's sake, let's assume that my application supports only two categories: computers and accessories. Now, what i require is a mechanism to assign correct categories to the items during item indexing so that this field can be used to better filter the search results. Continuing from the example in my original post, item A would belong to Computer category and item B would belong to Accessories category. So then, searching for Laptop would only look for items in the Computers category and return item A only. I would like to point out here that setting the category field manually is not an option since the data might be in the vicinity of thousands of records. I am not asking for an in-depth algorithm. Just a high level design would be sufficient to set me in the right direction. [...] How do you do your indexing? You would need to have the indexer decide on what the proper category for a document should be, and add that value to the category field. Depending on your requirements, it might be possible to use synonyms in Solr to arrive at something like this. Other than that, Solr has no mechanism to automatically assign a category. You could possibly look at things like Apache Mahout to help you here. Regards, Gora
Re: error in html???
These HTTP Status 500 - null java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:50) at are returned in HTML. I use Nginx to detect the HTTP error code and return a JSON encoded body with the appropriate content type. Maybe it could be done in the servlet container but i never tried. What html format? Solr responds in XML, not HTML. Any HTML has to be created somewhere in the chain. Your browser may not be set up to render XML, so you could be seeing problems because of that. If hit is off-base, could you explain your issue in a bit more detail? Best Erick On Thu, Dec 23, 2010 at 6:30 AM, satya swaroop satya.yada...@gmail.comwrote: Hi All, I am able to get the response in the success case in json format by stating wt=json in the query. But as in case if any errors i am geting in html format. 1) Is there any specified reason to get in html format?? 2)cant we get the error result in json format?? Regards, satya
Re: Using remote Nutch Server to crawl, then merging results into local index
Merging the indexes seems problematical. It's easy enough to #code#, but I'm not sure it would produce results you want. And it supposes that your schemas are identical (or at least compatible) between the crawled data and your local data, which I wonder about... Instead, I'd think about cores. Cores can be thought of as a virtual Solr index accessible by a single Solr instance. I'd guess that your requirements for handling the crawled data are different enough from the local documents that this might be what you want to do anyway. Federating these would probably involve two queries and some kind of manual integration of them though. Best Erick On Thu, Dec 23, 2010 at 10:27 AM, Dietrich diet...@gmail.com wrote: I want to use Solr to index two types of documents: - local documents in Drupal (ca. 10M) - a large number of web sites to be crawled thru Nutch (ca 100M) Our data center does not have the necessary bandwith to crawl all the external sites and we want to use a hosting provider to do the crawling for us, but we want the actual serving of results to happen locally. It seems as if it would be probably be easiest to delegate all the indexing to a remote server and replicated those indexes to a slave in our data center using built-in Solr replication, but then the indexing of our internal sites would have to happen remotely, too, which I would like to avoid. I think Hadoop/MapReduce would be overkill for this scenario, so what other options are there? I was considering - using Solr merge to merge the Drupal Nutch indexes - have Nutch post the crawled results to the local Solr index Any suggestions would be highly appreciated. Dietrich Schmidt http://www.linkedin.com/in/dietrichschmidt
Re: Item catagorization problem.
Doesn't indexing analyzing do this to some degree anyway? Not sure the alogrithm, but something like: How often, hom much near the top, how many differnt forms, subject or object of a sentence. That has to have some relevance to what category something is in. The simplest extension to that would be something like a 'sub vocabulary' cross listing. If such and such words were hi relevance, then the subject is about this or that. The smartest categorizer is your users, though. So the best way to make that list is to keep track of how close to the top of the search results did a user respond to his search results and what were the words, and how many search attempts did it take. That's waht netflix does. Their goal is to have users get something in theh top three off the first search attempt. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, December 23, 2010 10:00:05 AM Subject: Re: Item catagorization problem. What you're asking for appears to me to be auto-categorization, and there's nothing built into Solr to do this. Somehow you need to analyze the documents at index time and add the proper categories, but I have no clue how. This is especially hard with short fields since most auto-categorization algorithms try to do some statistical analysis of the document to figure this out. Best Erick On Thu, Dec 23, 2010 at 8:12 AM, Hasnain hasn...@hotmail.com wrote: Hi all, I am using solr in my web application for search purposes. However, i am having a problem with the default behaviour of the solr search. From my understanding, if i query for a keyword, let's say Laptop, preference is given to result rows having more occurences of the search keyword Laptop in the field name. This, however, is producing undesirable scenarios, for example: 1. I index an item A with name value Sony Laptop. 2. I index another item B with name value: Laptop bags for laptops. 3. I search for the keyword Laptop According to the default behaviour, precedence would be given to item B since the keyword appears more times in the name field for that item. In my schema, i have another field by the name of Category and, for example's sake, let's assume that my application supports only two categories: computers and accessories. Now, what i require is a mechanism to assign correct categories to the items during item indexing so that this field can be used to better filter the search results, item A would belong to Computer category and item B would belong to Accessories category. So then, searching for Laptop would only look for items in the Computers category and return item A only. I would like to point out here that setting the category field manually is not an option since the data might be in the vicinity of thousands of records. I am not asking for an in-depth algorithm. Just a high level design would be sufficient to set me in the right direction. thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Item-catagorization-problem-tp2136415p2136415.html l Sent from the Solr - User mailing list archive at Nabble.com.
Re: full text search in multiple fields
Sorry to bother you again, but it still doesnt seem to work all the time... This (what you solved earlier) works: q=title_search:PappegaydefType=lucenefl=id,title But for another location, which value in DB is: de tuinkamer When I query the id of that location: q=id:431fl=id,title the location is found, so it IS indexed... But this query DOESNT work: q=title_search:tuinkamer*defType=lucenefl=id,title And this one DOES: q=title_search:tuin*defType=lucenefl=id,title for me this is unexpected...what can it be? -- View this message in context: http://lucene.472066.n3.nabble.com/full-text-search-in-multiple-fields-tp1888328p2137983.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: full text search in multiple fields
But for another location, which value in DB is: de tuinkamer When I query the id of that location: q=id:431fl=id,title the location is found, so it IS indexed... But this query DOESNT work: q=title_search:tuinkamer*defType=lucenefl=id,title And this one DOES: q=title_search:tuin*defType=lucenefl=id,title for me this is unexpected...what can it be? As you can verify from /solr/admin/analysis.jsp, tuinkamer is reduced to tuinkam by EnglishPorterFilterFactory. So it expected/normal that q=title_search:tuinkamer* won't return that document. Remember tuinkamer* is not analyzed and tested against what is indexed. That said, if you plan using wildcards, remove EnglishPorterFilterFactory from your analyzers.
Re: synonyms database
Hi ramzesua, Synonym lists will often be application specific and will of course be language specific. Given this I don't think you can talk about a generic solr synonym list, just won't be very helpful in lots of cases. What are you hoping to achieve with your synonyms for your app? On 23 December 2010 11:50, ramzesua michaelnaza...@gmail.com wrote: Hi all. Where can I get synonyms database for Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-database-tp2136076p2136076.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: full text search in multiple fields
@iorixxx: removing that line did solve the problem, thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/full-text-search-in-multiple-fields-tp1888328p2138629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Spellcheker automatically tokenizes on period marks
Is it possible that the spellcheck query can be configured to stop tokenizing on period marks through a parameter, rather than through the analyzer? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2138753.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4.1 stats component count not matching facet count for multi valued field
: I have a facet field called option which may be multi-valued and : a weight field which is single-valued. : : When I use the Solr 1.4.1 stats component with a facet field, i.e. ... : I get conflicting results for the stats count result a jira search for solr stats multivalued would have given you... https://issues.apache.org/jira/browse/SOLR-1782 -Hoss
Re: DIH for taxonomy faceting in Lucid webcast
The DIH lets you code in Javascript- you can do anything. On 12/23/10, Erick Erickson erickerick...@gmail.com wrote: SolrJ is often used when DIH doesn't do what you wish. Using SolrJ is really quite easy, but you're doing the DB queries yourself, often with the appropriate jdbc driver. Within DIH, the transformers, as Chris says, *might* work for you. Best Erick On Wed, Dec 22, 2010 at 6:16 PM, Andy angelf...@yahoo.com wrote: --- On Wed, 12/22/10, Chris Hostetter hossman_luc...@fucit.org wrote: : 2) Once I have the fully spelled out category path such as : NonFic/Science, how do I turn that into 0/NonFic : 1/NonFic/Science using the DIH? I don't have any specific suggestions for you -- i've never tried it in DIH myself. the ScriptTransformer might be able to help you out, but i'm not sure. Thanks Chris. What did you use to generate those encodings if not DIH? -- Lance Norskog goks...@gmail.com
Problem of results ordering
When I search guitar center 94305, it gives the results: guitar center guitar center Hollywood guitar center 94305 guitar center 94305 location But I want results to be like this: guitar center 94305 guitar center 94305 location guitar center guitar center Hollywood How can I make the results that match all keywords come first? Or how can I reduce the weight of the word that appears the second or more time? Thanks Ruixiang
Re: Problem of results ordering
What does your query look like? Especially what is the output when you append debugQuery=on? You can examine the scoring at the end of the response to gain more insight. Best Erick On Thu, Dec 23, 2010 at 8:34 PM, Ruixiang Zhang rxzh...@gmail.com wrote: When I search guitar center 94305, it gives the results: guitar center guitar center Hollywood guitar center 94305 guitar center 94305 location But I want results to be like this: guitar center 94305 guitar center 94305 location guitar center guitar center Hollywood How can I make the results that match all keywords come first? Or how can I reduce the weight of the word that appears the second or more time? Thanks Ruixiang
Re: Problem of results ordering
Try boosting 94305 as guitar center 94305^10 On Fri, Dec 24, 2010 at 9:23 AM, Erick Erickson [via Lucene] ml-node+2139685-1248268645-146...@n3.nabble.comml-node%2b2139685-1248268645-146...@n3.nabble.com wrote: What does your query look like? Especially what is the output when you append debugQuery=on? You can examine the scoring at the end of the response to gain more insight. Best Erick On Thu, Dec 23, 2010 at 8:34 PM, Ruixiang Zhang [hidden email]http://user/SendEmail.jtp?type=nodenode=2139685i=0 wrote: When I search guitar center 94305, it gives the results: guitar center guitar center Hollywood guitar center 94305 guitar center 94305 location But I want results to be like this: guitar center 94305 guitar center 94305 location guitar center guitar center Hollywood How can I make the results that match all keywords come first? Or how can I reduce the weight of the word that appears the second or more time? Thanks Ruixiang -- View message @ http://lucene.472066.n3.nabble.com/Problem-of-results-ordering-tp2139314p2139685.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=. -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-of-results-ordering-tp2139314p2139978.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr 1.4.1 stats component count not matching facet count for multi valued field
Interesting, the wiki page on StatsComponent says multi-valued fields may be slow , and may use lots of memory. http://wiki.apache.org/solr/StatsComponent Apparently it should also warn that multi-valued fields may not work at all? I'm going to add that with a link to the JIRA ticket. From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Thursday, December 23, 2010 7:22 PM To: solr-user@lucene.apache.org Subject: Re: Solr 1.4.1 stats component count not matching facet count for multi valued field : I have a facet field called option which may be multi-valued and : a weight field which is single-valued. : : When I use the Solr 1.4.1 stats component with a facet field, i.e. ... : I get conflicting results for the stats count result a jira search for solr stats multivalued would have given you... https://issues.apache.org/jira/browse/SOLR-1782 -Hoss
RE: Solr 1.4.1 stats component count not matching facet count for multi valued field
: Interesting, the wiki page on StatsComponent says multi-valued fields : may be slow , and may use lots of memory. : http://wiki.apache.org/solr/StatsComponent *stats* over multivalued fields work, but use lots of memory -- that bug only hits you when you compute stats over any field, that are faceted by a multivalued field. -Hoss
RE: Solr 1.4.1 stats component count not matching facet count for multi valued field
Aha! Thanks, sorry, I'll clarify on my wiki edit. From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Friday, December 24, 2010 12:11 AM To: solr-user@lucene.apache.org Subject: RE: Solr 1.4.1 stats component count not matching facet count for multi valued field : Interesting, the wiki page on StatsComponent says multi-valued fields : may be slow , and may use lots of memory. : http://wiki.apache.org/solr/StatsComponent *stats* over multivalued fields work, but use lots of memory -- that bug only hits you when you compute stats over any field, that are faceted by a multivalued field. -Hoss
Re: error in html???
Hi Erick, Every result comes in xml format. But when you get any errors like http 500 or http 400 like wise we will get in html format. My query is cant we make that html file into json or vice versa.. Regards, satya
Map failed at getSearcher
Hi all, I have created a new index (using Solr trunk version from 17th December, running on Windows 7 Tomcat 6, 64 bit JVM) with around 1.1 billion of documents (index size around 550GB, mergeFactor=20). After the (csv) import I have commited the data and got this error: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. - java.lang.RuntimeException: java.io.IOException: Map failed at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4651) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:546) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:637) at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:563) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:498) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1277) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:321) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:785) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445) at org.apache.catalina.core.StandardService.start(StandardService.java:519) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:581) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MultiMMapIndexInput.init(MMapDirectory.java:327) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:209) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:68) at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:208) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:529) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:504) at org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:123) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:91) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 33 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ... 48 more I can see that the error is going down to lucene and java, but I don't have a clue what should I do... Any suggestions? Thanks and merry christmas:) Rok
Re: Total number of groups after collapsing
Hi, I figured out a better way of doing it. The following query would be a better option: q=*:*start=2147483647rows=0collapse=truecollapse.field=abccollapse.threshold=1 Thanks, Samarth On Thu, Dec 23, 2010 at 8:57 PM, samarth s samarth.s.seksa...@gmail.comwrote: Hi, I have been using collapsing in my application. I have a requirement of finding the no of groups matching some filter criteria. Something like a COUNT(DISTINCT columnName). The only solution I can currently think of is using the query: q=*:*rows=Integer.MAX_VALUEstart=0fl=scorecollapse.field=abccollapse.threshold=1collapse.type=normal I get the number of groups from 'numFound', but this seems like a bad solution in terms of performance. Is there a cleaner way? Thanks, Samarth