Re: Solr replication and spellcheck data
This is not supported by the Java Replication . but planned for later https://issues.apache.org/jira/browse/SOLR-866 On Wed, Jul 29, 2009 at 4:04 AM, Ian Sugariansu...@gmail.com wrote: Hi I would like to make use of the new replication mechanism [1] to set up a master-slaves configuration, but from quick reading and searching around, I can't seem to find a way to replicate the spelling index in addition to the main search index. (We use the spellcheck component) Is there a way to do it, or would we have to go the cron/script/rsync way [2]? Any pointers appreciated. I probably missed something! Ian [1] http://wiki.apache.org/solr/SolrReplication [2] http://wiki.apache.org/solr/CollectionDistribution -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: query in solr lucene
I tried using AND, but it even provided me doc 3 which was not required. Hence my problem still persists... regards, Sushan At 06:59 AM 7/29/2009, Avlesh Singh wrote: No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I read it. Sorry, my bad. I did not read properly before replying. Cheers Avlesh On Wed, Jul 29, 2009 at 3:23 AM, Erick Erickson erickerick...@gmail.comwrote: No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I read it. You might have some joy with KeywordAnalyzer, which does not break the incoming stream up into tokens. You have to be careful, though, because it also won't fold the case, so 'Hello' would not match 'hello'. Best Erick On Tue, Jul 28, 2009 at 11:11 AM, Avlesh Singh avl...@gmail.com wrote: You should perform a PhraseQuery on the required field. Meaning, http://your-solr-host:port: /your-core-path/select?q=fieldName:Hello how are you sushan would work for you. Cheers Avlesh 2009/7/28 Gérard Dupont ger.dup...@gmail.com Hi Sushan, I'm not an expert of Solr, just beginner, but it appears to me that you may have default 'OR' combinaison fo keywords so that will explain this behavior. Try to modify the configuration for an 'AND' combinaison. cheers On Tue, Jul 28, 2009 at 16:49, Sushan Rungta s...@clickindia.com wrote: I am extremely sorry for responding late as I was ill from past few days. My problem is explained below with an example: I am having three documents with following list: 1. Hello how are you 2. Hello how are you sushan 3. Hello how are you sushan. I am fine. When I search for a query Hello how are you sushan, I should only get document 2 in my result. I hope this will give you all a better insight in my problem. regards, Sushan Rungta -- Gérard Dupont Information Processing Control and Cognition (IPCC) - EADS DS http://weblab-project.org Document Learning team - LITIS Laboratory
Re: highlighting performance
Hey Matt: I have been facing the same issue. I have a text field that I highlight along with other fields (may be 10 others fields). But If I enable highlighting on this text field that contains large number of characters/words ( 100 000 characters) , highlighting suffers performance. Queries return in about 15/20 seconds with this field enabled in highlights as compared to less than a second WITHOUT this enabled in highlight. I did try termvector=true , but I did not see any performance gain either. Just wondering if you were able to solve your issue OR tweak the performance in any other way. BTW , I use solr 1.3. ~Ravi . goodieboy wrote: Thanks Otis. I added termVector=true for those fields, but there isn't a noticeable difference. So, just to be a little more clear, the dynamic fields I'm adding... there might be hundreds. Do you see this as a problem? Thanks, Matt On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Matt, I believe indexing those fields that you will use for highlighting with term vectors enabled will make things faster (and your index a bit bigger). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Matt Mitchell goodie...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, May 15, 2009 5:08:23 PM Subject: highlighting performance Hi, I'm experimenting with highlighting and am noticing a big drop in performance with my setup. I have documents that use quite a few dynamic fields (20-30). The fields are multiValued stored/indexed text fields, each with a few paragraphs worth of text. My hl.fl param is set to *_t What kinds of things can I tweak to make this faster? Is it because I'm highlighting so many different fields? Thanks, Matt Quoted from: http://www.nabble.com/highlighting-performance-tp23567323p23713406.html goodieboy wrote: Thanks Otis. I added termVector=true for those fields, but there isn't a noticeable difference. So, just to be a little more clear, the dynamic fields I'm adding... there might be hundreds. Do you see this as a problem? Thanks, Matt On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Matt, I believe indexing those fields that you will use for highlighting with term vectors enabled will make things faster (and your index a bit bigger). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Matt Mitchell goodie...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, May 15, 2009 5:08:23 PM Subject: highlighting performance Hi, I'm experimenting with highlighting and am noticing a big drop in performance with my setup. I have documents that use quite a few dynamic fields (20-30). The fields are multiValued stored/indexed text fields, each with a few paragraphs worth of text. My hl.fl param is set to *_t What kinds of things can I tweak to make this faster? Is it because I'm highlighting so many different fields? Thanks, Matt -- View this message in context: http://www.nabble.com/highlighting-performance-tp23567323p24713543.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a multi-shard optimize message?
On Wed, Jul 29, 2009 at 2:48 AM, Phillip Farber pfar...@umich.edu wrote: Normally to optimize an index you POST optimize/ to /solr/update. Is there any way to POST an optimize message to one instance and have it propagate to all shards sort of like the select? /solr-shard-1/select?q=dog... shards=shard-1,shard2 No, you'll need to send optimize to each host separately. -- Regards, Shalin Shekhar Mangar.
refering/alias other Solr documents
Hi all: Is in solr, that will allow documents referring each other ? In other words, if a search for abc matches on document 1 , I should be able to return document 2 even though the index does any fields matching abc. Here is the scenario with some more details: Solr version:1.3 Scenario: 1) Solr Document 1 with say some field title=abc and Solr Document 2 with its own data. 2) User searches for abc and gets Document 1 as it matches on title field Expected results: When the user searches for abc he it also get Document 2 along with Document 1. I understand one way of doing this is to make sure Document 2 has all the contents of Document 1. But this introduces a issue of keeping the two documents (and hence their solr index) in sync with each other. I think I am looking for a mechanism like this: Document 1 refers = document 2, Document 3. Hence whenever document 1 in part of search results, document 2 and document 3 will also be returned as search results . I may be totally off on this expectation but am trying to solve a Contains problem where lets say a book (represented as Document 1 in solr) contains Chapters (represented by Document 2,3,4..) in solr. I hope this is not too confusing ;) TIA ~Ravi Gidwani -- View this message in context: http://www.nabble.com/refering-alias-other-Solr-documents-tp24713855p24713855.html Sent from the Solr - User mailing list archive at Nabble.com.
Boosting ('bq') on multi-valued fields
Hey, I have a field defined as such: field name=site_idtype=string indexed=true stored=false multiValued=true / with the string type defined as: fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ When I try using some query-time boost parameters using the bq on values of this field it seems to behave strangely in case of documents actually having multiple values: If i'd do a boost for a particular value ( site_id:5^1.1 ) it seems like all the cases where this field is actually populated with multiple ones ( i.e a document with field value 5|6 ) do not get boosted at all. I verified this using debugQuery explainOther=doc_id:document_with_multiple_values. is this a known issue/bug? any work arounds? (i'm using a nightly solr build from a few months back.. ) Thanks, -Chak -- View this message in context: http://www.nabble.com/Boosting-%28%27bq%27%29-on-multi-valued-fields-tp24713905p24713905.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: update some index documents after indexing process is done with DIH
On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: That really sounds the best way to reach my goal. How could I invoque a listener from the newSearcher?Would be something like: listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst str name=qsolr/str str name=start0/str str name=rows10/str /lst lst str name=qrocks/str str name=start0/str str name=rows10/str /lst lststr name=qstatic newSearcher warming query from solrconfig.xml/str/lst /arr /listener listener event=newSearcher class=solr.MyCustomListener And MyCustomListener would be the class who open the reader: RefCountedSolrIndexSearcher searchHolder = null; try { searchHolder = dataImporter.getCore().getSearcher(); IndexReader reader = searchHolder.get().getReader(); //Here I iterate over the reader doing docuemnt modifications } finally { if (searchHolder != null) searchHolder.decref(); } } catch (Exception ex) { LOG.info(error); } you may not be able to access the DIH API from a newSearcher event . But the API would give you the searcher directly as a method parameter. Finally, to access to documents and add fields to some of them, I have thought in using SolrDocument classes. Can you please point me where something similar is done in solr source (I mean creation of SolrDocuemnts and conversion of them to proper lucene docuements). Does this way for reaching the goal makes sense? Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: when a core is reloaded the event fired is firstSearcher. newSearcher is fired when a commit happens On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Ok, but if I handle it in a newSearcher listener it will be executed every time I reload a core, isn't it? The thing is that I want to use an IndexReader to load in a HashMap some doc fields of the index and depending of the values of some field docs modify other docs. Its very memory consuming (I have tested it with a simple lucene script). Thats why I wanted to do it just after the indexing process. My ideal case would be to do it in the commit function of DirectUpdatehandler2.java just before writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want to mess that code... so trying to find out the best way to do that as a plugin instead of a hack as possible. Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: It is best handled as a 'newSearcher' listener in solrconfig.xml. onImportEnd is invoked before committing On Tue, Jul 28, 2009 at 3:13 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Hey there, I would like to be able to do something like: After the indexing process is done with DIH I would like to open an indexreader, iterate over all docs, modify some of them depending on others and delete some others. I can easy do this directly coding with lucene but would like to know if there's a way to do it with Solr using SolrDocument or SolrInputDocument classes. I have thougth in using SolrJ or DIH listener onImportEnd but not sure if I can get an IndexReader in there. Any advice? Thanks in advance -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24697751.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: FieldCollapsing: Two response elements returned?
I've applied latest collapse field related patch (patch-3) and it doesn't work. Anyone knows how can i get only the collapse response ? 29-jul-2009 11:05:21 org.apache.solr.common.SolrException log GRAVE: java.lang.ClassCastException: org.apache.solr.handler.component.CollapseComponent cannot be cast to org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:150) at org.apache.solr.core.SolrCore.init(SolrCore.java:539) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:108) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3800) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4450) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:987) at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:909) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:495) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1206) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:314) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:722) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(StandardService.java:516) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:583) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413) 2009/7/28 Marc Sturlese marc.sturl...@gmail.com: That's provably because you are using both the CollpaseComponent and the QueryComponent. I think the 2 or 3 last patches allow full replacement of QueryComponent.You shoud just replace: searchComponent name=query class=org.apache.solr.handler.component.QueryComponent / for: searchComponent name=query class=org.apache.solr.handler.component.CollapseComponent / This will sort your problem and make response times faster. Jay Hill wrote: I'm doing some testing with field collapsing, and early results look good. One thing seems odd to me however. I would expect to get back one block of results, but I get two - the first one contains the collapsed results, the second one contains the full non-collapsed results: result name=response numFound=11 start=0 ... /result result name=response numFound=62 start=0 ... /result This seems somewhat confusing. Is this intended or is this a bug? Thanks, -Jay -- View this message in context: http://www.nabble.com/FieldCollapsing%3A-Two-response-elements-returned--tp24690426p24693960.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lici
solr/home in web.xml relative to web server home
Hi all, the environment variable (env-entry) in web.xml to configure the solr/home is relative to the web server's working directory. I find this unusual as all the servlet paths are relative to the web applications directory (webapp context, that is). So, I specified solr/home relative to the web app dir, as well, at first. I think it makes deployment in an unknown environment, or in different environments using a simple war more complex than it needed to be. If a webapp relative path inside the war file could be used, the configuration of solr (and cores) could be included in the war file completely with no outside dependency - except, of course, of the data directory if that is to go some place else. (In my case, I want to deliver the solr web application including a custom entity processor, so that is why I want to include the solr war as part of my release cycle. It is easier to deliver that to the system administration than to provide them with partial packages they have to install into an already installed war, imho.) Am I the only one who has run into that? Thanks for any input on that! Chantal -- Chantal Ackermann
Re: highlighting performance
Just an FYI, Lucene 2.9 has FastVectorHighlighter: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/vectorhighlight/package-summary.html Features * fast for large docs * support N-gram fields * support phrase-unit highlighting with slops * need Java 1.5 * highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS * take into account query boost to score fragments * support colored highlight tags * pluggable FragListBuilder * pluggable FragmentsBuilder Unfortunately, Solr hasn't incorporated it yet: https://issues.apache.org/jira/browse/SOLR-1268 Koji ravi.gidwani wrote: Hey Matt: I have been facing the same issue. I have a text field that I highlight along with other fields (may be 10 others fields). But If I enable highlighting on this text field that contains large number of characters/words ( 100 000 characters) , highlighting suffers performance. Queries return in about 15/20 seconds with this field enabled in highlights as compared to less than a second WITHOUT this enabled in highlight. I did try termvector=true , but I did not see any performance gain either. Just wondering if you were able to solve your issue OR tweak the performance in any other way. BTW , I use solr 1.3. ~Ravi . goodieboy wrote: Thanks Otis. I added termVector=true for those fields, but there isn't a noticeable difference. So, just to be a little more clear, the dynamic fields I'm adding... there might be hundreds. Do you see this as a problem? Thanks, Matt On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Matt, I believe indexing those fields that you will use for highlighting with term vectors enabled will make things faster (and your index a bit bigger). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Matt Mitchell goodie...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, May 15, 2009 5:08:23 PM Subject: highlighting performance Hi, I'm experimenting with highlighting and am noticing a big drop in performance with my setup. I have documents that use quite a few dynamic fields (20-30). The fields are multiValued stored/indexed text fields, each with a few paragraphs worth of text. My hl.fl param is set to *_t What kinds of things can I tweak to make this faster? Is it because I'm highlighting so many different fields? Thanks, Matt Quoted from: http://www.nabble.com/highlighting-performance-tp23567323p23713406.html goodieboy wrote: Thanks Otis. I added termVector=true for those fields, but there isn't a noticeable difference. So, just to be a little more clear, the dynamic fields I'm adding... there might be hundreds. Do you see this as a problem? Thanks, Matt On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Matt, I believe indexing those fields that you will use for highlighting with term vectors enabled will make things faster (and your index a bit bigger). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Matt Mitchell goodie...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, May 15, 2009 5:08:23 PM Subject: highlighting performance Hi, I'm experimenting with highlighting and am noticing a big drop in performance with my setup. I have documents that use quite a few dynamic fields (20-30). The fields are multiValued stored/indexed text fields, each with a few paragraphs worth of text. My hl.fl param is set to *_t What kinds of things can I tweak to make this faster? Is it because I'm highlighting so many different fields? Thanks, Matt
Re: debugQuery=true issue
Hi, Thanks for your response, I'm still developing so the schema is still in flux so I guess that explains it. Oh and regarding the NPE, I updated my checkout and recompiled and now it's gone so I guess somewhere between revision 787997 and 798482 it's already been fixed. Regards, gwk Robert Petersen wrote: I had something similar happen where optimize fixed an odd sorting/scoring problem, and as I understand it the optimize will clear out index 'lint' from old schemas/documents and so thus could affect result scores since all the term vectors or something similar are refreshed etc etc
Re: HTTP Status 500 - java.lang.RuntimeException: Can't find resource 'solrconfig.xml'
As Solr said in the log, Solr couldn't find solrconfig.xml in classpath or solr.solr.home, cwd. My guess is that relative path you set for solr.solr.home was incorrect? Why don't you try: solr.solr.home=/home/huenzhao/search/tomcat6/bin/solr instead of: solr.solr.home=home/huenzhao/search/tomcat6/bin/solr Koji huenzhao wrote: Hi all, I used ubuntu 8.10 as the solr server OS, and set the solr.solr.home=home/huenzhao/search/tomcat6/bin/solr. When I run the tomcat(The tomcat and the solr that I used running on windows XP has no problem), there has error as : HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in null - java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath or 'home/huenzhao/search/tomcat6/bin/solr/conf/', cwd=/home/huenzhao/search/tomcat6/bin at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:194) at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:162) at org.apache.solr.core.Config.(Config.java:100) at org.apache.solr.core.SolrConfig.(SolrConfig.java:113) at org.apache.solr.core.SolrConfig.(SolrConfig.java:70) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397) at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3696) at …… Anybody knows how to do? enzhao...@gmail.com
Re: solr/home in web.xml relative to web server home
On Wed, Jul 29, 2009 at 2:42 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi all, the environment variable (env-entry) in web.xml to configure the solr/home is relative to the web server's working directory. I find this unusual as all the servlet paths are relative to the web applications directory (webapp context, that is). So, I specified solr/home relative to the web app dir, as well, at first. I think it makes deployment in an unknown environment, or in different environments using a simple war more complex than it needed to be. If a webapp relative path inside the war file could be used, the configuration of solr (and cores) could be included in the war file completely with no outside dependency - except, of course, of the data directory if that is to go some place else. (In my case, I want to deliver the solr web application including a custom entity processor, so that is why I want to include the solr war as part of my release cycle. It is easier to deliver that to the system administration than to provide them with partial packages they have to install into an already installed war, imho.) You don't need to create a custom war for that. You can package the EntityProcessor into a separate jar and add it to solr_home/lib directory. -- Regards, Shalin Shekhar Mangar.
Relevant results with DisMaxRequestHandler
Hello, I did notice several strange behaviors on queries. I would like to share with you an example, so maybe you can explain to me what is going wrong. Using the following query : http://localhost:8983/solr/others/select/?debugQuery=trueq=anna%20lewisrows=20start=0fl=*qt=dismax I get back around 100 results. Follow the two first : doc str name=idPerson:151/str str name=name_sVictoria Davisson/str /doc doc str name=idPerson:37/str str name=name_sAnna Lewis/str /doc And the related debugs : 57.998047 = (MATCH) sum of: 0.048290744 = (MATCH) sum of: 0.024546575 = (MATCH) max plus 0.01 times others of: 0.024546575 = (MATCH) weight(text:anna^0.5 in 64288), product of: 0.027395602 = queryWeight(text:anna^0.5), product of: 0.5 = boost 5.734427 = idf(docFreq=564, numDocs=30400) 0.009554783 = queryNorm 0.8960042 = (MATCH) fieldWeight(text:anna in 64288), product of: 1.0 = tf(termFreq(text:anna)=1) 5.734427 = idf(docFreq=564, numDocs=30400) 0.15625 = fieldNorm(field=text, doc=64288) 0.02374417 = (MATCH) max plus 0.01 times others of: 0.02374417 = (MATCH) weight(text:lewi^0.5 in 64288), product of: 0.026944114 = queryWeight(text:lewi^0.5), product of: 0.5 = boost 5.6399217 = idf(docFreq=620, numDocs=30400) 0.009554783 = queryNorm 0.88123775 = (MATCH) fieldWeight(text:lewi in 64288), product of: 1.0 = tf(termFreq(text:lewi)=1) 5.6399217 = idf(docFreq=620, numDocs=30400) 0.15625 = fieldNorm(field=text, doc=64288) 57.949757 = (MATCH) FunctionQuery(ord(name_s)), product of: 1213.0 = ord(name_s)=1213 5.0 = boost 0.009554783 = queryNorm 5.006892 = (MATCH) sum of: 0.038405567 = (MATCH) sum of: 0.021955125 = (MATCH) max plus 0.01 times others of: 0.021955125 = (MATCH) weight(text:anna^0.5 in 62632), product of: 0.027395602 = queryWeight(text:anna^0.5), product of: 0.5 = boost 5.734427 = idf(docFreq=564, numDocs=30400) 0.009554783 = queryNorm 0.80141056 = (MATCH) fieldWeight(text:anna in 62632), product of: 2.236068 = tf(termFreq(text:anna)=5) 5.734427 = idf(docFreq=564, numDocs=30400) 0.0625 = fieldNorm(field=text, doc=62632) 0.016450444 = (MATCH) max plus 0.01 times others of: 0.016450444 = (MATCH) weight(text:lewi^0.5 in 62632), product of: 0.026944114 = queryWeight(text:lewi^0.5), product of: 0.5 = boost 5.6399217 = idf(docFreq=620, numDocs=30400) 0.009554783 = queryNorm 0.61053944 = (MATCH) fieldWeight(text:lewi in 62632), product of: 1.7320508 = tf(termFreq(text:lewi)=3) 5.6399217 = idf(docFreq=620, numDocs=30400) 0.0625 = fieldNorm(field=text, doc=62632) 4.968487 = (MATCH) FunctionQuery(ord(name_s)), product of: 104.0 = ord(name_s)=104 5.0 = boost 0.009554783 = queryNorm I'm using a simple boost function : requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf text^0.5 name_s^5.0 /str str name=pf name_s^5.0 /str str name=bf name_s^5.0 /str /lst /requestHandler Can anyone explain to me why the first result is on top (the query is 'anna lewis') with a huge weight and nothing related (it seems the weight come from the name_s field...) ? A second general question... is it possible to boost a field if the query match exactly the content of a field? Thank you ! Vincent -- View this message in context: http://www.nabble.com/Relevant-results-with-DisMaxRequestHandler-tp24716870p24716870.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet.prefix question
Licinio Fernández Maurelo wrote: i'm trying to do some filtering in the count list retrieved by solr when doing a faceting query , i'm wondering how can i use facet.prefix to gem something like this: Query facet.field=foofacet.prefix=A OR B Response lst name=facet_fields - lst name=foo int name=A12560/int int name=A*5440/int int name=B**2357/int . . . /lst How can i achieve such this behaviour? Best Regards You cannot set a query for facet.prefix parameter. facet.prefix should be a prefix *string* of terms in the index, and you can set it at a time. So I think you need to send two requests to get what you want: ...facet.field=foofacet.prefix=A ...facet.field=foofacet.prefix=B Koji
Question about formatting the results returned from Solr
Hi all, Not sure how good my title is, but here is a (hopefully) better explanation on what I mean. I am indexing a set of articles from a DB. Each article has an author. The author is saved in then the DB as an author ID, which is a number. There is another table in the DB with more relevant information about the author. Basically it has columns like: id, firstname, lastname, email, userid I set up the DIH so that it returns the userid, and it works fine: arr name=author strjdoe/str strmsmith/str /arr Would it be possible to return all of the information about the author (first name, ...) as a subset of the results above? Here is what I mean: arr name=author arr name=jdoe str name=firstNameJohn/str str name=lastNameDoe/str str name=emailj...@doe.com/str /arr ... /arr Something similar to that at least... Not sure how descriptive I was, but any pointers would be highly appreciated. Cheers -- View this message in context: http://www.nabble.com/Question-about-formatting-the-results-returned-from-Solr-tp24719831p24719831.html Sent from the Solr - User mailing list archive at Nabble.com.
Getting Tika to work in Solr 1.4 nightly
I am working with Solr 1.4 nightly and am running it on a Windows machine. Solr is running using the example folder that was installed from the zip file. The only alteration that I have made to this default installation is to add a simple Word document into the exampledocs folder. I am trying to get Tika to work in Solr. When I run the tika-0.3.jar directed to a Word document it outputs to the screen in XML format. I am not able to get Solr to run tika and index the information in the sample Word document. I have looked at the following resources: Solr mailing list archive (although I could have missed something here); Documentation and Getting started on the Apache Tika website; I even found an article called Content Extraction with Tika at this website: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles /Content-Extraction-Tika This article talks about using curl. Is curl necessary to use or does Solr have something already configured to do the same as curl? I have modified the solrconfig.xml file to include the request handler for the ExtractingRequestHandler. I used the modification that was commented out in the solrconfig.xml file. Here it is for reference: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler lst name=defaults str name=ext.map.Last-Modifiedlast_modified/str bool name=ext.ignore.und.fltrue/bool /lst /requestHandler Is there some modification to this code that I need to make? Can some one please direct me to a source that can help me get this to work. Kevin Miller
Re: FieldCollapsing: Two response elements returned?
My last mail is wrong. Sorry El 29 de julio de 2009 11:10, Licinio Fernández Maurelolicinio.fernan...@gmail.com escribió: I've applied latest collapse field related patch (patch-3) and it doesn't work. Anyone knows how can i get only the collapse response ? 29-jul-2009 11:05:21 org.apache.solr.common.SolrException log GRAVE: java.lang.ClassCastException: org.apache.solr.handler.component.CollapseComponent cannot be cast to org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:150) at org.apache.solr.core.SolrCore.init(SolrCore.java:539) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:108) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3800) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4450) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:987) at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:909) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:495) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1206) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:314) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053) at org.apache.catalina.core.StandardHost.start(StandardHost.java:722) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) at org.apache.catalina.core.StandardService.start(StandardService.java:516) at org.apache.catalina.core.StandardServer.start(StandardServer.java:710) at org.apache.catalina.startup.Catalina.start(Catalina.java:583) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413) 2009/7/28 Marc Sturlese marc.sturl...@gmail.com: That's provably because you are using both the CollpaseComponent and the QueryComponent. I think the 2 or 3 last patches allow full replacement of QueryComponent.You shoud just replace: searchComponent name=query class=org.apache.solr.handler.component.QueryComponent / for: searchComponent name=query class=org.apache.solr.handler.component.CollapseComponent / This will sort your problem and make response times faster. Jay Hill wrote: I'm doing some testing with field collapsing, and early results look good. One thing seems odd to me however. I would expect to get back one block of results, but I get two - the first one contains the collapsed results, the second one contains the full non-collapsed results: result name=response numFound=11 start=0 ... /result result name=response numFound=62 start=0 ... /result This seems somewhat confusing. Is this intended or is this a bug? Thanks, -Jay -- View this message in context: http://www.nabble.com/FieldCollapsing%3A-Two-response-elements-returned--tp24690426p24693960.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lici -- Lici
Re: Relevant results with DisMaxRequestHandler
On Jul 29, 2009, at 6:55 AM, Vincent Pérès wrote: Using the following query : http://localhost:8983/solr/others/select/?debugQuery=trueq=anna%20lewisrows=20start=0fl=*qt=dismax I get back around 100 results. Follow the two first : doc str name=idPerson:151/str str name=name_sVictoria Davisson/str /doc doc str name=idPerson:37/str str name=name_sAnna Lewis/str /doc And the related debugs : 57.998047 = (MATCH) sum of: 0.048290744 = (MATCH) sum of: 0.024546575 = (MATCH) max plus 0.01 times others of: 0.024546575 = (MATCH) weight(text:anna^0.5 in 64288), product of: 0.027395602 = queryWeight(text:anna^0.5), product of: 0.5 = boost 5.734427 = idf(docFreq=564, numDocs=30400) 0.009554783 = queryNorm 0.8960042 = (MATCH) fieldWeight(text:anna in 64288), product of: 1.0 = tf(termFreq(text:anna)=1) 5.734427 = idf(docFreq=564, numDocs=30400) 0.15625 = fieldNorm(field=text, doc=64288) 0.02374417 = (MATCH) max plus 0.01 times others of: 0.02374417 = (MATCH) weight(text:lewi^0.5 in 64288), product of: 0.026944114 = queryWeight(text:lewi^0.5), product of: 0.5 = boost 5.6399217 = idf(docFreq=620, numDocs=30400) 0.009554783 = queryNorm 0.88123775 = (MATCH) fieldWeight(text:lewi in 64288), product of: 1.0 = tf(termFreq(text:lewi)=1) 5.6399217 = idf(docFreq=620, numDocs=30400) 0.15625 = fieldNorm(field=text, doc=64288) 57.949757 = (MATCH) FunctionQuery(ord(name_s)), product of: 1213.0 = ord(name_s)=1213 5.0 = boost 0.009554783 = queryNorm 5.006892 = (MATCH) sum of: 0.038405567 = (MATCH) sum of: 0.021955125 = (MATCH) max plus 0.01 times others of: 0.021955125 = (MATCH) weight(text:anna^0.5 in 62632), product of: 0.027395602 = queryWeight(text:anna^0.5), product of: 0.5 = boost 5.734427 = idf(docFreq=564, numDocs=30400) 0.009554783 = queryNorm 0.80141056 = (MATCH) fieldWeight(text:anna in 62632), product of: 2.236068 = tf(termFreq(text:anna)=5) 5.734427 = idf(docFreq=564, numDocs=30400) 0.0625 = fieldNorm(field=text, doc=62632) 0.016450444 = (MATCH) max plus 0.01 times others of: 0.016450444 = (MATCH) weight(text:lewi^0.5 in 62632), product of: 0.026944114 = queryWeight(text:lewi^0.5), product of: 0.5 = boost 5.6399217 = idf(docFreq=620, numDocs=30400) 0.009554783 = queryNorm 0.61053944 = (MATCH) fieldWeight(text:lewi in 62632), product of: 1.7320508 = tf(termFreq(text:lewi)=3) 5.6399217 = idf(docFreq=620, numDocs=30400) 0.0625 = fieldNorm(field=text, doc=62632) 4.968487 = (MATCH) FunctionQuery(ord(name_s)), product of: 104.0 = ord(name_s)=104 5.0 = boost 0.009554783 = queryNorm I'm using a simple boost function : requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf text^0.5 name_s^5.0 /str str name=pf name_s^5.0 /str str name=bf name_s^5.0 /str /lst /requestHandler Can anyone explain to me why the first result is on top (the query is 'anna lewis') with a huge weight and nothing related (it seems the weight come from the name_s field...) ? The ord function perhaps isn't doing what you want. It is returning the term position, and thus it appears Anna Lewis is the 104th name_s value in your index lexicographically. And of course Victoria Davisson is much further down, at the 1203rd position. Maybe you want rord instead? But probably not... A second general question... is it possible to boost a field if the query match exactly the content of a field? You can use set dismax to have a qs (query slop) factor which will boost documents where the users terms are closer together (within the number of terms distance specified). Erik
RE: Boosting ('bq') on multi-valued fields
Hey, I have a field defined as such: field name=site_idtype=string indexed=true stored=false multiValued=true / with the string type defined as: fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ When I try using some query-time boost parameters using the bq on values of this field it seems to behave strangely in case of documents actually having multiple values: If i'd do a boost for a particular value ( site_id:5^1.1 ) it seems like all the cases where this field is actually populated with multiple ones ( i.e a document with field value 5|6 ) do not get boosted at all. I verified this using debugQuery explainOther=doc_id:document_with_multiple_values. is this a known issue/bug? any work arounds? (i'm using a nightly solr build from a few months back.. ) There is no tokenization on 'string' fields, so a query for 5 does not match a doc with a value of 5|6 for this field. You could try using field type 'text' for this and see what you get. You may need to customize it to you the StandardAnalyzer or WordDelimiterFilterFactory to get the right behavior. Using the analysis tool in the solr admin UI to experiment will probably be helpful. -Ken
Re: update some index documents after indexing process is done with DIH
From the newSearcher(..) of a CustomEventListener which extends of AbstractSolrEventListener can access to SolrIndexSearcher and all core properties but can't get a SolrIndexWriter. Do you now how can I get from there a SolrIndexWriter? This way I would be able to modify the documents (I need to modify them depending on values of other documents, that's why I can't do it with DIH delta-import). Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: That really sounds the best way to reach my goal. How could I invoque a listener from the newSearcher?Would be something like: listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst str name=qsolr/str str name=start0/str str name=rows10/str /lst lst str name=qrocks/str str name=start0/str str name=rows10/str /lst lststr name=qstatic newSearcher warming query from solrconfig.xml/str/lst /arr /listener listener event=newSearcher class=solr.MyCustomListener And MyCustomListener would be the class who open the reader: RefCountedSolrIndexSearcher searchHolder = null; try { searchHolder = dataImporter.getCore().getSearcher(); IndexReader reader = searchHolder.get().getReader(); //Here I iterate over the reader doing docuemnt modifications } finally { if (searchHolder != null) searchHolder.decref(); } } catch (Exception ex) { LOG.info(error); } you may not be able to access the DIH API from a newSearcher event . But the API would give you the searcher directly as a method parameter. Finally, to access to documents and add fields to some of them, I have thought in using SolrDocument classes. Can you please point me where something similar is done in solr source (I mean creation of SolrDocuemnts and conversion of them to proper lucene docuements). Does this way for reaching the goal makes sense? Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: when a core is reloaded the event fired is firstSearcher. newSearcher is fired when a commit happens On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Ok, but if I handle it in a newSearcher listener it will be executed every time I reload a core, isn't it? The thing is that I want to use an IndexReader to load in a HashMap some doc fields of the index and depending of the values of some field docs modify other docs. Its very memory consuming (I have tested it with a simple lucene script). Thats why I wanted to do it just after the indexing process. My ideal case would be to do it in the commit function of DirectUpdatehandler2.java just before writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want to mess that code... so trying to find out the best way to do that as a plugin instead of a hack as possible. Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: It is best handled as a 'newSearcher' listener in solrconfig.xml. onImportEnd is invoked before committing On Tue, Jul 28, 2009 at 3:13 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Hey there, I would like to be able to do something like: After the indexing process is done with DIH I would like to open an indexreader, iterate over all docs, modify some of them depending on others and delete some others. I can easy do this directly coding with lucene but would like to know if there's a way to do it with Solr using SolrDocument or SolrInputDocument classes. I have thougth in using SolrJ or DIH listener onImportEnd but not sure if I can get an IndexReader in there. Any advice? Thanks in advance -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24697751.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24722111.html Sent from the Solr - User mailing list archive at
RE: search suggest
To do a proper search suggest feature you have to index all the queries your system gets and search it with wildcards for matches on what the user has typed so far for each user keystroke in the search box... Usually with some timer logic to wait for a small hesitation in their typing. -Original Message- From: Jack Bates [mailto:ms...@freezone.co.uk] Sent: Tuesday, July 28, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: search suggest how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr
Wildcard and boosting
Hey now! I do index time boosting for my fields and just discovered that when searching with a trailing wild card the boosting is ignored. Will my boosting work with a wild card if I do it at query time? And if so is there a lot of performance difference? Some other method I can use to preserve my boosting? I do not need hightlighting. Thanks, Jon Helgi
RE: refering/alias other Solr documents
Hi Ravi, This may help: http://wiki.apache.org/solr/HierarchicalFaceting Steve -Original Message- From: ravi.gidwani [mailto:ravi.gidw...@gmail.com] Sent: Wednesday, July 29, 2009 3:24 AM To: solr-user@lucene.apache.org Subject: refering/alias other Solr documents Hi all: Is in solr, that will allow documents referring each other ? In other words, if a search for abc matches on document 1 , I should be able to return document 2 even though the index does any fields matching abc. Here is the scenario with some more details: Solr version:1.3 Scenario: 1) Solr Document 1 with say some field title=abc and Solr Document 2 with its own data. 2) User searches for abc and gets Document 1 as it matches on title field Expected results: When the user searches for abc he it also get Document 2 along with Document 1. I understand one way of doing this is to make sure Document 2 has all the contents of Document 1. But this introduces a issue of keeping the two documents (and hence their solr index) in sync with each other. I think I am looking for a mechanism like this: Document 1 refers = document 2, Document 3. Hence whenever document 1 in part of search results, document 2 and document 3 will also be returned as search results . I may be totally off on this expectation but am trying to solve a Contains problem where lets say a book (represented as Document 1 in solr) contains Chapters (represented by Document 2,3,4..) in solr. I hope this is not too confusing ;) TIA ~Ravi Gidwani -- View this message in context: http://www.nabble.com/refering-alias- other-Solr-documents-tp24713855p24713855.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting Tika to work in Solr 1.4 nightly
Hi Kevin, The parameter names have changed in the latest Solr 1.4 builds... please see http://wiki.apache.org/solr/ExtractingRequestHandler -Yonik http://www.lucidimagination.com On Wed, Jul 29, 2009 at 10:17 AM, Kevin Millerkevin.mil...@oktax.state.ok.us wrote: I am working with Solr 1.4 nightly and am running it on a Windows machine. Solr is running using the example folder that was installed from the zip file. The only alteration that I have made to this default installation is to add a simple Word document into the exampledocs folder. I am trying to get Tika to work in Solr. When I run the tika-0.3.jar directed to a Word document it outputs to the screen in XML format. I am not able to get Solr to run tika and index the information in the sample Word document. I have looked at the following resources: Solr mailing list archive (although I could have missed something here); Documentation and Getting started on the Apache Tika website; I even found an article called Content Extraction with Tika at this website: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles /Content-Extraction-Tika This article talks about using curl. Is curl necessary to use or does Solr have something already configured to do the same as curl? I have modified the solrconfig.xml file to include the request handler for the ExtractingRequestHandler. I used the modification that was commented out in the solrconfig.xml file. Here it is for reference: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler lst name=defaults str name=ext.map.Last-Modifiedlast_modified/str bool name=ext.ignore.und.fltrue/bool /lst /requestHandler Is there some modification to this code that I need to make? Can some one please direct me to a source that can help me get this to work. Kevin Miller
Multi select faceting
Hi, We're using Lucid Imagination's LucidWorks Solr 1.3 and we have a requirement to implement multiple-select faceting where the facet cells show up as checkboxes and despite checked options, all of the options continue to persist with counts. The best example I found is the search on Lucid Imagination's site: http://www.lucidimagination.com/search/ It appears the Solr 1.4 release has support for doing this with filter tagging (http://wiki.apache.org/solr/SimpleFacetParameters#head-f277d409b221b407d9c5430f552bf40ee6185c4c), but I was wondering if there was another way to accomplish this in 1.3? Mike
query and analyzers
Hi, What analyzer, tokenizer, filter factory would I need to use to get wildcard matching to match where: Value: XYZ123 Query: XYZ1* I have been messing with solr.WordDelimiterFilterFactory splitOnNumerics and oreserveOriginal in both the analyzer and the query. I also noticed it is different when I use quotes in the query - phrase search. Unfortunately, I'm missing something as I can't get it to work. Tim
Re: query and analyzers
What analyzer, tokenizer, filter factory would I need to use to get wildcard matching to match where: Value: XYZ123 Query: XYZ1* StandardAnalyzer, WhitespaceAnalyzer. I have been messing with solr.WordDelimiterFilterFactory splitOnNumerics and oreserveOriginal in both the analyzer and the query. I also noticed it is different when I use quotes in the query - phrase search. Unfortunately, I'm missing something as I can't get it to work. But i think your problem is not the analyzer. I guess in your analyzer there is lowercase filter and wildcard queries are not analyzed. Try querying xyz1*
Re: query in solr lucene
You may index your data using a delimiter, like $my-field-content$. While searching, perform a phrase query with the leading and trailing $ appended to the query string. Cheers Avlesh On Wed, Jul 29, 2009 at 12:04 PM, Sushan Rungta s...@clickindia.com wrote: I tried using AND, but it even provided me doc 3 which was not required. Hence my problem still persists... regards, Sushan At 06:59 AM 7/29/2009, Avlesh Singh wrote: No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I read it. Sorry, my bad. I did not read properly before replying. Cheers Avlesh On Wed, Jul 29, 2009 at 3:23 AM, Erick Erickson erickerick...@gmail.com wrote: No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I read it. You might have some joy with KeywordAnalyzer, which does not break the incoming stream up into tokens. You have to be careful, though, because it also won't fold the case, so 'Hello' would not match 'hello'. Best Erick On Tue, Jul 28, 2009 at 11:11 AM, Avlesh Singh avl...@gmail.com wrote: You should perform a PhraseQuery on the required field. Meaning, http://your-solr-host:port: /your-core-path/select?q=fieldName:Hello how are you sushan would work for you. Cheers Avlesh 2009/7/28 Gérard Dupont ger.dup...@gmail.com Hi Sushan, I'm not an expert of Solr, just beginner, but it appears to me that you may have default 'OR' combinaison fo keywords so that will explain this behavior. Try to modify the configuration for an 'AND' combinaison. cheers On Tue, Jul 28, 2009 at 16:49, Sushan Rungta s...@clickindia.com wrote: I am extremely sorry for responding late as I was ill from past few days. My problem is explained below with an example: I am having three documents with following list: 1. Hello how are you 2. Hello how are you sushan 3. Hello how are you sushan. I am fine. When I search for a query Hello how are you sushan, I should only get document 2 in my result. I hope this will give you all a better insight in my problem. regards, Sushan Rungta -- Gérard Dupont Information Processing Control and Cognition (IPCC) - EADS DS http://weblab-project.org Document Learning team - LITIS Laboratory
Re: search suggest
Autosuggest is something that would be very useful to build into Solr as many search projects require it. I'd recommend indexing relevant terms/phrases into a Ternary Search Tree which is compact and performant. Using a wildcard query will likely not be as fast as a Ternary Tree, and I'm not sure how phrases would be handled? http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi It would be good to separate out the TernaryTree from analysis/compound and into Lucene core, or into it's own contrib. Also see http://issues.apache.org/jira/browse/LUCENE-625 which improves relevancy using click through rates. I'll open an issue in Solr to get this one going. On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersenrober...@buy.com wrote: To do a proper search suggest feature you have to index all the queries your system gets and search it with wildcards for matches on what the user has typed so far for each user keystroke in the search box... Usually with some timer logic to wait for a small hesitation in their typing. -Original Message- From: Jack Bates [mailto:ms...@freezone.co.uk] Sent: Tuesday, July 28, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: search suggest how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr
Re: search suggest
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/hyphenation/TernaryTree.html On Wed, Jul 29, 2009 at 12:08 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: Autosuggest is something that would be very useful to build into Solr as many search projects require it. I'd recommend indexing relevant terms/phrases into a Ternary Search Tree which is compact and performant. Using a wildcard query will likely not be as fast as a Ternary Tree, and I'm not sure how phrases would be handled? http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi It would be good to separate out the TernaryTree from analysis/compound and into Lucene core, or into it's own contrib. Also see http://issues.apache.org/jira/browse/LUCENE-625 which improves relevancy using click through rates. I'll open an issue in Solr to get this one going. On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersenrober...@buy.com wrote: To do a proper search suggest feature you have to index all the queries your system gets and search it with wildcards for matches on what the user has typed so far for each user keystroke in the search box... Usually with some timer logic to wait for a small hesitation in their typing. -Original Message- From: Jack Bates [mailto:ms...@freezone.co.uk] Sent: Tuesday, July 28, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: search suggest how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr
Visualizing Semantic Journal Space (large scale) using full-text
I thought the Lucene and Solr communities would find this interesting: My collaborators and I have used LuSql, Lucene and Semantic Vectors to visualize a large scale semantic journal space (kind of like 'Maps of Science') of a large scale (5.7 million articles) journal article collection using only the full-text (no metadata). For more info howto: http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html Glen Newton -- -
RE: query and analyzers
This was the definition I was last working with (I've been playing with setting the various parameters). fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=1/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=1/ /analyzer /fieldType -Original Message- From: AHMET ARSLAN [mailto:iori...@yahoo.com] Sent: Wednesday, July 29, 2009 11:55 AM To: solr-user@lucene.apache.org Subject: Re: query and analyzers What analyzer, tokenizer, filter factory would I need to use to get wildcard matching to match where: Value: XYZ123 Query: XYZ1* StandardAnalyzer, WhitespaceAnalyzer. I have been messing with solr.WordDelimiterFilterFactory splitOnNumerics and oreserveOriginal in both the analyzer and the query. I also noticed it is different when I use quotes in the query - phrase search. Unfortunately, I'm missing something as I can't get it to work. But i think your problem is not the analyzer. I guess in your analyzer there is lowercase filter and wildcard queries are not analyzed. Try querying xyz1*
RE: query and analyzers
In order to match (query) XYZ1* to (document) XYZ123 you do not need WordDelimiterFilterFactory. You need an tokenizer that recognizes XYZ123 as one token. And WhitespaceTokenizer is one of them. As I see from the fieldType named text_ws, you want to use WhitespaceTokenizerFactory and there is no LowercaseFilter in it. So there is no problem. Just remove the WordDelimiterFilterFactory (both query and index) and it should work. Ahmet
RE: query and analyzers
That did it, thanks! I thought that was how it should work, but I guess somehow I got out of sync or something at one point which led me to dive deeper into it than I needed to. -Original Message- From: AHMET ARSLAN [mailto:iori...@yahoo.com] Sent: Wednesday, July 29, 2009 12:52 PM To: solr-user@lucene.apache.org Subject: RE: query and analyzers In order to match (query) XYZ1* to (document) XYZ123 you do not need WordDelimiterFilterFactory. You need an tokenizer that recognizes XYZ123 as one token. And WhitespaceTokenizer is one of them. As I see from the fieldType named text_ws, you want to use WhitespaceTokenizerFactory and there is no LowercaseFilter in it. So there is no problem. Just remove the WordDelimiterFilterFactory (both query and index) and it should work. Ahmet
Re: search suggest
also watch out that you have a good stopwords list otherwise the suggestions won't be helpful for the user. Jack Bates wrote: how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr -- manuel aldana ald...@gmx.de software-engineering blog: http://www.aldana-online.de
RE: search suggest
Simple minded autosuggest can just not tokenize the phrases at all and so the wildcards just complete whatever the user has typed so far including spaces. Upon encountering a space though, autosuggest should wait to make more suggestions until the user has typed at least a couple of letters of the next word. That is the way I did it last time using a different search engine. It'd sure be kewl if this became a core feature of solr! I like the idea of the tree approach, sounds much faster. The root is the least letters to start suggestions and the leaves are the full phrases? -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Wednesday, July 29, 2009 12:09 PM To: solr-user@lucene.apache.org Subject: Re: search suggest Autosuggest is something that would be very useful to build into Solr as many search projects require it. I'd recommend indexing relevant terms/phrases into a Ternary Search Tree which is compact and performant. Using a wildcard query will likely not be as fast as a Ternary Tree, and I'm not sure how phrases would be handled? http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi It would be good to separate out the TernaryTree from analysis/compound and into Lucene core, or into it's own contrib. Also see http://issues.apache.org/jira/browse/LUCENE-625 which improves relevancy using click through rates. I'll open an issue in Solr to get this one going. On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersenrober...@buy.com wrote: To do a proper search suggest feature you have to index all the queries your system gets and search it with wildcards for matches on what the user has typed so far for each user keystroke in the search box... Usually with some timer logic to wait for a small hesitation in their typing. -Original Message- From: Jack Bates [mailto:ms...@freezone.co.uk] Sent: Tuesday, July 28, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: search suggest how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr
Re: Indexing TIKA extracted text. Are there some issues?
Sure. The java command I use with TIKA to extract text from a URL is: java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: str name=parsedquery +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½)~0.01) () /str Debug query for Perl: str name=parsedquery +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½)~0.01) () /str The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: Hmm, looks very much like an encoding problem. Can you post a sample showing it, along with the commands you invoked? Thanks, Grant On Jul 28, 2009, at 6:14 PM, ashokc wrote: I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ŒÂ¸à ¥ÂŽÂ†Ã¥Â² 您的æˆÂ功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation à ¥Â etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: search suggest
Here's a good article on Ternary Trees: http://www.ddj.com/windows/184410528 I looked at the one in Lucene, I don't understand why the find method only returns a char/int? On Wed, Jul 29, 2009 at 2:33 PM, Robert Petersenrober...@buy.com wrote: Simple minded autosuggest can just not tokenize the phrases at all and so the wildcards just complete whatever the user has typed so far including spaces. Upon encountering a space though, autosuggest should wait to make more suggestions until the user has typed at least a couple of letters of the next word. That is the way I did it last time using a different search engine. It'd sure be kewl if this became a core feature of solr! I like the idea of the tree approach, sounds much faster. The root is the least letters to start suggestions and the leaves are the full phrases? -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: Wednesday, July 29, 2009 12:09 PM To: solr-user@lucene.apache.org Subject: Re: search suggest Autosuggest is something that would be very useful to build into Solr as many search projects require it. I'd recommend indexing relevant terms/phrases into a Ternary Search Tree which is compact and performant. Using a wildcard query will likely not be as fast as a Ternary Tree, and I'm not sure how phrases would be handled? http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi It would be good to separate out the TernaryTree from analysis/compound and into Lucene core, or into it's own contrib. Also see http://issues.apache.org/jira/browse/LUCENE-625 which improves relevancy using click through rates. I'll open an issue in Solr to get this one going. On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersenrober...@buy.com wrote: To do a proper search suggest feature you have to index all the queries your system gets and search it with wildcards for matches on what the user has typed so far for each user keystroke in the search box... Usually with some timer logic to wait for a small hesitation in their typing. -Original Message- From: Jack Bates [mailto:ms...@freezone.co.uk] Sent: Tuesday, July 28, 2009 10:54 AM To: solr-user@lucene.apache.org Subject: search suggest how can i use solr to make search suggestions? i'm thinking google-style suggestions, which suggests more refined queries - vs. freebase-style suggestions, which suggests top hits. i've been looking at the query params, http://wiki.apache.org/solr/StandardRequestHandler - and searching for solr suggest - but haven't figured out how to get search suggestions from solr
Re: Indexing TIKA extracted text. Are there some issues?
it appears there is an encoding problem, in the screenshot I can see the title is mangled, and if i open up the URL in IE or firefox, both browsers think it is iso-8859-1. I think this is why (from w3c validator): Character Encoding mismatch! The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the meta element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation. On Wed, Jul 29, 2009 at 6:02 PM, ashokcash...@qualcomm.com wrote: Sure. The java command I use with TIKA to extract text from a URL is: java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str Debug query for Perl: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: Hmm, looks very much like an encoding problem. Can you post a sample showing it, along with the commands you invoked? Thanks, Grant On Jul 28, 2009, at 6:14 PM, ashokc wrote: I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ¬å ¸à ¥ÂŽÂ†Ã¥Â ² 您的戠功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation à ¥Â etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com. -- Robert Muir rcm...@gmail.com
Re: Indexing TIKA extracted text. Are there some issues?
Could very well be... I will rectify it and try again. Thanks - ashok Robert Muir wrote: it appears there is an encoding problem, in the screenshot I can see the title is mangled, and if i open up the URL in IE or firefox, both browsers think it is iso-8859-1. I think this is why (from w3c validator): Character Encoding mismatch! The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the meta element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation. On Wed, Jul 29, 2009 at 6:02 PM, ashokcash...@qualcomm.com wrote: Sure. The java command I use with TIKA to extract text from a URL is: java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str Debug query for Perl: str name=parsedquery +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | content_china:高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能)~0.01) () /str The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: Hmm, looks very much like an encoding problem. Can you post a sample showing it, along with the commands you invoked? Thanks, Grant On Jul 28, 2009, at 6:14 PM, ashokc wrote: I am finding that the search results based on indexing Tika extracted text are very different from results based on indexing the text extracted via other means. This shows up for example with a chinese web site that I am trying to index. I created the documents (for posting to SOLR) in two ways. The source text of the web pages are full of html entities like #12345; and some english characters mixed in. (a) Simple text extraction from the page source by a Perl script. The resulting content field looks like field name=content_chinaWho We Are #20844;#21496;#21382;#21490; #24744;#30340;#25104;#21151;#26696;#20363; #39046;#23548;#22242;#38431; #19994;#21153;#37096;#38376; Innovation #21019; etc... /field I posted these documents to a SOLR instance (b) Used Tika (command line). The resulting content field looks like field name=content_chinaWho We Are Ã¥ ¬å ¸à ¥ÂŽÂ†Ã¥Â ² 您的戠功æ¡ ˆä¾‹ 领导团队 业务部门  Innovation à ¥Â etc... /field I posted these documents to a different instance When I search the first instance for a string (that I copied pasted from the web site) I find a number of hits, including the page from which I copied the string from. But when I do the same on the instance with Tika extracted text - I get nothing. Has anyone seen this? I believe it may have to do with encoding. In both cases the posted documents were utf-8 compiant. Thanks for your insights. - ashok -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com. -- Robert Muir rcm...@gmail.com -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html Sent from the Solr - User mailing list archive at Nabble.com.
deleteById always returning OK
Is it expected behaviour that deleteById will always return OK as a status, regardless of whether the id was matched? I have a unit test: // set up the test data engine.index(12345, s1, d1); engine.index(54321, s2, d2); engine.index(23453, s3, d3); // ... @Test public void testRemove() throws Exception { assertEquals(engine.size(), 3); assertTrue(engine.remove(12345)); assertEquals(engine.size(), 2); // XXX, it returns true assertFalse(engine.remove(23523352)); Engine is my wrapper around Solr. The remove method looks like this: private static final int RESPONSE_STATUS_OK = 0; private SolrServer server; public boolean remove(final Integer titleInstanceId) throws IOException { try { server.deleteById(String.valueOf(titleInstanceId)); final UpdateResponse updateResponse = server.commit(true, true); // XXX It's always OK return (updateResponse.getStatus() == RESPONSE_STATUS_OK); Any ideas what's going wrong? Is there a different way to test for the id not having been there, other than an additional search? Thanks Reuben
Re: THIS WEEK: PNW Hadoop, HBase / Apache Cloud Stack Users' Meeting, Wed Jul 29th, Seattle
Don't forget this is tonight! Excited to see everyone there. On Tue, Jul 28, 2009 at 11:25 AM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey everyone, SLIGHT change of plans. A few people have asked me to move to a place with Air Conditioning, since the temperature's in the 90's this week. So, here we go: Big Time Brewing Company 4133 University Way NE Seattle, WA 98105 Call me at 904-415-3009 if you have any questions. On Mon, Jul 27, 2009 at 12:16 PM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hello again! Yes, I know some of us are still recovering from OSCON. It's time for another delicious meetup to chat about Hadoop, HBase, Solr, Lucene, and more! UW is quite a pain for us to access until August, so we're changing the venue to one pretty close: Piccolo's Pizza 5301 Roosevelt Way NE (between 53rd St 55th St) 6:45pm - 8:30 (or when we get bored)! As usual, people are more than welcome to give talks, whether they're long-format or lightning. I'd also really like to start thinking about hackathons, perhaps we could have one next month? I'll be talking about HBase .20 and the possibility of low-latency HBase Analytics. I'd be very excited to hear what people are up to! Contact me if there's any questions: 904-415-3009 Cheers, Bradford -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science -- http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Wildcard and boosting
I just updated to nightly build (I was using 1.2) and this does not seem to be an issue anymore. 2009/7/29 Jón Helgi Jónsson jonjons...@gmail.com: Hey now! I do index time boosting for my fields and just discovered that when searching with a trailing wild card the boosting is ignored. Will my boosting work with a wild card if I do it at query time? And if so is there a lot of performance difference? Some other method I can use to preserve my boosting? I do not need hightlighting. Thanks, Jon Helgi
Re: deleteById always returning OK
Reuben Firmin wrote: Is it expected behaviour that deleteById will always return OK as a status, regardless of whether the id was matched? It is expected behaviour as Solr always returns 0 unless an error occurs during processing a request (query, update, ...), so you don't need to check the status, but you'll get an exception if something wrong; otherwise the request succeeded. And you cannot know whether the id was matched. The only way you can try is send a query q=id:valuerows=0 and check the numFound in the response before sending deleteById. Koji I have a unit test: // set up the test data engine.index(12345, s1, d1); engine.index(54321, s2, d2); engine.index(23453, s3, d3); // ... @Test public void testRemove() throws Exception { assertEquals(engine.size(), 3); assertTrue(engine.remove(12345)); assertEquals(engine.size(), 2); // XXX, it returns true assertFalse(engine.remove(23523352)); Engine is my wrapper around Solr. The remove method looks like this: private static final int RESPONSE_STATUS_OK = 0; private SolrServer server; public boolean remove(final Integer titleInstanceId) throws IOException { try { server.deleteById(String.valueOf(titleInstanceId)); final UpdateResponse updateResponse = server.commit(true, true); // XXX It's always OK return (updateResponse.getStatus() == RESPONSE_STATUS_OK); Any ideas what's going wrong? Is there a different way to test for the id not having been there, other than an additional search? Thanks Reuben
RE: Boosting ('bq') on multi-valued fields
Hey Ken, Thanks for your reply. When I wrote '5|6' I ment that this is a multiValued field with two values '5' and '6', rather than the literal string '5|6' (and any Tokenizer). Does your reply still holds? That is, are multiValued fields dependent on the notion of tokenization to such a degree so that I cant use str type with them meaningfully? if so, it seems weird to me that I should be able to define a str multiValued field to begin with.. -Chak Ensdorf Ken wrote: Hey, I have a field defined as such: field name=site_idtype=string indexed=true stored=false multiValued=true / with the string type defined as: fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ When I try using some query-time boost parameters using the bq on values of this field it seems to behave strangely in case of documents actually having multiple values: If i'd do a boost for a particular value ( site_id:5^1.1 ) it seems like all the cases where this field is actually populated with multiple ones ( i.e a document with field value 5|6 ) do not get boosted at all. I verified this using debugQuery explainOther=doc_id:document_with_multiple_values. is this a known issue/bug? any work arounds? (i'm using a nightly solr build from a few months back.. ) There is no tokenization on 'string' fields, so a query for 5 does not match a doc with a value of 5|6 for this field. You could try using field type 'text' for this and see what you get. You may need to customize it to you the StandardAnalyzer or WordDelimiterFilterFactory to get the right behavior. Using the analysis tool in the solr admin UI to experiment will probably be helpful. -Ken -- View this message in context: http://www.nabble.com/Boosting-%28%27bq%27%29-on-multi-valued-fields-tp24713905p24730981.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a multi-shard optimize message?
: Normally to optimize an index you POST optimize/ to /solr/update. Is : there any way to POST an optimize message to one instance and have it : propagate to all shards sort of like the select? : : /solr-shard-1/select?q=dog... shards=shard-1,shard2 : No, you'll need to send optimize to each host separately. and for the record: it would be relatively straight forward to impliment something like this (just like distributed search) ... but it has very little value. clients doing indexing operations have to send add/delete commands directly to the individual shards, so they have to send teh commit/optimize commands directly to them as well. if/when someone writes a distributed indexing handler, making it support distributed optimize/commit will be fairly trivial. -Hoss
Re: update some index documents after indexing process is done with DIH
If you make your EventListener implements SolrCoreAware you can get hold of the core on inform. use that to get hold of the SolrIndexWriter On Wed, Jul 29, 2009 at 9:20 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: From the newSearcher(..) of a CustomEventListener which extends of AbstractSolrEventListener can access to SolrIndexSearcher and all core properties but can't get a SolrIndexWriter. Do you now how can I get from there a SolrIndexWriter? This way I would be able to modify the documents (I need to modify them depending on values of other documents, that's why I can't do it with DIH delta-import). Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: That really sounds the best way to reach my goal. How could I invoque a listener from the newSearcher?Would be something like: listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst str name=qsolr/str str name=start0/str str name=rows10/str /lst lst str name=qrocks/str str name=start0/str str name=rows10/str /lst lststr name=qstatic newSearcher warming query from solrconfig.xml/str/lst /arr /listener listener event=newSearcher class=solr.MyCustomListener And MyCustomListener would be the class who open the reader: RefCountedSolrIndexSearcher searchHolder = null; try { searchHolder = dataImporter.getCore().getSearcher(); IndexReader reader = searchHolder.get().getReader(); //Here I iterate over the reader doing docuemnt modifications } finally { if (searchHolder != null) searchHolder.decref(); } } catch (Exception ex) { LOG.info(error); } you may not be able to access the DIH API from a newSearcher event . But the API would give you the searcher directly as a method parameter. Finally, to access to documents and add fields to some of them, I have thought in using SolrDocument classes. Can you please point me where something similar is done in solr source (I mean creation of SolrDocuemnts and conversion of them to proper lucene docuements). Does this way for reaching the goal makes sense? Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: when a core is reloaded the event fired is firstSearcher. newSearcher is fired when a commit happens On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Ok, but if I handle it in a newSearcher listener it will be executed every time I reload a core, isn't it? The thing is that I want to use an IndexReader to load in a HashMap some doc fields of the index and depending of the values of some field docs modify other docs. Its very memory consuming (I have tested it with a simple lucene script). Thats why I wanted to do it just after the indexing process. My ideal case would be to do it in the commit function of DirectUpdatehandler2.java just before writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want to mess that code... so trying to find out the best way to do that as a plugin instead of a hack as possible. Thanks in advance Noble Paul നോബിള് नोब्ळ्-2 wrote: It is best handled as a 'newSearcher' listener in solrconfig.xml. onImportEnd is invoked before committing On Tue, Jul 28, 2009 at 3:13 PM, Marc Sturlesemarc.sturl...@gmail.com wrote: Hey there, I would like to be able to do something like: After the indexing process is done with DIH I would like to open an indexreader, iterate over all docs, modify some of them depending on others and delete some others. I can easy do this directly coding with lucene but would like to know if there's a way to do it with Solr using SolrDocument or SolrInputDocument classes. I have thougth in using SolrJ or DIH listener onImportEnd but not sure if I can get an IndexReader in there. Any advice? Thanks in advance -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24697751.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL |