a bug of solr distributed search
in QueryComponent.mergeIds. It will remove document which has duplicated uniqueKey with others. In current implementation, it use the first encountered. String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; collapseList.remove(id+); docs.set(i, null);//remove it. // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } It iterate ove ShardResponse by for (ShardResponse srsp : sreq.responses) But the sreq.responses may be different. That is -- shard1's result and shard2's result may interchange position So when an uniqueKey(such as url) occurs in both shard1 and shard2. which one will be used is unpredicatable. But the socre of these 2 docs are different because of different idf. So the same query will get different result. One possible solution is to sort ShardResponse srsp by shard name.
Re: Any there any known issues may cause the index sync between the master/slave abnormal?
Hi! Any there any known issues may cause the index sync between the master/slave abnormal? What do you mean here? Corrupt indices? Please, describe your problems in more detail. And is there any API to call to force sync the index between the master and slave, or force to delete the old index on the slave? Syncing can be done via HTTP: http://wiki.apache.org/solr/SolrReplication Regards, Peter.
Re:Re: Any there any known issues may cause the index sync between the master/slave abnormal?
Hi Peter, Thanks your reponse. I will check the http://wiki.apache.org/solr/SolrReplication first. I mean the slave node did not delete the old index and finally cause the disk usage to large for the slave node. I am thinking to manually force the slave node to refresh the index. Regards, James. Hi! Any there any known issues may cause the index sync between the master/slave abnormal? What do you mean here? Corrupt indices? Please, describe your problems in more detail. And is there any API to call to force sync the index between the master and slave, or force to delete the old index on the slave? Syncing can be done via HTTP: http://wiki.apache.org/solr/SolrReplication Regards, Peter.
Re: Any there any known issues may cause the index sync between the master/slave abnormal?
Hi James, triggering an optimize (on the salve) helped us to shrink the disc usage of the slaves. But I think, the slaves will clean them up automatically on the next replication (if you don't mind the double-size-index) Regards, Peter. Hi Peter, Thanks your reponse. I will check the http://wiki.apache.org/solr/SolrReplication first. I mean the slave node did not delete the old index and finally cause the disk usage to large for the slave node. I am thinking to manually force the slave node to refresh the index. Regards, James. Hi! Any there any known issues may cause the index sync between the master/slave abnormal? What do you mean here? Corrupt indices? Please, describe your problems in more detail. And is there any API to call to force sync the index between the master and slave, or force to delete the old index on the slave? Syncing can be done via HTTP: http://wiki.apache.org/solr/SolrReplication Regards, Peter.
Re: a bug of solr distributed search
Li Li, this is the intended behaviour, not a bug. Otherwise you could get back the same record in a response for several times, which may not be intended by the user. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
But users will think there is something wrong with it when he/she search the same query but got different result. 2010/7/21 MitchK mitc...@web.de: Li Li, this is the intended behaviour, not a bug. Otherwise you could get back the same record in a response for several times, which may not be intended by the user. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: set field with value 0 to the end
Integer field can be empty also, I think u have set required=true if that remove required=true, and u can live field without data at the time of indexing. -- View this message in context: http://lucene.472066.n3.nabble.com/set-field-with-value-0-to-the-end-tp980580p983728.html Sent from the Solr - User mailing list archive at Nabble.com.
nested query and number of matched records
Hello community, I got a situation, where I know that some types of documents contain very extensive information and other types are giving more general information. Since I don't know whether a user searches for general or extensive information (and I don't want to ask him when he uses the default search), I want to give him a response back like this: 10 documents are type: short 1 document, if there is one, is type: extensive An example query would look like this: q={!dismax fq=type:short}my cool query OR {!dismax fq=type:extensive}my cool query The problem with this one will be, that I can not specify to retrive up to 10 short-documents and at most one extensive. I think this will not work and if I want to create such a search, I need to do two different queries. But before I waste performance, I wanted to ask. Thank you! Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983756.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
Ah, okay. I understand your problem. Why should doc x be at position 1 when searching for the first time, and when I search for the 2nd time it occurs at position 8 - right? I am not sure, but I think you can't prevent this without custom coding or making a document's occurence unique. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nested query and number of matched records
Oh,... I just see, there is no direct question ;-). How can I specify the number of returned documents in the desired way *within* one request? - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983773.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
yes. This will make user think our search engine has some bug. from the comments of the codes, it needs more things to do if (prevShard != null) { // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } 2010/7/21 MitchK mitc...@web.de: Ah, okay. I understand your problem. Why should doc x be at position 1 when searching for the first time, and when I search for the 2nd time it occurs at position 8 - right? I am not sure, but I think you can't prevent this without custom coding or making a document's occurence unique. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
I don't know much about the code. Maybe you can tell me to what file you are referring? However, from the comments one can see, that the problem is known but one decided to let it happen, because of System requirements in the Java version. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983880.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LocalSolr distance in km?
Hi, What resource are you using for LocalSolr? Using the SpatialTierQParser, you can choose between km or mile: http://blog.jteam.nl/2009/08/03/geo-location-search-with-solr-and-lucene/ Or, if you are using the LocalSolrQueryComponent (http://www.gissearch.com/localsolr), and you can't choose between the two units, you can use the radius parameter and the conversion from mile to Km (1 kilometer = 0.621371192 mile), e.g., http://...select?qt=geolat=xx.xxlong=yy.yyq=*:*radius=0.621371192 HTP -S On Jul 21, 2010, at 6:14 AM, Chamnap Chhorn wrote: Hi, I want to do a geo query with LocalSolr. However, It seems it supports only miles **when calculating distances. Is there a quick way to use this search component with solr using Km instead? The other thing I want it to calculate distance start from 500 meters up. How could I do this? -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/
Re: nested query and number of matched records
I Think Solr does not provide any thing like that U want. -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p983938.html Sent from the Solr - User mailing list archive at Nabble.com.
solrconfig.xml and xinclude
I am trying to export some config options common to all cores into single file, which would be included using xinclude. The only problem is how to include childrens of given node. common_solrconfig.xml looks like that: ?xml version=1.0 encoding=UTF-8 ? config lib dir=/solr/lib / /config solrconfig.xml looks like that: ?xml version=1.0 encoding=UTF-8 ? config !-- xinclude here -- /config now all of the following attemps have failed: xi:include href=/solr/common_solrconfig.xml xmlns:xi=http://www.w3.org/2001/XInclude;/xi:include xi:include href=/solr/common_solrconfig.xml xpointer=config/* xmlns:xi=http://www.w3.org/2001/XInclude;/xi:include xi:include href=/solr/common_solrconfig.xml xpointer=xpointer(config/*) xmlns:xi=http://www.w3.org/2001/XInclude;/xi:include xi:include href=/solr/common_solrconfig.xml xpointer=element(config/*) xmlns:xi=http://www.w3.org/2001/XInclude;/xi:include -- View this message in context: http://lucene.472066.n3.nabble.com/solrconfig-xml-and-xinclude-tp984058p984058.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Load cores without restarting/reloading Solr
Hi Peter We are using the packaged Ubuntu Server (10.04 LTS) versions of Tomcat6 and Solr1.4 and running a single instance of Solr with multiple cores. Regards Andrew On 20 July 2010 19:47, Peter Karich peat...@yahoo.de wrote: Hi Andrew, the whole tomcat shouldn't fail on restart if only one core fails. We are using the setup described here: http://wiki.apache.org/solr/SolrTomcat With the help of several different Tomcat Context xml files (under conf/Catalina/localhost/) the cores should be independent webapps: A different data directory (+config) and even a different solr version is possible. Or are you using the same setup? Regards, Peter. Hi Sorry, it wasn't very clear was it? [?] Yes, I use a 'template' core that isn't used and create a copy of this on the command line. I then edit the newcore/conf/solrconfig.xml and set the data path, add data-import sections etc and then I edit the solr.home/solr.xml and add the core name directory to that. I then go to the Tomcat manager/html and reload Solr. The problem I get is that if I have broken something in the new core Solr (correctly) doesn't reload and the other cores aren't then working. I don't need replication just yet but I will be looking into that eventually. Regards Andrew On 20 July 2010 10:32, Peter Karich peat...@yahoo.de wrote: Hi Andrew, I didn't correctly understand what you are trying to do with 'copying'? Just use one core as a template or use it to replicate data? You can reload only one application via: http://localhost/manager/html/reload?path=/yourapp (if you do this often you need to increase the PermGen space) You can replicate a core: http://wiki.apache.org/solr/SolrReplication Regards, Peter. Hi We have a few cores set up for separate sites and one of these is in use constantly. When I add a new core I can currently copying one of the other cores and renaming it, changing the conf etc and then reloading Solr via the tomcat manager. However, if something goes wrong then the other cores stop working until I have resolved the problem. My questions are: 1) Is using a separate core for different sites the correct method? 2) Is there a way of creating a core and starting it without having to reload Solr or restart tomcat? 3) I've looked at the Solr Cores CREATE handler but from what I gather, I need to create the core folder and edit the solr.xml first before loading the core with action=CREATE. Is that correct? Regards Andrew
Re: nested query and number of matched records
parallel calls. simultaneously query for type:short rows=10 and type:extensive rows=1 and merge your results. This would also let you separate your short docs from your extensive docs into different solr instances if you wished...depending on your document architecture this could speed up one or the other. -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984280.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nested query and number of matched records
Sure SOLR supports this: use facets on the field type: add to your regular query: facet.query=truefacet.field=type see http://wiki.apache.org/solr/SimpleFacetParameters On Wed, 2010-07-21 at 15:48 +0200, kenf_nc wrote: parallel calls. simultaneously query for type:short rows=10 and type:extensive rows=1 and merge your results. This would also let you separate your short docs from your extensive docs into different solr instances if you wished...depending on your document architecture this could speed up one or the other.
Re: nested query and number of matched records
That just gives a count of documents by type. The use-case, I believe, is to return from a search, 10 documents of type 'short' and 1 document of type 'extensive'. -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984539.html Sent from the Solr - User mailing list archive at Nabble.com.
faceted search with job title
Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
RE: faceted search with job title
You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
RE: Securing Solr 1.4 in a glassfish container AS NEW THREAD
Some further information -- I tried indexing a batch of PDFs with the client and Solr CELL, setting the credentials in the httpclient. For some reason after successfully indexing several hundred files I start getting a SolrException: Unauthorized and an info message (for every subsequent file): INFO basic authentication scheme selected Org.apache.commons.httpclient.HttpMethodDirector process WWWAuthChallenge INFO Failure authenticating with BASIC 'realm'@host:port I increased session timeout in web.xml with no change. I'm looking through the httpclient authentication now. -Jon -Original Message- From: Sharp, Jonathan Sent: Friday, July 16, 2010 8:59 AM To: 'solr-user@lucene.apache.org' Subject: RE: Securing Solr 1.4 in a glassfish container AS NEW THREAD Hi Bilgin, Thanks for the snippet -- that helps a lot. -Jon -Original Message- From: Bilgin Ibryam [mailto:bibr...@gmail.com] Sent: Friday, July 16, 2010 1:31 AM To: solr-user@lucene.apache.org Subject: Re: Securing Solr 1.4 in a glassfish container AS NEW THREAD Hi Jon, SolrJ (CommonsHttpSolrServer) internally uses apache http client to connect to solr. You can check there for some documentation. I secured solr also with BASIC auth-method and use the following snippet to access it from solrJ: //set username and password ((CommonsHttpSolrServer) server).getHttpClient().getParams().setAuthenticationPreemptive(true); Credentials defaultcreds = new UsernamePasswordCredentials(username, secret); ((CommonsHttpSolrServer) server).getHttpClient().getState().setCredentials(new AuthScope(localhost, 80, AuthScope.ANY_REALM), defaultcreds); HTH Bilgin Ibryam On Fri, Jul 16, 2010 at 2:35 AM, Sharp, Jonathan jsh...@coh.org wrote: Hi All, I am considering securing Solr with basic auth in glassfish using the container, by adding to web.xml and adding sun-web.xml file to the distributed WAR as below. If using SolrJ to index files, how can I provide the credentials for authentication to the http-client (or can someone point me in the direction of the right documentation to do that or that will help me make the appropriate modifications) ? Also any comment on the below is appreciated. Add this to web.xml --- login-config auth-methodBASIC/auth-method realm-nameSomeRealm/realm-name /login-config security-constraint web-resource-collection web-resource-nameAdmin Pages/web-resource-name url-pattern/admin/url-pattern url-pattern/admin/*/url-pattern http-methodGET/http-methodhttp-methodPOST/http-methodhttp-metho dPUT/http-methodhttp-methodTRACE/http-methodhttp-methodHEAD/htt p-methodhttp-methodOPTIONS/http-methodhttp-methodDELETE/http-met hod /web-resource-collection auth-constraint role-nameSomeAdminRole/role-name /auth-constraint /security-constraint security-constraint web-resource-collection web-resource-nameUpdate Servlet/web-resource-name url-pattern/update/*/url-pattern http-methodGET/http-methodhttp-methodPOST/http-methodhttp-metho dPUT/http-methodhttp-methodTRACE/http-methodhttp-methodHEAD/htt p-methodhttp-methodOPTIONS/http-methodhttp-methodDELETE/http-met hod /web-resource-collection auth-constraint role-nameSomeUpdateRole/role-name /auth-constraint /security-constraint security-constraint web-resource-collection web-resource-nameSelect Servlet/web-resource-name url-pattern/select/*/url-pattern http-methodGET/http-methodhttp-methodPOST/http-methodhttp-metho dPUT/http-methodhttp-methodTRACE/http-methodhttp-methodHEAD/htt p-methodhttp-methodOPTIONS/http-methodhttp-methodDELETE/http-met hod /web-resource-collection auth-constraint role-nameSomeSearchRole/role-name /auth-constraint /security-constraint --- Also add this as sun-web.xml ?xml version=1.0 encoding=UTF-8? !DOCTYPE sun-web-app PUBLIC -//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd; sun-web-app error-url= context-root/Solr/context-root jsp-config property name=keepgenerated value=true descriptionKeep a copy of the generated servlet class' java code./description /property /jsp-config security-role-mapping role-nameSomeAdminRole/role-name group-nameSomeAdminGroup/group-name /security-role-mapping security-role-mapping role-nameSomeUpdateRole/role-name group-nameSomeUpdateGroup/group-name /security-role-mapping security-role-mapping role-nameSomeSearchRole/role-name group-nameSomeSearchGroup/group-name /security-role-mapping /sun-web-app -- -Jon
Re: a bug of solr distributed search
How about sorting over the score? Would that be possible? On Jul 21, 2010, at 12:13 AM, Li Li wrote: in QueryComponent.mergeIds. It will remove document which has duplicated uniqueKey with others. In current implementation, it use the first encountered. String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; collapseList.remove(id+); docs.set(i, null);//remove it. // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } It iterate ove ShardResponse by for (ShardResponse srsp : sreq.responses) But the sreq.responses may be different. That is -- shard1's result and shard2's result may interchange position So when an uniqueKey(such as url) occurs in both shard1 and shard2. which one will be used is unpredicatable. But the socre of these 2 docs are different because of different idf. So the same query will get different result. One possible solution is to sort ShardResponse srsp by shard name.
Re: faceted search with job title
mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: a bug of solr distributed search
It already was sorted by score. The problem here is the following: Shard_A and shard_B contain doc_X and doc_X. If you are querying for something, doc_X could have a score of 1.0 at shard_A and a score of 12.0 at shard_B. You can never be sure which doc Solr sees first. In the bad case, Solr sees the doc_X firstly at shard_A and ignores it at shard_B. That means, that the doc maybe would occur at page 10 in pagination, although it *should* occur at page 1 or 2. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nested query and number of matched records
Thank you three for your feedback! Chantal, unfortuntately kenf is right. Facetting won't work in this special case. parallel calls. Yes, this will be the solution. However, this would lead to a second HTTP-request and I hoped to be able to avoid it. Chantal Ackermann wrote: Sure SOLR supports this: use facets on the field type: add to your regular query: facet.query=truefacet.field=type see http://wiki.apache.org/solr/SimpleFacetParameters On Wed, 2010-07-21 at 15:48 +0200, kenf_nc wrote: parallel calls. simultaneously query for type:short rows=10 and type:extensive rows=1 and merge your results. This would also let you separate your short docs from your extensive docs into different solr instances if you wished...depending on your document architecture this could speed up one or the other. -- View this message in context: http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984750.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: faceted search with job title
Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Re: help finding illegal chars in XML doc
Hi Chris, Thanks for your reply. I could not find in the log files any mention to that. By the way I only have _MM_DD.request.log files in my directory. Do I have to enable any specific log or level to catch those errors? On Sun, Jul 18, 2010 at 3:45 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : SimplePostTool: FATAL: Solr returned an error: : Illegal_character_CTRLCHAR_code_27__at_rowcol_unknownsource_37022847 : : I've tried to track where this problem is located without luck. check your solr logs, it will contain the unmunged version of the error message (the the version of jetty used in the 1.4.1 example setup seems to think all punctuation should be removed from error messages) complete with the row/column of your XML message that had the problem (it's either 3,7022847; or 370,22847; or 3702,2847; etc... -Hoss
boosting particular field values
I'm using dismax request handler, solr 1.4. I would like to boost the weight of certain fields according to their values... this appears to work: bq=category:electronics^5.5 However, I think this boosting only affects sorting the results that have already matched? So if I only get 10 rows back, I might not get any records back that are category electronics. If I get 100 rows, I can see that bq is working. However, I only want to get 10 rows. How does one affect the kinds of results that are matched to begin with? bq is the wrong thing to use, right? Thanks for any help, Justin
RE: boosting particular field values
function queries match all documents http://wiki.apache.org/solr/FunctionQuery#Using_FunctionQuery -Original message- From: Justin Lolofie jta...@gmail.com Sent: Wed 21-07-2010 20:24 To: solr-user@lucene.apache.org; Subject: boosting particular field values I'm using dismax request handler, solr 1.4. I would like to boost the weight of certain fields according to their values... this appears to work: bq=category:electronics^5.5 However, I think this boosting only affects sorting the results that have already matched? So if I only get 10 rows back, I might not get any records back that are category electronics. If I get 100 rows, I can see that bq is working. However, I only want to get 10 rows. How does one affect the kinds of results that are matched to begin with? bq is the wrong thing to use, right? Thanks for any help, Justin
Re: Solr searching performance issues, using large documents
From the mailing list archive, Koji wrote: 1. Provide another field for highlighting and use copyField to copy plainText to the highlighting field. and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field. Now: doing a copyfield adds a value to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued=false to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. I've been experimenting with this, and here's what I've tried: field name=body type=text_pl indexed=true stored=false multiValued=true termVectors=true termPositions=true termOff sets=true / field name=body_all type=text_pl indexed=false stored=true multiValued=true / copyField source=body dest=body_all/ ... but it's still very slow (10+ seconds). Why is it better to have two fields (one indexed but not stored, and the other not indexed but stored) rather than just one field that's both indexed and stored? From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used. What does this mean? How do you load a field lazily? Thanks for your time, guys - this has started to become frustrating, since it works so well, but is very slow! -Pete On Jul 20, 2010, at 5:36 PM, Peter Spam wrote: Data set: About 4,000 log files (will eventually grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB. Problem: When I search for common terms, the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I disable highlighting, performance improves a lot, but is still slow for some queries (7 seconds). Thanks in advance for any ideas! -Peter - 4GB RAM server % java -Xms2048M -Xmx3072M -jar start.jar - schema.xml changes: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType ... field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=version type=string indexed=true stored=true multiValued=false/ field name=device type=string indexed=true stored=true multiValued=false/ field name=filename type=string indexed=true stored=true multiValued=false/ field name=filesize type=long indexed=true stored=true multiValued=false/ field name=pversion type=int indexed=true stored=true multiValued=false/ field name=first2md5 type=string indexed=false stored=true multiValued=false/ field name=ckey type=string indexed=true stored=true multiValued=false/ ... dynamicField name=* type=ignored multiValued=true / defaultSearchFieldbody/defaultSearchField solrQueryParser defaultOperator=AND/ - solrconfig.xml changes: maxFieldLength2147483647/maxFieldLength ramBufferSizeMB128/ramBufferSizeMB - The query: rowStr = rows=10 facet = facet=truefacet.limit=10facet.field=devicefacet.field=ckeyfacet.field=version fields = fl=id,score,filename,version,device,first2md5,filesize,ckey termvectors = tv=trueqt=tvrhtv.all=true hl = hl=truehl.fl=bodyhl.snippets=1hl.fragsize=400 regexv = (?m)^.*\n.*\n.*$ hl_regex = hl.regex.pattern= + CGI::escape(regexv) + hl.regex.slop=1hl.fragmenter=regexhl.regex.maxAnalyzedChars=2147483647hl.maxAnalyzedChars=2147483647 justq = 'q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
Re: faceted search with job title
I don't see how it can be done without writing sax or dom code for each job board, it is non-maintainable if there are a lot of new job boards being crawled. Maybe I should use regex match? Then I just need to substitute the regex pattern for each job board without writing any new sax or dom code. But is regex pattern flexible enough for all job boards? Thanks. From: Nagelberg, Kallin knagelb...@globeandmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Dismax query response field number
Hi, It seems that not all field are returned from query response when i use DISMAX? Only first 10?? Any idea? Here is my solrconfig: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str str name=fl*/str float name=tie0.01/float str name=qf text^0.5 content^1.1 title^1.5 /str str name=pf text^0.2 content^1.1 title^1.5 /str str name=bf recip(price,1,1000,1000)^0.3 /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int str name=q.alt*:*/str !-- example highlighter config, enable per-query with hl=true -- str name=hl.fltext features name/str !-- for this field, we want no fragmenting, just highlighting -- str name=f.name.hl.fragsize0/str !-- instructs Solr to return the field itself if no query terms are found -- str name=f.name.hl.alternateFieldname/str str name=f.text.hl.fragmenterregex/str !-- defined below -- /lst /requestHandler
Re: boosting particular field values
I might have misunderstood, but I think I cant do string literals in function queries, right? myfield:something^3.0 I tried it anyway using solr 1.4, doesnt seem to work. On Wed, Jul 21, 2010 at 1:48 PM, Markus Jelsma markus.jel...@buyways.nl wrote: function queries match all documents http://wiki.apache.org/solr/FunctionQuery#Using_FunctionQuery -Original message- From: Justin Lolofie jta...@gmail.com Sent: Wed 21-07-2010 20:24 To: solr-user@lucene.apache.org; Subject: boosting particular field values I'm using dismax request handler, solr 1.4. I would like to boost the weight of certain fields according to their values... this appears to work: bq=category:electronics^5.5 However, I think this boosting only affects sorting the results that have already matched? So if I only get 10 rows back, I might not get any records back that are category electronics. If I get 100 rows, I can see that bq is working. However, I only want to get 10 rows. How does one affect the kinds of results that are matched to begin with? bq is the wrong thing to use, right? Thanks for any help, Justin
Count hits per document?
If I search for foo, I get back a list of documents. Any way to get a per-document hit count? Thanks! -Pete
Re: Using hl.regex.pattern to print complete lines
Still not working ... any ideas? -Pete On Jul 14, 2010, at 11:56 AM, Peter Spam wrote: Any other thoughts, Chris? I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want. 1) I don't care how many characters it returns, I'd like entire lines all the time 2) I just want it to always return 3 lines: the line before, the actual line, and the line after. 3) This should be like grep -C1 Thanks for your time! -Pete On Jul 9, 2010, at 12:08 AM, Peter Spam wrote: Ah, this makes sense. I've changed my regex to (?m)^.*$, and it works better, but I still get fragments before and after some returns. Thanks for the hint! -Pete On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote: : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single : is available that is for getting entire field contents with search terms : highlighted. To use it, set hl.useFastVectorHighlighter to true. He doesn't want the entire field -- his stored field values contain multi-line strings (using newline characters) and he wants to make fragments per line (ie: bounded by newline characters, or the start/end of the entire field value) Peter: i haven't looked at the code, but i expect that the problem is that the java regex engine isn't being used in a way that makes ^ and $ match any line boundary -- they are probably only matching the start/end of the field (and . is probably only matching non-newline characters) java regexes support embedded flags (ie: (?xyz)your regex) so you might try that (i don't remember what the correct modifier flag is for the multiline mode off the top of my head) -Hoss
Re: faceted search with job title
You could grab your xpath rules from a db too. This is what I did for a price scrapping app I did a while ago. New sites were added with a set of rules using a web ui You could certainly use regex of course, but IMO that's more complex than writing a simple xpath. Using JavaScript or some dom traversal code, you could quite easily create a click and point tool to generate rules very simply and quickly. On 21 Jul 2010, at 23:10, Savannah Beckett savannah_becket...@yahoo.com wrote: And I will have to recompile the dom or sax code each time I add a job board for crawling. Regex patten is only a string which can be stored in a text file or db, and retrieved based on the job board. What do you think? From: Nagelberg, Kallin knagelb...@globeandmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 10:39:32 AM Subject: RE: faceted search with job title Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
Clustering results limit?
Hi, I am attempting to cluster a query. It kinda works, but where my (regular) query returns 500 results the cluster only shows 1-10 hits for each cluster (5 clusters). Never more than 10 docs and I know its not right. What could be happening here? It should be showing dozens of documents per cluster. thanks, Darren
Re: Dismax query response field number
Fields or documents? It will return all of the fields that are 'stored'. The default number of documents to return is 10. Returning all of the documents is very slow, so you have to request that with the rows= parameter. On Wed, Jul 21, 2010 at 3:32 PM, scr...@asia.com wrote: Hi, It seems that not all field are returned from query response when i use DISMAX? Only first 10?? Any idea? Here is my solrconfig: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str str name=fl*/str float name=tie0.01/float str name=qf text^0.5 content^1.1 title^1.5 /str str name=pf text^0.2 content^1.1 title^1.5 /str str name=bf recip(price,1,1000,1000)^0.3 /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int str name=q.alt*:*/str !-- example highlighter config, enable per-query with hl=true -- str name=hl.fltext features name/str !-- for this field, we want no fragmenting, just highlighting -- str name=f.name.hl.fragsize0/str !-- instructs Solr to return the field itself if no query terms are found -- str name=f.name.hl.alternateFieldname/str str name=f.text.hl.fragmenterregex/str !-- defined below -- /lst /requestHandler -- Lance Norskog goks...@gmail.com
Re: Count hits per document?
You have to store the termvectors when you index, and then retrieve them when you do a query. Highlighting does exactly this; the easy way to do this is to ask for highlighting and search for the highlighted words, and count them. On Wed, Jul 21, 2010 at 4:21 PM, Peter Spam ps...@mac.com wrote: If I search for foo, I get back a list of documents. Any way to get a per-document hit count? Thanks! -Pete -- Lance Norskog goks...@gmail.com
Re: Using hl.regex.pattern to print complete lines
Java regex might be different from all other regex, so writing a test program and experimenting is the only way. Once you decide that this expression really is what you want, and that it does not achieve what you expect, you might have found a bug in highlighting. Lucene/Solr highlighting has always been a difficult area, and might not do everything right. On Wed, Jul 21, 2010 at 4:20 PM, Peter Spam ps...@mac.com wrote: Still not working ... any ideas? -Pete On Jul 14, 2010, at 11:56 AM, Peter Spam wrote: Any other thoughts, Chris? I've been messing with this a bit, and can't seem to get (?m)^.*$ to do what I want. 1) I don't care how many characters it returns, I'd like entire lines all the time 2) I just want it to always return 3 lines: the line before, the actual line, and the line after. 3) This should be like grep -C1 Thanks for your time! -Pete On Jul 9, 2010, at 12:08 AM, Peter Spam wrote: Ah, this makes sense. I've changed my regex to (?m)^.*$, and it works better, but I still get fragments before and after some returns. Thanks for the hint! -Pete On Jul 8, 2010, at 6:27 PM, Chris Hostetter wrote: : If you can use the latest branch_3x or trunk, hl.fragListBuilder=single : is available that is for getting entire field contents with search terms : highlighted. To use it, set hl.useFastVectorHighlighter to true. He doesn't want the entire field -- his stored field values contain multi-line strings (using newline characters) and he wants to make fragments per line (ie: bounded by newline characters, or the start/end of the entire field value) Peter: i haven't looked at the code, but i expect that the problem is that the java regex engine isn't being used in a way that makes ^ and $ match any line boundary -- they are probably only matching the start/end of the field (and . is probably only matching non-newline characters) java regexes support embedded flags (ie: (?xyz)your regex) so you might try that (i don't remember what the correct modifier flag is for the multiline mode off the top of my head) -Hoss -- Lance Norskog goks...@gmail.com
how to change the default path of Solr Tomcat
Hi everyone, I really need your help this is the default address that I got from the solr: http://172.16.17.126:8983/solr/ the question is how to change that path to be: http://172.16.17.126:8983/search/ Please I really need your help thanks a lot before
Re: a bug of solr distributed search
I think what Siva mean is that when there are docs with the same url, leave the doc whose score is large. This is the right solution. But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. 2010/7/22 MitchK mitc...@web.de: It already was sorted by score. The problem here is the following: Shard_A and shard_B contain doc_X and doc_X. If you are querying for something, doc_X could have a score of 1.0 at shard_A and a score of 12.0 at shard_B. You can never be sure which doc Solr sees first. In the bad case, Solr sees the doc_X firstly at shard_A and ignores it at shard_B. That means, that the doc maybe would occur at page 10 in pagination, although it *should* occur at page 1 or 2. Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to change the default path of Solr Tomcat
Your environment may be different, but this is how I did it. (Apache Tomcat on Windows 2008) go to \program files\apache...\Tomcat\conf\catalina\localhost rename solr.xml to search.xml recycle Tomcat service -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-change-the-default-path-of-Solr-Tomcat-tp985881p985937.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to change the default path of Solr Tomcat
firstly, I really appreciate your respond to my question Ken I'm using Tomcat on Linux Debian I can't find the solr.xml in \program files\apache...\Tomcat\conf\catalina\localhost there are only 2 files in localhost folder: host-manager.xml and manager.xml any solutions? On 7/22/2010 10:41 AM, kenf_nc wrote: Your environment may be different, but this is how I did it. (Apache Tomcat on Windows 2008) go to \program files\apache...\Tomcat\conf\catalina\localhost rename solr.xml to search.xml recycle Tomcat service
Re: how to change the default path of Solr Tomcat
it seems like you are using Default server (Jetty with port 8983), also it looks like you are trying to run it with command java -jar start.jar if so then under same directory there is another directory called webapps go in there, rename solr.war to search.war bounce server and you should be good to go! Eben wrote: firstly, I really appreciate your respond to my question Ken I'm using Tomcat on Linux Debian I can't find the solr.xml in \program files\apache...\Tomcat\conf\catalina\localhost there are only 2 files in localhost folder: host-manager.xml and manager.xml any solutions? On 7/22/2010 10:41 AM, kenf_nc wrote: Your environment may be different, but this is how I did it. (Apache Tomcat on Windows 2008) go to \program files\apache...\Tomcat\conf\catalina\localhost rename solr.xml to search.xml recycle Tomcat service
Re: how to change the default path of Solr Tomcat
Check: /var/lib/tomcat5.5/conf/Catalina/localhost/ Are you using Tomcat on a custom port (the default tomcat port is 8080)? Check your ports ($ sudo netstat -nlp) Maybe try searching the file system for the solr.xml file? $ sudo find / -name solr.xml Hope this helps. K On Wed, Jul 21, 2010 at 8:22 PM, Eben e...@tokobagus.com wrote: firstly, I really appreciate your respond to my question Ken I'm using Tomcat on Linux Debian I can't find the solr.xml in \program files\apache...\Tomcat\conf\catalina\localhost there are only 2 files in localhost folder: host-manager.xml and manager.xml any solutions? On 7/22/2010 10:41 AM, kenf_nc wrote: Your environment may be different, but this is how I did it. (Apache Tomcat on Windows 2008) go to \program files\apache...\Tomcat\conf\catalina\localhost rename solr.xml to search.xml recycle Tomcat service
Re: set field with value 0 to the end
why using default=0 its optional remove that from field definition -- View this message in context: http://lucene.472066.n3.nabble.com/set-field-with-value-0-to-the-end-tp980580p986115.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to change the default path of Solr Tomcat
Hi Wong, I'm using Default server (Jetty with port 8983) Girish solution already solve my problem thanks for your response Wong :) On 7/22/2010 11:57 AM, K Wong wrote: Check: /var/lib/tomcat5.5/conf/Catalina/localhost/ Are you using Tomcat on a custom port (the default tomcat port is 8080)? Check your ports ($ sudo netstat -nlp) Maybe try searching the file system for the solr.xml file? $ sudo find / -name solr.xml Hope this helps. K On Wed, Jul 21, 2010 at 8:22 PM, Ebene...@tokobagus.com wrote: firstly, I really appreciate your respond to my question Ken I'm using Tomcat on Linux Debian I can't find the solr.xml in \program files\apache...\Tomcat\conf\catalina\localhost there are only 2 files in localhost folder: host-manager.xml and manager.xml any solutions? On 7/22/2010 10:41 AM, kenf_nc wrote: Your environment may be different, but this is how I did it. (Apache Tomcat on Windows 2008) go to \program files\apache...\Tomcat\conf\catalina\localhost rename solr.xml to search.xml recycle Tomcat service
facet.query with facet.date
Hello, I need to create two date facets displaying counts of a particular fields values. With normal facets, this can be done with facet.query, but this parameter is not available to facet.date . Is this possbile? I'd really prefer to avoid performing two queries. Thanks William -- View this message in context: http://lucene.472066.n3.nabble.com/facet-query-with-facet-date-tp986206p986206.html Sent from the Solr - User mailing list archive at Nabble.com.