Cached document view in solr
Hi, nutch search results provide a link for getting the cached document copy. It fetches the raw content from segments based on document id. {cached.jsp} Is it possible to have similar functionality in solr, what can be done to achieve this? Any pointers. Thanks, Ram DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: How to set User.dir or CWD for Solr during Tomcat startup
Am 07.01.2010 um 00:07 schrieb Turner, Robbin J: I've been doing a bunch of googling and haven't seen if there is a parameter to set within Tomcat other than the solr/home which is setup in the solr.xml under the $CATALINA_HOME/conf/Catalina/localhost/. Hi. We set this in solr.xml Context docBase=/opt/solr-tomcat/apache-tomcat-6.0.20/webapps/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/opt/solr-tomcat/solr override=true / /Context http://wiki.apache.org/solr/SolrTomcat#Simple_Example_Install hope this helps. olivier -- Olivier Dobberkau . . . . . . . . . . . . . . Je TYPO3, desto d.k.d
Is there other way than sorting by date?
Hi, guys, I am getting started with solr. when I search a collection of data, I care both the document score(relevance towards user query word) and document publishTime(which is another field in each of the document). If I simply sort matching document set by publishTime field, then the score is not considered in. How would I handle this? Maybe I should use publishTime field as another search field,and compute a composite score together with relevence score? Any hints, thanks very much! -- 梅旺生
Re: DisMaxRequestHandler bf configuration
it wouldn't be q.alt though, just q, in the config file. q.alt is typically *:*, it's the fall back query when no q is provided. though, in thinking about it, q.alt would work here, but i'd use q personally. On Jan 6, 2010, at 9:45 PM, Andy wrote: Let me make sure I understand you. I'd get my regular query from haystack as qq=foo rather than q=foo. Then I put in solrconfig within the dismax section: str name=q.alt {!boost b=$popularityboost v= $qq}popularityboost=log(popularity) /str Is that what you meant? --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote: From: Yonik Seeley yo...@lucidimagination.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Wednesday, January 6, 2010, 8:42 PM On Wed, Jan 6, 2010 at 8:24 PM, Andy angelf...@yahoo.com wrote: I meant can I do it with dismax without modifying every single query? I'm accessing Solr through haystack and all queries are generated by haystack. I'd much rather not have to go under haystack to modify the generated queries. Hence I'm trying to find a way to boost every query by default. If you can get haystack to pass through the user query as something like qq, then yes - just use something like the last link I showed at http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents and set defaults for everything except qq. -Yonik http://www.lucidimagination.com --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote: From: Yonik Seeley yo...@lucidimagination.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Wednesday, January 6, 2010, 7:48 PM On Wed, Jan 6, 2010 at 7:43 PM, Andy angelf...@yahoo.com wrote: So if I want to configure Solr to turn every query q=foo into q={! boost b=log(popularity)}foo, dismax wouldn't work but edismax would? You can do it with dismax it's just that the syntax is slightly more convoluted. Check out the section on boosting newer documents: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
RE: Cached document view in solr
nutch search results provide a link for getting the cached document copy. It fetches the raw content from segments based on document id. {cached.jsp} Is it possible to have similar functionality in solr, what can be done to achieve this? Any pointers. I could retrieve the content using the text filed. 'fl=text' so content can be retrieved. But its parsed text with font formatting lost. Can the original content be stored in any field as is? Thanks, Ram DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Is there other way than sorting by date?
2010/1/7 Wangsheng Mei hairr...@gmail.com Hi, guys, I am getting started with solr. when I search a collection of data, I care both the document score(relevance towards user query word) and document publishTime(which is another field in each of the document). If I simply sort matching document set by publishTime field, then the score is not considered in. How would I handle this? Maybe I should use publishTime field as another search field,and compute a composite score together with relevence score? Any hints, thanks very much! Perhaps this can help? http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents -- Regards, Shalin Shekhar Mangar.
Re: replication -- missing field data file
actually it does not. BTW, FYI, backup is just to take periodics backups not necessary for the Replicationhandler to work On Thu, Jan 7, 2010 at 2:37 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: How can you tell when the backup is done? -Original Message- From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble Paul ??? ?? Sent: Wednesday, January 06, 2010 12:23 PM To: solr-user Subject: Re: replication -- missing field data file the index dir is in the name index others will be stored as indexdate-as-number On Wed, Jan 6, 2010 at 10:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: How can you differentiate between the backup and the normal index files? -Original Message- From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble Paul ??? ?? Sent: Wednesday, January 06, 2010 11:52 AM To: solr-user Subject: Re: replication -- missing field data file On Wed, Jan 6, 2010 at 9:49 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: I set up replication between 2 cores on one master and 2 cores on one slave. Before doing this the master was working without issues, and I stopped all indexing on the master. Now that replication has synced the index files, an .FDT field is suddenly missing on both the master and the slave. Pretty much every operation (core reload, commit, add document) fails with an error like the one posted below. How could this happen? How can one recover from such an error? Is there any way to regenerate the FDT file without re-indexing everything? This brings me to a question about backups. If I run the replication?command=backup command, where is this backup stored? I've tried this a few times and get an OK response from the machine, but I don't see the backup generated anywhere. The backup is done asynchronously. So it always gives an OK response immedietly. The backup is created in the data dir itself Thanks, Gio. org.apache.solr.common.SolrException: Error handling 'reload' action at org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:412) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:142) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:298) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Y:\solrData\FilingsCore2\index\_a0r.fdt (The system cannot find the file specified) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at org.apache.solr.core.SolrCore.lt;initgt;(SolrCore.java:579) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:425) at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:486) at org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:409) ... 18 more Caused by: java.io.FileNotFoundException: Y:\solrData\FilingsCore2\index\_a0r.fdt (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.lt;initgt;(Unknown Source) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.lt;initgt;(SimpleFSDirectory.java:78) at
Meaning of this error: Failure to meet condition(s) of required/prohibited clause(s)???
Hi All, i have a document indexed in solr, which is as follow : doc str name=idP-E-HE-Philips-32PFL5409-98-Black-32/str arr name=keywords strPhilips/str strLCD TVs/str /arr str name=title Philips 32PFL5409-98 32 LCDTV withPixel Plus HD (Black,32) /str /doc now when i search for lcd tvs, i dont the above doc in search results.. on doing explain other, i got the following output.. P-E-HE-Philips-32PFL5409-98-Black-32: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (((subKeywords:lcd^0.1 | keywords:lcd^0.5 | defaultKeywords:lcd | contributors:lcd^0.5 | title:lcd) (subKeywords:televis^0.1 | keywords:tvs^0.5 | defaultKeywords:tvs | contributors:tvs^0.5 | (title:televis title:tv title:tvs)))~1) 0.0 = (NON-MATCH) Failure to match minimum number of optional clauses: 1 0.91647065 = (MATCH) max of: 0.91647065 = (MATCH) weight(keywords:lcd tvs^0.5 in 40), product of: 0.13178125 = queryWeight(keywords:lcd tvs^0.5), product of: 0.5 = boost 11.127175 = idf(docFreq=34, maxDocs=875476) 0.023686381 = queryNorm 6.9544845 = (MATCH) fieldWeight(keywords:lcd tvs in 40), product of: 1.0 = tf(termFreq(keywords:lcd tvs)=1) 11.127175 = idf(docFreq=34, maxDocs=875476) 0.625 = fieldNorm(field=keywords, doc=40) i am not sure of what it means? and if i can tweak it or not? please not, this score was more than the results which showed up... Regards, Gunjan -- View this message in context: http://old.nabble.com/Meaning-of-this-error%3A-Failure-to-meet-condition%28s%29-of-required-prohibited-clause%28s%29tp27058008p27058008.html Sent from the Solr - User mailing list archive at Nabble.com.
solr updateCSV
I am trying to use solr's csv updater to index the data , i am tryin to specify the .Dat format consisting of field seperator , text qualifier and a line seperator for example field 1 field separator field 2field seperator text qualifiervalue for field 1text qualifierfield seperatortext qualifiervalue for field 2 text qualifierfield seperatorline seperator Can we specify text qualifier and line seperator as well ? I have tested that we can specify a seperator and works good. -- Nipen Mark
Re: Is there other way than sorting by date?
This is exactly what I need, really appreciate. 2010/1/7 Shalin Shekhar Mangar shalinman...@gmail.com 2010/1/7 Wangsheng Mei hairr...@gmail.com Hi, guys, I am getting started with solr. when I search a collection of data, I care both the document score(relevance towards user query word) and document publishTime(which is another field in each of the document). If I simply sort matching document set by publishTime field, then the score is not considered in. How would I handle this? Maybe I should use publishTime field as another search field,and compute a composite score together with relevence score? Any hints, thanks very much! Perhaps this can help? http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents -- Regards, Shalin Shekhar Mangar. -- 梅旺生
Re: Is there other way than sorting by date?
This is exactly what I need, really appreciate. 2010/1/7 Shalin Shekhar Mangar shalinman...@gmail.com 2010/1/7 Wangsheng Mei hairr...@gmail.com Hi, guys, I am getting started with solr. when I search a collection of data, I care both the document score(relevance towards user query word) and document publishTime(which is another field in each of the document). If I simply sort matching document set by publishTime field, then the score is not considered in. How would I handle this? Maybe I should use publishTime field as another search field,and compute a composite score together with relevence score? Any hints, thanks very much! Perhaps this can help? http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents -- Regards, Shalin Shekhar Mangar. -- 梅旺生
Field highlighting
Hi, I'm trying to highlight short text values. The field they came from has a type shared with other fields. I have highlighting working on other fields but not on this one. Why ?
Re: Meaning of this error: Failure to meet condition(s) of required/prohibited clause(s)???
How are these fields defined in your schema.xml? Note that String types are indexed without tokenization, so if str is defined as a String field type, that may be part of your problem (try text type if so). If this is irrelevant, please show us the relevant parts of your schema and the query you're submitting... Erick On Thu, Jan 7, 2010 at 6:17 AM, gunjan_versata gunjanga...@gmail.comwrote: Hi All, i have a document indexed in solr, which is as follow : doc str name=idP-E-HE-Philips-32PFL5409-98-Black-32/str arr name=keywords strPhilips/str strLCD TVs/str /arr str name=title Philips 32PFL5409-98 32 LCDTV withPixel Plus HD (Black,32) /str /doc now when i search for lcd tvs, i dont the above doc in search results.. on doing explain other, i got the following output.. P-E-HE-Philips-32PFL5409-98-Black-32: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (((subKeywords:lcd^0.1 | keywords:lcd^0.5 | defaultKeywords:lcd | contributors:lcd^0.5 | title:lcd) (subKeywords:televis^0.1 | keywords:tvs^0.5 | defaultKeywords:tvs | contributors:tvs^0.5 | (title:televis title:tv title:tvs)))~1) 0.0 = (NON-MATCH) Failure to match minimum number of optional clauses: 1 0.91647065 = (MATCH) max of: 0.91647065 = (MATCH) weight(keywords:lcd tvs^0.5 in 40), product of: 0.13178125 = queryWeight(keywords:lcd tvs^0.5), product of: 0.5 = boost 11.127175 = idf(docFreq=34, maxDocs=875476) 0.023686381 = queryNorm 6.9544845 = (MATCH) fieldWeight(keywords:lcd tvs in 40), product of: 1.0 = tf(termFreq(keywords:lcd tvs)=1) 11.127175 = idf(docFreq=34, maxDocs=875476) 0.625 = fieldNorm(field=keywords, doc=40) i am not sure of what it means? and if i can tweak it or not? please not, this score was more than the results which showed up... Regards, Gunjan -- View this message in context: http://old.nabble.com/Meaning-of-this-error%3A-Failure-to-meet-condition%28s%29-of-required-prohibited-clause%28s%29tp27058008p27058008.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field highlighting
It's really hard to provide any response with so little information, could you show us the difference between a field that works and one that doesn't? Especially the relevant schema.xml entries and the query that fails to highlight Erick On Thu, Jan 7, 2010 at 7:47 AM, Xavier Schepler xavier.schep...@sciences-po.fr wrote: Hi, I'm trying to highlight short text values. The field they came from has a type shared with other fields. I have highlighting working on other fields but not on this one. Why ?
Re: Field highlighting
Erick Erickson a écrit : It's really hard to provide any response with so little information, could you show us the difference between a field that works and one that doesn't? Especially the relevant schema.xml entries and the query that fails to highlight Erick On Thu, Jan 7, 2010 at 7:47 AM, Xavier Schepler xavier.schep...@sciences-po.fr wrote: Hi, I'm trying to highlight short text values. The field they came from has a type shared with other fields. I have highlighting working on other fields but not on this one. Why ? Thanks for your response. Here are some extracts from my schema.xml : fieldtype name=textFr class=solr.TextField analyzer !-- suppression des mots vides de sens -- filter class=solr.StopFilterFactory words=french-stopwords.txt ignoreCase=true/ !-- decoupage en jetons -- tokenizer class=solr.StandardTokenizerFactory/ !-- suppression des accents -- filter class=solr.ISOLatin1AccentFilterFactory/ !-- suppression des points a la fin des accronymes -- filter class=solr.StandardFilterFactory/ !-- passage en miniscules -- filter class=solr.LowerCaseFilterFactory/ !-- lexemisation avec le filtre porter -- filter class=solr.SnowballPorterFilterFactory language=French/ !-- synonymes -- filter class=solr.SynonymFilterFactory synonyms=test-synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldtype Here's a field on which highlighting works : field name=questionsLabelsFr required=false type=textFr multiValued=true indexed=true stored=true compressed=false omitNorms=false termVectors=true termPositions=true termOffsets=true / Here's the field on which it doesn't : field name=modalitiesLabelsFr required=false type=textFr multiValued=true indexed=true stored=true compressed=false omitNorms=false termVectors=true termPositions=true termOffsets=true / They are kinda the same. But modalitiesLabelFr contains mostly short strings like : Côtes-d Armor Creuse Dordogne Doubs Drôme Eure Eure-et-Loir Finistère When matches are found in them, I get a list like this, with no text : lst name=highlighting lst name=dbbd3642-db1d-4b35-9280-11582523903d/ lst name=f1d8be2d-1070-4111-b16e-94d16c8c0bc6/ /lst The name attribute is the uid of the document. I tryed several values for hl.fragsize (0, 1, 2, ...) with no success at all.
Combining frange with other query parameters?
Hey, I'm doing a query which involves using an frange in the filter query — and I was wondering if there is a way of combing the frange with other parameters. Something like ({!frange l=x u=y)*do stuff*) AND *field:param*) — but obviously this doesn't work. Is there a way of doing this? —Oliver
Re: Strange Behavior When Using CSVRequestHandler
Erick - thanks very much, all of this makes sense. But the one thing I still find puzzling is the fact that re-adding the file a second, third, fourth etc time causes numDocs to increase, and ALWAYS by the same amount (141,645). Any ideas as to what could cause that? Dan Erick Erickson wrote: I think the root of your problem is that unique fields should NOT be multivalued. See http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)In this case, since you're tokenizing, your query field is implicitly multi-valued, I don't know what the behavior will be. But there's another problem: All the filters in your analyzer definition will mess up the correspondence between the Unix uniq and numDocs even if you got by the above. I.e StopFilter would make the lines a problem and the problem identical. WordDelimiter would do all kinds of interesting things LowerCaseFilter would make Myproblem and myproblem identical. RemoveDuplicatesFilter would make interesting interesting and interesting identical You could define a second field, make *that* one unique and NOT analyzer it in any way... You could hash your sentences and define the hash as your unique key. You could HTH Erick On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote: The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl ' http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape= \' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: High Availability
I've tried having two servers set up to replicate each other, and it is not a pretty thing. It seems that SOLR doesn't really do any checking of the version # to see if the version # on the master is the version # on the slave before deciding to replicate. It only looks to see if it's different. As a result, what ends up happening is this: 1. Both servers at same revision, say revision 100 2. Update Master 1 to revision 101 3. Master 2 starts pull of revision 101 4. Master 1 sees master 2 has different revision and starts pull of revision 100 See where it's going? Eventually, both servers seem to end up back at revision 100, and my updates get lost, so my sequencing might be a little out of wack here, but nonetheless having two servers setup as slaves to each other does not work properly. I would think though that with a small code change to check to see if the revision # has increased before pulling the file, that would solve the issue. In the mean time, my plan is to: 1. Setup two index update servers as masters behind an F5 load balancer with a VIP in an active/passive configuration. 2. Setup N search servers as slaves behind an F5 load balancer with a VIP in an round robin configuration. Replication would be from the master's VIP, instead of any one particular master. 3. Index update servers would have a handler would would do delta updates every so often to keep both servers in sync with the database (i'm only indexing a complex database here, which doesn't lend itself well to sql querying on the fly). Ideally, i'd love to be able to force the master servers to update if either one of them switches from passive to active state, but am not sure how to accomplish that. mattin...@yahoo.com Once you start down the dark path, forever will it dominate your destiny. Consume you it will - Yoda - Original Message From: r...@intelcompute.com r...@intelcompute.com To: solr-user@lucene.apache.org Sent: Mon, January 4, 2010 11:37:22 AM Subject: Re: High Availability Even when Master 1 is alive again, it shouldn't get the floating IP until Master 2 actually fails. So you'd ideally want them replicating to eachother, but since one will only be updated/Live at a time, it shouldn't cause an issue with cobbling data (?). Just a suggestion tho, not done it myself on Solr, only with DB servers. On Mon 04/01/10 16:28 , Matthew Inger mattin...@yahoo.com wrote: So, when the masters switch back, does that mean, we have to force a full delta update, correct? Once you start down the dark path, forever will it dominate your destiny. Consume you it will - Yoda - Original Message From: To: Sent: Mon, January 4, 2010 11:17:40 AM Subject: Re: High Availability Have you looked into a basic floating IP setup? Have the master also replicate to another hot-spare master. Any downtime during an outage of the 'live' master would be minimal as the hot-spare takes up the floating IP. On Mon 04/01/10 16:13 , Matthew Inger wrote: I'm kind of stuck and looking for suggestions for high availability options. I've figured out without much trouble how to get the master-slave replication working. This eliminates any single points of failure in the application in terms of the application's searching capability. I would setup a master which would create the index and several slaves to act as the search servers, and put them behind a load balancer to distribute the requests. This would ensure that if a slave node goes down, requests would continue to get serviced by the other nodes that are still up. The problem I have is that my particular application also has the capability to trigger index updates from the user interface. This means that the master now becomes a single point of failure for the user interface. The basic idea of the app is that there are multiple oracle instances contributing to a single document. The volume and organization of the data (database links, normalization, etc...) prevents any sort of fast querying via SQL to do querying of the documents. The solution is to build a lucene index (via solr), and use that for searching. When updates are made in the UI, we will also send the updates directly to the solr server as well (we don't want to wait some arbitrary interval for a delta query to run). So you can see the problem here is that if the master is down, the sending of the updates to the master solr server will fail, thus causing an application exception. I have tried configuring multiple solr servers which are both setup as masters and slaves to each other, but they keep clobber each other's index updates and rolling back each other's delta updates. It seems that the replication doesn't take the generation # into account and check that the generation it's fetching is the generation it already has before it applies it. I thought of maybe
Re: No Analyzer, tokenizer or stemmer works at Solr
Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? If yes, I have got another problem: I don't want to waste any diskspace. Does the copyfield-order stores the same data two times? I mean: I have got originalField and copiedField. originalField gets indexed with text_analyzer and copiedField with a stemmer. Does this mean, I am storing the original data two times public and once analyzed per analyzer? Or does Solr stores the original input only once and makes a reference to the public data of the originalField? Thank you Mitch Erik Hatcher-4 wrote: Mitch, Again, I think you're misunderstanding what analysis does. You must be expecting we think, though you've not provided exact duplication steps to be sure, that the value you get back from Solr is the analyzer processed output. It's not, it's exactly what you provide. Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. There's some thinking going on implementing it such that analyzed output is stored. You can, however, use the analysis request handler componentry to get analyzed stuff back as you see it in analysis.jsp on a per-document or per-field text basis - if you're looking to leverage the analyzer output in that fashion from a client. Erik On Jan 7, 2010, at 1:21 AM, MitchK wrote: Hello Erick, thank you for answering. I can do whatever I want - Solr does nothing. For example: If I use the textgen-fieldtype which is predefined, nothing happens to the text. Even the stopFilter is not working - no stopword from stopword.txt was replaced. I think, that this only affects the index, because, if I query for for he returns nothing, which is quietly correct, due to the work of the stopFilter. Everything works fine on analysis.jsp, but not in reality. If you have got any testcase-data you want me to add, please, tell me and I will show you the saved data afterwards. Thank you. Mitch Erick Erickson wrote: Well, I have noticed that Solr isn't using ANY analyzer How do you know this? Because it's highly unlikely that SOLR is completely broken on that level. Erick On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote: I have tested a lot and all the time I thought I set wrong options for my custom analyzer. Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer. It seems like it only stores the original input. I am using the example-configuration of the current Solr 1.4 release. What's wrong? Thank you! -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27062080.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: replication -- missing field data file
Right, but if you want to take periodic backups and ship them to tape or some DR site, you need to be able to tell when the backup is actually complete. It's seems very strange to me that you can actually track the replication progress on a slave, but you can't track the backup progress on a master. To me that suggests that the only reliable way of performing backups is to set up replication to some slave without a regular polling interval. Then force a poll, wait for the sync to complete, and ship the slave's index to redundant storage. Seems like a pretty backwards way of doing things... -Original Message- From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble Paul ??? ?? Sent: Thursday, January 07, 2010 5:56 AM To: solr-user Subject: Re: replication -- missing field data file actually it does not. BTW, FYI, backup is just to take periodics backups not necessary for the Replicationhandler to work On Thu, Jan 7, 2010 at 2:37 AM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: How can you tell when the backup is done? -Original Message- From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble Paul ??? ?? Sent: Wednesday, January 06, 2010 12:23 PM To: solr-user Subject: Re: replication -- missing field data file the index dir is in the name index others will be stored as indexdate-as-number On Wed, Jan 6, 2010 at 10:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: How can you differentiate between the backup and the normal index files? -Original Message- From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble Paul ??? ?? Sent: Wednesday, January 06, 2010 11:52 AM To: solr-user Subject: Re: replication -- missing field data file On Wed, Jan 6, 2010 at 9:49 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: I set up replication between 2 cores on one master and 2 cores on one slave. Before doing this the master was working without issues, and I stopped all indexing on the master. Now that replication has synced the index files, an .FDT field is suddenly missing on both the master and the slave. Pretty much every operation (core reload, commit, add document) fails with an error like the one posted below. How could this happen? How can one recover from such an error? Is there any way to regenerate the FDT file without re-indexing everything? This brings me to a question about backups. If I run the replication?command=backup command, where is this backup stored? I've tried this a few times and get an OK response from the machine, but I don't see the backup generated anywhere. The backup is done asynchronously. So it always gives an OK response immedietly. The backup is created in the data dir itself Thanks, Gio. org.apache.solr.common.SolrException: Error handling 'reload' action at org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:412) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:142) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:298) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Y:\solrData\FilingsCore2\index\_a0r.fdt (The system cannot find the file specified) at
ontology support
hello, i'm trying to use an ontology (homegrown :) ) to support the search. i.e. i'd like my search engine to report search results for barack obama even if i look for president. I see there's some support in Nutch API (org.apache.nutch.ontology) so (if it does what i'm looking for) i'm guessing if something like that comes with solr too. any ideas? Claudio -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
Re: No Analyzer, tokenizer or stemmer works at Solr
On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan
Re: Strange Behavior When Using CSVRequestHandler
It puzzles me too. I don't know the internals of that code well enough to speculate, but once you're into undefined behavior, I have great faith in *many* inexplicable things happening. Erick On Thu, Jan 7, 2010 at 9:45 AM, danben dan...@gmail.com wrote: Erick - thanks very much, all of this makes sense. But the one thing I still find puzzling is the fact that re-adding the file a second, third, fourth etc time causes numDocs to increase, and ALWAYS by the same amount (141,645). Any ideas as to what could cause that? Dan Erick Erickson wrote: I think the root of your problem is that unique fields should NOT be multivalued. See http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) In this case, since you're tokenizing, your query field is implicitly multi-valued, I don't know what the behavior will be. But there's another problem: All the filters in your analyzer definition will mess up the correspondence between the Unix uniq and numDocs even if you got by the above. I.e StopFilter would make the lines a problem and the problem identical. WordDelimiter would do all kinds of interesting things LowerCaseFilter would make Myproblem and myproblem identical. RemoveDuplicatesFilter would make interesting interesting and interesting identical You could define a second field, make *that* one unique and NOT analyzer it in any way... You could hash your sentences and define the hash as your unique key. You could HTH Erick On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote: The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl ' http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape= \' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No Analyzer, tokenizer or stemmer works at Solr
Thank you, Ryan. I will have a look on lucene's material and luke. I think I got it. :) Sometimes there will be the need, to response on the one hand the value and on the other hand the indexed version of the value. How can I fullfill such needs? Doing copyfield on indexed-only fields? ryantxu wrote: On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html Sent from the Solr - User mailing list archive at Nabble.com.
SolJ and query parameters
Hi there, I'm trying to understand how the query syntax specified on the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in with the usage of the SolJ class SolrQuery. There are not too many examples of usage to be found. For example. Say I wanted to replicate the following query using SolrQuery. q={!lucene q.op=AND df=text}myfield:foo +bar -baz How would I do it so that q.op was set to OR instead of AND? There is no method I can see on SolrQuery to set q.op, only a query string, which is presumably in this case is the text +bar -baz, as the rest can be specified by calling set methods on SolrQuery. Thanks in advance for any help. Jon
Re: SolJ and query parameters
--- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote: From: Jon Poulton jon.poul...@vyre.com Subject: SolJ and query parameters To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 7:25 PM Hi there, I'm trying to understand how the query syntax specified on the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in with the usage of the SolJ class SolrQuery. There are not too many examples of usage to be found. For example. Say I wanted to replicate the following query using SolrQuery. q={!lucene q.op=AND df=text}myfield:foo +bar -baz Whole string is the value of the parameter q. SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz); How would I do it so that q.op was set to OR instead of AND? There is no method I can see on SolrQuery to set q.op, only a query string, which is presumably in this case is the text +bar -baz, as the rest can be specified by calling set methods on SolrQuery. if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND); hope this helps.
Sharding and Index Update
All, I have two indices - one has 23 M document and the other has less than 1000. The small index is for real time update. Does updating small index (with commit) hurt the overall performance? (We can not update realtime for 23M big index because of heavy traffic and size). Thanks, Jae Joo
RE: SolJ and query parameters
Thanks for the reply. Using SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz}); would make more sense if it were not for the other methods available on SolrQuery. For example, there is a setFields(String..) method. So what happens if I call setFields(title, description) after having set the query to the above value? What do I end up with? Something like this: {!lucene q.op=AND df=text}title:(foo +bar -baz) description:(foo +bar baz)} I'm still having trouble understanding how the class is intended to be used. Cheers Jon -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: 07 January 2010 17:38 To: solr-user@lucene.apache.org Subject: Re: SolJ and query parameters --- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote: From: Jon Poulton jon.poul...@vyre.com Subject: SolJ and query parameters To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 7:25 PM Hi there, I'm trying to understand how the query syntax specified on the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in with the usage of the SolJ class SolrQuery. There are not too many examples of usage to be found. For example. Say I wanted to replicate the following query using SolrQuery. q={!lucene q.op=AND df=text}myfield:foo +bar -baz Whole string is the value of the parameter q. SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz); How would I do it so that q.op was set to OR instead of AND? There is no method I can see on SolrQuery to set q.op, only a query string, which is presumably in this case is the text +bar -baz, as the rest can be specified by calling set methods on SolrQuery. if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND); hope this helps.
RE: SolJ and query parameters
I've also just noticed that QueryParsing is not in the SolrJ API. It's in one of the other Solr jar dependencies. I'm beginning to think that maybe the best approach it to write a query string generator which can generate strings of the form: q={!lucene q.op=AND df=text}myfield:foo +bar -baz Then just set this on a SolrQuery instance and send it over the wire. It not the kind of string you'd want an end user to have to type out. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: 07 January 2010 17:38 To: solr-user@lucene.apache.org Subject: Re: SolJ and query parameters --- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote: From: Jon Poulton jon.poul...@vyre.com Subject: SolJ and query parameters To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 7:25 PM Hi there, I'm trying to understand how the query syntax specified on the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in with the usage of the SolJ class SolrQuery. There are not too many examples of usage to be found. For example. Say I wanted to replicate the following query using SolrQuery. q={!lucene q.op=AND df=text}myfield:foo +bar -baz Whole string is the value of the parameter q. SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz); How would I do it so that q.op was set to OR instead of AND? There is no method I can see on SolrQuery to set q.op, only a query string, which is presumably in this case is the text +bar -baz, as the rest can be specified by calling set methods on SolrQuery. if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND); hope this helps.
RE: SolJ and query parameters
Using SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz}); would make more sense if it were not for the other methods available on SolrQuery. For example, there is a setFields(String..) method. So what happens if I call setFields(title, description) after having set the query to the above value? What do I end up with? Something like this: {!lucene q.op=AND df=text}title:(foo +bar -baz) description:(foo +bar baz)} No. setFields is equvalent to fl=title,description It determines which fields will be returned as a result. I'm still having trouble understanding how the class is intended to be used. SolrQuery extends ModifiableSolrParams. If you look at the source code of it you can understand. For example setQuery method invokes this.set(CommonParams.Q, query); You can set anything in the search url with this class. key=value is equal to SolrQuery.set(key, value). There are some multivalued keys like fq and facet.field, in those cases you can use add() method.
Re: No Analyzer, tokenizer or stemmer works at Solr
On Jan 7, 2010, at 12:11 PM, MitchK wrote: Thank you, Ryan. I will have a look on lucene's material and luke. I think I got it. :) Sometimes there will be the need, to response on the one hand the value and on the other hand the indexed version of the value. How can I fullfill such needs? Doing copyfield on indexed-only fields? see erik's response on 'analysis request handler' ryantxu wrote: On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No Analyzer, tokenizer or stemmer works at Solr
What is your use case for responding sometimes with the indexed value? Other than reconstructing a field that hasn't been stored, I can't think of one. I still think you're missing the point. Indexing and storing are orthogonal operations that have (almost) nothing to do with each other, for all that they happen at the same time on the same field. You never search against the stored data in a field. You *always* search against the indexed data. Contrariwise, you never display the indexed form to the user, you *always* show the stored data (unless you come up with a really interesting use case). Step back and consider what happens when you index data, it gets broken up all kinds of ways. Stop words are removed, case may change, etc, etc, etc. It makes no sense to then display this data for a user. Would you really like to have, say a movie title The Good, The Bad, and The Ugly. Remove stopwords, puncuation and lowercase and you index three tokens good, bad, ugly. Even if you reconstruct this field, the user would see good bad ugly. Bad, very bad. Yet I want to display the original title to the user in response to searching on ugly, so I need the original, unanalyzed data. Perhaps it would help to think of it this way. 1 take some data and index it in f1 but do NOT store it in f1. Store it in f2 but do NOT index it in f2. 2 take that same data, index AND store it in f3. 1 is almost entirely equivalent to 2 in terms of index resources. Practically though, 1 is harder to use, because you have to remember to use f1 for searching and f2 for getting the raw data. HTH Erick On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote: Thank you, Ryan. I will have a look on lucene's material and luke. I think I got it. :) Sometimes there will be the need, to response on the one hand the value and on the other hand the indexed version of the value. How can I fullfill such needs? Doing copyfield on indexed-only fields? ryantxu wrote: On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolJ and query parameters
On Jan 7, 2010, at 1:05 PM, Jon Poulton wrote: I've also just noticed that QueryParsing is not in the SolrJ API. It's in one of the other Solr jar dependencies. I'm beginning to think that maybe the best approach it to write a query string generator which can generate strings of the form: q={!lucene q.op=AND df=text}myfield:foo +bar -baz Then just set this on a SolrQuery instance and send it over the wire. It not the kind of string you'd want an end user to have to type out. Yes, if you need to manipulate the local params, that seems like a good approach. Solrj was written before the local params syntax was introduced. A patch that adds LocalParams support to solrj would be welcome :) ryan
Re: No Analyzer, tokenizer or stemmer works at Solr
The difference between stored and indexed is clear now. You are right, if you are responsing only to normal users. Use case: You got a stored field The good, the bad and the ugly. And you got a really fantastic analyzer, which is doing some magic to this movie title. Let's say, the analyzer translates the title into md5 or into another abstract expression. Instead of doing the same magical function on the client's side again and again, he only needs to take the prepared data from your response. Another use case could be: Imagine you have got two categories: cheap and expensive and your document gots a title-, a label-, an owner- and a price-field. Imagine you would analyze, index and store them like you normally do and afterwards you want to set, whether the document belongs to the expensive item-group or not. If the price for the item is higher than 500$, it belongs to the expensive ones, otherwise not. I think, this would be a job for a special analyzer - and this only makes sense, if I also store the analyzed data. I think information retrieval is a really interesting use case. Erick Erickson wrote: What is your use case for responding sometimes with the indexed value? Other than reconstructing a field that hasn't been stored, I can't think of one. I still think you're missing the point. Indexing and storing are orthogonal operations that have (almost) nothing to do with each other, for all that they happen at the same time on the same field. You never search against the stored data in a field. You *always* search against the indexed data. Contrariwise, you never display the indexed form to the user, you *always* show the stored data (unless you come up with a really interesting use case). Step back and consider what happens when you index data, it gets broken up all kinds of ways. Stop words are removed, case may change, etc, etc, etc. It makes no sense to then display this data for a user. Would you really like to have, say a movie title The Good, The Bad, and The Ugly. Remove stopwords, puncuation and lowercase and you index three tokens good, bad, ugly. Even if you reconstruct this field, the user would see good bad ugly. Bad, very bad. Yet I want to display the original title to the user in response to searching on ugly, so I need the original, unanalyzed data. Perhaps it would help to think of it this way. 1 take some data and index it in f1 but do NOT store it in f1. Store it in f2 but do NOT index it in f2. 2 take that same data, index AND store it in f3. 1 is almost entirely equivalent to 2 in terms of index resources. Practically though, 1 is harder to use, because you have to remember to use f1 for searching and f2 for getting the raw data. HTH Erick On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote: Thank you, Ryan. I will have a look on lucene's material and luke. I think I got it. :) Sometimes there will be the need, to response on the one hand the value and on the other hand the indexed version of the value. How can I fullfill such needs? Doing copyfield on indexed-only fields? ryantxu wrote: On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Basic sentence parsing with the regex highlighter fragmenter
On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, I'll have to defer to the highlighter experts here I've looked at the source code for the highlighter, and I think I know what's going on. I haven't had time to play with this yet, so I could be wrong, but this is my impression. The highlighter builds a highlighted fragment by reading tokens in, and appending their contents to a string buffer. Now, every time a token is appended to a fragment, it adds the whitespace between the previous token and the current token (this isn't strictly whitespace, but really anything that was removed from the source text by the tokenizer, like punctuation etc.). I believe what is happening in my case is that the leading . is the whitespace between the last token (of the previous fragment) and the first token of the current fragment. And, of course, the trailing punctuation is being cut off because the fragment builder doesn't APPEND whitespace after the last token, it just prepends this whitespace. You can see the code that does this, from the Highlighter#getBestTextFragments (line 233 in lucene 3.0.0) here: http://gist.github.com/271515 If I do what I said in my second email (add preserveOriginal=1 to the WordDelimiterFilter), things work because the ending punctuation is stored with the token, and just the real whitespace is prepended by this code. I'm not sure what the solution is, but currently I'm just trimming leading punctuation + a space off on the client side, and leaving the sentence terminator-less. -- Caleb Land
Corrupted Index
Hi all, Our application uses solrj to communicate with our solr servers. We started a fresh index yesterday after upping the maxFieldLength setting in solrconfig. Our task indexes content in batches and all appeared to be well until noonish today, when after 40k docs, I started seeing errors. I've placed three stack traces below, the first occurred once and was the initial error, the second occurred a few times before the third started occurring on each request. I'd really appreciate any insight into what could have caused this, a missing file and then a corrupt index. If you know we'll have to nuke the entire index and start over I'd like to know that too-oddly enough searches against the index appear to be working. Thanks! Jake #1 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) Caused by: solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.apache.solr.common.SolrException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) #2 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
Re: Corrupted Index
what version of solr are you running? On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote: Hi all, Our application uses solrj to communicate with our solr servers. We started a fresh index yesterday after upping the maxFieldLength setting in solrconfig. Our task indexes content in batches and all appeared to be well until noonish today, when after 40k docs, I started seeing errors. I've placed three stack traces below, the first occurred once and was the initial error, the second occurred a few times before the third started occurring on each request. I'd really appreciate any insight into what could have caused this, a missing file and then a corrupt index. If you know we'll have to nuke the entire index and start over I'd like to know that too-oddly enough searches against the index appear to be working. Thanks! Jake #1 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) Caused by: solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.apache.solr.common.SolrException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) #2 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
RE: Corrupted Index
Yes, that would be helpful to include, sorry, the official 1.4. -Original Message- From: Ryan McKinley [mailto:ryan...@gmail.com] Sent: Thursday, January 07, 2010 2:15 PM To: solr-user@lucene.apache.org Subject: Re: Corrupted Index what version of solr are you running? On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote: Hi all, Our application uses solrj to communicate with our solr servers. We started a fresh index yesterday after upping the maxFieldLength setting in solrconfig. Our task indexes content in batches and all appeared to be well until noonish today, when after 40k docs, I started seeing errors. I've placed three stack traces below, the first occurred once and was the initial error, the second occurred a few times before the third started occurring on each request. I'd really appreciate any insight into what could have caused this, a missing file and then a corrupt index. If you know we'll have to nuke the entire index and start over I'd like to know that too-oddly enough searches against the index appear to be working. Thanks! Jake #1 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ _fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) Caused by: solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.apache.solr.common.SolrException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org .bookshare .service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) #2 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2
RE: How to set User.dir or CWD for Solr during Tomcat startup
That's just setting the solr/home environment not the user.dir variable. I have that already set. But when I got to the solr/admin page, at the top it shows the Solr Admin(schemaname), hostname, and cwd=/root, SolrHome=/opt/solr. How do I get cwd=/root not to be that but to be set to /opt/solr? robbin -Original Message- From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de] Sent: Thursday, January 07, 2010 3:28 AM To: solr-user@lucene.apache.org Subject: Re: How to set User.dir or CWD for Solr during Tomcat startup Am 07.01.2010 um 00:07 schrieb Turner, Robbin J: I've been doing a bunch of googling and haven't seen if there is a parameter to set within Tomcat other than the solr/home which is setup in the solr.xml under the $CATALINA_HOME/conf/Catalina/localhost/. Hi. We set this in solr.xml Context docBase=/opt/solr-tomcat/apache-tomcat-6.0.20/webapps/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/opt/solr-tomcat/solr override=true / /Context http://wiki.apache.org/solr/SolrTomcat#Simple_Example_Install hope this helps. olivier -- Olivier Dobberkau . . . . . . . . . . . . . . Je TYPO3, desto d.k.d
Re: Corrupted Index
If you need to fix the index and maybe lose some data (in bad segments), check Lucene's CheckIndex (cmd-line app) Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Jake Brownell ja...@benetech.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Thu, January 7, 2010 3:08:55 PM Subject: Corrupted Index Hi all, Our application uses solrj to communicate with our solr servers. We started a fresh index yesterday after upping the maxFieldLength setting in solrconfig. Our task indexes content in batches and all appeared to be well until noonish today, when after 40k docs, I started seeing errors. I've placed three stack traces below, the first occurred once and was the initial error, the second occurred a few times before the third started occurring on each request. I'd really appreciate any insight into what could have caused this, a missing file and then a corrupt index. If you know we'll have to nuke the entire index and start over I'd like to know that too-oddly enough searches against the index appear to be working. Thanks! Jake #1 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.benetech.exception.WrappedException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) Caused by: solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) solr-home/core0/data/index/_fsk_1uj.del (No such file or directory) request: /core0/update org.apache.solr.common.SolrException org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424) org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243) org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105) org.apache.solr.client.solrj.SolrServer#commit(86) org.apache.solr.client.solrj.SolrServer#commit(75) org.bookshare.search.solr.SolrSearchServerWrapper#add(63) org.bookshare.search.solr.SolrSearchEngine#index(232) org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95) org.bookshare.service.task.SearchEngineIndexingTask#run(53) org.bookshare.service.scheduler.TaskWrapper#run(233) java.util.TimerThread#mainLoop(512) java.util.TimerThread#run(462) #2 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1 January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc counts differ
Re: Sharding and Index Update
Won't hurt the performance - that *is* why people use BIG+small core trick. :) Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Jae Joo jae...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, January 7, 2010 12:40:16 PM Subject: Sharding and Index Update All, I have two indices - one has 23 M document and the other has less than 1000. The small index is for real time update. Does updating small index (with commit) hurt the overall performance? (We can not update realtime for 23M big index because of heavy traffic and size). Thanks, Jae Joo
Re: ontology support
Claudio, Check out Solr synonym support: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Claudio Martella claudio.marte...@tis.bz.it To: solr-user@lucene.apache.org Sent: Thu, January 7, 2010 11:17:54 AM Subject: ontology support hello, i'm trying to use an ontology (homegrown :) ) to support the search. i.e. i'd like my search engine to report search results for barack obama even if i look for president. I see there's some support in Nutch API (org.apache.nutch.ontology) so (if it does what i'm looking for) i'm guessing if something like that comes with solr too. any ideas? Claudio -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
Re: High Availability
Your setup with the master behind a LB VIP looks right. I don't think replication in Solr was meant to be bidirectional. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Matthew Inger mattin...@yahoo.com To: solr-user@lucene.apache.org; r...@intelcompute.com Sent: Thu, January 7, 2010 10:45:20 AM Subject: Re: High Availability I've tried having two servers set up to replicate each other, and it is not a pretty thing. It seems that SOLR doesn't really do any checking of the version # to see if the version # on the master is the version # on the slave before deciding to replicate. It only looks to see if it's different. As a result, what ends up happening is this: 1. Both servers at same revision, say revision 100 2. Update Master 1 to revision 101 3. Master 2 starts pull of revision 101 4. Master 1 sees master 2 has different revision and starts pull of revision 100 See where it's going? Eventually, both servers seem to end up back at revision 100, and my updates get lost, so my sequencing might be a little out of wack here, but nonetheless having two servers setup as slaves to each other does not work properly. I would think though that with a small code change to check to see if the revision # has increased before pulling the file, that would solve the issue. In the mean time, my plan is to: 1. Setup two index update servers as masters behind an F5 load balancer with a VIP in an active/passive configuration. 2. Setup N search servers as slaves behind an F5 load balancer with a VIP in an round robin configuration. Replication would be from the master's VIP, instead of any one particular master. 3. Index update servers would have a handler would would do delta updates every so often to keep both servers in sync with the database (i'm only indexing a complex database here, which doesn't lend itself well to sql querying on the fly). Ideally, i'd love to be able to force the master servers to update if either one of them switches from passive to active state, but am not sure how to accomplish that. mattin...@yahoo.com Once you start down the dark path, forever will it dominate your destiny. Consume you it will - Yoda - Original Message From: r...@intelcompute.com To: solr-user@lucene.apache.org Sent: Mon, January 4, 2010 11:37:22 AM Subject: Re: High Availability Even when Master 1 is alive again, it shouldn't get the floating IP until Master 2 actually fails. So you'd ideally want them replicating to eachother, but since one will only be updated/Live at a time, it shouldn't cause an issue with cobbling data (?). Just a suggestion tho, not done it myself on Solr, only with DB servers. On Mon 04/01/10 16:28 , Matthew Inger wrote: So, when the masters switch back, does that mean, we have to force a full delta update, correct? Once you start down the dark path, forever will it dominate your destiny. Consume you it will - Yoda - Original Message From: To: Sent: Mon, January 4, 2010 11:17:40 AM Subject: Re: High Availability Have you looked into a basic floating IP setup? Have the master also replicate to another hot-spare master. Any downtime during an outage of the 'live' master would be minimal as the hot-spare takes up the floating IP. On Mon 04/01/10 16:13 , Matthew Inger wrote: I'm kind of stuck and looking for suggestions for high availability options. I've figured out without much trouble how to get the master-slave replication working. This eliminates any single points of failure in the application in terms of the application's searching capability. I would setup a master which would create the index and several slaves to act as the search servers, and put them behind a load balancer to distribute the requests. This would ensure that if a slave node goes down, requests would continue to get serviced by the other nodes that are still up. The problem I have is that my particular application also has the capability to trigger index updates from the user interface. This means that the master now becomes a single point of failure for the user interface. The basic idea of the app is that there are multiple oracle instances contributing to a single document. The volume and organization of the data (database links, normalization, etc...) prevents any sort of fast querying via SQL to do querying of the documents. The solution is to build a lucene index (via solr), and use that for searching. When updates are made in the UI, we will also send the updates directly to the solr server as well (we don't want to wait some arbitrary interval for a delta query to run). So you can see the problem here is that if the master is down, the sending of the updates to
Re: No Analyzer, tokenizer or stemmer works at Solr
Well, I'd approach either of these use cases by simply performing my computations on the input and storing the result in another (non-indexed unless I wanted to search it) field. This wouldn't happen in the Analyzer, but in the code that populated the document fields. Which is a much cleaner solution IMO than creating some sort of index this but store that capability. The purpose of analysis is to produce *searchable* tokens after all. But we're getting into angels dancing on pins here. Do you actually have a use case you're trying to implement or is this mostly theoretical? Erick On Thu, Jan 7, 2010 at 2:08 PM, MitchK mitc...@web.de wrote: The difference between stored and indexed is clear now. You are right, if you are responsing only to normal users. Use case: You got a stored field The good, the bad and the ugly. And you got a really fantastic analyzer, which is doing some magic to this movie title. Let's say, the analyzer translates the title into md5 or into another abstract expression. Instead of doing the same magical function on the client's side again and again, he only needs to take the prepared data from your response. Another use case could be: Imagine you have got two categories: cheap and expensive and your document gots a title-, a label-, an owner- and a price-field. Imagine you would analyze, index and store them like you normally do and afterwards you want to set, whether the document belongs to the expensive item-group or not. If the price for the item is higher than 500$, it belongs to the expensive ones, otherwise not. I think, this would be a job for a special analyzer - and this only makes sense, if I also store the analyzed data. I think information retrieval is a really interesting use case. Erick Erickson wrote: What is your use case for responding sometimes with the indexed value? Other than reconstructing a field that hasn't been stored, I can't think of one. I still think you're missing the point. Indexing and storing are orthogonal operations that have (almost) nothing to do with each other, for all that they happen at the same time on the same field. You never search against the stored data in a field. You *always* search against the indexed data. Contrariwise, you never display the indexed form to the user, you *always* show the stored data (unless you come up with a really interesting use case). Step back and consider what happens when you index data, it gets broken up all kinds of ways. Stop words are removed, case may change, etc, etc, etc. It makes no sense to then display this data for a user. Would you really like to have, say a movie title The Good, The Bad, and The Ugly. Remove stopwords, puncuation and lowercase and you index three tokens good, bad, ugly. Even if you reconstruct this field, the user would see good bad ugly. Bad, very bad. Yet I want to display the original title to the user in response to searching on ugly, so I need the original, unanalyzed data. Perhaps it would help to think of it this way. 1 take some data and index it in f1 but do NOT store it in f1. Store it in f2 but do NOT index it in f2. 2 take that same data, index AND store it in f3. 1 is almost entirely equivalent to 2 in terms of index resources. Practically though, 1 is harder to use, because you have to remember to use f1 for searching and f2 for getting the raw data. HTH Erick On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote: Thank you, Ryan. I will have a look on lucene's material and luke. I think I got it. :) Sometimes there will be the need, to response on the one hand the value and on the other hand the indexed version of the value. How can I fullfill such needs? Doing copyfield on indexed-only fields? ryantxu wrote: On Jan 7, 2010, at 10:50 AM, MitchK wrote: Eric, you mean, everything is okay, but I do not see it? Internally for searching the analysis takes place and writes to the index in an inverted fashion, but the stored stuff is left alone. if I use an analyzer, Solr stores it's output two ways? One public output, which is similar to the original input and one hidden or internal output, which is based on the analyzer's work? Did I understand that right? yes. indexed fields and stored fields are different. Solr results show stored fields in the results (however facets are based on indexed fields) Take a look at Lucene in Action for a better description of what is happening. The best tool to get your head around what is happening is probably luke (http://www.getopt.org/luke/) If yes, I have got another problem: I don't want to waste any diskspace. You have control over what is stored and what is indexed -- how that is configured is up to you. ryan -- View this
Re: Basic sentence parsing with the regex highlighter fragmenter
Regular expressions won't work well for sentence boundary detection. If you want something free, you could plug in OpenNLP or GATE. Or LingPipe, but that's not free. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Caleb Land caleb.l...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, January 5, 2010 2:05:18 PM Subject: Basic sentence parsing with the regex highlighter fragmenter Hello, I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse basic sentences, and I'm running into a problem. I'm using the default regex specified in the example solr configuration: [-\w ,/\n\']{20,200} But I am using a larger fragment size (140) with a slop of 1.0. Given the passage: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue vitae, molestie quis nunc. When I search for Nulla (the first word of the second sentence) and grab the first highlighted snippet, this is what I get: . Nulla a neque a ipsum accumsan iaculis at id lacus As you can see, there's a leading period from the previous sentence and the period from the current sentence is missing. I understand this regex isn't that advanced, but I've tried everything I can think of, regex-wise, to get this to work, and I always end up with this problem. For example, I've tried: \w[^.!?]{0,200}[.!?] Which seems like it should include the ending punctuation, but it doesn't, so I think I'm missing something. Does anybody know a regex that works? -- Caleb Land
Re: Removing facets which frequency match the result count
Hi, Either I don't understand this or this doesn't make much sense. Are you saying you want to show only facet values whose counts == # of hits? If so, what would be the value of showing facets -- they wouldn't be narrowing down the result set. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: joeMcElroy pho...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, January 5, 2010 5:25:18 AM Subject: Removing facets which frequency match the result count Is there any way to specify to solr only to bring back facet filter options where the frequency is less than the total results found? I found facets which match the result count are not helpful to the user, and produce noise within the UI to filter results. I can obviously do this within the view but would be better if solr dealt with this logic. Cheers! Joe -- View this message in context: http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27026359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any way to modify result ranking using an integer field?
Not sure if this was answered. Yes, you can set the default params/values for a request handler in the solrconfig.xml . Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Andy angelf...@yahoo.com To: solr-user@lucene.apache.org Sent: Mon, January 4, 2010 4:56:14 PM Subject: Re: Any way to modify result ranking using an integer field? Thank you Ahmet. Is there any way I can configure Solr to always use {!boost b=log(popularity)} as the default for all queries? I'm using Solr through django-haystack, so all the Solr queries are actually generated by haystack. It'd be much cleaner if I could configure Solr to always use BoostQParserPlugin for all queries instead of manually modifying every single query generated by haystack. --- On Mon, 1/4/10, Ahmet Arslan wrote: From: Ahmet Arslan Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Monday, January 4, 2010, 2:33 PM Thanks Ahmet. Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already enabled? I just confirmed that it is already enabled. You can see affect of it by appending debugQuery=on to your search url.
Re: SOLR Performance Tuning: Pagination
Peter - Aaron just commented on a recent Solr issue (reading large result sets) and mentioned his patch. So far he has 2 x +1 from Grant and me to stick his patch in JIRA. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Peter Wolanin peter.wola...@acquia.com To: solr-user@lucene.apache.org Sent: Sun, January 3, 2010 3:37:01 PM Subject: Re: SOLR Performance Tuning: Pagination At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers from Near Infinity (Aaron McCurry I think) mentioned that he had a patch for lucene that enabled unlimited depth memory-efficient paging. Is anyone in contact with him? -Peter On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote: On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority Queue management. See http://issues.apache.org/jira/browse/LUCENE-2127 and the linked discussion on java-dev. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
NPE when sharding multiple layers
Hi all, I've got an index split across 28 cores -- 4 cores on each of 7 boxes (multiple cores per box in order to use more of its CPUs.) When I configure a toplevel core to fan out to all 28 index cores, it works, but is slower than I'd have expected: Toplevel core == all 28 index cores In case it is the aggregation of 28 shards that is slow, I wanted to try 2 layers of sharding. I changed the toplevel core to shard to 1 midlevel core per box, which in turn shards to the 4 index cores on localhost: Toplevel core == 7 midlevel cores, 1 per box == 4 localhost index cores If I search for *:*, this works. If I search for an actual field:value, the midlevel cores throw an NPE. I am configuring toplevel and midlevel cores' shards= parameters via default values in their solrconfigs, so my request URL just looks like host/solr/toplevel/select/q=field:value. Is this a known bug, or am I just doing something wrong? Thanks in advance! - Michael PS: The NPE, which is thrown by the midlevel cores: Jan 7, 2010 4:01:02 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardDoc.java:210) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:134) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:255) at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:114) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:141) at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:445) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:619)
Re: NPE when sharding multiple layers
On Thu, Jan 7, 2010 at 4:17 PM, Michael solrco...@gmail.com wrote: I wanted to try 2 layers of sharding. Distrib search was written with multi-level in mind, but it's not supported yet. -Yonik http://www.lucidimagination.com
Re: NPE when sharding multiple layers
Thanks, Yonik. Does not supported mean we can't guarantee whether it will work or not, or you may be able to figure it out on your own? Apparently I am able to get *some* queries through, just not those that pass through the fieldtype that i really need (a complex analyzer). When I search for foo:value where foo is an analyzer that uses StandardTokenizer LowerCaseFilter WordDelimeterFilter TrimFilter I *don't* get an NPE. Thanks, Michael On Thu, Jan 7, 2010 at 4:25 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jan 7, 2010 at 4:17 PM, Michael solrco...@gmail.com wrote: I wanted to try 2 layers of sharding. Distrib search was written with multi-level in mind, but it's not supported yet. -Yonik http://www.lucidimagination.com
Re: Using IDF to find Collactions and SIPs . . ?
Christopher, It's not Lucene or Solr, but have a look at http://www.sematext.com/products/key-phrase-extractor/index.html There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long windows for SIPs): http://www.sematext.com/demo/kpe/i.html (it looks like the CollateFilter option on the left is kaput, so ignore it -- though that filter is actually quite useful and without it you may see some phrase overlap) Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Subscriptions sub.scripti...@metaheuristica.com To: solr-user@lucene.apache.org Sent: Sun, December 27, 2009 9:43:56 PM Subject: Using IDF to find Collactions and SIPs . . ? I am trying to write a query analyzer to pull: 1.Common phrases (also known as Collocations) with in a query 2.Highly unusual phrases (also known as Statistically Improbable Phrases or SIPs) with in a query The Collocations would be similar to facets except I am also trying to get multi word phrases as well as single terms. So suppose I could write something that does a chained query off the facet query looking for words in proximity. Conceptually (as I understand it) this should just be a question of using the IDF (inverse document frequency i.e. the measure of how often the term appears across the index). * Has anyone tried to write an analyzer that looks for the words that typically occur within a given proximity of another word? The highly unusual phrases on the other hand requires getting a handle on the IDF which at present only appears to be available via the explain function of debugging. * Has anyone written something to go directly after the IDF score only? * If I do have to go down the path of writing this from scratch is the org.apache.lucene.search.Similarity class the one to leverage? Most grateful for any feedback or insights, Christopher
Re: Any way to modify result ranking using an integer field?
Right. But my understanding is that the handler default setting in solrconfig doesn't take the parameter {!boost}, it only takes the parameter bf, which adds the function query instead of multiply it. Seems like the only way to have a default for parameter {!boost} is to use edismax, which won't be available till 1.5 --- On Thu, 1/7/10, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 4:07 PM Not sure if this was answered. Yes, you can set the default params/values for a request handler in the solrconfig.xml . Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Andy angelf...@yahoo.com To: solr-user@lucene.apache.org Sent: Mon, January 4, 2010 4:56:14 PM Subject: Re: Any way to modify result ranking using an integer field? Thank you Ahmet. Is there any way I can configure Solr to always use {!boost b=log(popularity)} as the default for all queries? I'm using Solr through django-haystack, so all the Solr queries are actually generated by haystack. It'd be much cleaner if I could configure Solr to always use BoostQParserPlugin for all queries instead of manually modifying every single query generated by haystack. --- On Mon, 1/4/10, Ahmet Arslan wrote: From: Ahmet Arslan Subject: Re: Any way to modify result ranking using an integer field? To: solr-user@lucene.apache.org Date: Monday, January 4, 2010, 2:33 PM Thanks Ahmet. Do I need to do anything to enable BoostQParserPlugin in Solr, or is it already enabled? I just confirmed that it is already enabled. You can see affect of it by appending debugQuery=on to your search url.
Re: NPE when sharding multiple layers
On Thu, Jan 7, 2010 at 4:33 PM, Michael solrco...@gmail.com wrote: Does not supported mean we can't guarantee whether it will work or not, or you may be able to figure it out on your own? Not implemented, and not expected to work. For example, some info such as sortFieldValues would need to be merged and returned as is done for leaf requests. There are probably other little things like that, but I can't list them off the top of my head. -Yonik http://www.lucidimagination.com
Re: Indexing content on Windows file shares?
Matt: http://sharehound.sourceforge.net/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Matt Wilkie matt.wil...@gov.yk.ca To: solr-user@lucene.apache.org Sent: Thu, December 10, 2009 3:06:38 PM Subject: Indexing content on Windows file shares? Hello, I'm new to Solr, I know nothing about it other than it's been touted in a couple of places as a possible competitor to Google Search Appliance, which is what brought me here. I'm looking for a search engine which can index files on windows shares and websites, and, hopefully, integrate with Active Directory to ensure results are not returned to users who don't have access to those files(s). Can Solr do this? If so where is the documentation for it? Reconnaisance searches of the mailing list and wiki have not turned up anything, so far. thanks, -- matt wilkie Geomatics Analyst Information Management and Technology Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: Adaptive search?
Shalin, - Original Message From: Shalin Shekhar Mangar shalinman...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, December 23, 2009 2:45:21 AM Subject: Re: Adaptive search? On Wed, Dec 23, 2009 at 4:09 AM, Lance Norskog wrote: Nice! Siddhant: Another problem to watch out for is the feedback problem: someone clicks on a link and it automatically becomes more interesting, so someone else clicks, and it gets even more interesting... So you need some kind of suppression. For example, as individual clicks get older, you can push them down. Or you can put a cap on the number of clicks used to rank the query. We use clicks/views instead of just clicks to avoid this problem. Doesn't a click imply a view? You click to view. I must be missing something... Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
Re: Replicating multiple cores
For me, the Java replication is nice because it's much easier to set up and has fewer moving pieces (vs. rsync server, scripts config file, event hook, external shell scripts). Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Jason Rutherglen jason.rutherg...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, December 8, 2009 7:44:07 PM Subject: Re: Replicating multiple cores Yes. I'd highly recommend using the Java replication though. Is there a reason? I understand it's new etc, however I think one issue with it is it's somewhat non-native access to the filesystem. Can you illustrate a real world advantage other than the enhanced admin screens? On Mon, Dec 7, 2009 at 11:13 PM, Shalin Shekhar Mangar wrote: On Tue, Dec 8, 2009 at 11:48 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: If I've got multiple cores on a server, I guess I need multiple rsyncd's running (if using the shell scripts)? Yes. I'd highly recommend using the Java replication though. -- Regards, Shalin Shekhar Mangar.
Re: Retrieving large num of docs
Strange. Ever figured out the source of performance difference? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Sat, December 5, 2009 12:05:49 PM Subject: Re: Retrieving large num of docs Hi Otis, I think my experiments are not conclusive about reduction in search time. I was playing around with various configurations to reduce the time to retrieve documents from Solr. I am sure that making the two multi valued text fields from stored to un-stored, retrieval time (query time + time to load the stored fields) became very fast. I was expecting the lazyfieldloading setting in solrconfig to take care of this but apparently it is not working as expected. Out of curiosity, I removed these 2 fields from the index (this time I am not even indexing them) and my search time got better (10 times better). However, I am still trying to isolate the reason for the search time reduction. It may be either because of 2 less fields to search in or because of the reduction in size of the index or may be something else. I am not sure if lazyfieldloading has any part in explaining this. - Raghu On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic wrote: Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Removing facets which frequency match the result count
Hi there If i have two documents with a field indexing a taxonomy path for example doc1: bags/handbags/clutch doc2: bags/handbags/beach and that field tokenizes on the forward slash, the facets produced will be : bags(2), handbags(2),beach(1),clutch(1) if i select clutch, the facets returned by solr will be handbags (1) and bags(1). I would like to have no facets returned. Therefore I want facets only to be returned if the facet frequency is smaller than the total results found. this will return a more helpful selection of facets for the user to then refine their search. In this example the user would not want to select 'bags' when they have selected handbags as it will not help the user in their search. We can remove these facets within the view but was asking if there is a more elegant way to do this in SOLR. Otis Gospodnetic wrote: Hi, Either I don't understand this or this doesn't make much sense. Are you saying you want to show only facet values whose counts == # of hits? If so, what would be the value of showing facets -- they wouldn't be narrowing down the result set. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: joeMcElroy pho...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, January 5, 2010 5:25:18 AM Subject: Removing facets which frequency match the result count Is there any way to specify to solr only to bring back facet filter options where the frequency is less than the total results found? I found facets which match the result count are not helpful to the user, and produce noise within the UI to filter results. I can obviously do this within the view but would be better if solr dealt with this logic. Cheers! Joe -- View this message in context: http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27026359.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27068209.html Sent from the Solr - User mailing list archive at Nabble.com.
What is the process to build Lucidworks Solr?
I am using LucidWorks Solr v1.4 and I would like to compile in a search component, however it does not seem like a very straightforward process. The ant script in the solr directory is that of the stock solr installation which does not compile out of the box. Has anyone been able to successfully compile Lucidworks Solr?
Exact matches without field copying?
Hi, I think the how can I perform both exact and non-exact (no stemming involved) searches? is a pretty FAQ, but it looks like we don't have an answer for it on the Wiki. The advice is typically to copy a field and apply different analysis to it (one stemmed, the other not stemmed), and then search on the appropriate field. Is there a better way of doing this? * CASE 1: index time stemming input word: house == indexed as token: hous exact-match desired (non-stemmed query): house == house == no match --- so if you want exact matches, you can't stem at index-time stemmed query: house == hous == match * CASE 2: no index-time stemming input word: house == indexed as token: house exact-match desired (non-stemmed query): house == house == match stemmed query: house == hous == no match --- so if you don't stem at index-time, non-exact matching stops working Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
Re: Custom Analyzer/Tokenizer works but results were not saved
Analysis is called when creating the indexed data for content, but not when storing the content. CopyField copies one field's raw values to another field for storage. The source and target fields can be of any type. copyField does not analyse the source data and then feed it to another field's analyzer stack. You will have to copy the raw data and add you analyzer to the existing analyzer stack for the target field. On Tue, Jan 5, 2010 at 2:56 PM, MitchK mitc...@web.de wrote: Hello community, I wrote another mail today, but I think something goes wrong (I can't find my post in the mailinglist) - if not, I am sorry for posting a doublepost - I am using a maillist for the first time. I have created a custom analyzer, which contains on a LowerCaseTokenizer, a StopFilter and a custom TestFilter. The analyzer works as expected, when I test him with analysis.jsp. However, it does not work, when I try to index or query real data via post.jar. I use the analyzer for a testField. This testField gets it's value via copyfield from the nameField's value. I am speculating that Solr only does copy the value without analyzing it afterwards. Here is some xml from my schema: The field: copyField source=name dest=test/ field name=test type=testAnalyzer indexed=true stored=true/ The analyzer: fieldType name=testAnalyzer class=solr.TextField positionIncrementGap=100 analyzer tokenizer class = solr.LowerCaseTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class = my.package.solr.analysis.TestFilterFactory mode = Main / /analyzer /fieldType How can I force Solr to use my analyzer the way it does, when I test him with analysis.jsp? Restart and re-indexing does not solve the problem. Hopefully you can help me with this. Thank you! Kind regards from Germany Mitch -- View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026739.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Tokenizing problem with numbers in query
Hi, Did you re-start tomcat and re-index your collection? Yes Do you want to search inside alpanumeric strings? Or you are interested only prefix queries. Can you give us more examples like target documents and queries. Searching inside would be required, yes. If the above example would work I would already be glad. Bernd
Re: Cached document view in solr
If you index the raw document, that is what is returned by the search. The analyzers create separate data that is stored in various files, but is only used in searching. Searching, facets, and sorting use this analyzed output, but search returns pull the original. On Thu, Jan 7, 2010 at 2:28 AM, Ramchandra Phadake ramchandra_phad...@persistent.co.in wrote: nutch search results provide a link for getting the cached document copy. It fetches the raw content from segments based on document id. {cached.jsp} Is it possible to have similar functionality in solr, what can be done to achieve this? Any pointers. I could retrieve the content using the text filed. 'fl=text' so content can be retrieved. But its parsed text with font formatting lost. Can the original content be stored in any field as is? Thanks, Ram DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -- Lance Norskog goks...@gmail.com
Re: SOLR Performance Tuning: Pagination
Great - this issue? https://issues.apache.org/jira/browse/LUCENE-2127 Sounds like it would be a real win for lucene. -Peter On Thu, Jan 7, 2010 at 4:12 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Peter - Aaron just commented on a recent Solr issue (reading large result sets) and mentioned his patch. So far he has 2 x +1 from Grant and me to stick his patch in JIRA. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Peter Wolanin peter.wola...@acquia.com To: solr-user@lucene.apache.org Sent: Sun, January 3, 2010 3:37:01 PM Subject: Re: SOLR Performance Tuning: Pagination At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers from Near Infinity (Aaron McCurry I think) mentioned that he had a patch for lucene that enabled unlimited depth memory-efficient paging. Is anyone in contact with him? -Peter On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote: On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority Queue management. See http://issues.apache.org/jira/browse/LUCENE-2127 and the linked discussion on java-dev. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 - stats page slow
I recently noticed the same sort of thing. The attached screenshot shows the transition on a search server when we updated from a Solr 1.4 dev build (revision 779609 from 2009-05-28) to the Solr 1.4.0 released code. Every 3 hours we have a cron task to log some of the data from the stats.jsp page from each core (about 100 cores, most of which are small indexes). You can see there is a dramatic spiking of the load after the update - I think due to added reporting on that page such as from the lucene field cache. Is this amount of load expected? -Peter On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill jayallenh...@gmail.com wrote: Also, what is your heap size and the amount of RAM on the machine? I've also noticed that, when watching memory usage through JConsole or YourKit while loading the stats page, the memory usage spikes dramatically - are you seeing this as well? -Jay On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill jayallenh...@gmail.com wrote: I've noticed this as well, usually when working with a large field cache. I haven't done in-depth analysis of this yet, but it seems like when the stats page is trying to pull data from a large field cache it takes quite a long time. Are you doing a lot of sorting? If so, what are the field types of the fields you're sorting on? How large is the index both in document count and file size? Another approach to get data from the Solr instance would be to use JMX. And I've been working on a request handler (started by Erik Hatcher) that will provide the same information as the stats page, but a little more efficiently. I may try to put up a patch with this soon. -Jay On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote: We've been using Solr 1.4 for a few days now and one slight downside we've noticed is the stats page comes up very slowly for some reason - sometimes more than 10 seconds. We call this programmatically to retrieve the last commit date so that we can keep users from committing too frequently. This means some of our administration pages are now taking a long time to load. Is there anything we should be doing to ensure that this page comes up quickly? I see some notes on this back in October but it looks like that update should already be applied by now. Or, better yet, is there now a better way to just retrieve the last commit date from Solr without pulling all of the statistics? Thanks in advance. -- Steve -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: DisMaxRequestHandler bf configuration
Thanks. Can I use the standard request handler for this purpose? So something like: requestHandler name=standard class=solr.StandardRequestHandler lst name=defaults str name=q{!boost b=$popularityboost v=$qq}popularityboost=log(popularity)/str /lst /requestHandlerOr do I still need the dismax handler? --- On Thu, 1/7/10, Erik Hatcher erik.hatc...@gmail.com wrote: From: Erik Hatcher erik.hatc...@gmail.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 4:56 AM it wouldn't be q.alt though, just q, in the config file. q.alt is typically *:*, it's the fall back query when no q is provided. though, in thinking about it, q.alt would work here, but i'd use q personally. On Jan 6, 2010, at 9:45 PM, Andy wrote: Let me make sure I understand you. I'd get my regular query from haystack as qq=foo rather than q=foo. Then I put in solrconfig within the dismax section: str name=q.alt {!boost b=$popularityboost v=$qq}popularityboost=log(popularity) /str Is that what you meant? --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote: From: Yonik Seeley yo...@lucidimagination.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Wednesday, January 6, 2010, 8:42 PM On Wed, Jan 6, 2010 at 8:24 PM, Andy angelf...@yahoo.com wrote: I meant can I do it with dismax without modifying every single query? I'm accessing Solr through haystack and all queries are generated by haystack. I'd much rather not have to go under haystack to modify the generated queries. Hence I'm trying to find a way to boost every query by default. If you can get haystack to pass through the user query as something like qq, then yes - just use something like the last link I showed at http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents and set defaults for everything except qq. -Yonik http://www.lucidimagination.com --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote: From: Yonik Seeley yo...@lucidimagination.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Wednesday, January 6, 2010, 7:48 PM On Wed, Jan 6, 2010 at 7:43 PM, Andy angelf...@yahoo.com wrote: So if I want to configure Solr to turn every query q=foo into q={!boost b=log(popularity)}foo, dismax wouldn't work but edismax would? You can do it with dismax it's just that the syntax is slightly more convoluted. Check out the section on boosting newer documents: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
Re: Solr 1.4 - stats page slow
I'd love to see the screenshot, but it didn't come through - got stripped by ML manager. Maybe upload it somewhere? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Peter Wolanin peter.wola...@acquia.com To: solr-user@lucene.apache.org Sent: Thu, January 7, 2010 9:32:26 PM Subject: Re: Solr 1.4 - stats page slow I recently noticed the same sort of thing. The attached screenshot shows the transition on a search server when we updated from a Solr 1.4 dev build (revision 779609 from 2009-05-28) to the Solr 1.4.0 released code. Every 3 hours we have a cron task to log some of the data from the stats.jsp page from each core (about 100 cores, most of which are small indexes). You can see there is a dramatic spiking of the load after the update - I think due to added reporting on that page such as from the lucene field cache. Is this amount of load expected? -Peter On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill wrote: Also, what is your heap size and the amount of RAM on the machine? I've also noticed that, when watching memory usage through JConsole or YourKit while loading the stats page, the memory usage spikes dramatically - are you seeing this as well? -Jay On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill wrote: I've noticed this as well, usually when working with a large field cache. I haven't done in-depth analysis of this yet, but it seems like when the stats page is trying to pull data from a large field cache it takes quite a long time. Are you doing a lot of sorting? If so, what are the field types of the fields you're sorting on? How large is the index both in document count and file size? Another approach to get data from the Solr instance would be to use JMX. And I've been working on a request handler (started by Erik Hatcher) that will provide the same information as the stats page, but a little more efficiently. I may try to put up a patch with this soon. -Jay On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss wrote: We've been using Solr 1.4 for a few days now and one slight downside we've noticed is the stats page comes up very slowly for some reason - sometimes more than 10 seconds. We call this programmatically to retrieve the last commit date so that we can keep users from committing too frequently. This means some of our administration pages are now taking a long time to load. Is there anything we should be doing to ensure that this page comes up quickly? I see some notes on this back in October but it looks like that update should already be applied by now. Or, better yet, is there now a better way to just retrieve the last commit date from Solr without pulling all of the statistics? Thanks in advance. -- Steve -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: SOLR Performance Tuning: Pagination
Si si, that issue. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Peter Wolanin peter.wola...@acquia.com To: solr-user@lucene.apache.org Sent: Thu, January 7, 2010 9:27:04 PM Subject: Re: SOLR Performance Tuning: Pagination Great - this issue? https://issues.apache.org/jira/browse/LUCENE-2127 Sounds like it would be a real win for lucene. -Peter On Thu, Jan 7, 2010 at 4:12 PM, Otis Gospodnetic wrote: Peter - Aaron just commented on a recent Solr issue (reading large result sets) and mentioned his patch. So far he has 2 x +1 from Grant and me to stick his patch in JIRA. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Peter Wolanin To: solr-user@lucene.apache.org Sent: Sun, January 3, 2010 3:37:01 PM Subject: Re: SOLR Performance Tuning: Pagination At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers from Near Infinity (Aaron McCurry I think) mentioned that he had a patch for lucene that enabled unlimited depth memory-efficient paging. Is anyone in contact with him? -Peter On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote: On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority Queue management. See http://issues.apache.org/jira/browse/LUCENE-2127 and the linked discussion on java-dev. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: DisMaxRequestHandler bf configuration
On Jan 7, 2010, at 9:51 PM, Andy wrote: Thanks. Can I use the standard request handler for this purpose? So something like: Yes, but... requestHandler name=standard class=solr.StandardRequestHandler lst name=defaults str name=q{!boost b=$popularityboost v= $qq}popularityboost=log(popularity)/str /lst /requestHandlerOr do I still need the dismax handler? popularityboost needs to be a separate str parameter. Erik
Re: DisMaxRequestHandler bf configuration
Oh I see. Is popularityboost the name of the parameter? requestHandler name=standard class=solr.StandardRequestHandler lst name=defaults str name=q{!boost b=$popularityboost v=$qq}/str str name=popularityboostlog(popularity)/str /lst /requestHandler --- On Thu, 1/7/10, Erik Hatcher erik.hatc...@gmail.com wrote: From: Erik Hatcher erik.hatc...@gmail.com Subject: Re: DisMaxRequestHandler bf configuration To: solr-user@lucene.apache.org Date: Thursday, January 7, 2010, 9:57 PM On Jan 7, 2010, at 9:51 PM, Andy wrote: Thanks. Can I use the standard request handler for this purpose? So something like: Yes, but... requestHandler name=standard class=solr.StandardRequestHandler lst name=defaults str name=q{!boost b=$popularityboost v=$qq}popularityboost=log(popularity)/str /lst /requestHandlerOr do I still need the dismax handler? popularityboost needs to be a separate str parameter. Erik
Re: solr updateCSV
http://www.lucidimagination.com/search/s:wiki?q=update+csv You can set the field names on the URL or as the first line. On Thu, Jan 7, 2010 at 3:48 AM, Mark N nipen.m...@gmail.com wrote: I am trying to use solr's csv updater to index the data , i am tryin to specify the .Dat format consisting of field seperator , text qualifier and a line seperator for example field 1 field separator field 2field seperator text qualifiervalue for field 1text qualifierfield seperatortext qualifiervalue for field 2 text qualifierfield seperatorline seperator Can we specify text qualifier and line seperator as well ? I have tested that we can specify a seperator and works good. -- Nipen Mark -- Lance Norskog goks...@gmail.com
How to Split Index file.
Hi, I would like to split the existing index by 2 index, ( inverse of merge index function). My index directory size around 20G and 10 Million documents. -Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com
Understanding the query parser
I am using Solr 1.3. I have an index with a field called name. It is of type text (unmodified, stock text field from solr). My query field:foo-bar is parsed as a phrase query field:foo bar I was rather expecting it to be parsed as field:(foo bar) or field:foo field:bar Is there an expectation mismatch? Can I make it work as I expect it to? Cheers Avlesh
Re: Meaning of this error: Failure to meet condition(s) of required/prohibited clause(s)???
I have defined a field type in schema.xml : fieldType name=lowercase class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType field name=keywords type=lowercase indexed=true stored=true multiValued=true/ field name=defaultKeywords type=lowercase indexed=true stored=true multiValued=true/ field name=subKeywords type=textTight indexed=true stored=true multiValued=true/ The other fields don't have lcd tvs in them.. And handler used is : requestHandler name=product class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str str name=qf title^0.7 contributors^0.2 keywords^0.5 defaultKeywords^1 subKeywords^0.1 /str str name=pf keywords^0.5 defaultKeywords^1 /str str name=mm50%/str /lst /requestHandler -Regards, Gunjan Erick Erickson wrote: How are these fields defined in your schema.xml? Note that String types are indexed without tokenization, so if str is defined as a String field type, that may be part of your problem (try text type if so). If this is irrelevant, please show us the relevant parts of your schema and the query you're submitting... Erick On Thu, Jan 7, 2010 at 6:17 AM, gunjan_versata gunjanga...@gmail.comwrote: Hi All, i have a document indexed in solr, which is as follow : doc str name=idP-E-HE-Philips-32PFL5409-98-Black-32/str arr name=keywords strPhilips/str strLCD TVs/str /arr str name=title Philips 32PFL5409-98 32 LCDTV withPixel Plus HD (Black,32) /str /doc now when i search for lcd tvs, i dont the above doc in search results.. on doing explain other, i got the following output.. P-E-HE-Philips-32PFL5409-98-Black-32: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 0.0 = no match on required clause (((subKeywords:lcd^0.1 | keywords:lcd^0.5 | defaultKeywords:lcd | contributors:lcd^0.5 | title:lcd) (subKeywords:televis^0.1 | keywords:tvs^0.5 | defaultKeywords:tvs | contributors:tvs^0.5 | (title:televis title:tv title:tvs)))~1) 0.0 = (NON-MATCH) Failure to match minimum number of optional clauses: 1 0.91647065 = (MATCH) max of: 0.91647065 = (MATCH) weight(keywords:lcd tvs^0.5 in 40), product of: 0.13178125 = queryWeight(keywords:lcd tvs^0.5), product of: 0.5 = boost 11.127175 = idf(docFreq=34, maxDocs=875476) 0.023686381 = queryNorm 6.9544845 = (MATCH) fieldWeight(keywords:lcd tvs in 40), product of: 1.0 = tf(termFreq(keywords:lcd tvs)=1) 11.127175 = idf(docFreq=34, maxDocs=875476) 0.625 = fieldNorm(field=keywords, doc=40) i am not sure of what it means? and if i can tweak it or not? please not, this score was more than the results which showed up... Regards, Gunjan -- View this message in context: http://old.nabble.com/Meaning-of-this-error%3A-Failure-to-meet-condition%28s%29-of-required-prohibited-clause%28s%29tp27058008p27058008.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Meaning-of-this-error%3A-Failure-to-meet-condition%28s%29-of-required-prohibited-clause%28s%29tp27058008p27071735.html Sent from the Solr - User mailing list archive at Nabble.com.