RE: Best use of wildcard searches
Hello I'm exactly in the same situation as you. I've got some structured subject ( as subjects:main subject/sub subject/sub sub subject ) and want to search them as litteral from a given level (subjects:main subject/*). As you know subjects:main subject/* doesn't work (but it should, shouldn't it ?), so what i've done is to replace the space caracter by the wildcard '?' in my query, as this : subjects:main?subject/*. It works, even if it isn't very elegant.Is it a great loss, performance-wise ? Could somebody tell ?Sure, a better solution would be appreciated.Kind regards,Pierre-Yves Landron From: [EMAIL PROTECTED] Subject: Re: Best use of wildcard searches Date: Wed, 8 Aug 2007 14:59:36 -0700 To: solr-user@lucene.apache.org OK. So a followup question.. ?q=department_exact:Apparel%3EMen's% 20Apparel*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true returns 0 results. Note the %20 in there for the space character. ?q=department_exact:Apparel% 3EMen's*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true returns several, and the only change is that I've truncated Men's Apparel* to be Men's*. (example department_exacts from this result set below..) ApparelMen's ApparelSweatshirtsHooded ApparelMen's ApparelShirtsTank TopWorkout ApparelMen's ApparelSweatshirts ApparelMen's ApparelSweatshirts ApparelMen's ApparelJacketsWindbreaker Any ideas why ApparelMen's* would work, but ApparelMen's Apparel* would not? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 8, 2007, at 2:42 PM, Yonik Seeley wrote: On 8/8/07, Matthew Runo [EMAIL PROTECTED] wrote: I've been using the standard query handler to do searches like q=department_exact:FooBarBazQux Now, lets assume I have lots of records, with various department trees... 1. FooBarBazQux 2. FooBarBazPut 3. FooBarSomething With SpacesElese 4. FooTotalyDifferentTree I'd like to get all the products at various levels, and all the levels below. I have a tokenzied department field, and a copyFielddepartment_exact. I've been doing searches on the department_exact feild, thinking I could do this.. q=department_exact:FooBar* A * inside quotes is literal. Try q=department_exact:FooBar* Or if is a reserved character, escape it with \ q=department_exact:Foo\Bar* If Bar is unique (only under Foo), you could use a copyfield to copy it to a regex tokenizer to split on and then do a simple search on Bar -Yonik _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Best use of wildcard searches
I just saw an e-mail from Yonik suggesting escaping the space. I know so little about Solr that all I can do is parrot Yonik... Erick On 8/8/07, Matthew Runo [EMAIL PROTECTED] wrote: OK. So a followup question.. ?q=department_exact:Apparel%3EMen's% 20Apparel*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true returns 0 results. Note the %20 in there for the space character. ?q=department_exact:Apparel% 3EMen's*fq=country_code:USfq=brand_exact:adidasfq=hibernated:true returns several, and the only change is that I've truncated Men's Apparel* to be Men's*. (example department_exacts from this result set below..) ApparelMen's ApparelSweatshirtsHooded ApparelMen's ApparelShirtsTank TopWorkout ApparelMen's ApparelSweatshirts ApparelMen's ApparelSweatshirts ApparelMen's ApparelJacketsWindbreaker Any ideas why ApparelMen's* would work, but ApparelMen's Apparel* would not? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 8, 2007, at 2:42 PM, Yonik Seeley wrote: On 8/8/07, Matthew Runo [EMAIL PROTECTED] wrote: I've been using the standard query handler to do searches like q=department_exact:FooBarBazQux Now, lets assume I have lots of records, with various department trees... 1. FooBarBazQux 2. FooBarBazPut 3. FooBarSomething With SpacesElese 4. FooTotalyDifferentTree I'd like to get all the products at various levels, and all the levels below. I have a tokenzied department field, and a copyField department_exact. I've been doing searches on the department_exact feild, thinking I could do this.. q=department_exact:FooBar* A * inside quotes is literal. Try q=department_exact:FooBar* Or if is a reserved character, escape it with \ q=department_exact:Foo\Bar* If Bar is unique (only under Foo), you could use a copyfield to copy it to a regex tokenizer to split on and then do a simple search on Bar -Yonik
Too many open files
result status=1java.io.FileNotFoundException: /usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open files) When I'm importing, this is the error I get. I know it's vague and obscure. Can someone suggest where to start? I'll buy a bag of MMs (not peanut) for anyone who can help me solve this* *limit one bag per successful solution for a total maximum of 1 bag to be given
question: how to divide the indexing into sperate domains
Hi! say I have 300 csv files that I need to index. Each one holds millions of lines (each line is a few fields separated by commas) Each csv file represents a different domain of data (e,g, file1 is computers, file2 is flowers, etc) There is no indication of the domain ID in the data inside the csv file When I search I would like to specify the id of a specific domain And I want solr to search only in this domain - to save time and reduce the number of matches I need to specify during indexing - the domain id of the csv file being indexed How do I do it ? Thanks p.s. I wish I could index like this: curl http://localhost:8080/solr/update/csv?stream.file=test.csvfieldnames=fi eld1,field2f.domain.value=98765 http://localhost:8080/solr/update/csv?stream.file=test.csvfieldnames=f ield1,field2f.domain.value=98765 (where 98765 is the domain id for ths specific csv file)
RE: Too many open files
You're a gentleman and a scholar. I will donate the MMs to myself :). Can you tell me from this snippet of my solrconfig.xml what I might tweak to make this more betterer? -KH indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout /indexDefaults
RE: Too many open files
You could try committing updates more frequently, or maybe optimising the index beforehand (and even during!). I imagine you could also change the Solr config, if you have access to it, to tweak indexing (or index creation) parameters - http://wiki.apache.org/solr/SolrConfigXml should be of use to you here. In the unlikely event I qualify for the MMs, I hereby donate them back to you for giving to someone else! Jon -Original Message- From: Kevin Holmes [mailto:[EMAIL PROTECTED] Sent: 09 August 2007 15:23 To: solr-user@lucene.apache.org Subject: Too many open files result status=1java.io.FileNotFoundException: /usr/local/bin/apache-solr/enr/solr/data/index/_16ik.tii (Too many open files) When I'm importing, this is the error I get. I know it's vague and obscure. Can someone suggest where to start? I'll buy a bag of MMs (not peanut) for anyone who can help me solve this* *limit one bag per successful solution for a total maximum of 1 bag to be given
Any clever ideas to inject into solr? Without http?
I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
RE: Any clever ideas to inject into solr? Without http?
What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
(re)building the index separately (ie. on a different computer) and then replacing the active index may be an option. David Whalen wrote: What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote: 2: Is there a way to inject into solr without using POST / curl / http? Check http://wiki.apache.org/solr/EmbeddedSolr There's examples in java and cocoa to use the DirectSolrConnection class, querying and updating solr w/o a web server. It uses JNI in the Cocoa case. -b
Re: Any clever ideas to inject into solr? Without http?
If it's a contention between search and indexing, separate them via a query-slave and an index-master. --cw On 8/9/07, David Whalen [EMAIL PROTECTED] wrote: What we're looking for is a way to inject *without* using curl, or wget, or any other http-based communication. We'd like for the HTTP daemon to only handle search requests, not indexing requests on top of them. Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl _ david whalen senior applications developer eNR Services, Inc. [EMAIL PROTECTED] 203-849-7240 -Original Message- From: Clay Webster [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 11:43 AM To: solr-user@lucene.apache.org Subject: Re: Any clever ideas to inject into solr? Without http? Condensing the loader into a single executable sounds right if you have performance problems. ;-) You could also try adding multiple docs in a single post if you notice your problems are with tcp setup time, though if you're doing localhost connections that should be minimal. If you're already local to the solr server, you might check out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV It's a little specialized. And then there's of course the question of are you doing full re-indexing or incremental indexing of changes? --cw On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, David Whalen [EMAIL PROTECTED] wrote: Plus, I have to believe there's a faster way to get documents into solr/lucene than using curl One issue with HTTP is latency. You can get around that by adding multiple documents per request, or by using multiple threads concurrently. You can also bypass HTTP by using something like the CVS loader (very light weight) and specifying a local file (via stream.file parameter). http://wiki.apache.org/solr/UpdateCSV I doubt you will see much of a difference between reading locally vs streaming over HTTP, but it might be interesting to see the exact overhead. -Yonik
RE: Too many open files
If you check out the documentation for mergeFactor, you'll find that adjusting it downward can lower the number of open files. Just remember that it is a speed tradeoff, and only lower it as much as you need to to stop getting the too many files errors. See this section: http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html#indexing_speed Thanks, Stu -Original Message- From: Ard Schrijvers [EMAIL PROTECTED] Sent: Thu, August 9, 2007 10:52 am To: solr-user@lucene.apache.org Subject: RE: Too many open files Hello, useCompoundFile set to true, should avoid the problem. You could also try to set maximum open files higher, something like (I assume linux) ulimit -n 8192 Ard You're a gentleman and a scholar. I will donate the MMs to myself :). Can you tell me from this snippet of my solrconfig.xml what I might tweak to make this more betterer? -KH default unless overridden. -- false 10 1000 2147483647 1 1000 1
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, Siegfried Goeschl [EMAIL PROTECTED] wrote: +) my colleague just finished a database import service running within the servlet container to avoid writing out the data to the file system and transmitting it over HTTP. Most people doing this read data out of the database and construct the XML in-memory for sending to Solr... one definitely doesn't want to write intermediate stuff to the filesystem (unless perhaps it's a CSV dump). +) I think there were some discussion regarding a generic database importer but nothing I'm aware of Absolutely a needed feature... it's in the queue: https://issues.apache.org/jira/browse/SOLR-103 But there will always be more complex cases, pulling from multiple data sources, doing some merging and munging, etc. The easiest way to handle many of those would probably be via a scripting language that does the app-specific merging+munging and then uses a Solr client (which constructs in-memory CSV or XML and sends to Solr). -Yonik
always fail to update the first time after I restart the server
Hi, I noticed the first index update after I restart my jboss server always fail with the exception below. Any update after that works fine. Does anyone know what the problem is? The solr version I'm using is solr1.2 Thanks Xuesong 2007-08-09 11:41:44,559 ERROR [STDERR] Aug 9, 2007 11:41:44 AM org.apache.solr.core.SolrException log SEVERE: java.io.IOException: Underlying input stream returned zero bytes at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:415) at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183) at java.io.InputStreamReader.read(InputStreamReader.java:167) at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2972) at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026) at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410) at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395) at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093) at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH andler.java:111) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd ateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:173) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilte r.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:178) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAs sociationValve.java:175) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.j ava:74) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :105) at org.jboss.web.tomcat.tc5.jca.CachedConnectionValve.invoke(CachedConnecti onValve.java:156) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1 48) at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199) at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282) at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:767) at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java: 697) at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket. java:889) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool .java:684) at java.lang.Thread.run(Thread.java:595) 2007-08-09 11:41:44,590 ERROR [STDERR] Aug 9, 2007 11:41:44 AM org.apache.solr.core.SolrCore execute INFO: /update/ 0 78 2007-08-09 11:41:44,590 ERROR [STDERR] Aug 9, 2007 11:41:44 AM org.apache.solr.core.SolrException log SEVERE: java.io.IOException: Underlying input stream returned zero bytes at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:415) at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183) at java.io.InputStreamReader.read(InputStreamReader.java:167) at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2972) at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026) at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410) at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395) at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093) at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH andler.java:111) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd ateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at
Synonym questions
Hi - Just looking at synonyms, and had a couple of questions. 1) For some of my synonyms, it seems to make senses to simply replace the original word with the other (e.g. theatre = theater, so searches for either will find either). For others, I want to add an alternate term while preserving the original (e.g. cirque = circus, so searches for circus find Cirque du Soleil, but searches for cirque only match cirque, not circus. I was thinking that the best way to do this was with two different synonym filters. The replace filter would be used both at index and query time, the other only at index time. Does doing this using two synonym filters make sense? section from my schema.xml fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms_replace.txt ignoreCase=true expand=false includeOrig=false/ filter class=solr.SynonymFilterFactory synonyms=synonyms_add.txt ignoreCase=true expand=false includeOrig=true/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms_replace.txt ignoreCase=true expand=false includeOrig=false/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ /analyzer /fieldType 2) For this to work, I need to use includeOrig. It appears that includeOrig is hard coded to be false in SynonymFilterFactory. Is there any reason for this? It's pretty easy to change (diff below), any reason this should not be supported? Thanks, Tom Diffing vs. my local copy of 1.2, but it appears to be the same in HEAD. --- src/java/org/apache/solr/analysis/SynonymFilterFactory.java +++ src/java/org/apache/solr/analysis/SynonymFilterFactory.java (working copy) @@ -37,6 +37,7 @@ ignoreCase = getBoolean(ignoreCase,false); expand = getBoolean(expand,true); +includeOrig = getBoolean(includeOrig,false); if (synonyms != null) { ListString wlist=null; @@ -57,8 +58,9 @@ private SynonymMap synMap; private boolean ignoreCase; private boolean expand; + private boolean includeOrig; - private static void parseRules(ListString rules, SynonymMap map, String mappingSep, String synSep, boolean ignoreCase, boolean expansion) { + private void parseRules(ListString rules, SynonymMap map, String mappingSep, String synSep, boolean ignoreCase, boolean expansion) { int count=0; for (String rule : rules) { // To use regexes, we need an expression that specifies an odd number of chars. @@ -88,7 +90,6 @@ } } - boolean includeOrig=false; for (ListString fromToks : source) { count++; for (ListString toToks : target) {
Re: Too many open files
On 9-Aug-07, at 7:52 AM, Ard Schrijvers wrote: ulimit -n 8192 Unless you have an old, creaky box, I highly recommend simply upping your filedesc cap. -Mike
Re: Best use of wildcard searches
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Hmm.. I just tried the following three queries... /?q=department_exact:ApparelMen's? ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (no results) /?q=department_exact:ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (no results) /?q=ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (results) I know that the string I'm searching for is stored in department_exact (copyField) and department (original field). What's the department_exact fieldType look like? -Yonik
Re: Best use of wildcard searches
Here you go.. I thought that string wasn't munged, so I used that... field name=department type=text indexed=true stored=true/ field name=department_exact type=string indexed=true stored=true/ copyField source=department dest=department_exact/ ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 12:26 PM, Yonik Seeley wrote: On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Hmm.. I just tried the following three queries... /?q=department_exact:ApparelMen's? ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (no results) /?q=department_exact:ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (no results) /?q=ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas... (results) I know that the string I'm searching for is stored in department_exact (copyField) and department (original field). What's the department_exact fieldType look like? -Yonik
Re: Best use of wildcard searches
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Here you go.. I thought that string wasn't munged, so I used that... field name=department type=text indexed=true stored=true/ field name=department_exact type=string indexed=true stored=true/ copyField source=department dest=department_exact/ Hmmm, that looks ok. You re-indexed since department_exact was added? If so, could you show the exact XML response containing a document with department_exact in it, and then a prefix query on department_exact that doesn't return that query (with debugQuery=on)? -Yonik
RE: Any clever ideas to inject into solr? Without http?
Is this a native feature, or do we need to get creative with scp from one server to the other? If it's a contention between search and indexing, separate them via a query-slave and an index-master. --cw
Re: Any clever ideas to inject into solr? Without http?
On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr For the most up-to-date solr client for python, check out https://issues.apache.org/jira/browse/SOLR-216 -Yonik
Re: Best use of wildcard searches
Yes, we've reindexed several times. Here are three sample result sets.. 1 - ?q=department_exact:ApparelMen's? ApparelJackets*fq=country_code:USfq=brand_exact:adidas 2 - ?q=department_exact:ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas 3 - ?q=department_exact:ApparelMen's ApparelJackets*fq=country_code:USfq=brand_exact:adidas 4 - ?q=ApparelMen's ApparelJackets*fq=country_code:USfq=brand_exact:adidas 1 is the only one that has any data now.. very strange that it'd change when I changed nothing in the index. But at any rate, shouldn't ? and \ give the same results? Also attached is my schema.xml. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime2/intlst name=paramsstr name=qdepartment_exact:Apparelgt;Men's?Apparelgt;Jackets*/strarr name=fqstrcountry_code:US/strstrbrand_exact:adidas/str/arr/lst/lstresult name=response numFound=35 start=0docstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint name=product_id7280433/intarr name=product_typestrJacket/strstrTop/strstrApparel/str/arrarr name=product_type_exactstrJacket/strstrTop/strstrApparel/str/arrstr name=product_url/n/p/p/7280433/c/151.html/strarr name=sizestrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrarr name=size_exactstrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrint name=style_id333625/intstr name=thumbnail_url/images/728/7280433/6220-333625-t.jpg/strarr name=widthstrRegular/str/arr/docdocstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint name=product_id7280433/intarr name=product_typestrJacket/strstrTop/strstrApparel/str/arrarr name=product_type_exactstrJacket/strstrTop/strstrApparel/str/arrstr name=product_url/n/p/p/7280433/c/6758.html/strarr name=sizestrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrarr name=size_exactstrSM/strstr2XL/strstrYouth 4.5/strstrYouth 5/strstrYouth 6/strstrYouth 7/strstrYouth 3.5/strstrYouth 2.5/strstrYouth 6.5/strstrYouth 2/strstrMD/strstrLG/strstrYouth 3/strstrYouth 5.5/strstrYouth 4/str/arrint name=style_id333629/intstr name=thumbnail_url/images/728/7280433/6220-333629-t.jpg/strarr name=widthstrRegular/str/arr/docdocstr name=brandadidas/strstr name=brand_exactadidas/strint name=brand_id1/intarr name=country_codestrUS/str/arrdate name=create_date2007-08-05T01:08:26Z/datestr name=departmentApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=department_exactApparelgt;Men's Apparelgt;Jacketsgt;Windbreaker/strstr name=descriptionlt;ulgt;lt;ligt;The Essential 3-Stripes Track Top is a ClimaLiteamp;reg; full-zip jacket with front pockets.lt;ligt;Applied 3-Stripes on sleeves.lt;ligt;3-D inserts at chest and back.lt;ligt;Heat-transfer brandmark on left chest.lt;ligt;100% Polyester.lt;/ulgt;/strstr name=genderMens/strstr name=hibernatedtrue/strbool name=inStocktrue/boolstr name=nameEssential 3-Stripes Track Top/strint name=popularity20018/intfloat name=price55.95/floatint
Re: Best use of wildcard searches
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Yes, we've reindexed several times. Here are three sample result sets.. 1 - ?q=department_exact:ApparelMen's? ApparelJackets*fq=country_code:USfq=brand_exact:adidas 2 - ?q=department_exact:ApparelMen's\ ApparelJackets*fq=country_code:USfq=brand_exact:adidas 3 - ?q=department_exact:ApparelMen's ApparelJackets*fq=country_code:USfq=brand_exact:adidas 4 - ?q=ApparelMen's ApparelJackets*fq=country_code:USfq=brand_exact:adidas 1 is the only one that has any data now.. very strange that it'd change when I changed nothing in the index. But at any rate, shouldn't ? and \ give the same results? They translate to different queries. But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended? -Yonik
Re: Best use of wildcard searches
On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote: They translate to different queries. But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended? Or perhaps with wt=python would be less confusing seeing that there are '' chars in there that would otherwise be escaped. -Yonik
Re: Best use of wildcard searches
Sure thing! Heres 1, and 2. 1 - just a space. 2 - a \ . ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote: On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote: They translate to different queries. But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended? Or perhaps with wt=python would be less confusing seeing that there are '' chars in there that would otherwise be escaped. -Yonik
Re: Best use of wildcard searches
Hm, I don't see any attachments, I'm forwarding them to you directly. Would anyone else like to see them? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:20 PM, Matthew Runo wrote: Sure thing! Heres 1, and 2. 1 - just a space. 2 - a \ . ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote: On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote: They translate to different queries. But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended? Or perhaps with wt=python would be less confusing seeing that there are '' chars in there that would otherwise be escaped. -Yonik
Re: Best use of wildcard searches
Feel free to run some queries yourself. We opened the firewall for this box... http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's\%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:21 PM, Matthew Runo wrote: Hm, I don't see any attachments, I'm forwarding them to you directly. Would anyone else like to see them? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:20 PM, Matthew Runo wrote: Sure thing! Heres 1, and 2. 1 - just a space. 2 - a \ . ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:14 PM, Yonik Seeley wrote: On 8/9/07, Yonik Seeley [EMAIL PROTECTED] wrote: They translate to different queries. But can I see the XML output for 1 and 2 with debugQuery=onindent=on appended? Or perhaps with wt=python would be less confusing seeing that there are '' chars in there that would otherwise be escaped. -Yonik
Re: Best use of wildcard searches
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Feel free to run some queries yourself. We opened the firewall for this box... http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's\%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python OK, so this query is returning results, right? So what query isn't returning any results (but should) now? -Yonik
Re: Best use of wildcard searches
http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python The same exact query, with... wait.. Wow. I'm making myself look like an idiot. I swear that these queries didn't work the first time I ran them... But now \ and ? give the same results, as would be expected, while returns nothing. I'm sorry for wasting your time, but I do appreciate the help! ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 9, 2007, at 1:35 PM, Yonik Seeley wrote: On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: Feel free to run some queries yourself. We opened the firewall for this box... http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's\%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python OK, so this query is returning results, right? So what query isn't returning any results (but should) now? -Yonik
Re: Best use of wildcard searches
On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python The same exact query, with... wait.. Wow. I'm making myself look like an idiot. I swear that these queries didn't work the first time I ran them... But now \ and ? give the same results, as would be expected, while returns nothing. I'm sorry for wasting your time, but I do appreciate the help! lo - these things can happen when you get too many levels of escaping needed. Hopefully we can improve the situation in the future to get rid of the query parser escaping for certain queries such as prefix and term. -Yonik
Creating a document blurb when nothing is returned from highlight feature
Hi all, I'd like to provide a blurb of documents matching a search in the case when there is no text highlighted. I assumed that perhaps the highlighter would give me back the first few words in a document if this occurred, but it doesn't. My conundrum is that I'd rather not grab the whole document body field because some of them are large. Is there some way I can request from Lucene the first N words or lines from a field? Thanks. Ben
Re: Creating a document blurb when nothing is returned from highlight feature
On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote: Hi all, I'd like to provide a blurb of documents matching a search in the case when there is no text highlighted. I assumed that perhaps the highlighter would give me back the first few words in a document if this occurred, but it doesn't. My conundrum is that I'd rather not grab the whole document body field because some of them are large. Is there some way I can request from Lucene the first N words or lines from a field? Hmmm, that does sound like a reasonable thing, and I guess it belongs in/with the highlighter functionallity? Could you open a JIRA issue to track this? -Yonik
Re: Creating a document blurb when nothing is returned from highlight feature
On 9-Aug-07, at 2:10 PM, Benjamin Higgins wrote: Hi all, I'd like to provide a blurb of documents matching a search in the case when there is no text highlighted. I assumed that perhaps the highlighter would give me back the first few words in a document if this occurred, but it doesn't. My conundrum is that I'd rather not grab the whole document body field because some of them are large. Is there some way I can request from Lucene the first N words or lines from a field? The way I deal with this is that I modified the highlighter fragment scorer to return a positive (but low) score for the first few fragments of a doc. This will work, but tends not to provide great summaries and will definitely still fetch and process the entire doc contents. The better way to do this is to generate a better general summary yourself and store it in a separate field; this can be used if no highlighting is generated (or, capability in Solr to automatically substitute a field in the case of no highlighting would be cool). I might even implement this if there is sufficient interest :). Unfortunately, the highlighter does not know (and realy has no way of knowing) what parts of a doc matched, so it would still have to try highlighting first. Note that you can control the cpu usage for long fields by setting hl.maxAnalyzedChars (will be in the next release). best, -Mike
Returning a list of matching words
This may be obvious but I can't get my head straight. Is there a way to return a list of matching words that a record got matched against? For instance: record_a: ruby, solr, mysql, rails record_b: solr, java Then ?q=solr+OR+rails would return the matched words for the records record_a: solr, rails record_b: solr I'm not looking into using the highlight feature for that. Thanks, -- Thiago Jackiw
Multivalued fields and the 'copyField' operator
I'm adding a field to be the source of the spellcheck database. Since that is its only job, it has raw text lower-cased, de-Latin1'd, and de-duplicated. Since it is only for the spellcheck DB, it does not need to keep duplicates. I specified it as 'multiValued=false and used copyField from a few other fields to populate it. The Analyser promptly blew up, claiming that I was putting multiple values in a single-valued field. I changed it to multiValued=true, but now it keeps separate copies of the different fields, which usually overlap. Would it make sense for multiple copyField operations to work with a single-valued field? Since single-valued fields are a new feature in Solr, I assume these little corner cases have not come to light before. I defer to The Wisdom Of The Elder Council. Thanks, Lance
RE: Creating a document blurb when nothing is returned from highlight feature
Thanks Mike. I didn't think of creating a blurb beforehand, but that's a great solution. I'll probably do that. Yonik, I can still add a JIRA issue if you'd like, though. Ben -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 2:32 PM To: solr-user@lucene.apache.org Subject: Re: Creating a document blurb when nothing is returned from highlight feature On 9-Aug-07, at 2:10 PM, Benjamin Higgins wrote: Hi all, I'd like to provide a blurb of documents matching a search in the case when there is no text highlighted. I assumed that perhaps the highlighter would give me back the first few words in a document if this occurred, but it doesn't. My conundrum is that I'd rather not grab the whole document body field because some of them are large. Is there some way I can request from Lucene the first N words or lines from a field? The way I deal with this is that I modified the highlighter fragment scorer to return a positive (but low) score for the first few fragments of a doc. This will work, but tends not to provide great summaries and will definitely still fetch and process the entire doc contents. The better way to do this is to generate a better general summary yourself and store it in a separate field; this can be used if no highlighting is generated (or, capability in Solr to automatically substitute a field in the case of no highlighting would be cool). I might even implement this if there is sufficient interest :). Unfortunately, the highlighter does not know (and realy has no way of knowing) what parts of a doc matched, so it would still have to try highlighting first. Note that you can control the cpu usage for long fields by setting hl.maxAnalyzedChars (will be in the next release). best, -Mike
Is it possible to know from where in the field highlighed text comes from?
Hi again, It'd be nice to know what the starting line number is for highlighted snippets. I imagine others might find it useful to know the starting byte offset. Is there an easy way to add this in? I'm not afraid of hacking the source if it's not too involved. Thanks. Ben
tomcat and solr multiple instances
Hi, I have built 2 solr instance - one is example and the other is ca_companies. The ca_companies solr instance is working find, but example is not working... In the admin page, /solr/admin, for example instance, it shows that Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf -- this should be Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/ Any one knows why? If I run Jetty for instance example, it is working well... Thanks, Jae Joo
EmbeddedSolr and optimize
http://wiki.apache.org/solr/EmbeddedSolr Following the example on connecting to the Index directly without using HTTP, I tried to optimize by passing the true flag to the CommitUpdateCommand. When optimizing an index with Lucene directly it doubles the size of the index temporarily and then deletes the old segments that were optimized. Instead, what happened was the old segments were still there. Calling optimize a second time did remove the old segments. Lucene it's usually writer.optimize(); writer.close(); So what method call do I need to make afterwards so I don't have to call optimize a second time with the Solr API? public void index() { //do stuff while (loop) { //add millions of documents and commit at intervals } optimize(); // optimize to reduce file handles optimize();//clean up old segments which still existed //WHAT SHOULD BE HERE INSTEAD OF ANOTHER OPTIMIZE? } public void commit() throws IOException { commit(false); } public void optimize() throws IOException { logger.info(Optimizing an index temporarily doubles size of index, + but reduces number of files ); commit(true); } private static void commit(boolean optimize) throws IOException { UpdateHandler updateHandler = core.getUpdateHandler(); CommitUpdateCommand commitcmd = new CommitUpdateCommand(optimize); updateHandler.commit(commitcmd); } Paul Sundling
RE: tomcat and solr multiple instances
Here are the Catalina/localhost/ files For example instance Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/rpt/src/apache-solr-1.2.0/example/solr override=true / /Context For ca_companies instance Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/rpt/src/apache-solr-1.2.0/ca_companies/solr override=true / /Context Urls http://host:8080/solr/admin -- pointint example instance (Problem...) http://host:8080/solr_ca/admin -- pointing ca-companies instance (it is working) -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 5:45 PM To: solr-user@lucene.apache.org Subject: tomcat and solr multiple instances Hi, I have built 2 solr instance - one is example and the other is ca_companies. The ca_companies solr instance is working find, but example is not working... In the admin page, /solr/admin, for example instance, it shows that Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf -- this should be Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/ Any one knows why? If I run Jetty for instance example, it is working well... Thanks, Jae Joo
Re: Returning a list of matching words
On 8/9/07, Thiago Jackiw [EMAIL PROTECTED] wrote: This may be obvious but I can't get my head straight. Is there a way to return a list of matching words that a record got matched against? Unfortunately no... lucene doesn't provide that capability with standard queries. You could do it (slower) with additional queries of course: q=solr OR railsrows=5 // retrieves the top docs q=solr OR railsfq=solrfl=id // see which top docs match solr q=solr OR railsfq=railsf=id // see which top docs match rails -Yonik
RE: Any clever ideas to inject into solr? Without http?
Jython is a Python interpreter implemented in Java. (I have a lot of Python code.) Total throughput in the servlet is very sensitive to the total number of servlet sockets available v.s. the number of CPUs. The different analysers have very different performance. You might leave some data in the DB, instead of storing it all in the index. Underlying this all, you have a sneaky network performance problem. Your successive posts do not reuse a TCP socket. Obvious: re-opening a new socket each post takes time. Not obvious: your server has sockets building up in TIME_WAIT state. (This means the sockets are shutting down. Having both ends agree to close the connection is metaphysically difficult. The TCP/IP spec even has a bug in this area.) Sockets building up can use TCP resources to run low or may run out. Your kernel configuration may be weak in this area. Lance -Original Message- From: Kevin Holmes [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 8:13 AM To: solr-user@lucene.apache.org Subject: Any clever ideas to inject into solr? Without http? I inherited an existing (working) solr indexing script that runs like this: Python script queries the mysql DB then calls bash script Bash script performs a curl POST submit to solr We're injecting about 1000 records / minute (constantly), frequently pushing the edge of our CPU / RAM limitations. I'm in the process of building a Perl script to use DBI and lwp::simple::post that will perform this all from a single script (instead of 3). Two specific questions 1: Does anyone have a clever (or better) way to perform this process efficiently? 2: Is there a way to inject into solr without using POST / curl / http? Admittedly, I'm no solr expert - I'm starting from someone else's setup, trying to reverse-engineer my way out. Any input would be greatly appreciated.
Re: Multivalued fields and the 'copyField' operator
On 8/9/07, Lance Norskog [EMAIL PROTECTED] wrote: I'm adding a field to be the source of the spellcheck database. Since that is its only job, it has raw text lower-cased, de-Latin1'd, and de-duplicated. Since it is only for the spellcheck DB, it does not need to keep duplicates. Duplicate token values (words) or duplicate field values? Could you give some examples? -Yonik
Re: tomcat and solr multiple instances
The current working directory (Cwd) is the directory from which you started the Tomcat server and is not dependent on the Solr instance configurations. So as long as SolrHome is correct for each Solr instance, you shouldn't have a problem. cheers, Piete On 10/08/07, Jae Joo [EMAIL PROTECTED] wrote: Here are the Catalina/localhost/ files For example instance Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/rpt/src/apache-solr-1.2.0/example/solr override=true / /Context For ca_companies instance Context docBase=/rpt/src/apache-solr-1.2.0/dist/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/rpt/src/apache-solr-1.2.0/ca_companies/solr override=true / /Context Urls http://host:8080/solr/admin -- pointint example instance (Problem...) http://host:8080/solr_ca/admin -- pointing ca-companies instance (it is working) -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 5:45 PM To: solr-user@lucene.apache.org Subject: tomcat and solr multiple instances Hi, I have built 2 solr instance - one is example and the other is ca_companies. The ca_companies solr instance is working find, but example is not working... In the admin page, /solr/admin, for example instance, it shows that Cwd=/rpt/src/apache-solr-1.2.0/ca_companies/solr/conf -- this should be Cwd=/rpt/src/apache-solr-1.2.0/example/solr/conf SolrHome=/rpt/src/apache-solr-1.2.0/example/solr/ Any one knows why? If I run Jetty for instance example, it is working well... Thanks, Jae Joo
Re: Creating a document blurb when nothing is returned from highlight feature
It should probably be configurable: (1) return nothing if no match, (2) substitute with an alternate field, (3) return first sentence or N number of tokens. -Sean Yonik Seeley wrote on 8/9/2007, 5:50 PM: On 8/9/07, Benjamin Higgins [EMAIL PROTECTED] wrote: Thanks Mike. I didn't think of creating a blurb beforehand, but that's a great solution. I'll probably do that. Yonik, I can still add a JIRA issue if you'd like, though. Always 10 different ways to tackle the same problem in the search space, and that's why it's great to have a lot of people around for different ideas/approaches. I do think opening a JIRA issue would be worth it, even if Mike's approach yields superior results. It seems like a reasonable expectation to always get something back as a document summary without having to create a specific field for that. -Yonik
Re: Any clever ideas to inject into solr? Without http?
On Thu, 9 Aug 2007 15:23:03 -0700 Lance Norskog [EMAIL PROTECTED] wrote: Underlying this all, you have a sneaky network performance problem. Your successive posts do not reuse a TCP socket. Obvious: re-opening a new socket each post takes time. Not obvious: your server has sockets building up in TIME_WAIT state. (This means the sockets are shutting down. Having both ends agree to close the connection is metaphysically difficult. The TCP/IP spec even has a bug in this area.) Sockets building up can use TCP resources to run low or may run out. Your kernel configuration may be weak in this area. Good point. and putting my pedantic hat on here, it may not necessarily be 'kernel configuration', but network stack - not sure what OS the OP is using. B _ {Beto|Norberto|Numard} Meijome All parts should go together without forcing. You must remember that the parts you are reassembling were disassembled by you. Therefore, if you can't get them together again, there must be a reason. By all means, do not use hammer. IBM maintenance manual, 1975 I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: Multivalued fields and the 'copyField' operator
If we have a field spellcheck_db, and have two copyField lines for it: fieldType name=spellcheck ... Basically the text type without stemming... field name=title type=string / field name=description type=string / field name=spellcheck_db multiValued=false type=spellcheck indexed=true stored=false required=true / copyField source=title dest=spellcheck_db / copyField source=description dest=spellcheck_db / All I want to do is make a pile of words as input to the spellcheck feature. If I index with this, the spellcheck Analyser class complains that I'm putting two values in a multiValued=false field. Since I have to make it multiValued, the same word in successive values is not collapsed into one mention of the word. I suppose this is an 'out' case, and not worth any major internal rework. Thanks for your time, Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, August 09, 2007 5:28 PM To: solr-user@lucene.apache.org Subject: Re: Multivalued fields and the 'copyField' operator On 8/9/07, Lance Norskog [EMAIL PROTECTED] wrote: I'm adding a field to be the source of the spellcheck database. Since that is its only job, it has raw text lower-cased, de-Latin1'd, and de-duplicated. Since it is only for the spellcheck DB, it does not need to keep duplicates. Duplicate token values (words) or duplicate field values? Could you give some examples? -Yonik
Re: [newbie] how to debug the schema?
Good day, danc86 of #lucene gave me the answer - I was not storing the fields :-) Thanks, Franz On 8/9/07, Ryan McKinley [EMAIL PROTECTED] wrote: [QUESTION] What could be the problem? .Or what else can I do to debug this problem? In general 'luke' is a great tool to figure out what may be happening in the index. (assuming you are running 1.2) check your schema fields from: http://localhost:8983/solr/admin/luke?show=schema Using http://www.getopt.org/luke/ is also very useful.
RE: Best use of wildcard searches
Maybe there's a different way, in which path-like values like this are treated explicitly. I use a similar approach to Matthew at www.colfes.com, where all pages are generated from Lucene searches according to filters on a couple of hierarchical categories ('spaces'), i.e. subject and organisational unit. From that experience, a few things occur to me here: 1. The structure of any particular category/space is not immediately derivable from data, so unless we're Google or doing something RDF-like they're something you define up front. For this reason, and because it makes internationalisation easier, I feel you should model this kind of standing data independently of its representation. So instead searching for DepartmentsMen's ApparelJackets, I index (and search for) a String /departments/mensapparel/jackets/, and used a simple standing data mapping to resolves each of the nodes along the path to a human-readable form when necessary. In my case, the values for any particular resource (e.g. a news article) are defined by CMS users from drop-downs. 2. In my Lucene library, I redundantly indexed paths like /departments/mensapparel/jackets/ into successive fragments, together with the whole path value: /departments /departments/mensapparel /departments/mensapparel/jackets /departments/mensapparel/jackets/ using my own PathAnalyzer (extends Analyzer, of course) which makes it very fast to query on path fragments: all goods anywhere in the men's apparel section - query on /departments/mensapparel; all goods categorised as exactly in the men's apparel section - query on /departments/mensapparel/. I implemented all queries like this as filters, and cached the filter definitions. I guess Solr's query optimisation and filter caching do all this out of the box, so it may end up being just as fast to use the kind of PrefixQuery suggested in this thread. 3. However, I can post/attach/donate PathAnalyzer if anyone thinks it might still be useful. I started off calling it HierarchyValueAnalyzer, then TreeNodePathAnalyzer, but now that it's PathAnalyzer I cna't help thinking it might have lots of applications Jon -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: 09 August 2007 21:50 To: solr-user@lucene.apache.org Subject: Re: Best use of wildcard searches On 8/9/07, Matthew Runo [EMAIL PROTECTED] wrote: http://66.209.92.171:8080/solr/select/?q=department_exact:Apparel% 3EMen's%20Apparel% 3EJackets*fq=country_code:USfq=brand_exact:adidaswt=python The same exact query, with... wait.. Wow. I'm making myself look like an idiot. I swear that these queries didn't work the first time I ran them... But now \ and ? give the same results, as would be expected, while returns nothing. I'm sorry for wasting your time, but I do appreciate the help! lo - these things can happen when you get too many levels of escaping needed. Hopefully we can improve the situation in the future to get rid of the query parser escaping for certain queries such as prefix and term.