Re: combining xml and nutch index in solr
hi thanks, That's exactly what i want as far as I know we can not update solr index with partial values it does not update the index record, it gets recreated. so I m not sure how solrindex command will work here -- View this message in context: http://lucene.472066.n3.nabble.com/combining-xml-and-nutch-index-in-solr-tp3209911p3218125.html Sent from the Solr - User mailing list archive at Nabble.com.
xpath expression not working
hi I have a xml doc whichi would like to index using xpath entity processor. add doc id1/id detailsxyz/details /doc doc id2/id detailsxyz2/details /doc /add if i want to just load document with id=2 how would that work? I tried xpath expression that works with xpath tools but not in solr. dataConfig dataSource type=FileDataSource / document entity name=f processor=FileListEntityProcessor baseDir=c:\temp fileName=promotions.xml recursive=false rootEntity=false dataSource=null entity name=x processor=XPathEntityProcessor forEach=/add/doc url=${f.fileAbsolutePath} pk=id field column=id xpath=/add/doc/[id=2]/id/ /entity /entity /document /dataConfig Any help how i can do this? -- View this message in context: http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3218133.html Sent from the Solr - User mailing list archive at Nabble.com.
SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'
I am using Solr 3.3 on a Windows box. I want to use the solr.ICUTokenizerFactory in my schema.xml and added the fieldType name=text_icu as per the URL - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory I also added the following files to my apache-solr-3.3.0\example\lib folder: lucene-icu-3.3.0.jar lucene-smartcn-3.3.0.jar icu4j-4_8.jar lucene-stempel-3.3.0.jar When I start my Solr server from apache-solr-3.3.0\example folder: java -jar start.jar I get the following errors: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory' SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer filter list SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text_icu' specified on field subject I tried adding various other jar files to the lib folder but it does not help. What am I doing wrong? Satish
Solr 3.3 crashes after ~18 hours?
Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: Solr 3.3 crashes after ~18 hours?
Any JAVA_OPTS set? Do not use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags. Am 02.08.2011 12:01, schrieb alexander sulz: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: Solr 3.3 crashes after ~18 hours?
Nope, none :/ Am 02.08.2011 12:33, schrieb Bernd Fehling: Any JAVA_OPTS set? Do not use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags. Am 02.08.2011 12:01, schrieb alexander sulz: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
performance crossover between single index and sharding
Is there any knowledge on this list about the performance crossover between a single index and sharding and when to change from a single index to sharding? E.g. if index size is larger than 150GB and num of docs is more than 25 mio. then it is better to change from single index to sharding and have two shards. Or something like this... Sure, solr might even handle 50 mio. docs but performance is going down and a sharded system with distributed search will be faster than a single index, or not? Is a single index always fast than sharding? Regards Bernd
Re: Solr 3.3 crashes after ~18 hours?
Strange, anything out of the ordinary in the syslog? On Tuesday 02 August 2011 12:01:35 alexander sulz wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Solr 3.3 crashes after ~18 hours?
What do you mean by it just crashes? Does the process stops execution? Does it takes too long to respond which might result in lots of 503s in your application? Does the system run out of resources? Are you indexing and serving from the same server? It happened once with us that Solr was performing commit and then optimize while the load from app server was at its peak. This caused slow response from search server, which caused requests getting stacked up at app server and causing 503s. Could you look if you have a similar syndrome? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Tue, Aug 2, 2011 at 15:31, alexander sulz a.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
RE: changing the root directory where solrCloud stores info inside zookeeper File system
Thanks A lot mark, Since My SolrCloud code was old I tried downloading and building the newest code from here https://svn.apache.org/repos/asf/lucene/dev/trunk/ I am using tomcat6 I manually created the sc sub-directory in my zooKeeper ensemble file-system I used this connection String to my ZK ensemble zook1:2181/sc,zook2:2181/sc,zook3:2181/sc but I still get the same problem here is the entire catalina.out log with the exception Using CATALINA_BASE: /opt/tomcat6 Using CATALINA_HOME: /opt/tomcat6 Using CATALINA_TMPDIR: /opt/tomcat6/temp Using JRE_HOME:/usr/java/default/ Using CLASSPATH: /opt/tomcat6/bin/bootstrap.jar Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12). Aug 2, 2011 4:28:46 AM org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/a md64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/ usr/lib64:/lib64:/lib:/usr/lib Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8983 Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.Catalina load INFO: Initialization processed in 448 ms Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardService start INFO: Starting service Catalina Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.29 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.HostConfig deployDescriptor INFO: Deploying configuration descriptor solr1.xml Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '/home/tomcat/solrCloud1/' Aug 2, 2011 4:28:46 AM org.apache.solr.servlet.SolrDispatchFilter init INFO: SolrDispatchFilter.init() Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer$Initializer initialize INFO: looking for solr.xml: /home/tomcat/solrCloud1/solr.xml Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer init INFO: New CoreContainer 853527367 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '/home/tomcat/solrCloud1/' Aug 2, 2011 4:28:46 AM org.apache.solr.cloud.SolrZkServerProps getProperties INFO: Reading configuration from: /home/tomcat/solrCloud1/zoo.cfg Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer initZooKeeper INFO: Zookeeper client=zook1:2181/sc,zook2:2181/sc,zook3:2181/sc Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:host.name=ob1079.nydc1.outbrain.com Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.version=1.6.0_21 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.vendor=Sun Microsystems Inc. Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.home=/usr/java/jdk1.6.0_21/jre Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.class.path=/opt/tomcat6/bin/bootstrap.jar Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.library.path=/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/ usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.io.tmpdir=/opt/tomcat6/temp Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.compiler=NA Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.name=Linux Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.arch=amd64 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.version=2.6.18-194.8.1.el5 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:user.name=tomcat Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:user.home=/home/tomcat Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client
Re: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'
did you add the analysis-extras jar itself? thats what has this factory. On Tue, Aug 2, 2011 at 5:03 AM, Satish Talim satish.ta...@gmail.com wrote: I am using Solr 3.3 on a Windows box. I want to use the solr.ICUTokenizerFactory in my schema.xml and added the fieldType name=text_icu as per the URL - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory I also added the following files to my apache-solr-3.3.0\example\lib folder: lucene-icu-3.3.0.jar lucene-smartcn-3.3.0.jar icu4j-4_8.jar lucene-stempel-3.3.0.jar When I start my Solr server from apache-solr-3.3.0\example folder: java -jar start.jar I get the following errors: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory' SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer filter list SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text_icu' specified on field subject I tried adding various other jar files to the lib folder but it does not help. What am I doing wrong? Satish -- lucidimagination.com
indexing taking very long time
Hi We have a requirement where we are indexing all the messages of a a thread, a thread may have attachment too . We are adding to the solr for indexing and searching for applying few business rule. For a user, we have almost many threads (100k) in number and each thread may be having 10-20 messages. Now what we are finding is that it is taking 30 mins to index the entire threads. When we run optimize then it is taking faster time. The question here is that how frequently this optimize should be called and when ? Please note that we are following commit strategy (that is every after 10k threads, commit is called). we are not calling commit after every doc. Secondly how can we use multi threading from solr perspective in order to improve jvm and other utilization ? Thanks Naveen
DIH + signature
Hi, I'm using solr 3.3 and want to add a signature field to solr to later be able to deduplicate search results using field collapsing. I'm using DIH to fill solr. Extract from solrconfig.xml updateRequestProcessorChain name=dedupe processor class=solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldsctcontent/str str name=signatureClasssolr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.processordedupe/str /lst /requestHandler in the schema.xml there is: field name=signature type=string indexed=true stored=true multiValued=false / and field name=ctcontent type=text_nl_splitting indexed=true stored=true termVectors=on termPositions=on termOffsets=on/ When I run a full-import however the signature field remains empty. Any insight on what I'm doing wrong would be greatly appreciated! Kind regards, Jo -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-signature-tp3218813p3218813.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.3 crashes after ~18 hours?
Monitor your memory usage. I use to encounter a problem like this before where nothing was in the logs and the process was just gone. Turned out my system was out odd memory and swap got used up because of another process which then forced the kernel to start killing off processes. Google OOM linux and you will find plenty of other programs and people with a similar problem. Cameron On Aug 2, 2011 6:02 AM, alexander sulz a.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: Solr 3.3 crashes after ~18 hours?
Assuming you are running on Linux, you might want to check /var/log/messages too (the location might vary), I think the kernel logs forced process termination there. I recall that the kernel will usually picks the process consuming the most memory, there may be other factors involved too. François On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote: Monitor your memory usage. I use to encounter a problem like this before where nothing was in the logs and the process was just gone. Turned out my system was out odd memory and swap got used up because of another process which then forced the kernel to start killing off processes. Google OOM linux and you will find plenty of other programs and people with a similar problem. Cameron On Aug 2, 2011 6:02 AM, alexander sulz a.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: Different options for autocomplete/autosuggestion
You have to tell us more information about what not right means. Please review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Wed, Jul 27, 2011 at 6:12 AM, scorpking lehoank1...@gmail.com wrote: HI Bell, i used autocomplete in solr 3.1. same this: searchComponent name=autocomplete class=solr.SpellCheckComponent lst name=spellchecker str name=nameautocomplete/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/s tr str name=fieldautocomplete/str str name=buildOnCommittrue/str /lst and i make following URL* http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/* to index my data. and had a problem. with one word, it have done very good. But when i typed more two words, rerults return not right. I don't know why? Can any one know this problem? Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Different-options-for-autocomplete-autosuggestion-tp2678899p3203032.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Master-slave master failover without data loss
Not OOB. You say that the index updates, but if the data hasn't been committed, it isn't really in the index. After the commit (which varies time-wise depending on merges etc.) the next replication from the slave should get the new index, regardless of whether the master has gone down or not. One way to handle this issue is to re-index data from some time before the master went down, relying on the uniqueKey to replace any duplicate documents Best Erick On Wed, Jul 27, 2011 at 10:43 AM, Nagendraprasad nagu.nutalap...@gmail.com wrote: Suppose master goes down immediately after the index updates, while the updates haven't been replicated to the slaves, data loss seems to happen. Does Solr have any mechanism to deal with that? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Master-slave-master-failover-without-data-loss-tp3203644p3203644.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH + signature
Follow-up on this issue. I eventually found the problem. The naming scheme changed from solr 3.2 onwards. The line as it states in the documentation: str name=update.processordedupe/str should now be: str name=update.chaindedupe/str https://issues.apache.org/jira/browse/SOLR-2105 -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-signature-tp3218813p3218979.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: xpath expression not working
Hi abhayd, XPathEntityProcessor does only support a subset of xpath, like div[@id=2] but not [id=2] Take a look to https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose I solve this problem by using xslt a preprocessor (with full xpath). The drawback is performance wasting: See http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html Best regards Karsten Original-Nachricht Datum: Mon, 1 Aug 2011 23:21:45 -0700 (PDT) Von: abhayd ajdabhol...@hotmail.com An: solr-user@lucene.apache.org Betreff: xpath expression not working hi I have a xml doc whichi would like to index using xpath entity processor. add doc id1/id detailsxyz/details /doc doc id2/id detailsxyz2/details /doc /add if i want to just load document with id=2 how would that work? I tried xpath expression that works with xpath tools but not in solr. dataConfig dataSource type=FileDataSource / document entity name=f processor=FileListEntityProcessor baseDir=c:\temp fileName=promotions.xml recursive=false rootEntity=false dataSource=null entity name=x processor=XPathEntityProcessor forEach=/add/doc url=${f.fileAbsolutePath} pk=id field column=id xpath=/add/doc/[id=2]/id/ /entity /entity /document /dataConfig Any help how i can do this? -- View this message in context: http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3218133.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Store complete XML record (DIH XPathEntityProcessor)
Hi g, Hi Chantal I had the same problem. You can use XPathEntityProcessor but you have to insert an xsl. The drawback is performance wasting: See http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html Best regards Karsten Original-Nachricht Datum: Mon, 1 Aug 2011 12:17:45 +0200 Von: Chantal Ackermann chantal.ackerm...@btelligent.de An: solr-user@lucene.apache.org solr-user@lucene.apache.org Betreff: Re: Store complete XML record (DIH XPathEntityProcessor) Hi g, ok, I understand your problem, now. (Sorry for answering that late.) I don't think PlainTextEntityProcessor can help you. It does not take a regex. LineEntityProcessor does but your record elements probably do not come on their own line each and you wouldn't want to depend on that, anyway. I guess you would be best off writing your own entity processor - maybe by extending XPath EP if that gives you some advantage. You can of course also implement your own importer using SolrJ and your favourite XML parser framework - or any other programming language. If you are looking for a config-only solution - i'm not sure that there is one. Someone else might be able to comment on that? Cheers, Chantal On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote: Thanks Chantal I am ok with the second call and I already tried using that. Unfortunatly It reads the whole file into a field. My file is as below example xml record ... /record record ... /record record ... /record /xml Now the XPATH does the 'for each /record' part. For each record I also need to store the raw log in there. If I use the PlainTextEntityProcessor then it gives me the whole file (from xml .. /xml ) and not each of the record /record Am I using the PlainTextEntityProcessor wrong? THanks g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
You have a few choices: 1) flatten your field structure - like your undesirable example, but wouldn't you want to have the document identifier as a field value also? 2) use phrase queries to make sure the key/value pairs are adjacent 3) use a join query That's all I can think of -Mike On 08/01/2011 08:08 PM, Suk-Hyun Cho wrote: I'm sure someone asked this before, but I couldn't find a previous post regarding this. The problem: Let's say that I have a multivalued field called myFriends that tokenizes on whitespaces. Basically, I'm treating it like a List of Lists (attributes of friends): Document A: myFriends = [ isCool=true SOME_JUNK_HERE gender=male bloodType=A ] Document B: myFriends = [ isCool=true SOME_JUNK_HERE gender=female bloodType=O, isCool=false SOME_JUNK_HERE gender=male bloodType=AB ] Now, let's say that I want to search for all the cool male friends I have. Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male. However, this returns documents A and B, because the two criteria are tested against the entire collection, rather than against individual elements. I could work around this by not tokenizing on whitespaces and using wildcards: q=myFriends:isCool=true\ *\ gender=male but this becomes painful when the query becomes more complex. What if I wanted to find cool friends who are either type A or type O? I could do q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O). And you can see that the number of criteria will just explode as queries get more complex. There are other methods that I've considered, such as duplicating documents for every friend, like so: Document A1: myFriend = [ isCool=true, gender=male, bloodType=A ] Document B1: myFriend = [ isCool=true, gender=female, bloodType=O ] Document B2: myFriend = [ isCool=false, gender=male, bloodType=AB ] but this would be less than desirable. I would like to hear any other ideas around solving this problem, but going back to the original question, is there a way to match multiple criteria on a per-item basis rather than against the entire multifield? -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: performance crossover between single index and sharding
On 8/2/2011 4:44 AM, Bernd Fehling wrote: Is there any knowledge on this list about the performance crossover between a single index and sharding and when to change from a single index to sharding? E.g. if index size is larger than 150GB and num of docs is more than 25 mio. then it is better to change from single index to sharding and have two shards. Or something like this... Sure, solr might even handle 50 mio. docs but performance is going down and a sharded system with distributed search will be faster than a single index, or not? The answer I've always seen here boils down to it depends on a large number of variables unique to every situation. The nature of your data will affect things, like the number of fields, number of unique terms per field, etc. If you have really complicated queries, that will slow things down. Probably the greatest limiting factor is memory. Having enough free memory to fit the entire index into the operating system's disk cache is the best thing you can do for performance. This is memory over and above whatever you give to your Java heap. If you have a 150GB index and you can afford machines with at least 192GB of RAM, a single index would perform very well, once it is warmed up. Performance on a cold index would not be very good. In a sharded scenario, you want to try and size each machine so that its piece fits into RAM. Next would be disk I/O. Any data that won't fit in the disk cache must be retrieved from disk, which is typically the weakest link in the chain. If you can put your index on solid state disks, that's almost as good as having the index entirely in memory. Performance on a cold index with SSD would be incredible. Having a lot of high speed CPU available will help, but not as much as memory and I/O. Index rebuild time is another consideration that might lead you to go distributed, as long as your data source can keep up with multiple readers. My own index is too big to fit in RAM, even sharded. Each of the six large shards is getting close to 19GB. Each machine has 14GB of RAM (it's a virtual environment with three large shards per physical host) and has 3GB allocated to Java. I am in the process of upgrading the memory, at which point it will fit, but our growth will exceed the maximum server memory again in the next year or so. I have plans to eliminate the virtualization and have three shards in cores on each server. I know this isn't really what you were looking for, but there are no simple answers to your question. Thanks, Shawn
How to cut off hits with score below threshold?
Hello, If one wanted to cut off hits whose score is below some threshold (I know, I know, one doesn't typically want to do this), what are the most elegant options? I can think of 2 options, but I wonder if there are better choices: 1) custom Collector (problem: one can't specify a custom Collector via an API, so one would have to modify Solr source code) 2) custom SearchComponent that filters hits with score threshold (problem: if hits are removed from results then too few hits will be returned to the client, so one has to either request more rows from Solr or re-request more hits or do both to avoid this problem) Is there something better one can do? Thanks, Otis Sematext is hiring Search Engineers -- http://sematext.com/about/jobs.html
Re: Matching queries on a per-element basis against a multivalued field
Hi Suk-Hyun Cho, if myFriend is the unit of retrieval you should use this as lucene document with the fields isCool gender bloodType ... if you realy want to insert all myFriends in one field like your myFriends = [ isCool=true SOME_JUNK_HERE gender=female bloodType=O, isCool=false SOME_JUNK_HERE gender=male bloodType=AB ] example, you can use SpanQueries http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ with SpanNotQuery you can search for all isCool true and gender male where no other isCool is between both phrases. Best regards Karsten P.S. see in context http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-td3217432.html
RE: Spell Check
The most likely problem is forgetting to specify spellcheck.build=true on the first query since the last restart. This builds the spell check dictionary used by the IndexBasedSpellChecker. You should put this in a warming query or alternatively, specify build-on-commit or build-on-optimize. It also looks like str name=queryAnalyzerFieldTypetextSpell/str should probably be str name=queryAnalyzerFieldTypetextSpellPhrase/str . Finally, if you've done a build and changing the query Analyzer field type doesn't help, then you have to wonder if dizeagar exists somewhere in your data. If the keyword exists in the spelling dictionary, Solr's spellchecker will not try to correct it. See https://issues.apache.org/jira/browse/SOLR-2585 for a potential solution to this problem. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: tamanjit.bin...@yahoo.co.in [mailto:tamanjit.bin...@yahoo.co.in] Sent: Tuesday, August 02, 2011 12:30 AM To: solr-user@lucene.apache.org Subject: Spell Check Hi All, Facing some issue with Solr spellcheck. I got an index based dictionary made. My changes to *solrconfig.xml* are: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=classnamesolr.IndexBasedSpellChecker/str str name=namelocSpell/str str name=fieldlocSpell/str str name=buildOnOptimizetrue/str str name=spellcheckIndexDir./spellchecker_loc_spell/str /lst /searchComponent requestHandler name=/spellCheckCompRH class=solr.SearchHandler lst name=locSpell str name=echoParamsexplicit/str str name=spellcheck.dictionarylocSpell/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I got my dictionary made to the folder spellchecker_loc_spell post an optimize. Now my changes to schema.xml are as follows: New *fieldtype * fieldType name=textSpellPhrase class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType my *fields*: field name=id type=integer indexed=true stored=true/ field name=locName type=string indexed=true stored=true/ field name=ct type=integer indexed=true stored=true/ field name=st type=integer indexed=true stored=true/ field name=ppd type=string indexed=true stored=true/ field name=ecd type=string indexed=true stored=true/ field name=city type=text indexed=true stored=true/ field name=state type=text indexed=true stored=true/ field name=locSpell type=textSpellPhrase indexed=true stored=false/ defaultSearchFieldlocName/defaultSearchField copyField source=locName dest=locSpell/ Now when I send the following command http://SolrIP/MagicBricks/Locality/spellCheckCompRH/?q=Dizeagarversion=2.2start=0rows=10indent=onspellcheck=truespellcheck.collate=truespellcheck.extendedResults=truespellcheck.count=3spellcheck.dictionary=locSpell I get the following result:: − response − lst name=responseHeader int name=status0/int int name=QTime1/int /lst result name=response numFound=0 start=0/ − lst name=spellcheck − lst name=suggestions bool name=correctlySpelledtrue/bool /lst /lst /response Which should not be the case as it is wrongly spelled. Could anyone help me out as to why am I getting this strange result that it is correctlySpelled=true when it is not. -- View this message in context: http://lucene.472066.n3.nabble.com/Spell-Check-tp3218037p3218037.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to cut off hits with score below threshold?
Hi Otis, is this the same question as http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html ? If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/) q={!frange l=0.85}query($qq) qq=the original relevancy query will help? (BTW, a also would like to specify a custom Collector via API in Solr, possible an issue?) Best regards Karsten in context: http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html Original-Nachricht If one wanted to cut off hits whose score is below some threshold (I know, I know, one doesn't typically want to do this), what are the most elegant options?
Re: How to cut off hits with score below threshold?
Be careful with that approach as it will return score=1.0f for all documents (fl=*,score). This, however, doesn't affect the outcome of the frange. Feels like a bug though On Tuesday 02 August 2011 16:29:16 karsten-s...@gmx.de wrote: Hi Otis, is this the same question as http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html ? If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/) q={!frange l=0.85}query($qq) qq=the original relevancy query will help? (BTW, a also would like to specify a custom Collector via API in Solr, possible an issue?) Best regards Karsten in context: http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-thr eshold-td3219064.html Original-Nachricht If one wanted to cut off hits whose score is below some threshold (I know, I know, one doesn't typically want to do this), what are the most elegant options? -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
lucene/solr, raw indexing/searching
Hello, I am trying to get lucene and solr to agree on a completely Raw indexing method. I use lucene in my indexers that write to an index on disk, and solr to search those indexes that i create, as creating the indexes without solr is much much faster than using the solr server. are there settings for BOTH solr and lucene to use EXACTLY whats in the content as opposed to interpreting what it thinks im trying to do? My content is extremely specific and needs no interpretation or adjustment, indexing or searching, a text field. for example: 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work correctly, but then it wont find whats indexed using 3.1's standard analyzer. if i have content that is : this is rev. 23.302 i need it indexed EXACTLY as it appears, this is rev. 23.302 I do not want any of solr or lucenes attempts to fix my content or my queries. rev. needs to stay rev. and not turn into rev, 23.302 needs to stay as such, and NOT turn into 23302. this is for BOTH indexing and searching. any hints? right now for indexing i have: Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha); Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords); writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); writer.setUseCompoundFile(false) ; and for searching i have in my schema : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thanks. Very much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html Sent from the Solr - User mailing list archive at Nabble.com.
CoreContainer from CommonsHttpSolrServer
Hi everybody, I'm using Solr (with multiple cores) in a Webapp and access the differnt cores using CommonsHttpSolrServer. As I would like to know, which cores are configured and what there status is I would like to get an instance of CoreContainer. The site http://wiki.apache.org/solr/CoreAdmin tells me how to interact with the CoreAdminHandler via my browser. But I would like to get the information provided by the STATUS action in my java application. As CoreContainer provides appropriate methods I need to get access to such an object. What's the best way to achieve that. Thanks in advance. Matthias -- View this message in context: http://lucene.472066.n3.nabble.com/CoreContainer-from-CommonsHttpSolrServer-tp3219299p3219299.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to cut off hits with score below threshold?
I've created an issue to track this funny behaviour: https://issues.apache.org/jira/browse/SOLR-2689 On Tuesday 02 August 2011 16:46:18 Markus Jelsma wrote: Be careful with that approach as it will return score=1.0f for all documents (fl=*,score). This, however, doesn't affect the outcome of the frange. Feels like a bug though On Tuesday 02 August 2011 16:29:16 karsten-s...@gmx.de wrote: Hi Otis, is this the same question as http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html ? If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/) q={!frange l=0.85}query($qq) qq=the original relevancy query will help? (BTW, a also would like to specify a custom Collector via API in Solr, possible an issue?) Best regards Karsten in context: http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-t hr eshold-td3219064.html Original-Nachricht If one wanted to cut off hits whose score is below some threshold (I know, I know, one doesn't typically want to do this), what are the most elegant options? -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Jetty error message regarding EnvEntry in WebAppContext
Hi! I am trying to deploy Solr under Jetty 6.1.22-1ubuntu1 (installed the jetty and libjetty-extra-java packages). However, it seems as if I can't get the webapp configuration set right. With this configuration... Configure class=org.mortbay.jetty.webapp.WebAppContext ... *Call name=addEnvEntry* Arg/solr/home/Arg Arg type=java.lang.String/opt/exptbx-solr/solr/Arg Arg type=java.lang.Booleantrue/Arg /Call /Configure ... I get the error: 426 [main] WARN org.mortbay.log - Config error at Call name=addEnvEntryArg/solr/home/ArgArg type=java.lang.String/opt/exptbx-solr/solr/ArgArg type=java.lang.Booleantrue/Arg/Call 426 [main] ERROR org.mortbay.log - EXCEPTION java.lang.IllegalStateException: No Method: Call name=addEnvEntryArg/solr/home/ArgArg type=java.lang.String/opt/exptbx-solr/solr/ArgArg type=java.lang.Booleantrue/Arg/Call on class org.mortbay.jetty.webapp.WebAppContext With this configuration instead... Configure class=org.mortbay.jetty.webapp.WebAppContext ... *New class=org.mortbay.jetty.plus.naming.EnvEntry* Arg/solr/home/Arg Arg type=java.lang.String/opt/exptbx-solr/solr/Arg Arg type=java.lang.Booleantrue/Arg /New /Configure I get the following error: 438 [main] WARN org.mortbay.log - Config error at New class=org.mortbay.jetty.plus.naming.EnvEntryArg/solr/home/ArgArg type=java.lang.String/opt/exptbx-solr/solr/ArgArg type=java.lang.Booleantrue/Arg/New 438 [main] WARN org.mortbay.log - EXCEPTION java.lang.ClassNotFoundException: org.mortbay.jetty.plus.naming.EnvEntry Both examples are derived from http://wiki.apache.org/solr/SolrJetty - the second one being a user-contributed config. It seems that the second problem occurs since I'm not using Jetty Plus. Or at least I don't have the library in the path. Can anyone tell me how a working configuration for Jetty 6.1.22 would have to look like? Thanks! Marian
Re: Matching queries on a per-element basis against a multivalued field
Suk, You're hitting on a well known limitation with Lucene, and the solutions are work-arounds that may be unacceptable depending on the specifics of your case. Solr 4.0 (trunk)'s support for Joins is definitely an up and coming option, as Mike pointed out. Kersen's suggestion of using an index just for friends is very good, although depending on the specifics of your actual needs it may not work or be unscalable. Mike also pointed out phrase queries, which will work, but remember to add a proximity, e.g. isCool=true gender=male~50 You'll want to consider the position increment gap setting in your schema. A limitation here is that your text analysis options are limited since all the data is in the same field. You're also limited to simple term search; no range queries. I took a different approach for an app I built. I indexed into separate fields (i.e. isCool, gender, bloodType) so that I could analyze each of them appropriately. But I did have to add a filter that basically collapsed all position offsets within a value to zero, effectively nullifying my ability to do a phrase query for a particular value. That was acceptable to me and it can be ameliorated with shingling. Then at search time I used Span queries and their unique ability to positionally query over more than one field. There were some edge conditions that were tricky to debug when I had a null value, but it was at least fixable with a sentinal value kluge. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219352.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: changing the root directory where solrCloud stores info inside zookeeper File system
Thanks A lot mark, Since My SolrCloud code was old I tried downloading and building the newest code from here https://svn.apache.org/repos/asf/lucene/dev/trunk/ I am using tomcat6 I manually created the sc sub-directory in my zooKeeper ensemble file-system I used this connection String to my ZK ensemble zook1:2181/sc,zook2:2181/sc,zook3:2181/sc but I still get the same problem here is the entire catalina.out log with the exception Using CATALINA_BASE: /opt/tomcat6 Using CATALINA_HOME: /opt/tomcat6 Using CATALINA_TMPDIR: /opt/tomcat6/temp Using JRE_HOME:/usr/java/default/ Using CLASSPATH: /opt/tomcat6/bin/bootstrap.jar Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12). Aug 2, 2011 4:28:46 AM org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8983 Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.Catalina load INFO: Initialization processed in 448 ms Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardService start INFO: Starting service Catalina Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.29 Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.HostConfig deployDescriptor INFO: Deploying configuration descriptor solr1.xml Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '/home/tomcat/solrCloud1/' Aug 2, 2011 4:28:46 AM org.apache.solr.servlet.SolrDispatchFilter init INFO: SolrDispatchFilter.init() Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer$Initializer initialize INFO: looking for solr.xml: /home/tomcat/solrCloud1/solr.xml Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer init INFO: New CoreContainer 853527367 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '/home/tomcat/solrCloud1/' Aug 2, 2011 4:28:46 AM org.apache.solr.cloud.SolrZkServerProps getProperties INFO: Reading configuration from: /home/tomcat/solrCloud1/zoo.cfg Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer initZooKeeper INFO: Zookeeper client=zook1:2181/sc,zook2:2181/sc,zook3:2181/sc Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:host.name=ob1079.nydc1.outbrain.com Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.version=1.6.0_21 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.vendor=Sun Microsystems Inc. Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.home=/usr/java/jdk1.6.0_21/jre Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.class.path=/opt/tomcat6/bin/bootstrap.jar Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.library.path=/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.io.tmpdir=/opt/tomcat6/temp Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:java.compiler=NA Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.name=Linux Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.arch=amd64 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:os.version=2.6.18-194.8.1.el5 Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:user.name=tomcat Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:user.home=/home/tomcat Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv INFO: Client environment:user.dir=/opt/tomcat6
Re: performance crossover between single index and sharding
That's a fantastic answer, Shawn. To more directly answer Bernd's question: Bernard, shard your data once you've done reasonable performance optimizations to your single core index setup (see Chapter 9 of my book) and the query response time isn't meeting your requirements in spite of this. Solr scales pretty darned well horizontally -- so as you shard your data more and more, the query responses will get faster. At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-index-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: performance crossover between single index and sharding
Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: lucene/solr, raw indexing/searching
dhastings, my recommendation for the approaches from both sides ... Lucene: try on a whitespace analyzer for size Analyzer an = new WhitespaceAnalyzer(Version.LUCENE_31); Solr: in your /index/solr/conf/schema.xml fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ ... /analyzer /fieldType -craig -Original Message- From: dhastings [mailto:dhasti...@wshein.com] Sent: Tuesday, 2 August 2011 10:14 PM To: solr-user@lucene.apache.org Subject: lucene/solr, raw indexing/searching Hello, I am trying to get lucene and solr to agree on a completely Raw indexing method. I use lucene in my indexers that write to an index on disk, and solr to search those indexes that i create, as creating the indexes without solr is much much faster than using the solr server. are there settings for BOTH solr and lucene to use EXACTLY whats in the content as opposed to interpreting what it thinks im trying to do? My content is extremely specific and needs no interpretation or adjustment, indexing or searching, a text field. for example: 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work correctly, but then it wont find whats indexed using 3.1's standard analyzer. if i have content that is : this is rev. 23.302 i need it indexed EXACTLY as it appears, this is rev. 23.302 I do not want any of solr or lucenes attempts to fix my content or my queries. rev. needs to stay rev. and not turn into rev, 23.302 needs to stay as such, and NOT turn into 23302. this is for BOTH indexing and searching. any hints? right now for indexing i have: Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha); Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords); writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); writer.setUseCompoundFile(false) ; and for searching i have in my schema : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thanks. Very much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219 277p3219277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: lucene/solr, raw indexing/searching
In your solr schema.xml, are the fields you are using defined as text fields with analyzers? It sounds like you want no analysis at all, which probably means you don't want text fields either, you just want string fields. That will make it impossible to search for individual tokens though, searches will match only on complete matches of the value. I'm not quite sure how to do what you want, it depends on exactly what you want. What kind of searching do you expect to support? If you still do want tokenization, you'll still want some analysis... but I'm not quite sure how that corresponds to what you'd want to do on the lucene end. What you're trying to do is going to be inevitably confusing, I think. Which doesn't mean it's not possible. You might find it less confusing if you were willing to use Solr to index though, rather than straight lucene -- you could use Solr via the SolrJ java classes, rather than the HTTP interface. On 8/2/2011 11:14 AM, dhastings wrote: Hello, I am trying to get lucene and solr to agree on a completely Raw indexing method. I use lucene in my indexers that write to an index on disk, and solr to search those indexes that i create, as creating the indexes without solr is much much faster than using the solr server. are there settings for BOTH solr and lucene to use EXACTLY whats in the content as opposed to interpreting what it thinks im trying to do? My content is extremely specific and needs no interpretation or adjustment, indexing or searching, a text field. for example: 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work correctly, but then it wont find whats indexed using 3.1's standard analyzer. if i have content that is : this is rev. 23.302 i need it indexed EXACTLY as it appears, this is rev. 23.302 I do not want any of solr or lucenes attempts to fix my content or my queries. rev. needs to stay rev. and not turn into rev, 23.302 needs to stay as such, and NOT turn into 23302. this is for BOTH indexing and searching. any hints? right now for indexing i have: Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha); Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords); writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); writer.setUseCompoundFile(false) ; and for searching i have in my schema : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thanks. Very much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Jetty error message regarding EnvEntry in WebAppContext
On 8/2/2011 11:42 AM, Marian Steinbach wrote: Can anyone tell me how a working configuration for Jetty 6.1.22 would have to look like? You know that Solr distro comes with a jetty with a Solr in it, right, as an example application? Even if you don't want to use it for some reason, that would probably be the best model to look at for a working jetty with solr. Or is the problem that you want a different version of jetty? As it happens, I just recently set up a jetty 6.1.26 for another project, not for solr. It was kind of a pain not being too familiar with java deployment or jetty. But I did get JDNI working, by following the jetty instructions here: http://docs.codehaus.org/display/JETTY/JNDI (It was a bit confusing to figure out what they were talking about not being familiar with jetty, but eventually I got it, and the instructions were correct.) But if I wanted to run Solr in jetty, I'd start with the jetty that is distributed with solr, rather than trying to build my own.
Re: Matching queries on a per-element basis against a multivalued field
I appreciate your replies and ideas. SpanQuery would work, and I'll look into this further. However, what about the original question? Is there no way to match documents on a per-element basis against a multivalued field? If not, would it perhaps make sense to create a feature request? Also, regarding the join support you guys have mentioned: is it only on a field within the same core, or is it across cores (as if cores are tables in a database)? Joining on cores would eliminate most of the issues I'm having. The examples I gave are simplified, but actually I have an entity A that has entity B that has entity C, and I'm flattening out queriable fields of B and C into the schema for A. This way, I can search for documents for the core A that match some criteria for A, B, and/or C. -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219565.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
On Aug 2, 2011, at 1:09 PM, Suk-Hyun Cho [via Lucene] wrote: I appreciate your replies and ideas. SpanQuery would work, and I'll look into this further. However, what about the original question? Is there no way to match documents on a per-element basis against a multivalued field? Correct; there is no way. Aside from Solr 4's Join feature, everything else suggested is a hack / work-around for a fundamental limitation. If not, would it perhaps make sense to create a feature request? You could but I wouldn't bother because its unlikely to get any traction as it's a fundamental issue with Lucene and at the Solr level there is a solution on the horizon. Also, regarding the join support you guys have mentioned: is it only on a field within the same core, or is it across cores (as if cores are tables in a database)? Joining on cores would eliminate most of the issues I'm having. The examples I gave are simplified, but actually I have an entity A that has entity B that has entity C, and I'm flattening out queriable fields of B and C into the schema for A. This way, I can search for documents for the core A that match some criteria for A, B, and/or C. The Join support works across cores. See the wiki and associated JIRA issue for it. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219638.html Sent from the Solr - User mailing list archive at Nabble.com.
how to get row no. of current record
Hi, How to know the row number of current record. i.e : suppose we have 10 million record indexed. Currently I am on 5th records and id of the this record is XYZ00234, how to know that the current record rows no is 5th. thanks.. regards Ranveer
RE: performance crossover between single index and sharding
Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our machines have 144GB of memory with about 16GB dedicated to the tomcat instance running the 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. We release a new index every morning and then warm the caches with several thousand queries. I probably should add that our disk storage is a very high performance Isilon appliance that has over 500 drives and every block of every file is striped over no less than 14 different drives. (See blog for details *) We have a very low number of queries per second (0.3-2 qps) and our modest response time goal is to keep 99th percentile response time for our application (i.e. Solr + application) under 10 seconds. Our current performance statistics are: average response time 300 ms median response time 113 ms 90th percentile663 ms 95th percentile1,691 ms We had plans to do some performance testing to determine the optimum shard size and optimum number of shards per machine, but that has remained on the back burner for a long time as other higher priority items keep pushing it down on the todo list. We would be really interested to hear about the experiences of people who have so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 12:33 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: performance crossover between single index and sharding
What's the reasoning behind having three shards on one machine, instead of just combining those into one shard? Just curious. I had been thinking the point of shards was to get them on different machines, and there'd be no reason to have multiple shards on one machine. On 8/2/2011 1:59 PM, Burton-West, Tom wrote: Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our machines have 144GB of memory with about 16GB dedicated to the tomcat instance running the 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. We release a new index every morning and then warm the caches with several thousand queries. I probably should add that our disk storage is a very high performance Isilon appliance that has over 500 drives and every block of every file is striped over no less than 14 different drives. (See blog for details *) We have a very low number of queries per second (0.3-2 qps) and our modest response time goal is to keep 99th percentile response time for our application (i.e. Solr + application) under 10 seconds. Our current performance statistics are: average response time 300 ms median response time 113 ms 90th percentile663 ms 95th percentile1,691 ms We had plans to do some performance testing to determine the optimum shard size and optimum number of shards per machine, but that has remained on the back burner for a long time as other higher priority items keep pushing it down on the todo list. We would be really interested to hear about the experiences of people who have so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 12:33 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: performance crossover between single index and sharding
Hi Tom, Very interesting indeed! But i keep wondering why some engineers choose to store multiple shards of the same index on the same machine, there must be significant overhead. The only reason i can think of is ease of maintenance in moving shards to a separate physical machine. I know that rearranging the shard topology can be a real pain in a large existing cluster (e.g. consistent hashing is not consistent anymore and having to shuffle docs to their new shard), is this the reason you choose this approach? Cheers, Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our machines have 144GB of memory with about 16GB dedicated to the tomcat instance running the 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. We release a new index every morning and then warm the caches with several thousand queries. I probably should add that our disk storage is a very high performance Isilon appliance that has over 500 drives and every block of every file is striped over no less than 14 different drives. (See blog for details *) We have a very low number of queries per second (0.3-2 qps) and our modest response time goal is to keep 99th percentile response time for our application (i.e. Solr + application) under 10 seconds. Our current performance statistics are: average response time 300 ms median response time 113 ms 90th percentile663 ms 95th percentile1,691 ms We had plans to do some performance testing to determine the optimum shard size and optimum number of shards per machine, but that has remained on the back burner for a long time as other higher priority items keep pushing it down on the todo list. We would be really interested to hear about the experiences of people who have so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-sea rch-50-volumes-5-million-volumes-and-beyond -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 12:33 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-i n dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: performance crossover between single index and sharding
Hi Jonothan and Markus, Why 3 shards on one machine instead of one larger shard per machine? Good question! We made this architectural decision several years ago and I'm not remembering the rationale at the moment. I believe we originally made the decision due to some tests showing a sweetspot for I/O performance for shards with 500,000-600,000 documents, but those tests were made before we implemented CommonGrams and when we were still using attached storage. I think we also might have had concerns about Java OOM errors with a really large shard/index, but we now know that we can keep memory usage under control by tweaking the amount of the terms index that gets read into memory. We should probably do some tests and revisit the question. The reason we don't have 12 shards on 12 machines is that current performance is good enough that we can't justify buying 8 more machines:) Tom -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Tom, Very interesting indeed! But i keep wondering why some engineers choose to store multiple shards of the same index on the same machine, there must be significant overhead. The only reason i can think of is ease of maintenance in moving shards to a separate physical machine. I know that rearranging the shard topology can be a real pain in a large existing cluster (e.g. consistent hashing is not consistent anymore and having to shuffle docs to their new shard), is this the reason you choose this approach? Cheers, bble.com.
Re: performance crossover between single index and sharding
With low qps and multi-core servers, I believe one reason to have multiple shards on one server is to provide better parallelism for a request, and thus reduce your response time. -- Ken On Aug 2, 2011, at 11:06am, Jonathan Rochkind wrote: What's the reasoning behind having three shards on one machine, instead of just combining those into one shard? Just curious. I had been thinking the point of shards was to get them on different machines, and there'd be no reason to have multiple shards on one machine. On 8/2/2011 1:59 PM, Burton-West, Tom wrote: Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our machines have 144GB of memory with about 16GB dedicated to the tomcat instance running the 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. We release a new index every morning and then warm the caches with several thousand queries. I probably should add that our disk storage is a very high performance Isilon appliance that has over 500 drives and every block of every file is striped over no less than 14 different drives. (See blog for details *) We have a very low number of queries per second (0.3-2 qps) and our modest response time goal is to keep 99th percentile response time for our application (i.e. Solr + application) under 10 seconds. Our current performance statistics are: average response time 300 ms median response time 113 ms 90th percentile663 ms 95th percentile1,691 ms We had plans to do some performance testing to determine the optimum shard size and optimum number of shards per machine, but that has remained on the back burner for a long time as other higher priority items keep pushing it down on the todo list. We would be really interested to hear about the experiences of people who have so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 12:33 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: performance crossover between single index and sharding
On 8/2/2011 12:06 PM, Jonathan Rochkind wrote: What's the reasoning behind having three shards on one machine, instead of just combining those into one shard? Just curious. I had been thinking the point of shards was to get them on different machines, and there'd be no reason to have multiple shards on one machine. I'd be interested in hearing Tom's answer as well, but my answer boils down to the time it takes to do a full index rebuild and worries about performance. Because I'm in a virtualized environment, I effectively have three large shards on each machine even though they are logically separate. When I first got involved, we had a distributed EasyAsk index on 20 separate low-end physical servers. That evolved into basically the same solution with a smaller number of virtual machines, on a pair of very powerful physical hosts. On this system, doing a full rebuild took nearly two days and wasn't an atomic operation. The EasyAsk system (also based on Lucene) was unable to deal with more than about 4 million documents per machine (real or virtual). The only way to get acceptable performance was distributed search. The cost of providing redundancy was too high, so we didn't have any. When we first started implementing Solr, we assumed from our previous experience that we'd need distributed search, especially if query volume were to go up. For that reason, we continued our virtualization model, but with only seven shards - six large static shards and a smaller incremental shard to hold data less than a week old. This is where we are now, and performance is MUCH better than the old solution. The low shard count made redundancy affordable, so we now have that too. At the time Solr was first implemented, we could rebuild the entire index in about two hours and swap it into place all at once. Our index has grown enough since then that it takes a little less than three hours, which is still pretty quick for 60 million documents. I did try some early tests with a single large index. Performance was pretty decent once it got warmed up, but I was worried about how it would perform under a heavy load, and how it would cope with frequent updates. I never really got very far with testing those fears, because the full rebuild time was unacceptable - at least 8 hours. The source database can keep up with six DIH instances reindexing at once, which completes much quicker than a single machine grabbing the entire database. I may increase the number of shards after I remove virtualization, but I'll need to fix a few limitations in my build system. Thanks, Shawn
Re: Query on multi valued field
: The query is get only those documents which have multiple elements for : that multivalued field. : : I.e, doc 2 and 3 should be returned from the above set.. The only way to do something like this is to add a field when you index your documents that contains the number and then filter on that field using a range query. With an UpdateProcessor (or a ScriptTransformer in DIH) you can automate counting how many values there are -- but it has to be indexed to search/filter on it. -Hoss
Re: Why Slop doens't match anything?
Hey dude, Sorry for the long absence. (Need to check my personal email more times o0) I am not using dismax. I didn't find the solution for the problem. I just made a full-import and the problem ended. Still odd. 2011/7/27 Gora Mohanty g...@mimirtech.com On Wed, Jul 27, 2011 at 8:38 PM, Alexander Ramos Jardim alexander.ramos.jar...@gmail.com wrote: Hello pals, Using solr 1.4.0. Trying to understand something. When I run the query *fieldA:nokia c3*, I get 5 results. All with nokia c3, as expected. But when I run fieldA:nokia c3~100, I don get any result! As far as I understand the ~100 should make my query bring even more results as not only documents with nokia c3 in their fieldA will be found. Something like nokia blue c3 should match too. Right? [...] That does seem odd. You are not using the dismax query handler by any chance, are you? If so, then the query slop needs to be specified by adding qs=100 to the query. Regards, Gora -- Alexander Ramos Jardim
Re: Matching queries on a per-element basis against a multivalued field
Thanks. I saw the related jira issue but didn't follow closely enough to see the cross-core join being added later. Any idea/hint on when I can expect Solr 4 to be released? -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220091.html Sent from the Solr - User mailing list archive at Nabble.com.
TikaEntityProcessor is filling logs
I want to use TikaEntityProcessor for URLs defined in link from the parent entity. This field can be empty as well. While the dataimport is working OK, the logging is filling up with exceptions in case link is null. Is there way to prevent this? field column=id xpath=/doc/id / field column=text xpath=/doc/text / field column=link xpath=/doc/link / entity name=tika processor=TikaEntityProcessor url=${crawl.link} dataSource=bin onError=continue format=text field column=text / /entity -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-is-filling-logs-tp3220100p3220100.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
My best guess (and it is just a guess) is between December and March. The roots of Solr 4 which triggered the major version change is known as flexible indexing (or just flex for short amongst developers). The genesis of it was posted to JIRA as a patch on 18 November 2008 -- LUCENE-1458 (almost 3 years ago!). About a year later it was committed into a special flex branch that is probably gone now, and then around April/early-May 2010, it went into trunk whereas the pre-flex code on trunk went to a newly formed 3x branch. That is ancient history now, and there are some amazing performance improvements tied to flex that haven't seen the light of day in an official release. It's a shame, really. So it's been so long that, well, after it dawns on everyone that it that the code is 3 friggin years old without a release -- it's time to get on with the show. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html Sent from the Solr - User mailing list archive at Nabble.com.
MultiSearcher/ParallelSearcher - searching over multiple cores?
Hi *, I searched the web for an answer whether it is possible in SOLR to make a query over several cores with all features(boosting, pagination, highlighting) and so on out of the box. In Lucene it it possible with MultiSearcher/Parallelsearcher. I do not mean Distributed Search or merging several indexes together. I mean a search over several cores with different types (different search fields). It sounds quit difficult, so I think it is no SOLR out of the box feature and I have to implement it by hand. Am I right? Thanks in advance, Ralf
Re: Matching queries on a per-element basis against a multivalued field
Well, Lucid released LucidWorks Enterprise with Complete Apache Solr 4.x Release Integrated and tested with powerful enhancements Whatever it means for solr 4.0 On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: My best guess (and it is just a guess) is between December and March. The roots of Solr 4 which triggered the major version change is known as flexible indexing (or just flex for short amongst developers). The genesis of it was posted to JIRA as a patch on 18 November 2008 -- LUCENE-1458 (almost 3 years ago!). About a year later it was committed into a special flex branch that is probably gone now, and then around April/early-May 2010, it went into trunk whereas the pre-flex code on trunk went to a newly formed 3x branch. That is ancient history now, and there are some amazing performance improvements tied to flex that haven't seen the light of day in an official release. It's a shame, really. So it's been so long that, well, after it dawns on everyone that it that the code is 3 friggin years old without a release -- it's time to get on with the show. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
LucidWorks Enterprise (which is more than Solr, and a modified Solr at that) isn't free; so you can't extract the Solr part of that package and use it unless you are willing to pay them. Lucid's Certified Solr, on the other hand, is free. But they have yet to bump that to trunk/4.x; it was only recently updated to 3.2. On Aug 2, 2011, at 5:26 PM, eks dev wrote: Well, Lucid released LucidWorks Enterprise with Complete Apache Solr 4.x Release Integrated and tested with powerful enhancements Whatever it means for solr 4.0 On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: My best guess (and it is just a guess) is between December and March. The roots of Solr 4 which triggered the major version change is known as flexible indexing (or just flex for short amongst developers). The genesis of it was posted to JIRA as a patch on 18 November 2008 -- LUCENE-1458 (almost 3 years ago!). About a year later it was committed into a special flex branch that is probably gone now, and then around April/early-May 2010, it went into trunk whereas the pre-flex code on trunk went to a newly formed 3x branch. That is ancient history now, and there are some amazing performance improvements tied to flex that haven't seen the light of day in an official release. It's a shame, really. So it's been so long that, well, after it dawns on everyone that it that the code is 3 friggin years old without a release -- it's time to get on with the show. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
Sure, I know..., the point I was trying to make, if someone serious like Lucid is using solr 4.x as a core technology for own customers, the trunk could not be all that bad = release date not as far as 2012 :) On Tue, Aug 2, 2011 at 11:33 PM, Smiley, David W. dsmi...@mitre.org wrote: LucidWorks Enterprise (which is more than Solr, and a modified Solr at that) isn't free; so you can't extract the Solr part of that package and use it unless you are willing to pay them. Lucid's Certified Solr, on the other hand, is free. But they have yet to bump that to trunk/4.x; it was only recently updated to 3.2. On Aug 2, 2011, at 5:26 PM, eks dev wrote: Well, Lucid released LucidWorks Enterprise with Complete Apache Solr 4.x Release Integrated and tested with powerful enhancements Whatever it means for solr 4.0 On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: My best guess (and it is just a guess) is between December and March. The roots of Solr 4 which triggered the major version change is known as flexible indexing (or just flex for short amongst developers). The genesis of it was posted to JIRA as a patch on 18 November 2008 -- LUCENE-1458 (almost 3 years ago!). About a year later it was committed into a special flex branch that is probably gone now, and then around April/early-May 2010, it went into trunk whereas the pre-flex code on trunk went to a newly formed 3x branch. That is ancient history now, and there are some amazing performance improvements tied to flex that haven't seen the light of day in an official release. It's a shame, really. So it's been so long that, well, after it dawns on everyone that it that the code is 3 friggin years old without a release -- it's time to get on with the show. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
On Aug 2, 2011, at 5:47 PM, eks dev wrote: Sure, I know..., the point I was trying to make, if someone serious like Lucid is using solr 4.x as a core technology for own customers, the trunk could not be all that bad = release date not as far as 2012 :) Oh the current trunk is most definitely *not* all that bad, as you say; that wasn't a point of discussion. Code coverage is excellent, testing is rather extensive, and many folks like me use it in in production. But after nearly 3 years of waiting, I wouldn't hold your breath on it getting released w/i 6 months (before 2012). ~ David
Re: Solr with many indexes
We have a multi-tenant Solr deployment with a core for each user. Due to the limitations we are facing with number of cores, lazy-loading (and associated warm-up times), we are researching about consolidating several users into one core with queries limited by user-id field. My question is about autosuggest. 1. Are there ways we can limit the autosuggest to only documents with matching ids? 2. What other SOLR operations like these which need further consideration when merging multiple indices and limiting by a field? -- Vikram On Sat, Jan 22, 2011 at 4:02 PM, Erick Erickson erickerick...@gmail.com wrote: See below. On Wed, Jan 19, 2011 at 7:26 PM, Joscha Feth jos...@feth.com wrote: Hello Erick, Thanks for your answer! But I question why you *require* many different indexes. [...] including isolating one users' data from all others, [...] Yes, thats exactly what I am after - I need to make sure that indexes don't mix, as every user shall only be able to query his own data (index). well, this can also be handled by simply appending the equivalent of +user:theuser to each query. This solution does have some interesting side effects though. In particular if you autosuggest based on combined documents, users will see terms NOT in documents they own. And even using lots of cores can be made to work if you don't pre-warm newly-opened cores, assuming that the response time when using cold searchers is adequate. Could you explain that further or point me to some documentation? Are you talking about: http://wiki.apache.org/solr/CoreAdmin#UNLOAD? if yes, LOAD does not seem to be implemented, yet. Or has this something to do with http://wiki.apache.org/solr/SolrCaching#autowarmCount only? About what time per X documents are we talking here for delay if auto warming is disabled? Is there more documentation about this setting? It's the autoWarm parameter. When you open a core the first few queries that run on it will pay some penalty for filling caches etc. If your cores are small enough, then this penalty may not be noticeable to your users, in which case you can just not bother autowarming (see firstSearcher , newSearcher). You might also be able to get away with having very small caches, it mostly depends on your usage patterns. If your pattern as that a user signs on, makes one search and signs off, there may not be much good in having large caches. On the other and, if users sign on and search for hours continually, their experience may be enhanced by having significant caches. It all depends. Hopt that helps Erick Kind regards, Joscha -- - Vikram
Re: Matching queries on a per-element basis against a multivalued field
Thanks for the history and the current state of trunk, guys. It sounds like it's rather stable for serious use... in which case it's probably ready for a release, but let's not go back in circles. :) I'll give it a shot sometime. Thanks, again! -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220449.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr with many indexes
Hello, From: Vikram Kumar vikrambku...@gmail.com We have a multi-tenant Solr deployment with a core for each user. Due to the limitations we are facing with number of cores, lazy-loading (and associated warm-up times), we are researching about consolidating several users into one core with queries limited by user-id field. My question is about autosuggest. 1. Are there ways we can limit the autosuggest to only documents with matching ids? Not sure about Solr's Suggester, but yes this and more is doable with Sematext's Autocomplete: http://sematext.com/products/autocomplete/index.html 2. What other SOLR operations like these which need further consideration when merging multiple indices and limiting by a field? Spellchecking is the first thing that comes to mind. Not sure what else... Otis On Sat, Jan 22, 2011 at 4:02 PM, Erick Erickson erickerick...@gmail.com wrote: See below. On Wed, Jan 19, 2011 at 7:26 PM, Joscha Feth jos...@feth.com wrote: Hello Erick, Thanks for your answer! But I question why you *require* many different indexes. [...] including isolating one users' data from all others, [...] Yes, thats exactly what I am after - I need to make sure that indexes don't mix, as every user shall only be able to query his own data (index). well, this can also be handled by simply appending the equivalent of +user:theuser to each query. This solution does have some interesting side effects though. In particular if you autosuggest based on combined documents, users will see terms NOT in documents they own. And even using lots of cores can be made to work if you don't pre-warm newly-opened cores, assuming that the response time when using cold searchers is adequate. Could you explain that further or point me to some documentation? Are you talking about: http://wiki.apache.org/solr/CoreAdmin#UNLOAD? if yes, LOAD does not seem to be implemented, yet. Or has this something to do with http://wiki.apache.org/solr/SolrCaching#autowarmCount only? About what time per X documents are we talking here for delay if auto warming is disabled? Is there more documentation about this setting? It's the autoWarm parameter. When you open a core the first few queries that run on it will pay some penalty for filling caches etc. If your cores are small enough, then this penalty may not be noticeable to your users, in which case you can just not bother autowarming (see firstSearcher , newSearcher). You might also be able to get away with having very small caches, it mostly depends on your usage patterns. If your pattern as that a user signs on, makes one search and signs off, there may not be much good in having large caches. On the other and, if users sign on and search for hours continually, their experience may be enhanced by having significant caches. It all depends. Hopt that helps Erick Kind regards, Joscha -- - Vikram
SolrCloud: is there a programmatic way to create an ensemble
I have multiple SolrCloud instances, each running its own Zookeeper (Solr launched with -DzkRun). I would like to create an ensemble out of them. I know about -DzkHost parameter, but can I achieve the same programmatically? Either with SolrJ or REST API? Thanks, Yury
Re: I can't pass the unit test when compile from apache-solr-3.3.0-src
On 7/29/2011 5:26 PM, Chris Hostetter wrote: Can you please be specific... * which test(s) fail for you? * what are the failures? Any time a test fails, that info appears in the ant test output, and the full details or all tests are written to build/test-results you can run ant test-reports from the solr directory to generate an HTML report of all the success/failure info. I am also having a consistent build failure with the 3.3 source. Some info from junit about the failure is below. If you want something different I still have it in my session, let me know. [junit] NOTE: reproduce with: ant test -Dtestcase=TestSqlEntityProcessorDelta -Dtestmethod=testNonWritablePersistFile -Dtests.seed=4609081405510352067:771607526385155597 [junit] NOTE: test params are: locale=ko_KR, timezone=Asia/Saigon [junit] NOTE: all tests run in this JVM: [junit] [TestCachedSqlEntityProcessor, TestClobTransformer, TestContentStreamDataSource, TestDataConfig, TestDateFormatTransformer, TestDocBuilder, TestDocBuilder2, TestEntityProcessorBase, TestErrorHandling, TestEvaluatorBag, TestF eldReader, TestFileListEntityProcessor, TestJdbcDataSource, TestLineEntityProcessor, TestNumberFormatTransformer, TestPlainTextEntityProcessor, TestRegexTransformer, TestScriptTransformer, TestSqlEntityProcessor, TestSqlEntityProcessor2 TestSqlEntityProcessorDelta] [junit] NOTE: Linux 2.6.18-238.12.1.el5.centos.plusxen amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=3,threads=4,free=100917744,total=254148608 Here's what I did on the last run: rm -rf lucene_solr_3_3 svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_3 lucene_solr_3_3 cd lucene_solr_3_3/solr ant clean test Thanks, Shawn
Re: IMP: indexing taking very long time
Can somebody answer this? What should be the best strategy for optimize (when million of messages we are indexing for a new registered user) Thanks Naveen On Tue, Aug 2, 2011 at 5:36 PM, Naveen Gupta nkgiit...@gmail.com wrote: Hi We have a requirement where we are indexing all the messages of a a thread, a thread may have attachment too . We are adding to the solr for indexing and searching for applying few business rule. For a user, we have almost many threads (100k) in number and each thread may be having 10-20 messages. Now what we are finding is that it is taking 30 mins to index the entire threads. When we run optimize then it is taking faster time. The question here is that how frequently this optimize should be called and when ? Please note that we are following commit strategy (that is every after 10k threads, commit is called). we are not calling commit after every doc. Secondly how can we use multi threading from solr perspective in order to improve jvm and other utilization ? Thanks Naveen
Re: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'
I copied the file apache-solr-analysis-extras-3.3.0.jar into solr's lib folder. Now the error is different - SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/analysis/BaseTokenizerFactory Please help. Satish On Tue, Aug 2, 2011 at 5:23 PM, Robert Muir rcm...@gmail.com wrote: did you add the analysis-extras jar itself? thats what has this factory. On Tue, Aug 2, 2011 at 5:03 AM, Satish Talim satish.ta...@gmail.com wrote: I am using Solr 3.3 on a Windows box. I want to use the solr.ICUTokenizerFactory in my schema.xml and added the fieldType name=text_icu as per the URL - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory I also added the following files to my apache-solr-3.3.0\example\lib folder: lucene-icu-3.3.0.jar lucene-smartcn-3.3.0.jar icu4j-4_8.jar lucene-stempel-3.3.0.jar When I start my Solr server from apache-solr-3.3.0\example folder: java -jar start.jar I get the following errors: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory' SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer filter list SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text_icu' specified on field subject I tried adding various other jar files to the lib folder but it does not help. What am I doing wrong? Satish -- lucidimagination.com
Re: how to get row no. of current record
any help.. On Tuesday 02 August 2011 11:22 PM, Ranveer wrote: Hi, How to know the row number of current record. i.e : suppose we have 10 million record indexed. Currently I am on 5th records and id of the this record is XYZ00234, how to know that the current record rows no is 5th. thanks..
Re: how to get row no. of current record
Hi Ranveer, I'm not really sure if you mean lucene's docid (as that's the auto increment id used in here). Why would you need that in the first place? I'd suggest you not to expose that. Let me know in case you wanted something else. Also, perhaps you could explain the exact usecase and one of us give you a better solution. Hope that helps. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 2, 2011 at 11:22 PM, Ranveer ranveer.s...@gmail.com wrote: Hi, How to know the row number of current record. i.e : suppose we have 10 million record indexed. Currently I am on 5th records and id of the this record is XYZ00234, how to know that the current record rows no is 5th. thanks.. regards Ranveer
Re: how to get row no. of current record
Hi Anshum, Thanks for reply. My requirement is to get result start from current id. For this I need to set start rows. I am looking something like Jonty's post : http://lucene.472066.n3.nabble.com/previous-and-next-rows-of-current-record-td3187935.html thanks Ranveer On Wednesday 03 August 2011 08:31 AM, Anshum wrote: Hi Ranveer, I'm not really sure if you mean lucene's docid (as that's the auto increment id used in here). Why would you need that in the first place? I'd suggest you not to expose that. Let me know in case you wanted something else. Also, perhaps you could explain the exact usecase and one of us give you a better solution. Hope that helps. -- Anshum Gupta http://ai-cafe.blogspot.com On Tue, Aug 2, 2011 at 11:22 PM, Ranveer ranveer.s...@gmail.com mailto:ranveer.s...@gmail.com wrote: Hi, How to know the row number of current record. i.e : suppose we have 10 million record indexed. Currently I am on 5th records and id of the this record is XYZ00234, how to know that the current record rows no is 5th. thanks.. regards Ranveer
PivotFaceting in solr 3.3
Hi All! Can anyone tell which patch should I apply to solr 3.3 to enable pivot faceting in it. Thanks in advance! Isha garg
Re: PivotFaceting in solr 3.3
From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA. Is this what you are looking for ? https://issues.apache.org/jira/browse/SOLR-792 *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Wed, Aug 3, 2011 at 10:16, Isha Garg isha.g...@orkash.com wrote: Hi All! Can anyone tell which patch should I apply to solr 3.3 to enable pivot faceting in it. Thanks in advance! Isha garg
Re: PivotFaceting in solr 3.3
Hi Pranav, I know Pivot faceting is a feature in solr 4.0 But i want is there any patch that can make pivot faceting possible in solr3.3. Thanks! Isha On Wednesday 03 August 2011 10:23 AM, Pranav Prakash wrote: From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA. Is this what you are looking for ? https://issues.apache.org/jira/browse/SOLR-792 *Pranav Prakash* temet nosce Twitterhttp://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Googlehttp://www.google.com/profiles/pranny On Wed, Aug 3, 2011 at 10:16, Isha Gargisha.g...@orkash.com wrote: Hi All! Can anyone tell which patch should I apply to solr 3.3 to enable pivot faceting in it. Thanks in advance! Isha garg
Re: Query on multi valued field
Thank you. This logic works for me. Thanks a lot. Regards, Rajani Maski On Wed, Aug 3, 2011 at 1:21 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The query is get only those documents which have multiple elements for : that multivalued field. : : I.e, doc 2 and 3 should be returned from the above set.. The only way to do something like this is to add a field when you index your documents that contains the number and then filter on that field using a range query. With an UpdateProcessor (or a ScriptTransformer in DIH) you can automate counting how many values there are -- but it has to be indexed to search/filter on it. -Hoss