Reindex Solr Using Tomcat
Hi, I searched google and the wiki to find out how I can force a full re-index of all of my content and I came up with zilch. My goal is to be able to adjust the weight settings, re-index my entire database and then search my site and view the results of my weight adjustments. I am using Tomcat 5.x and Solr 1.4.1. Weird how I couldn't find this info. I must have missed it. Anyone know where to find it? Eric
RE: Reindex Solr Using Tomcat
Ah, I am using an ApacheSolr module in Drupal and used nutch to insert the data into the Solr index. When I using Jetty I could just delete the data contents in sshd and then restart the service forcing the reindex. Currently, the ApacheSolr module for Drupal allows for a 200 record re-index every cron run, but that is too slow for me. During implantation and testing I would prefer to re-index the entire database as I have over 400k records. I appreciate your help. My mind was searching for a command on the CLI that would just tell solr to reindex the entire dbase and be done with it. -Original Message- From: Ken Stanley [mailto:doh...@gmail.com] Sent: Thursday, November 18, 2010 12:37 PM To: solr-user@lucene.apache.org Subject: Re: Reindex Solr Using Tomcat On Thu, Nov 18, 2010 at 3:33 PM, Eric Martin e...@makethembite.com wrote: Hi, I searched google and the wiki to find out how I can force a full re-index of all of my content and I came up with zilch. My goal is to be able to adjust the weight settings, re-index my entire database and then search my site and view the results of my weight adjustments. I am using Tomcat 5.x and Solr 1.4.1. Weird how I couldn't find this info. I must have missed it. Anyone know where to find it? Eric Eric, How you re-index SOLR determines which method you wish to use. You can either use the UpdateHandler using a POST of an XML file [1], or you can use the DataImportHandler (DIH) [2]. There exist other means, but these two should be sufficient to get started. How did you import your initial index in the first place? [1] http://wiki.apache.org/solr/UpdateXmlMessages [2] http://wiki.apache.org/solr/DataImportHandler
RE: Spell Checker
Like a charm Dan, like a charm. I'm going to write this up and post it on Drupal. Thanks a ton! I have a much better idea of Solr and Did You Mean, Spell checker -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010 5:21 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker See interjected responses below On 11/16/2010 06:14 PM, Eric Martin wrote: Thanks Dan! Few questions: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. Right. A copyField just copies data from one field to another during the indexing process. You can copy one field to n other fields without affecting the original. This will still keep my current copyfield for the same data, right? I don't need to rebuild, just reindex. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. If you are using the default solrconfig.xml, a requesthandler should already be set up for you (but you should need a dedicated one for production: you can just embed the spell checker component in your default handler). Just query the example like this: http://localhost:8983/solr/spell?q=ANYTHINGHEREspellcheck=truespellcheck.c ollate=truespellcheck.build=true Note the spellcheck.build=true parameter. Cheers, Dan http://twitter.com/danklynn Totally lost on that. I will buy a book here shortly. -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010 5:01 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker I had to deal with spellchecking today a bit. Make sure you are performing the analysis step at index-time as such: schema.xml: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType fields . field name=spell type=textSpell indexed=true stored=false multiValued=true/ /fields From http://wiki.apache.org/solr/SpellCheckingAnalysis: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. Hope this helps, Dan Lynn http://twitter.com/danklynn On 11/16/2010 05:45 PM, Eric Martin wrote: Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kind of confused by this because the recommended spell checker is not default in my Solr 1.4.1. I have read the documentation but am still fuzzy on what I should do. My site uses legal terms and as you can see, some terms don't jive with the default spell checker so I was hoping to map the spell checker to the body for referencing dictionary words. I am unclear what approach I should take and how to start the quest. Can someone clarify what I should be doing here? Am I on the right track? Eric
Spell Checker
Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kind of confused by this because the recommended spell checker is not default in my Solr 1.4.1. I have read the documentation but am still fuzzy on what I should do. My site uses legal terms and as you can see, some terms don't jive with the default spell checker so I was hoping to map the spell checker to the body for referencing dictionary words. I am unclear what approach I should take and how to start the quest. Can someone clarify what I should be doing here? Am I on the right track? Eric
RE: Spell Checker
Thanks Dan! Few questions: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. This will still keep my current copyfield for the same data, right? I don't need to rebuild, just reindex. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. Totally lost on that. I will buy a book here shortly. -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010 5:01 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker I had to deal with spellchecking today a bit. Make sure you are performing the analysis step at index-time as such: schema.xml: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType fields . field name=spell type=textSpell indexed=true stored=false multiValued=true/ /fields From http://wiki.apache.org/solr/SpellCheckingAnalysis: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. Hope this helps, Dan Lynn http://twitter.com/danklynn On 11/16/2010 05:45 PM, Eric Martin wrote: Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kind of confused by this because the recommended spell checker is not default in my Solr 1.4.1. I have read the documentation but am still fuzzy on what I should do. My site uses legal terms and as you can see, some terms don't jive with the default spell checker so I was hoping to map the spell checker to the body for referencing dictionary words. I am unclear what approach I should take and how to start the quest. Can someone clarify what I should be doing here? Am I on the right track? Eric
RE: Spell Checker
Ah, I thought I was going nuts. Thanks for clarifying about the Wiki. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, November 16, 2010 5:11 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configur a tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kind of confused by this because the recommended spell checker is not default in my Solr 1.4.1. I have read the documentation but am still fuzzy on what I should do. Yes, the wiki on the request handler can be confusing indeed as it discusses the spellchecker as a request handler instead of a component. Usually, people need the spellchecker just as a component in some request handler instead of a request handler specifically designed for only spellchecking. I'd forget about that wiki and just follow the spellcheck component wiki as it not only describes the request handler but also the component, and it is being maintained up to the most recent developments in trunk and branch 3.1. My site uses legal terms and as you can see, some terms don't jive with the default spell checker so I was hoping to map the spell checker to the body for referencing dictionary words. I am unclear what approach I should take and how to start the quest. Map the spellchecker to the body of what? I assume the body of your document where the `main content` is stored. In that case, you'd just follow the wiki on the component and create a spellchecking fieldType with proper analyzers (the example allright) and define a spellchecking field that has the spellcheck fieldType as type (again, like in the example). Then you'll need to configure the spellchecking component in your solrconfig. The example is, again, what you're looking for. All you need to map your document's main body to the spellchecker is a copyField directive in your schema which will copy your body field to the spellcheck field (which has the spellcheck fieldType). The example on the component wiki page should work. Many features have been added since 1.4.x but the examples should work as expected. Can someone clarify what I should be doing here? Am I on the right track? Eric
RE: Spell Checker
Hi: Ok, I made the changes and have the spell checker build on optimize set to true. So I guess now, I just reindex. I have to run to class now so I can't check it for another 30 minutes. Cheers! -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010 5:21 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker See interjected responses below On 11/16/2010 06:14 PM, Eric Martin wrote: Thanks Dan! Few questions: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. Right. A copyField just copies data from one field to another during the indexing process. You can copy one field to n other fields without affecting the original. This will still keep my current copyfield for the same data, right? I don't need to rebuild, just reindex. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. If you are using the default solrconfig.xml, a requesthandler should already be set up for you (but you should need a dedicated one for production: you can just embed the spell checker component in your default handler). Just query the example like this: http://localhost:8983/solr/spell?q=ANYTHINGHEREspellcheck=truespellcheck.c ollate=truespellcheck.build=true Note the spellcheck.build=true parameter. Cheers, Dan http://twitter.com/danklynn Totally lost on that. I will buy a book here shortly. -Original Message- From: Dan Lynn [mailto:d...@danlynn.com] Sent: Tuesday, November 16, 2010 5:01 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checker I had to deal with spellchecking today a bit. Make sure you are performing the analysis step at index-time as such: schema.xml: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType fields . field name=spell type=textSpell indexed=true stored=false multiValued=true/ /fields From http://wiki.apache.org/solr/SpellCheckingAnalysis: Use acopyField to divert your main text fields to the spell field and then configure your spell checker to use the spell field to derive the spelling index. After this, you'll need to query a spellcheck-enabled handler with spellcheck.build=true or enable spellchecker index builds during optimize. Hope this helps, Dan Lynn http://twitter.com/danklynn On 11/16/2010 05:45 PM, Eric Martin wrote: Hi (again) I am looking at the spell checker options: http://wiki.apache.org/solr/SpellCheckerRequestHandler#Term_Source_Configura tion http://wiki.apache.org/solr/SpellCheckComponent#Use_in_the_Solr_Example I am looking in my solrconfig.xml and I see one is already in use. I am kind of confused by this because the recommended spell checker is not default in my Solr 1.4.1. I have read the documentation but am still fuzzy on what I should do. My site uses legal terms and as you can see, some terms don't jive with the default spell checker so I was hoping to map the spell checker to the body for referencing dictionary words. I am unclear what approach I should take and how to start the quest. Can someone clarify what I should be doing here? Am I on the right track? Eric
Error When Switching to Tomcat
Hi, I have been using Jetty on my linux/apache webserver for about 3 weeks now. I decided that I should change to Tomcat after realizing I will be indexing a lot of URL's and Jetty is good for small production sites as noted in the Wiki. I am running into this error: org.apache.solr.common.SolrException: Schema Parsing Failed at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:656) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95) at org.apache.solr.core.SolrCore.init My localhost/solr.xml : Context docBase=/tomcat/webapps/solr.war debug=0 privileged=true allowLinking=true crossContext=true Environment name=solr/home type=java.lang.String value=/tomcat/webapps/solr/ override=true / /Context My solrconfig.xml: dataDir${solr.data.dir:/tomcat/webapps/solr/conf}/dataDir I can get to the 8080 Tomcat default page just fine. I've gone over the Wiki a couple of dozen times and verified that my solr.xml is configured correctly based on trial and error and reading the error logs. I just can't figure out where it is going wrong. I read there are three different ways to do this. Can someone help me out? I am using Solr 1.4.0 and Tomcat 5.5.30 Eric
RE: Error When Switching to Tomcat
Hi, Thank you! I got it working after you jarred my brain. Of course, the location of the solr instance is arbitrary/logical to tomcat. Sheesh, I feel kind of small, now. Anyway, I was able to clearly see my mistake from your information. As with all help I get from here I posted my fix/walkthrough for others to see here: http://drupal.org/node/716632 Thanks a bunch! You helped me and anyone else coming to the Drupal site for help with Tomcat and Solr :-) -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Sunday, November 14, 2010 2:23 AM To: solr-user@lucene.apache.org Subject: Re: Error When Switching to Tomcat Move solr.war file and solrhome directory somewhere else outside the tomcat webapps. Like /home/foo. Tomcat will generate webapps/solr automatically. This is what i use: under catalineHome/conf/Catalina/localhost/solr.xml Context docBase=/home/foo/apache-solr-1.4.0.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/home/foo/SoorHome override=true / /Context I also delete dataDir.../dataDir entry from solrconfig.xml. So that data dir is created under the solrhome directory. http://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomca t I have been using Jetty on my linux/apache webserver for about 3 weeks now. I decided that I should change to Tomcat after realizing I will be indexing a lot of URL's and Jetty is good for small production sites as noted in the Wiki. I am running into this error: org.apache.solr.common.SolrException: Schema Parsing Failed at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:656) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95) at org.apache.solr.core.SolrCore.init My localhost/solr.xml : Context docBase=/tomcat/webapps/solr.war debug=0 privileged=true allowLinking=true crossContext=true Environment name=solr/home type=java.lang.String value=/tomcat/webapps/solr/ override=true / /Context My solrconfig.xml: dataDir${solr.data.dir:/tomcat/webapps/solr/conf}/dataDir I can get to the 8080 Tomcat default page just fine. I've gone over the Wiki a couple of dozen times and verified that my solr.xml is configured correctly based on trial and error and reading the error logs. I just can't figure out where it is going wrong. I read there are three different ways to do this. Can someone help me out?
Search Result Differences a Puzzle
Hi, I cannot find out how this is occurring: Nolosearch/com/search/apachesolr_search/law You can see that the John Paul Stevens result yields more description in the search result because of the keyword relevancy, whereas, the other results just give you a snippet of the title based on keywords found. I am trying to figure out how to get a standard size search result no matter what the relevancy is. While application of this type of result would be irrelevant to many search engines it is completely practical in a legal setting as a keyword is only as good as how it is being referenced in the sentence or paragraph. What a dilemma I have! I have been trying to figure out if it is the actual schema.xml file or solrconfig.xml file and for the life of me, I can't find it referenced anywhere. I tried changing the fragsize to 200 instead of a default of like 70. Didn't do any damage at re-index. This problem is super critical to my search results. Like I said, as an attorney, the word is superfluous until it attached to a long sentence or two in order to describe if the keyword we searched for is relevant, let alone worthy of a click. That is why my titles are set to open in a new window, faster access and if the result is crud, then just close the window out and back to research. Eric
RE: importing from java
http://wiki.apache.org/solr/DIHQuickStart http://wiki.apache.org/solr/DataImportHandlerFaq http://wiki.apache.org/solr/DataImportHandler -Original Message- From: Tri Nguyen [mailto:tringuye...@yahoo.com] Sent: Thursday, November 11, 2010 9:34 PM To: solr-user@lucene.apache.org Subject: Re: importing from java another question is, can I write my own DataImportHandler class? thanks, Tri From: Tri Nguyen tringuye...@yahoo.com To: solr user solr-user@lucene.apache.org Sent: Thu, November 11, 2010 7:01:25 PM Subject: importing from java Hi, I'm restricted to the following in regards to importing. I have access to a list (Iterator) of Java objects I need to import into solr. Can I import the java objects as part of solr's data import interface (whenever an http request to solr to do a dataimport, it'll call my java class to get objects)? Before I had direct read only access to the db and specified the column mappings and things were fine with the data import. But now I am restricted to using a .jar file that has an api to get the records in the database and I need to publish these records in the db. I do see solrj and but solrj is seaparate from the solr webapp. Can I write my own dataimporthandler? Thanks, Tri
RE: solr init.d script
Er, what flavor? RHEL / CentOS #!/bin/sh # Starts, stops, and restarts Apache Solr. # # chkconfig: 35 92 08 # description: Starts and stops Apache Solr SOLR_DIR=/var/solr JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart} 2 exit 1 ;; esac Debian http://xdeb.org/node/1213 __ Ubuntu STEPS Type in the following command in TERMINAL to install nano text editor. sudo apt-get install nano Type in the following command in TERMINAL to add a new script. sudo nano /etc/init.d/solr TERMINAL will display a new page title GNU nano 2.0.x. Paste the below script in this TERMINAL window. #!/bin/sh -e # Starts, stops, and restarts solr SOLR_DIR=/apache-solr-1.4.0/example JAVA_OPTIONS=-Xmx1024m -DSTOP.PORT=8079 -DSTOP.KEY=stopkey -jar start.jar LOG_FILE=/var/log/solr.log JAVA=/usr/bin/java case $1 in start) echo Starting Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS 2 $LOG_FILE ;; stop) echo Stopping Solr cd $SOLR_DIR $JAVA $JAVA_OPTIONS --stop ;; restart) $0 stop sleep 1 $0 start ;; *) echo Usage: $0 {start|stop|restart} 2 exit 1 ;; esac Note: In above script you might have to replace /apache-solr-1.4.0/example with appropriate directory name. Press CTRL-X keys. Type in Y When ask File Name to Write press ENTER key. You're now back to TERMINAL command line. Type in the following command in TERMINAL to create all the links to the script. sudo update-rc.d solr defaults Type in the following command in TERMINAL to make the script executable. sudo chmod a+rx /etc/init.d/solr To test. Reboot your Ubuntu Server. Wait until Ubuntu Server reboot is completed. Wait 2 minutes for Apache Solr to startup. Using your internet browser go to your website and try a Solr search. -Original Message- From: Nikola Garafolic [mailto:nikola.garafo...@srce.hr] Sent: Monday, November 08, 2010 11:42 PM To: solr-user@lucene.apache.org Subject: solr init.d script Hi, Does anyone have some kind of init.d script for solr, that can start, stop and check solr status? -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr
RE: Removing irrelevant URLS
OK, thanks. I am using nutch and figuring out how to use urlfilters, unsuccessfully. Just thought there might be a way I could save some trouble this way. Thanks! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, November 07, 2010 8:46 AM To: solr-user@lucene.apache.org Subject: Re: Removing irrelevant URLS You can always do a delete-by-query, but that pre-supposes you can form a query that would remove only those documents with URLs you want removed... Assuming you do this, an optimize would then physically remove the documents from your index (delete by query just marks the docs as deleted). Solr has nothing specifically for URLs, it's an engine rather than a web crawling app Best Erick On Fri, Nov 5, 2010 at 4:33 PM, Eric Martin e...@makethembite.com wrote: Hi, I have 100k URL's in my index. I specifically crawled sits relating to law. However, during my intitial crawls I didn't specify urlfilters so I am stuck with extrinsic and often irrelevant URL's like twitter, etc. Is there some way in Solr that I can run periodic URL cleanings to remove URL's and search string results? Or, should I just dump my index and rebuild using the filter? I have looked on the Solr wiki and came across some candidates that look like it is what I am trying to accomplish but am not sure. If anyone knows where I should be looking I would appreciate it. Eric
Adding Carrot2
Hi, Solr and nutch have been working fine. I now want to integrate Carrot2. I followed this tutorial/quickstart: http://www.lucidimagination.com/blog/2009/09/28/solrs-new-clustering-capabil ities/ I didn't see anything to adjust in my schema so I didn't do anything there. I did add the code to the solrconfig.xml though. I am getting this when I start Solr now: Command: java -Dsolr.clustering.enabled=true -jar start.jar Nov 7, 2010 11:35:16 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [solrconfig.xml] requestHandler: missing mandatory attribute 'class' Anyone run into issues with Carrot2? Eric
RE: Adding Carrot2
Yeah I know, you have to download the libraries and copy them to your /lib inside of Solr. In Solr 1.4 the plugin is available but the libraries are not. http://www.lucidimagination.com/blog/2009/09/28/solrs-new-clustering-capabilities/ I think there is something wrong with the schema and solrconfig (xml's) integration. Some documentation on Apache says it's already written into the xml and some says its not. Searching the xml's in Solr I find no reference to clustering. Now that I think about it, I copied over the solrconfig.xml and schema.xml with my Drupal/ApacheSolr xml's. I think I may have answered my own question as to why the clustering isn't running correctly. I will go get a copy of the default xml's and if I find it there, I will try and merge them. Does this sound I am on the right path now? -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sunday, November 07, 2010 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Adding Carrot2 Carrot is already part of the Solr distributions. 1.4.1 and 3.x and the trunk. On 11/7/10, Eric Martin e...@makethembite.com wrote: Hi, Solr and nutch have been working fine. I now want to integrate Carrot2. I followed this tutorial/quickstart: http://www.lucidimagination.com/blog/2009/09/28/solrs-new-clustering-capabil ities/ I didn't see anything to adjust in my schema so I didn't do anything there. I did add the code to the solrconfig.xml though. I am getting this when I start Solr now: Command: java -Dsolr.clustering.enabled=true -jar start.jar Nov 7, 2010 11:35:16 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [solrconfig.xml] requestHandler: missing mandatory attribute 'class' Anyone run into issues with Carrot2? Eric -- Lance Norskog goks...@gmail.com
Removing irrelevant URLS
Hi, I have 100k URL's in my index. I specifically crawled sits relating to law. However, during my intitial crawls I didn't specify urlfilters so I am stuck with extrinsic and often irrelevant URL's like twitter, etc. Is there some way in Solr that I can run periodic URL cleanings to remove URL's and search string results? Or, should I just dump my index and rebuild using the filter? I have looked on the Solr wiki and came across some candidates that look like it is what I am trying to accomplish but am not sure. If anyone knows where I should be looking I would appreciate it. Eric
RE: Solr in virtual host as opposed to /lib
I was speaking about apache virtual hosts. I was concerned that there was an increase processing time due to the solr and nutch instance being housed inside a virtual host as opposed to being dropped in root of my distro. Thank you for the astute clarification. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, November 01, 2010 9:52 AM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib I think you guys are talking about two different kinds of 'virtual hosts'. Lance is talking about CPU virtualization. Eric appears to be talking about apache virtual web hosts, although Eric hasn't told us how apache is involved in his setup in the first place, so it's unclear. Assuming you are using apache to reverse proxy to Solr, there is no reason I can think of that your front-end apache setup would effect CPU utilizaton by Solr, let alone by nutch. Eric Martin wrote: Oh. So I should take out the installations and move them to /some_dir as opposed to inside my virtual host of /home/my solr nutch is here/www ' -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sunday, October 31, 2010 7:26 PM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib With virtual hosting you can give CPU memory quotas to your different VMs. This allows you to control the Nutch v.s. The World problem. Unforch, you cannot allocate disk channel. With two i/o bound apps, this is a problem. On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote: Excellent information. Thank you. Solr is acting just fine then. I can connect to it no issues, it indexes fine and there didn't seem to be any complication with it. Now I can rule it out and go about solving, what you pointed out, and I agree, to be a java/nutch issue. Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open source and found on apache.org Thanks for your time. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Sunday, October 31, 2010 4:33 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib What servlet container are you putting your Solr in? Jetty? Tomcat? Something else? Are you fronting it with apache on top of that? (I think maybe you are, otherwise I'm not sure how the phrase 'virtual host' applies). In general, Solr of course doesn't care what directory it's in on disk, so long as the process running solr has the neccesary read/write permissions to the neccesary directories (and if it doesn't, you'd usually find out right away with an error message). And clients to Solr don't care what directory it's in on disk either, they only care that they can get it to it connecting to a certain port at a certain hostname. In general, if they can't get to it on a certain port at a certain hostname, that's something you'd discover right away, not something that would be intermittent. But I'm not familiar with nutch, you may want to try connecting to the port you have Solr running on (the hostname/port you have told nutch to find solr on?) yourself manually, and just make sure it is connectable. I can't think of any reason that what directory you have Solr in could cause CPU utilization issues. I think it's got nothing to do with that. I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you might want to find some nutch experts to ask. Perhaps there's a nutch listserv? I am also not familiar with hadoop; you mention just in passing that you're using hadoop too, maybe that's an added complication, I don't know. One obvious reason nutch could be taking 100% cpu would be simply because you've asked it to do a lot of work quickly, and it's trying to. One reason I have seen Solr take 100% of CPU and become responsive, is when the Solr process gets caught up in terrible Java garbage collection. If that's what's happening, then giving the Solr JVM a higher maximum heap size can sometimes help (although confusingly, I've seen people suggest that if you give the Solr JVM too MUCH heap it can also result in long GC pauses), and if you have a multi-core/multi-CPU machine, I've found the JVM argument -XX:+UseConcMarkSweepGC to be very helpful. Other than that, it sounds to me like you've got a nutch/hadoop issue, not a Solr issue. From: Eric Martin [e...@makethembite.com] Sent: Sunday, October 31, 2010 7:16 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib Hi, Thank you. This is more than idle curiosity. I am trying to debug an issue I am having with my installation and this is one step in verifying that I have a setup that does not consume resources. I am trying to debunk my internal myth that having Solr nad Nutch in a virtual host would be causing these issues
RE: Solr in virtual host as opposed to /lib
I don't think you read the entire thread. I'm assuming you made a mistake. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, November 01, 2010 11:49 AM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib : References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : In-Reply-To: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com : Subject: Solr in virtual host as opposed to /lib http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Solr in virtual host as opposed to /lib
Is there an issue running Solr in /home/lib as opposed to running it somewhere outside of the virtual hosts like /lib? Eric
RE: Solr in virtual host as opposed to /lib
fetcher.Fetcher - -activeThreads=50, spinWaiting=49, fetchQueues.totalSize=2500 2010-10-31 15:44:21,360 INFO fetcher.Fetcher - -activeThreads=50, spinWaiting=49, fetchQueues.totalSize=2500 Can anyone help me out? Did I miss something should i be using Tomcat? One interesting part of this is when I try and change the nutch setting post url and urls by score to 1 they stay at 10 no matter what I do. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, October 31, 2010 4:12 PM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib Can you expand on your question? Are you having a problem? Is this idle curiosity? Because I have no idea how to respond when there is so little information. Best Erick On Sun, Oct 31, 2010 at 5:32 PM, Eric Martin e...@makethembite.com wrote: Is there an issue running Solr in /home/lib as opposed to running it somewhere outside of the virtual hosts like /lib? Eric
RE: Solr in virtual host as opposed to /lib
Excellent information. Thank you. Solr is acting just fine then. I can connect to it no issues, it indexes fine and there didn't seem to be any complication with it. Now I can rule it out and go about solving, what you pointed out, and I agree, to be a java/nutch issue. Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open source and found on apache.org Thanks for your time. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Sunday, October 31, 2010 4:33 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib What servlet container are you putting your Solr in? Jetty? Tomcat? Something else? Are you fronting it with apache on top of that? (I think maybe you are, otherwise I'm not sure how the phrase 'virtual host' applies). In general, Solr of course doesn't care what directory it's in on disk, so long as the process running solr has the neccesary read/write permissions to the neccesary directories (and if it doesn't, you'd usually find out right away with an error message). And clients to Solr don't care what directory it's in on disk either, they only care that they can get it to it connecting to a certain port at a certain hostname. In general, if they can't get to it on a certain port at a certain hostname, that's something you'd discover right away, not something that would be intermittent. But I'm not familiar with nutch, you may want to try connecting to the port you have Solr running on (the hostname/port you have told nutch to find solr on?) yourself manually, and just make sure it is connectable. I can't think of any reason that what directory you have Solr in could cause CPU utilization issues. I think it's got nothing to do with that. I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you might want to find some nutch experts to ask. Perhaps there's a nutch listserv? I am also not familiar with hadoop; you mention just in passing that you're using hadoop too, maybe that's an added complication, I don't know. One obvious reason nutch could be taking 100% cpu would be simply because you've asked it to do a lot of work quickly, and it's trying to. One reason I have seen Solr take 100% of CPU and become responsive, is when the Solr process gets caught up in terrible Java garbage collection. If that's what's happening, then giving the Solr JVM a higher maximum heap size can sometimes help (although confusingly, I've seen people suggest that if you give the Solr JVM too MUCH heap it can also result in long GC pauses), and if you have a multi-core/multi-CPU machine, I've found the JVM argument -XX:+UseConcMarkSweepGC to be very helpful. Other than that, it sounds to me like you've got a nutch/hadoop issue, not a Solr issue. From: Eric Martin [e...@makethembite.com] Sent: Sunday, October 31, 2010 7:16 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib Hi, Thank you. This is more than idle curiosity. I am trying to debug an issue I am having with my installation and this is one step in verifying that I have a setup that does not consume resources. I am trying to debunk my internal myth that having Solr nad Nutch in a virtual host would be causing these issues. Here is the main issue that involves Nutch/Solr and Drupal: /home/mootlaw/lib/solr /home/mootlaw/lib/nutch /home/mootlaw/www/Drupal site I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using jetty for my Solr. My server is not rooted. Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm: /usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64 -classpath /home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib /apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1. 4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils-1.8.0.jar:/home/mootla w/lib/nutch/lib/commons-cli-1.2.jar:/home/mootlaw/lib/nutch/lib/commons-code c-1.3.jar:/home/mootlaw/lib/nutch/lib/commons-collections-3.2.1.jar:/home/mo otlaw/lib/nutch/lib/commons-el-1.0.jar:/home/mootlaw/lib/nutch/lib/commons-h ttpclient-3.1.jar:/home/mootlaw/lib/nutch/lib/commons-io-1.4.jar:/home/mootl aw/lib/nutch/lib/commons-lang-2.1.jar:/home/mootlaw/lib/nutch/lib/commons-lo gging-1.0.4.jar:/home/mootlaw/lib/nutch/lib/commons-logging-api-1.0.4.jar:/h ome/mootlaw/lib/nutch/lib/commons-net-1.4.1.jar:/home/mootlaw/lib/nutch/lib/ core-3.1.1.jar:/home/mootlaw/lib/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1. jar:/home/mootlaw/lib/nutch/lib/hadoop-0.20.2-core.jar:/home/mootlaw/lib
RE: Solr in virtual host as opposed to /lib
Oh. So I should take out the installations and move them to /some_dir as opposed to inside my virtual host of /home/my solr nutch is here/www ' -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Sunday, October 31, 2010 7:26 PM To: solr-user@lucene.apache.org Subject: Re: Solr in virtual host as opposed to /lib With virtual hosting you can give CPU memory quotas to your different VMs. This allows you to control the Nutch v.s. The World problem. Unforch, you cannot allocate disk channel. With two i/o bound apps, this is a problem. On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote: Excellent information. Thank you. Solr is acting just fine then. I can connect to it no issues, it indexes fine and there didn't seem to be any complication with it. Now I can rule it out and go about solving, what you pointed out, and I agree, to be a java/nutch issue. Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open source and found on apache.org Thanks for your time. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Sunday, October 31, 2010 4:33 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib What servlet container are you putting your Solr in? Jetty? Tomcat? Something else? Are you fronting it with apache on top of that? (I think maybe you are, otherwise I'm not sure how the phrase 'virtual host' applies). In general, Solr of course doesn't care what directory it's in on disk, so long as the process running solr has the neccesary read/write permissions to the neccesary directories (and if it doesn't, you'd usually find out right away with an error message). And clients to Solr don't care what directory it's in on disk either, they only care that they can get it to it connecting to a certain port at a certain hostname. In general, if they can't get to it on a certain port at a certain hostname, that's something you'd discover right away, not something that would be intermittent. But I'm not familiar with nutch, you may want to try connecting to the port you have Solr running on (the hostname/port you have told nutch to find solr on?) yourself manually, and just make sure it is connectable. I can't think of any reason that what directory you have Solr in could cause CPU utilization issues. I think it's got nothing to do with that. I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you might want to find some nutch experts to ask. Perhaps there's a nutch listserv? I am also not familiar with hadoop; you mention just in passing that you're using hadoop too, maybe that's an added complication, I don't know. One obvious reason nutch could be taking 100% cpu would be simply because you've asked it to do a lot of work quickly, and it's trying to. One reason I have seen Solr take 100% of CPU and become responsive, is when the Solr process gets caught up in terrible Java garbage collection. If that's what's happening, then giving the Solr JVM a higher maximum heap size can sometimes help (although confusingly, I've seen people suggest that if you give the Solr JVM too MUCH heap it can also result in long GC pauses), and if you have a multi-core/multi-CPU machine, I've found the JVM argument -XX:+UseConcMarkSweepGC to be very helpful. Other than that, it sounds to me like you've got a nutch/hadoop issue, not a Solr issue. From: Eric Martin [e...@makethembite.com] Sent: Sunday, October 31, 2010 7:16 PM To: solr-user@lucene.apache.org Subject: RE: Solr in virtual host as opposed to /lib Hi, Thank you. This is more than idle curiosity. I am trying to debug an issue I am having with my installation and this is one step in verifying that I have a setup that does not consume resources. I am trying to debunk my internal myth that having Solr nad Nutch in a virtual host would be causing these issues. Here is the main issue that involves Nutch/Solr and Drupal: /home/mootlaw/lib/solr /home/mootlaw/lib/nutch /home/mootlaw/www/Drupal site I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using jetty for my Solr. My server is not rooted. Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm: /usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64 -classpath /home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib /apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1. 4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils
Basic Document Question
HI everyone, I'm new which won't be hard to figure out after I ask this question: I use Drupal/Solr/Nutch http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema. xml?view=markup Solr specific: How do I re-index for specific content only? I am starting a legal index specifically geared for law students and lawyers. I am crawling law related sites but I really don't want to index law firms, just the law content on places like: http://www.ecasebriefs.com/blog/law/ http://www.lawnix.com/cases/cases-index/ http://www.oyez.org/ http://www.4lawnotes.com/ http://www.docstoc.com/documents/education/law-school/case-briefs http://www.lawschoolcasebriefs.com/ http://dictionary.findlaw.com http://dictionary.findlaw.com/ As I was saying, while crawling I get all kinds of extrinsic information put into the Solr index. How do I combat that? I am assuming (cough) that I can do this but I am really at a loss as to where I start to look to get this done. I prefer to learn and I defiantly don't want to waste anyone's time. Non-Solr Specific Does anyone here help with nutch or is this Solr only? I am sorry if I am asking elementary questions and am asking in the wrong place. I just need to be pointed to the right place. I'm sort of lost.(imagine that.) Thanks Eric
RE: Does anyone notice this site?
This is not legal advice. Take this as it is. Just off my head and what I know. I did not research this, but could, if Solr wants me to. From a marketing stand-point, probably. From a legal standpoint. They can do whatever they want with the name Solr so long as they maintain a distance between any trademarked name and the fundamental use of the trademark, unless there is substantial connection between the trademark name and recognition. Of course, that is to be determined by a few factors, length in business, trademarks carried, whether or not the offending trademark makes a claim (not making a claim limits your recovery substantially and may even null it.). They are also in South Africa. So, throw in international law. Of course, you also have fair use law. Well, this can get tricky. Here is an example: myspace.com and moremyspace.com. If moremysapce.com is used as a social networking site than myspace has a claim. If it is used as a social networking site in parody then mysapce has no legal claim whatsoever. Another example is booble.com (not work safe link!) That case lasted many years and google lost. Trademarks are a very tricky business and one that I will never practice. Anyway, seeing as how they are making a search engine, they are using a lower level FQDN and they have not made a dent in the industry it would be futile to do anything but send them an email laying cliam to the name Solr. *If you do not send them a letter/email laying claim to Solr you will lose your rights to fight that battle with IANA, etc or the ability to seek legal remedy.* Eric Law Student - Second Year -Original Message- From: scott chu [mailto:scott@udngroup.com] Sent: Monday, October 25, 2010 9:55 AM To: solr-user@lucene.apache.org Subject: Does anyone notice this site? I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr?
Integrating Carrot2/Solr Deafult Example
Hello, Welcome to all. I am a very basic user. I have limited knowledge. I read the documentation, I have an 'example' Solr installation working on my server. I have Drupal 6. I have Drupal using Solr (apachesolr) as its default search engine. I have 1 document in the database that is searchable for testing purposes. I would like to know, if I am using all default paths in my Solr installation, how do I enable Carrot2? Once enabled, how do I verify that it is clustering properly? Carrot2 doc I read: http://download.carrot2.org/head/manual/index.html#chapter.application-suite Clustering Wiki Solr I read: http://wiki.apache.org/solr/ClusteringComponent I know this is really basic stuff and I really appreciate the help. I fumbled my way through installing Solr on my own, setting up Drupal, etc. I am a former Natural V2 3270 programmer (basic flat file OO) and have limited experience in PHP, Java, Jetty etc. However, I can read code, decipher what it is doing, and find a solution and then implement it. I just really have no foundation for Carrot2/Solr, yet. Any help, pointers and look here's would very much be appreciated. Eric Martin