Re: Questions about upgrade to Nutch 1.3
Thanks for replying! I do still have a couple of questions: Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM On Monday 20 June 2011 16:44:13 Chip Calhoun wrote: Hi everyone, I'm a complete Nutch newbie. I installed Nutch 1.2 and Solr 1.4.0 on my machine without any trouble. I've decided to try Nutch 1.3 as it's compatible with Solr 3.1.0, which includes Solritas. I hope you can help with some problems I'm having. Solr 1.4.x has it has Velocity as a contrib. Does it? Under 1.4.0 I could never get http://localhost:8983/solr/browse to work. I thought this was only added later. I get an error saying solrurl is not set. This seems to be new to Nutch 1.3. Where do I set this? According to the source you're using the crawl command. Usage: Crawl urlDir -solr solrURL [-dir d] [-threads n] [-depth i] [-topN N] Thanks, I hadn't known about the solrURL argument at all. So would a valid usage be: bin/nutch crawl urls -solr http://127.0.0.1:8983 -dir solrcrawl -depth 10 -topN 50 With the new solrURL argument, are there any steps I need to do after my crawl to get my content into Solr? Thanks!
Re: Questions about upgrade to Nutch 1.3
Ahh, thanks again. Based on your advice, I'm going back to Nutch 1.2 / Solr 1.4 and adding the Velocity contrib. Once I get that working, I'll try with Nutch 1.3 again. When I try to use Velocity now, I get this message: java.lang.RuntimeException: Can't find resource 'velocity.properties' in classpath or 'solr/conf/', cwd=C:\apache\apache-solr-1.4.0\exampleThis is despite velocity.properties very definitely being in my C:\apache\apache-solr-1.4.0\example\solr\conf directory. But I've veered completely into Solr territory now, so I guess that's off-topic. Markus Jelsma markus.jel...@openindex.io 6/20/2011 12:43 PM On Monday 20 June 2011 18:35:36 Chip Calhoun wrote: Thanks for replying! I do still have a couple of questions: Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM On Monday 20 June 2011 16:44:13 Chip Calhoun wrote: Hi everyone, I'm a complete Nutch newbie. I installed Nutch 1.2 and Solr 1.4.0 on my machine without any trouble. I've decided to try Nutch 1.3 as it's compatible with Solr 3.1.0, which includes Solritas. I hope you can help with some problems I'm having. Solr 1.4.x has it has Velocity as a contrib. Does it? Under 1.4.0 I could never get http://localhost:8983/solr/browse to work. I thought this was only added later. libs must be added manually from contrib but it is shipped. I get an error saying solrurl is not set. This seems to be new to Nutch 1.3. Where do I set this? According to the source you're using the crawl command. Usage: Crawl urlDir -solr solrURL [-dir d] [-threads n] [-depth i] [-topN N] Thanks, I hadn't known about the solrURL argument at all. So would a valid usage be: bin/nutch crawl urls -solr http://127.0.0.1:8983 -dir solrcrawl -depth 10 -topN 50 With the new solrURL argument, are there any steps I need to do after my crawl to get my content into Solr? I think so but i don't use it. Please try. Thanks! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Deploying the web application in Nutch 1.2
I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm going through the tutorial at http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the following instruction: Deploy the Nutch web application as the ROOT context I'm not sure what I'm meant to do here. I get the idea that I'm supposed to replace the current contents of $CATALINA_HOME/webapps/ROOT/ with something from my Nutch directory, but I don't know what from my Nutch directory I'm supposed to move. Can someone please explain what I need to move? Thanks, Chip
RE: Deploying the web application in Nutch 1.2
You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch-site.xml, I get a working Nutch web app; I have no index and all of my searches fail, but I have Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat, that's when I start getting the 404 with the The requested resource () is not available message. Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to paste the entire contents of my nutch-site.xml. Where am I screwing this up? Thanks for your help on this. ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have a catalina.sh file, so I've skipped that step. From memory the catalina.sh file is used to start you Tomcat server instance... this has nothing to do with Nutch. Regardless of what lind of WAR files you have in your Tomcat webapps directory, starting your tomat server from the command line sould be the same... And I've added the following to C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml : As far as a I can remember nutch-site.xml is already there, however you need to specify various property values after this has been uploaded the first time. After rebooting Tomcat all of your property setting will be running. property namesearcher.dir/name valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a crawl/index directory to run off !-- /property Looks fine, however please remove the !... as this is not required. However, when I go to http://localhost:8080/nutch/ I always get a 404 with the message, The requested resource () is not available. What am I missing? As I said the name of the WAR file needs to be identical to the webapp you specify in the tomcat URL... can you confirm this. There should really be no problem starting up the Nutch web app if you follow the tutorial carfeully. Thanks, Chip -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:40 AM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Hi Chip, Please see this tutorial for 1.2 administration [1], many people have been using it recently and as far as I'm aware it is working perfectly. Please post back if you have any troubles [1] http://wiki.apache.org/nutch/NutchTutorial On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote: I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm going through the tutorial at http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the following instruction: Deploy the Nutch web application as the ROOT context I'm not sure what I'm meant to do here. I get the idea that I'm supposed to replace the current contents of $CATALINA_HOME/webapps/ROOT/ with something from my Nutch directory, but I don't know what from my Nutch directory I'm supposed to move. Can someone please explain what I need to move? Thanks, Chip -- *Lewis* -- *Lewis*
RE: Deploying the web application in Nutch 1.2
I'm definitely changing the file in my webapp. I can tell I'm doing that much right because it makes a noticeable change to the function of my web app; unfortunately, the change is that it seems to break everything. I've tried playing with the actual value for this, but with no success. In the tutorial's example, value/somewhere/crawlvalue, what is that relative to? Where would that hypothetical /somewhere/ directory be, relative to $CATALINA_HOME/webapps/? It feels like this is my problem, because I can't think of anything else it could be. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Friday, July 15, 2011 3:19 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Are you adding this to nutch-site within your webapp or just in your root Nutch installation. This needs to be included in your webapp version of nutch-site.xml. In my experience this was a small case of confusion at first. On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote: You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch-site.xml, I get a working Nutch web app; I have no index and all of my searches fail, but I have Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat, that's when I start getting the 404 with the The requested resource () is not available message. Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to paste the entire contents of my nutch-site.xml. Where am I screwing this up? Thanks for your help on this. ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u rlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have a catalina.sh file, so I've skipped that step. From memory the catalina.sh file is used to start you Tomcat server instance... this has nothing to do with Nutch. Regardless of what lind of WAR files you have in your Tomcat webapps directory, starting your tomat server from the command line sould be the same... And I've added the following to C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml : As far as a I can remember nutch-site.xml is already there, however you need to specify various property values after this has been uploaded the first time. After rebooting Tomcat all of your property setting will be running. property namesearcher.dir/name valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a crawl/index directory to run off !-- /property Looks fine, however please remove the !... as this is not required. However, when I go to http://localhost:8080/nutch/ I always get a 404 with the message, The requested resource () is not available. What am I missing? As I said the name of the WAR file needs to be identical to the webapp you specify in the tomcat URL... can you confirm this. There should really be no problem starting up the Nutch web app if you follow the tutorial carfeully. Thanks, Chip -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:40 AM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Hi Chip, Please see this tutorial for 1.2 administration [1], many people have been using it recently and as far as I'm aware it is working perfectly. Please post back if you have any troubles [1] http://wiki.apache.org/nutch/NutchTutorial On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote: I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm going through the tutorial at http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the following instruction: Deploy the Nutch web application as the ROOT context I'm not sure what I'm meant to do here. I get the idea that I'm supposed to replace the current contents of $CATALINA_HOME/webapps
RE: Deploying the web application in Nutch 1.2
Success! I'm posting this not because I need further help, but in case someone with a similar issue finds this in the list archives. First: I now know that if I make no changes to nutch-site.xml, Nutch will expect my crawl directory to be C:\Apache\Tomcat-5.5\crawl . So now I know that much. Second, for some reason when I add the searcher.dir language to nutch-site.xml it causes a SEVERE: Error listenerStart issue. The obvious solution for me is to just stop editing nutch-site.xml, and live with /crawl/ being in my main Tomcat folder. Whatever's causing this listenerStart issue when I play with this on my own machine may very well not come up when I put this on the production server, so I'm not going to waste any time on it. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Friday, July 15, 2011 3:32 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 As a resource it would be wise to have a look at the list archives for an exact answer to this. Take a look at your catalina.out logs for more verbose info on where the error is. It has been a while since I have configured this now, sorry I can't be of more help in giving a definite answer. On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote: I'm definitely changing the file in my webapp. I can tell I'm doing that much right because it makes a noticeable change to the function of my web app; unfortunately, the change is that it seems to break everything. I've tried playing with the actual value for this, but with no success. In the tutorial's example, value/somewhere/crawlvalue, what is that relative to? Where would that hypothetical /somewhere/ directory be, relative to $CATALINA_HOME/webapps/? It feels like this is my problem, because I can't think of anything else it could be. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Friday, July 15, 2011 3:19 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 Are you adding this to nutch-site within your webapp or just in your root Nutch installation. This needs to be included in your webapp version of nutch-site.xml. In my experience this was a small case of confusion at first. On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote: You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch-site.xml, I get a working Nutch web app; I have no index and all of my searches fail, but I have Nutch. When I add my crawl location to nutch-site.xml and restart Tomcat, that's when I start getting the 404 with the The requested resource () is not available message. Clearly I'm doing something wrong when I edit nutch-site.xml. I'm going to paste the entire contents of my nutch-site.xml. Where am I screwing this up? Thanks for your help on this. ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor) |q uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic |u rlnormalizer-(pass|regex|basic)/value /property property namesearcher.dir/name valueC:/Apache/apache-nutch-1.2/crawlvalue /property /configuration -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Thursday, July 14, 2011 5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem to have a catalina.sh file, so I've skipped that step. From memory the catalina.sh file is used to start you Tomcat server instance... this has nothing to do with Nutch. Regardless of what lind of WAR files you have in your Tomcat webapps directory, starting your tomat server from the command line sould be the same... And I've added the following to C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml : As far as a I can remember nutch-site.xml is already there, however you need to specify various property values after this has been uploaded the first time. After rebooting Tomcat all of your property setting will be running. property namesearcher.dir/name valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a crawl/index directory to run off !-- /property Looks fine, however please remove the !... as this is not required. However, when I go to http://localhost:8080/nutch/ I
Nutch not indexing full collection
Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip
RE: Nutch not indexing full collection
I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does seem to at least be getting my urlfilters from $NUTCH_HOME/runtime/local/conf/ . I get no output when I try runtime/local/bin/nutch readdb -stats , so that's weird. I dimly recall there being a total index size value somewhere in Nutch or Solr which has to be increased, but I can no longer find any reference to it. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, July 20, 2011 10:06 AM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
RE: Nutch not indexing full collection
I'm still having trouble. I've set a windows environment variable, NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I now have my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local folder. But I'm still not crawling files later on my urls list, and apparently I can't search for words or phrases toward the end of any of my documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased? -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does seem to at least be getting my urlfilters from $NUTCH_HOME/runtime/local/conf/ . I get no output when I try runtime/local/bin/nutch readdb -stats , so that's weird. I dimly recall there being a total index size value somewhere in Nutch or Solr which has to be increased, but I can no longer find any reference to it. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, July 20, 2011 10:06 AM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- *Lewis*
RE: Nutch not indexing full collection
Thanks! This has solved half of my problem. I am now indexing material from every document I want. However, I'm still not indexing words from toward the end of longer documents. I'm not sure what else I could be missing. The current contents of my nutch-site.xml are: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namehttp.content.limit/name value-1/value /property /configuration And I'm still indexing with this command: bin/nutch crawl urls -dir crawl -depth 15 -topN 50 -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 27, 2011 12:18 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection has this been solved? If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr. On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote: I'm still having trouble. I've set a windows environment variable, NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I now have my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local folder. But I'm still not crawling files later on my urls list, and apparently I can't search for words or phrases toward the end of any of my documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased? -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm pretty sure that's the correct file. I run my commands while in $NUTCH_HOME/ , which means all of my commands begin with runtime/local/bin/nutch... . That means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so forth), but it does seem to at least be getting my urlfilters from $NUTCH_HOME/runtime/local/conf/ . I get no output when I try runtime/local/bin/nutch readdb -stats , so that's weird. I dimly recall there being a total index size value somewhere in Nutch or Solr which has to be increased, but I can no longer find any reference to it. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Wednesday, July 20, 2011 10:06 AM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right? nutch readdb -stats will give you the total number of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property /configuration My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0
RE: Nutch not indexing full collection
That did it! For the convenience of anyone who finds this in the list archives later on, I'll paste what it took: $NUTCH_HOME/runtime/local/conf/nutch-site.xml (full contents): ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property /configuration $NUTCH_HOME/runtime/local/conf/schema.xml $SOLR_HOME/example/solr/conf/schema.xml: Replace this: field name=content type=text stored=false indexed=true/ With this: field name=content type=text stored=true indexed=true/ $SOLR_HOME/example/solr/conf/solrconfig.xml: Replace this: maxFieldLength1/maxFieldLength With this: maxFieldLength2147483647/maxFieldLength -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, August 01, 2011 3:45 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Nutch not indexing full collection Nutch truncates content longer than configured and Solr truncates content exceeding max field length. Maybe check your limits. I'm still having trouble with this. In addition to the nutch-site-xml posted below, I have now modified my schema.xml (in both nutch and solr) to include the following important line: field name=content type=text stored=true indexed=true/ Now, when I search, the full text of each document shows up under str name=content. I'm clearly getting everything. And yet, when I search for text toward the end of a long document, I still don't get that document in my search results. It sounds like this might be an issue with my Solr setup. Can anyone think of what I might be missing? Chip -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, July 28, 2011 3:29 PM To: user@nutch.apache.org Subject: RE: Nutch not indexing full collection Thanks! This has solved half of my problem. I am now indexing material from every document I want. However, I'm still not indexing words from toward the end of longer documents. I'm not sure what else I could be missing. The current contents of my nutch-site.xml are: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueOHI Spider/value /property property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namehttp.content.limit/name value-1/value /property /configuration And I'm still indexing with this command: bin/nutch crawl urls -dir crawl -depth 15 -topN 50 -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 27, 2011 12:18 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection has this been solved? If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr. On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote: I'm still having trouble. I've set a windows environment variable, NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I now have my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local folder. But I'm still not crawling files later on my urls list, and apparently I can't search for words or phrases toward the end of any of my documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased? -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote: I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml
Machine readable vs. human readable URLs.
Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which we actually send users to. Has anyone done the same with Nutch and Solr? Thanks, Chip
RE: Machine readable vs. human readable URLs.
Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following: In Nutch: Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin. In %NUTCH_HOME%, run ant war. Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: http://www.aip.org/history/ead/20110369.xml\t humanURL=http://www.aip.org/history/ead/20110369.html; In Solr: Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line consists of: field name=humanURL type=string stored=true indexed=false/ I've redone the indexing, and my new field still doesn't show up in the search results. Can you tell where I'm going wrong? Thanks, Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, September 16, 2011 4:37 AM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side Julien On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote: Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which we actually send users to. Has anyone done the same with Nutch and Solr? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
RE: Machine readable vs. human readable URLs.
Hi Lewis, My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by \t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field called humanURL which is populated with the string I've specified for that specific URL. I may just be greatly misunderstanding how this plugin works. I've checked my Nutch logs now and it looks like nothing happened. The new field does at least show up in the Solr admin UI's schema, but clearly my problem is on the Nutch end of things. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 19, 2011 3:34 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, There is no need to run ant war, there is no war target in the = Nutch 1.3 build.xml file. Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs? Have you had a look at any of your log output as to whether the urlmeta plugin is loaded and used when fetching? You should be able to get info on your schema, fields etc within the Solr admin UI On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following: In Nutch: Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin. In %NUTCH_HOME%, run ant war. Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: http://www.aip.org/history/ead/20110369.xml\t humanURL=http://www.aip.org/history/ead/20110369.html; In Solr: Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line consists of: field name=humanURL type=string stored=true indexed=false/ I've redone the indexing, and my new field still doesn't show up in the search results. Can you tell where I'm going wrong? Thanks, Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, September 16, 2011 4:37 AM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side Julien On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote: Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which we actually send users to. Has anyone done the same with Nutch and Solr? Thanks, Chip -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- *Lewis*
RE: Machine readable vs. human readable URLs.
I thought it seemed too good to be true. I understood the part about this picking up metadata from tags within the actual documents; that seems like a feature a lot of people would need. But I thought the whole point of the tab-delimited tags in my URLs file was that I could also inject tags that aren't in the source documents. That doesn't seem like it would be a standard feature, but it's what I need. Most of the pages I need to index aren't owned by us, and I won't always be able to get other sites to add an extra meta tag to their pages. It looks like I might need to write my own plugin, which is a little daunting for me. Can anyone think of an existing plugin that injects metadata into indexed documents after the fact? It would be nice to have some existing code I could examine and learn from. Thanks, Chip -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 19, 2011 4:56 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g. meta name=humanURL content=blahblahblah The plugin then gets this data as well as any additional values added in the urlmeta.tags property within nutch-site.xml and add this to the index which can then be queried. Does this make sense? On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Since the info is available thanks to the injection you can use the url-meta plugin as-is and won't need to have a custom version. See https://issues.apache.org/jira/browse/NUTCH-855 Apart from that do not modify the content of \runtime\local\conf\ before re-compiling with ANT as this will be overwritten. Either modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. As Lewis suggested check the logs and see if the plugin is activated etc... J. On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote: Hi Lewis, My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by \t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field called humanURL which is populated with the string I've specified for that specific URL. I may just be greatly misunderstanding how this plugin works. I've checked my Nutch logs now and it looks like nothing happened. The new field does at least show up in the Solr admin UI's schema, but clearly my problem is on the Nutch end of things. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 19, 2011 3:34 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, There is no need to run ant war, there is no war target in the = Nutch 1.3 build.xml file. Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs? Have you had a look at any of your log output as to whether the urlmeta plugin is loaded and used when fetching? You should be able to get info on your schema, fields etc within the Solr admin UI On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following: In Nutch: Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin. In %NUTCH_HOME%, run ant war. Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: http://www.aip.org/history/ead/20110369.xml\t humanURL=http://www.aip.org/history/ead/20110369.html; In Solr: Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line consists of: field name=humanURL type=string stored=true indexed=false/ I've redone the indexing, and my new field still doesn't show up in the search results. Can you tell where I'm going wrong? Thanks, Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, September 16, 2011 4:37 AM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side Julien On 15 September 2011 21:50
RE: Machine readable vs. human readable URLs.
Hi Julien, Thanks for clarifying this! I've got it working now. Instead of seeding with a proper tab-delimited file created in Excel, I had been wrong-headedly seeding it with a text file that just had tabs in it. They look the same, but it makes a difference. Thanks! Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Monday, September 19, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g. meta name=humanURL content=blahblahblah Sorry Lewis but it does not do that at all. See link I gave earlier for a description of urlmeta. I agree that the name is misleading, it does not extra the content from the page but simply uses the crawldb metadata The plugin then gets this data as well as any additional values added in the urlmeta.tags property within nutch-site.xml and add this to the index which can then be queried. Does this make sense? On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Since the info is available thanks to the injection you can use the url-meta plugin as-is and won't need to have a custom version. See https://issues.apache.org/jira/browse/NUTCH-855 Apart from that do not modify the content of \runtime\local\conf\ before re-compiling with ANT as this will be overwritten. Either modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. As Lewis suggested check the logs and see if the plugin is activated etc... J. On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote: Hi Lewis, My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by \t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field called humanURL which is populated with the string I've specified for that specific URL. I may just be greatly misunderstanding how this plugin works. I've checked my Nutch logs now and it looks like nothing happened. The new field does at least show up in the Solr admin UI's schema, but clearly my problem is on the Nutch end of things. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 19, 2011 3:34 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, There is no need to run ant war, there is no war target in the = Nutch 1.3 build.xml file. Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs? Have you had a look at any of your log output as to whether the urlmeta plugin is loaded and used when fetching? You should be able to get info on your schema, fields etc within the Solr admin UI On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following: In Nutch: Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin. In %NUTCH_HOME%, run ant war. Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: http://www.aip.org/history/ead/20110369.xml \t humanURL=http://www.aip.org/history/ead/20110369.html; In Solr: Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line consists of: field name=humanURL type=string stored=true indexed=false/ I've redone the indexing, and my new field still doesn't show up in the search results. Can you tell where I'm going wrong? Thanks, Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, September 16, 2011 4:37 AM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side Julien On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote: Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity
RE: Machine readable vs. human readable URLs.
For my own sake I wish I could think of a way in which it was unclear, but no; I just screwed up. I could maybe see reinforcing that the urls document has to be saved as a tab-delimited file, so a newbie like me won't look at the examples and think this is meant to be a text file. Otherwise, both the plugin and the documentation work great! -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Wednesday, September 21, 2011 3:05 AM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. H^i Chip, Was there anything in particular you found misleading about the plugin example on the wiki? I am keen to make it as clear as possible. Thank you Lewis On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks for clarifying this! I've got it working now. Instead of seeding with a proper tab-delimited file created in Excel, I had been wrong-headedly seeding it with a text file that just had tabs in it. They look the same, but it makes a difference. Thanks! Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Monday, September 19, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g. meta name=humanURL content=blahblahblah Sorry Lewis but it does not do that at all. See link I gave earlier for a description of urlmeta. I agree that the name is misleading, it does not extra the content from the page but simply uses the crawldb metadata The plugin then gets this data as well as any additional values added in the urlmeta.tags property within nutch-site.xml and add this to the index which can then be queried. Does this make sense? On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Since the info is available thanks to the injection you can use the url-meta plugin as-is and won't need to have a custom version. See https://issues.apache.org/jira/browse/NUTCH-855 Apart from that do not modify the content of \runtime\local\conf\ before re-compiling with ANT as this will be overwritten. Either modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. As Lewis suggested check the logs and see if the plugin is activated etc... J. On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote: Hi Lewis, My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by \t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field called humanURL which is populated with the string I've specified for that specific URL. I may just be greatly misunderstanding how this plugin works. I've checked my Nutch logs now and it looks like nothing happened. The new field does at least show up in the Solr admin UI's schema, but clearly my problem is on the Nutch end of things. -Original Message- From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, September 19, 2011 3:34 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. Hi Chip, There is no need to run ant war, there is no war target in the = Nutch 1.3 build.xml file. Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs? Have you had a look at any of your log output as to whether the urlmeta plugin is loaded and used when fetching? You should be able to get info on your schema, fields etc within the Solr admin UI On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make were the following: In Nutch: Paste the prescribed block of code into %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for and run the urlmeta plugin. In %NUTCH_HOME%, run ant war. Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now looks like: http://www.aip.org/history/ead/20110369.xml \t humanURL=http://www.aip.org/history/ead/20110369.html; In Solr: Added my new tag
How can I figure out what my user-agent is?
I thought I understood how to set my user-agent, but after asking a few sites to add me to their robots.txt it looks like I'm missing something. My nutch-sites.xml includes: property namehttp.agent.name/name valuePHFAWS Spider/value /property property namehttp.robots.agents/name valuePHFAWS Spider,*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property A friendly site created a robots.txt which includes the following: User-agent: PHFAWS Spider Disallow: User-agent: * Disallow: / Why doesn't this work? Thanks, Chip
What could be blocking me, if not robots.txt?
Hi everyone, I'm using Nutch to crawl a few friendly sites, and am having trouble with some of them. One site in particular has created an exception for me in its robots.txt, and yet I can't crawl any of its pages. I've tried copying the files I want to index (3 XML documents) to my own server and crawling that, and it works fine that way; so something is keeping me from indexing any files on this other site. I compared the logs of my attempt to crawl the friendly site with my attempt to crawl my own site, and I've found few differences. Most differences come from the fact that my own site requires a crawlDelay, so there are many log sections along the lines of: 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - maxThreads= 1 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - inProgress= 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - crawlDelay= 5000 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 0. http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 1. http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml That strikes me as probably irrelevant, but I figured I should mention it. The main difference I see in the logs is that the crawl of my own site (the crawl that worked) has the following two lines which do not appear in the log of my failed crawl: 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature Also, while my successful crawl has three lines like the following, my failed one only has two: 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default Can anyone think of something I might have missed? Chip
RE: What could be blocking me, if not robots.txt?
I apologize, but I haven't found much Nutch documentation that deals with the user-agent and robots.txt. Why am I being blocked when the user-agent I'm sending matches the user-agent in that robots.txt? Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, September 30, 2011 6:28 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? I've been able to run the ParserChecker now, but I'm not sure how to understand the results. Here's what I got: # bin/nutch org.apache.nutch.parse.ParserChecker http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml - Url --- http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml- ParseData - Version: 5 Status: success(1,0) Title: Outlinks: 1 outlink: toUrl: GR:32:A:128 anchor: Content Metadata: ETag=1fa962a-56f20-485df79c50980 Date=Fri, 30 Sep 2011 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 21:26:14 GMT Content-Type=text/xml Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat) Parse Metadata: Content-Type=application/xml This means almost everything is good to go but... Curl also retrieves this file, and yet I can't get my crawl to pick it up. Could it be an issue with robots.txt? The robots file for this site reads as follows: User-agent: PHFAWS/Nutch-1.3 Disallow: User-agent: archive.org_bot Disallow: User-agent: * Disallow: / This is the problem. That first user-agent is, as near as I can tell, what I'm sending. My log shows the following: 2011-09-30 15:54:17,712 INFO http.Http - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: Physics History Finding Aids Web Site; http://www.aip.org/history/nbl/findingaids.html; ccalh...@aip.org) Can anyone tell what I'm missing? Thanks. Chip -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, September 29, 2011 4:12 PM To: user@nutch.apache.org Subject: RE: What could be blocking me, if not robots.txt? Ah, sorry. I had already deleted the local copy from my server (aip.org) to avoid clutter. So yeah, that will definitely 404 now. Curl retrieves the whole file with no problems. I can't try the ParserChecker today as I'm stuck away from my own machine, but I will try it tomorrow. The fact that I can curl it at least tells me this is a problem I need to fix in Nutch. Chip From: Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, September 29, 2011 1:01 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? Oh, it's a 404. That makes sense. Hi everyone, I'm using Nutch to crawl a few friendly sites, and am having trouble with some of them. One site in particular has created an exception for me in its robots.txt, and yet I can't crawl any of its pages. I've tried copying the files I want to index (3 XML documents) to my own server and crawling that, and it works fine that way; so something is keeping me from indexing any files on this other site. I compared the logs of my attempt to crawl the friendly site with my attempt to crawl my own site, and I've found few differences. Most differences come from the fact that my own site requires a crawlDelay, so there are many log sections along the lines of: 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - maxThreads= 1 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - inProgress= 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - crawlDelay= 5000 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 0. http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 1. http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml That strikes me as probably irrelevant, but I figured I should mention it. The main difference I see in the logs is that the crawl of my own site (the crawl that worked) has the following two lines which do not appear in the log of my failed crawl: 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO crawl.SignatureFactory - Using
RE: What could be blocking me, if not robots.txt?
Aha! That's done it. Thanks! Incidentally, I only asked them to add the /Nutch-1.3 because originally I had a user-agent of PHFAWS Spider and had them add PHFAWS Spider to their user-agent, and it didn't work. It seems that at least some sites have trouble with a user-agent that's more than one word. And I only went with multiple words because the tutorial gives valueMy Nutch Spider/value as an example. This might be something to warn people about in the documentation. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, October 03, 2011 9:42 AM To: user@nutch.apache.org Subject: Re: What could be blocking me, if not robots.txt? Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's what is configured as your user agent name? If your name is PHFAWS then the robots.txt must list your name without /Nutch-1.3. Or maybe change the robots.txt to User-agent: PHFAWS/Nutch-1.3 Allow: / On Monday 03 October 2011 15:31:46 Chip Calhoun wrote: I apologize, but I haven't found much Nutch documentation that deals with the user-agent and robots.txt. Why am I being blocked when the user-agent I'm sending matches the user-agent in that robots.txt? Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, September 30, 2011 6:28 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? I've been able to run the ParserChecker now, but I'm not sure how to understand the results. Here's what I got: # bin/nutch org.apache.nutch.parse.ParserChecker http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml - Url --- http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml- ParseData - Version: 5 Status: success(1,0) Title: Outlinks: 1 outlink: toUrl: GR:32:A:128 anchor: Content Metadata: ETag=1fa962a-56f20-485df79c50980 Date=Fri, 30 Sep 2011 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 21:26:14 GMT Content-Type=text/xml Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat) Parse Metadata: Content-Type=application/xml This means almost everything is good to go but... Curl also retrieves this file, and yet I can't get my crawl to pick it up. Could it be an issue with robots.txt? The robots file for this site reads as follows: User-agent: PHFAWS/Nutch-1.3 Disallow: User-agent: archive.org_bot Disallow: User-agent: * Disallow: / This is the problem. That first user-agent is, as near as I can tell, what I'm sending. My log shows the following: 2011-09-30 15:54:17,712 INFO http.Http - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: Physics History Finding Aids Web Site; http://www.aip.org/history/nbl/findingaids.html; ccalh...@aip.org) Can anyone tell what I'm missing? Thanks. Chip -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, September 29, 2011 4:12 PM To: user@nutch.apache.org Subject: RE: What could be blocking me, if not robots.txt? Ah, sorry. I had already deleted the local copy from my server (aip.org) to avoid clutter. So yeah, that will definitely 404 now. Curl retrieves the whole file with no problems. I can't try the ParserChecker today as I'm stuck away from my own machine, but I will try it tomorrow. The fact that I can curl it at least tells me this is a problem I need to fix in Nutch. Chip From: Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, September 29, 2011 1:01 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? Oh, it's a 404. That makes sense. Hi everyone, I'm using Nutch to crawl a few friendly sites, and am having trouble with some of them. One site in particular has created an exception for me in its robots.txt, and yet I can't crawl any of its pages. I've tried copying the files I want to index (3 XML documents) to my own server and crawling that, and it works fine that way; so something is keeping me from indexing any files on this other site. I compared the logs of my attempt to crawl the friendly site with my attempt to crawl my own site, and I've found few differences. Most differences come from the fact that my own site requires a crawlDelay, so there are many log sections along the lines of: 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - maxThreads= 1 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - inProgress= 0 2011-09-29 10:57
Unable to parse large XML files.
Hi everyone, I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of: # bin/nutch org.apache.nutch.parse.ParserChecker http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: One thing which may or may not be relevant is that when I look XML files up in a browser the http:// at the beginning tends to disappear. That seems relevant because it seems like it might defeat my file.content.limit, http.content.limit, and ftp.content.limitftp://ftp.content.limit properties. Is there a way around this? Thanks, Chip
RE: Unable to parse large XML files.
Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 04, 2011 7:41 PM To: user@nutch.apache.org Subject: Re: Unable to parse large XML files. Hi everyone, I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of: # bin/nutch org.apache.nutch.parse.ParserChecker http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: One thing which may or may not be relevant is that when I look XML files up in a browser the http:// at the beginning tends to disappear. You're using some fancy new browser? Some seem to do that. Check your http.content.limit. That seems relevant because it seems like it might defeat my file.content.limit, http.content.limit, and ftp.content.limitftp://ftp.content.limit properties. Is there a way around this? Thanks, Chip
RE: Unable to parse large XML files.
Hrm. No, it turns out I was wrong; I'd misread an error message. I've got the following in my nutch-site.xml: property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Wednesday, October 05, 2011 9:34 AM To: 'user@nutch.apache.org'; 'markus.jel...@openindex.io' Subject: RE: Unable to parse large XML files. Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 04, 2011 7:41 PM To: user@nutch.apache.org Subject: Re: Unable to parse large XML files. Hi everyone, I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of: # bin/nutch org.apache.nutch.parse.ParserChecker http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: One thing which may or may not be relevant is that when I look XML files up in a browser the http:// at the beginning tends to disappear. You're using some fancy new browser? Some seem to do that. Check your http.content.limit. That seems relevant because it seems like it might defeat my file.content.limit, http.content.limit, and ftp.content.limitftp://ftp.content.limit properties. Is there a way around this? Thanks, Chip
Truncated content despite my content.limit settings.
Hi everyone, I'm having issues with truncated content on some pages, despite what I believe to be solid content.limit settings. One page I have an issue with: http://www.canisius.edu/archives/ruddick.asp When I run a search in Solr, the content I get is limited to: str name=contentCanisius College - Ruddick Collection Canisius College Archives Return to Home Admissions Academics Athletics Student Life Alumni and Friends News and Events Welcome to CanisiusDepartment IndexArchives Special Collections Ruddick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling the/str Here's what I have in my nutch-site.xml page, which looks sufficient to me. property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property Can anyone see what I'm missing? Thanks. Chip
RE: Truncated content despite my content.limit settings.
With ParserChecker it's similarly truncated. Could it be the fact that it's a .asp page? The output is as follows: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.canisius. edu/archives/ruddick.asp - Url --- http://www.canisius.edu/archives/ruddick.asp- ParseData - Version: 5 Status: success(1,0) Title: Canisius College - Ruddick Collection Outlinks: 20 outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor: outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor: outlink: toUrl: http://www.google-analytics.com/urchin.js anchor: outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return to Home outlink: toUrl: http://www.canisius.edu/admissions/rd/?PROP-PROADM anchor: Adm issions outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics outlink: toUrl: http://www.gogriffs.com anchor: Athletics outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: Alumni and Frie nds outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and Events outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru ddick_centerBanner.jpg anchor: outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru ddick_HC.gif anchor: outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor: mission s tatement outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/mi ssion_blue.gif anchor: mission statement outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor: archive s directory outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ar chives_gold.gif anchor: archives directory outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome to Canisiu s outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor: Departme nt Index outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor: Archives Special Collections outlink: toUrl: http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru ddick.jpg anchor: Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 GMT Conte nt-Length=10610 Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/ Content-Type=text/html Connection=close X-Powered-By=ASP.NET Server=Microsoft-I IS/6.0 Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=wind ows-1252 - ParseText - Canisius College - Ruddick Collection Canisius College Archives Return to Home A dmissions Academics Athletics Student Life Alumni and Friends News and Events We lcome to Canisius áá Department Index áá Archives Special Collections ááRud dick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to t he C -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, October 17, 2011 4:26 PM To: user@nutch.apache.org Subject: Re: Truncated content despite my content.limit settings. What does parsechecker tell you? nutch org.apache.nutch.parse.ParserChecker -dumpText URL Keep in mind that your Solr may have a low value for max field length. Hi everyone, I'm having issues with truncated content on some pages, despite what I believe to be solid content.limit settings. One page I have an issue with: http://www.canisius.edu/archives/ruddick.asp When I run a search in Solr, the content I get is limited to: str name=contentCanisius College - Ruddick Collection Canisius College Archives Return to Home Admissions Academics Athletics Student Life Alumni and Friends News and Events Welcome to Canisius Department Index Archives Special Collections Ruddick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling the/str Here's what I have in my nutch-site.xml page, which looks sufficient to me. property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property
RE: Truncated content despite my content.limit settings.
Aha! It turns out that removing protocol-httpclient from my nutch-site.xml's plugin.includes value fixes this. If I'm remembering correctly, I only added this in the hope that it would fix something else that it didn't actually fix, so hopefully removing it won't break anything. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 18, 2011 9:58 AM To: user@nutch.apache.org Subject: Re: Truncated content despite my content.limit settings. Strange! I parsed it yesterday as well with parse-tike and the Boilerpipe patch enabled and got a lot of output. Can you try a different parser? Your settings look fine but are there any other exoting settings you use or custom code? On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote: With ParserChecker it's similarly truncated. Could it be the fact that it's a .asp page? The output is as follows: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.canisius. edu/archives/ruddick.asp - Url --- http://www.canisius.edu/archives/ruddick.asp- ParseData - Version: 5 Status: success(1,0) Title: Canisius College - Ruddick Collection Outlinks: 20 outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor: outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor: outlink: toUrl: http://www.google-analytics.com/urchin.js anchor: outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return to Home outlink: toUrl: http://www.canisius.edu/admissions/rd/?PROP-PROADM anchor: Adm issions outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics outlink: toUrl: http://www.gogriffs.com anchor: Athletics outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: Alumni and Frie nds outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and Events outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru ddick_centerBanner.jpg anchor: outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru ddick_HC.gif anchor: outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor: mission s tatement outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/mi ssion_blue.gif anchor: mission statement outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor: archive s directory outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ar chives_gold.gif anchor: archives directory outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome to Canisiu s outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor: Departme nt Index outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor: Archives Special Collections outlink: toUrl: http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru ddick.jpg anchor: Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 GMT Conte nt-Length=10610 Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/ Content-Type=text/html Connection=close X-Powered-By=ASP.NET Server=Microsoft-I IS/6.0 Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=wind ows-1252 - ParseText - Canisius College - Ruddick Collection Canisius College Archives Return to Home A dmissions Academics Athletics Student Life Alumni and Friends News and Events We lcome to Canisius áá Department Index áá Archives Special Collections ááRud dick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to t he C -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, October 17, 2011 4:26 PM To: user@nutch.apache.org Subject: Re: Truncated content despite my content.limit settings. What does parsechecker tell you? nutch org.apache.nutch.parse.ParserChecker -dumpText URL Keep in mind that your Solr may have a low value for max field length. Hi everyone, I'm having issues with truncated content on some pages, despite what I believe to be solid content.limit settings. One page I have an issue with: http://www.canisius.edu/archives/ruddick.asp When I run a search in Solr, the content I get is limited to: str name=contentCanisius College - Ruddick Collection Canisius College Archives Return to Home Admissions Academics Athletics Student Life Alumni and Friends News and Events Welcome to Canisius Department Index Archives Special Collections Ruddick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling the/str Here's what I have in my nutch-site.xml page, which looks sufficient to me. property namedb.max.outlinks.per.page/name value-1/value
Good workaround for timeout?
I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
RE: Good workaround for timeout?
If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property What else can I do? Thanks. Chip
RE: Good workaround for timeout?
I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be working so far; but since this has been an intermittent problem, I can't be sure whether I've really fixed it or whether I'm getting lucky. property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value99/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value9/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value300/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293 2D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value999/value
Is there a workaround for https?
I've noticed the recent posts about trouble with protocol-httpclient, which to my understanding is needed for https URLs. Is there another way to handle these? ParserChecker gives me the following when I try one of these URLs. Thanks. # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText https://libwebspace.library.cmu.edu:4430/Research/Archives/ead/generated/shull.xml Exception in thread main org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:78)
RE: Good workaround for timeout?
I started out with a pretty high number in http.timeout, and I've increased it to the fairly ridiculous 999. Is there an upper limit at which it would stop working properly? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be working so far; but since this has been an intermittent problem, I can't be sure whether I've really fixed it or whether I'm getting lucky. http.timeout is used in lib-http so it should work unless there's a bug around. Does the problem persist for that one URL if you increase this value to a more reasonable number, say 300? property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value99/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value9/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value300/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293 2D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2 93 2D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value
RE: Good workaround for timeout?
Good to know! I was definitely exceeding that, so I've changed my properties. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote: I started out with a pretty high number in http.timeout, and I've increased it to the fairly ridiculous 999. Is there an upper limit at which it would stop working properly? It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know how hadoop will handle for sure. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be working so far; but since this has been an intermittent problem, I can't be sure whether I've really fixed it or whether I'm getting lucky. http.timeout is used in lib-http so it should work unless there's a bug around. Does the problem persist for that one URL if you increase this value to a more reasonable number, say 300? property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value99/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value9/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value300/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2 93 2D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://digital.lib.washington.edu/findingaids/view?docId=UA37_06 _2 93 2D onal dsonLauren.xml Exception in thread main java.lang.NullPointerException at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84) I've stuck with the default value of 10 in my nutch-default.xml's fetcher.threads.fetch value, and I've added the following to nutch-site.xml: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page
Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)
|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property property nameurlmeta.tags/name valuehumanurl/value /property -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, October 20, 2011 10:23 AM To: 'markus.jel...@openindex.io'; user@nutch.apache.org Subject: RE: Good workaround for timeout? Good to know! I was definitely exceeding that, so I've changed my properties. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote: I started out with a pretty high number in http.timeout, and I've increased it to the fairly ridiculous 999. Is there an upper limit at which it would stop working properly? It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know how hadoop will handle for sure. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be working so far; but since this has been an intermittent problem, I can't be sure whether I've really fixed it or whether I'm getting lucky. http.timeout is used in lib-http so it should work unless there's a bug around. Does the problem persist for that one URL if you increase this value to a more reasonable number, say 300? property namehttp.timeout/name value999/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value99/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value9/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value300/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:28 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? It is indeed. Tricky. Are you going through some proxy? Are you using protocol-http or httpclient? Are you sure the http.time.out value is actually used in lib-http? If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2 93 2D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 19, 2011 11:08 AM To: user@nutch.apache.org Subject: Re: Good workaround for timeout? What is timing out, the fetch or the parse? I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http
RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)
Increasing parser.timeout to 3600 got me what I needed. I only have a few files this huge, so I'll live with that. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 26, 2011 10:55 AM To: user@nutch.apache.org Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?) The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote: I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow. I do get the following weird message in my log when I run ParserChecker or the crawler: 2011-10-26 09:51:47,729 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-10-26 10:06:40,639 WARN parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 WARN parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml My ParserChecker results look like this: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: - ParseText - And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference:
Trouble running solrindexer from Nutch 1.4
This is probably just down to my not waiting for a 1.4 tutorial, but here goes. I've always used the following two commands to run my crawl and then index to Solr: # bin/nutch crawl urls -dir crawl -depth 1 -topN 50 # bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* In 1.3 that works great. But in 1.4, when I run Solrindex I get this: # bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-12-07 17:09:58 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file: /C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/parse_text Sure enough, those directories don't exist. But they didn't exist in 1.3 either. What am I missing? Thanks, Chip
Can't crawl a domain; can't figure out why.
I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed crawl and solrindex and hope that someone can think of anything I've overlooked. Does anything look strange here? Thanks, Chip 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - threads = 10 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Html Parse Plug-in (parse-html) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTTP Framework (lib-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http Protocol Plug-in (protocol-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Tika Parser Plug-in (parse-tika) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -URL Meta Indexing Filter (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2011-12-19 16:31:05,722 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2011-12-19 16:31:07,014 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-12-19 16:31:07,897 INFO crawl.Injector - Injector: finished at 2011-12-19 16:31:07, elapsed: 00:00:06 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: starting at
RE: Can't crawl a domain; can't figure out why.
I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed crawl and solrindex and hope that someone can think of anything I've overlooked. Does anything look strange here? Thanks, Chip 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - threads = 10 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Html Parse Plug-in (parse-html) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTTP Framework (lib-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http Protocol Plug-in (protocol-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Tika Parser Plug-in (parse-tika) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -URL Meta Indexing Filter (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
RE: Can't crawl a domain; can't figure out why.
://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription anchor: [http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: http://libraries.mit.edu anchor: [http://libraries.mit.edu] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ Content Metadata: Date=Tue, 20 Dec 2011 21:30:50 GMT Content-Length=191500 Via=1.0 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from barracuda.acp.org Server=Apache/2.2.3 (Red Hat) Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Tuesday, December 20, 2011 2:15 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun ccalh...@aip.org To: user user@nutch.apache.org; 'markus.jel...@openindex.io' markus.jel...@openindex.io Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl a domain; can't figure out why. I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like
Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)
We've found that the solrindex process chokes on the custom metadata fields I added to my Nutch using the urlmeta plugin. A sample of the lengthy error messages: java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/phfaws: ERROR: [doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown field 'icosreposurl' at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) As mentioned in my previous message, I've copied my Nutch schema.xml into my Solr's conf folder, but since my Solr instance hadn't already had a schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields? Chip From: Chip Calhoun [ccalh...@aip.org] Sent: Friday, February 03, 2017 11:45 AM To: user@nutch.apache.org Subject: Failing to index from Nutch 1.12 to Solr 5.5.3 I'm switching to more recent Nutch/Solr, after years of using Nutch 1.4 and Solr 3.3.0. I get no results when I index into Solr. I can't tell where this breaks down. I use these commands: cd /opt/apache-nutch-1.12/runtime/local export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121.x86_64 export NUTCH_CONF_DIR=/opt/apache-nutch-1.12/runtime/local/conf/phfaws bin/crawl urls/phfaws crawl/phfaws 1 bin/nutch solrindex http://localhost:8983/solr/phfaws/ crawl/phfaws/crawldb -linkdb crawl/phfaws/linkdb crawl/phfaws/segments/* I believe that Nutch is crawling properly, but I do find that the crawl folders end up about 25% as large as what I produced with Nutch 1.4. I suspect that the problem is with the Nutch/Solr integration. My Solr core didn't create a schema.xml, instead having a managed scheme. I've copied my Nutch local conf's schema.xml into Solr, but I haven't seen that I'm supposed to do anything more with that. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library
Need help installing scoring-depth plugin
I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my seeds, so my 1.4 command was: bin/nutch crawl phfaws -dir crawl -depth 1 -topN 5 My understanding is that the "crawl" command is deprecated, "-depth" went with it, and I need to install the scoring-depth plugin. I'm new to adding plugins. The instructions at https://wiki.apache.org/nutch/AboutPlugins give a sample command, but I don't know what the official PluginRepository for this plugin is and the sample link for the HtmlParser plugin is dead. I'll appreciate any help. Thank you! Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library
RE: Need help installing scoring-depth plugin
Thank you Julien! That's exactly what I needed. Chip -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Tuesday, January 31, 2017 1:09 PM To: user@nutch.apache.org Subject: Re: Need help installing scoring-depth plugin You don't need to install scoring-depth. It's just that the term 'depth' in the old crawl class has been replaced by 'rounds', which is more accurate. The equivalent of the command you used to call should be *bin/crawl phfaws crawl **1 * The value for topN needs setting in the crawl scrip, see sizeFetchlist in [ https://github.com/apache/nutch/blob/master/src/bin/crawl#L117] HTH Julien On 31 January 2017 at 16:49, Chip Calhoun <ccalh...@aip.org> wrote: > I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my > seeds, so my 1.4 command was: > bin/nutch crawl phfaws -dir crawl -depth 1 -topN 5 > > My understanding is that the "crawl" command is deprecated, "-depth" > went with it, and I need to install the scoring-depth plugin. I'm new > to adding plugins. The instructions at > https://wiki.apache.org/nutch/AboutPlugins > give a sample command, but I don't know what the official > PluginRepository for this plugin is and the sample link for the HtmlParser > plugin is dead. > > I'll appreciate any help. Thank you! > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740 > 301-209-3180 > https://www.aip.org/history-programs/niels-bohr-library > > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>
Queries in new Solr version not finding results I'd expect
I'm testing a new setup, Solr 5.5.3 indexed from Nutch 1.12. I'm comparing it against my production instance, Solr 3.3.0 and Nutch 1.4. The new search misses some results that the old one got. If I search for the word "optician", the old results include a result which the new one misses. The document in question is indexed by the new Solr; I can find it using other search terms. The content field for this document, stored in the new Solr, does clearly include the word "optician". Why wouldn't it turn up? Where do I start looking? As an aside, thank you to everyone who's replied to my questions the past few weeks. I don't want to clog the listserv with a lot of short "thank you" posts, but I do appreciate it. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library
Failing to index from Nutch 1.12 to Solr 5.5.3
I'm switching to more recent Nutch/Solr, after years of using Nutch 1.4 and Solr 3.3.0. I get no results when I index into Solr. I can't tell where this breaks down. I use these commands: cd /opt/apache-nutch-1.12/runtime/local export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121.x86_64 export NUTCH_CONF_DIR=/opt/apache-nutch-1.12/runtime/local/conf/phfaws bin/crawl urls/phfaws crawl/phfaws 1 bin/nutch solrindex http://localhost:8983/solr/phfaws/ crawl/phfaws/crawldb -linkdb crawl/phfaws/linkdb crawl/phfaws/segments/* I believe that Nutch is crawling properly, but I do find that the crawl folders end up about 25% as large as what I produced with Nutch 1.4. I suspect that the problem is with the Nutch/Solr integration. My Solr core didn't create a schema.xml, instead having a managed scheme. I've copied my Nutch local conf's schema.xml into Solr, but I haven't seen that I'm supposed to do anything more with that. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library
No build.xml for Nutch 1.12
I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. What am I missing? Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library
RE: No build.xml for Nutch 1.12
Markus, Thank you! Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 25, 2017 5:03 PM To: user@nutch.apache.org Subject: RE: No build.xml for Nutch 1.12 Hello, A *-bin* file in ASF downloads is always a precompiled distribution. You are looking for the *-src* file, or not, if you don't have to recompile or don't need to customize the sources. Regards, Markus -Original message- > From:Chip Calhoun <ccalh...@aip.org> > Sent: Wednesday 25th January 2017 22:57 > To: user@nutch.apache.org > Subject: No build.xml for Nutch 1.12 > > I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't > find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. > What am I missing? > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740 > 301-209-3180 > https://www.aip.org/history-programs/niels-bohr-library > >
RE: [MASSMAIL]Nutch not indexing all seed URLs
Thank you. The problem was right below that; I had the default "timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to something ridiculous and try again. Chip -Original Message- From: Eyeris Rodriguez Rueda [mailto:eru...@uci.cu] Sent: Thursday, May 11, 2017 4:46 PM To: user@nutch.apache.org Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs Hi. Maybe one cause: Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117) sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url list. Also check your filters. Tell me if you have solved the problem - Mensaje original - De: "Chip Calhoun" <ccalh...@aip.org> Para: user@nutch.apache.org Enviados: Jueves, 11 de Mayo 2017 16:30:34 Asunto: [MASSMAIL]Nutch not indexing all seed URLs I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the uninteresting navigation pages on my site, I've made a URLs list of all the URLs I want crawled; the current list is 2522 URLs. However, the indexer stopped after just 1077 of these URLs. My generate.max.count is set to -1. What would cause my URLs to be skipped? Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library La @universidad_uci es Fidel. Los jóvenes no fallaremos. #HastaSiempreComandante #HastalaVictoriaSiempre
Nutch not indexing all seed URLs
I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the uninteresting navigation pages on my site, I've made a URLs list of all the URLs I want crawled; the current list is 2522 URLs. However, the indexer stopped after just 1077 of these URLs. My generate.max.count is set to -1. What would cause my URLs to be skipped? Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library
Re: Nutch fetching times out at 3 hours, not sure why.
Hi Sebastian, Yes, that explains it! Now I wish I'd pasted my crawl command in the first place. I'll leave it alone for now, but if it becomes an issue again I know where to check. Thank you. Chip From: Sebastian Nagel <wastl.na...@googlemail.com> Sent: Monday, April 30, 2018 4:53:20 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi Chip, got it, you probably run bin/crawl which has the option: --time-limit-fetch Number of minutes allocated to the fetching [default: 180] It's good to have a time limit, in case a single server responds too slowly. Best, Sebastian On 04/30/2018 09:04 PM, Chip Calhoun wrote: > Hi Sebastian, > > Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and > saved me a lot of time. > > I'm still bewildered by the original problem, though. Both my > fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. > I'll ignore it unless it causes a problem for my other cores. > > Chip > > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, April 30, 2018 12:21 PM > To: user@nutch.apache.org > Subject: Re: Nutch fetching times out at 3 hours, not sure why. > > Hi, > > if you still see the log message > >fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! > > then it can be only > - fetcher.timelimit.mins > - fetcher.max.exceptions.per.queue > >> I crawl a list of roughly 2600 URLs all on my local server > > If this is the case you can crawl more aggressively, see > fetcher.server.delay > or even fetch in parallel from your host, see > fetcher.threads.per.queue > > Best, > Sebastian > > On 04/30/2018 04:44 PM, Chip Calhoun wrote: >> I'm still experimenting with this. I had been crawling with a depth of 1 >> because I don't need anything outside my URLs list, but I tried with a depth >> of 10. It went through a crawl loop that ended after 3 hours, then a second >> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short >> of crawling every URL in my list, though it crawled a few I hadn't included. >> >> Are these 3 hour loops standard for large crawls? >> >> -Original Message- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: Tuesday, April 17, 2018 3:27 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same >> URL, or even at the same point in a URL's fetcher loop; it really seems to >> be time based. >> >> -Original Message- >> From: Sadiki Latty [mailto:sla...@uottawa.ca] >> Sent: Tuesday, April 17, 2018 1:43 PM >> To: user@nutch.apache.org >> Subject: RE: Nutch fetching times out at 3 hours, not sure why. >> >> Which version are you running? That value is defaulted to -1 in my current >> version (1.14) so shouldn't be something you should have needed to change. >> My crawls, by default, go for as much as even 12 hours with little to no >> tweaking necessary from the nutch-default. Something else is causing it. Is >> it always the same URL that it fails at? >> >> -Original Message- >> From: Chip Calhoun [mailto:ccalh...@aip.org] >> Sent: April-17-18 10:45 AM >> To: user@nutch.apache.org >> Subject: Nutch fetching times out at 3 hours, not sure why. >> >> I crawl a list of roughly 2600 URLs all on my local server, and I'm only >> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give >> or take a few milliseconds) with this message in the log: >> >> 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: >> https://history.aip.org >> dropping! >> >> I've seen that 3 hours is the default in some Nutch installations, but I've >> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something >> obvious. Any thoughts would be greatly appreciated. Thank you. >> >> Chip Calhoun >> Digital Archivist >> Niels Bohr Library & Archives >> American Institute of Physics >> One Physics Ellipse >> College Park, MD 20740-3840 USA >> Tel: +1 301-209-3180 >> Email: ccalh...@aip.org >> https://www.aip.org/history-programs/niels-bohr-library >> >
RE: Nutch fetching times out at 3 hours, not sure why.
I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. Are these 3 hour loops standard for large crawls? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Tuesday, April 17, 2018 3:27 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. -Original Message- From: Sadiki Latty [mailto:sla...@uottawa.ca] Sent: Tuesday, April 17, 2018 1:43 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Which version are you running? That value is defaulted to -1 in my current version (1.14) so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure why. I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Sebastian, Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and saved me a lot of time. I'm still bewildered by the original problem, though. Both my fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. I'll ignore it unless it causes a problem for my other cores. Chip -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Monday, April 30, 2018 12:21 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi, if you still see the log message fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! then it can be only - fetcher.timelimit.mins - fetcher.max.exceptions.per.queue > I crawl a list of roughly 2600 URLs all on my local server If this is the case you can crawl more aggressively, see fetcher.server.delay or even fetch in parallel from your host, see fetcher.threads.per.queue Best, Sebastian On 04/30/2018 04:44 PM, Chip Calhoun wrote: > I'm still experimenting with this. I had been crawling with a depth of 1 > because I don't need anything outside my URLs list, but I tried with a depth > of 10. It went through a crawl loop that ended after 3 hours, then a second 3 > hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of > crawling every URL in my list, though it crawled a few I hadn't included. > > Are these 3 hour loops standard for large crawls? > > -----Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: Tuesday, April 17, 2018 3:27 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library >
Nutch fetching times out at 3 hours, not sure why.
I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Lewis, I'm using Nutch 1.2. Chip -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Wednesday, April 18, 2018 1:55 PM To: user@nutch.apache.org Subject: Re: Nutch fetching times out at 3 hours, not sure why. Hi Chip, Which version of Nutch are you using? On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Chip Calhoun <ccalh...@aip.org> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Tue, 17 Apr 2018 14:45:01 + > Subject: Nutch fetching times out at 3 hours, not sure why. > I crawl a list of roughly 2600 URLs all on my local server, and I'm > only crawling around 1000 of them. The fetcher quits after exactly 3 > hours (give or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but > I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing > something obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
RE: Nutch fetching times out at 3 hours, not sure why.
Hi Markus, I don't see an indication of the web server blocking me, though that sounds reasonable. Could there be a per-server limit in Nutch itself that we're overlooking, since this is all on the same server? Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, April 17, 2018 3:58 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Hello Chip, I have no clue where the three hour limit could come from. Please take a further look in the last few minutes of the logs. The only thing i can think of is that a webserver would block you after some amount of requests/time window, that would be visible in the logs. It is clear Nutch itself terminates the fetcher (the dropping line). That is only possible with an imposed time limit, or a if you reached some number of exceptions (or one other variable i am forgetting). Regards, Markus -Original message- > From:Chip Calhoun <ccalh...@aip.org> > Sent: Tuesday 17th April 2018 21:27 > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, > or even at the same point in a URL's fetcher loop; it really seems to be time > based. > > -Original Message- > From: Sadiki Latty [mailto:sla...@uottawa.ca] > Sent: Tuesday, April 17, 2018 1:43 PM > To: user@nutch.apache.org > Subject: RE: Nutch fetching times out at 3 hours, not sure why. > > Which version are you running? That value is defaulted to -1 in my current > version (1.14) so shouldn't be something you should have needed to change. > My crawls, by default, go for as much as even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutch.apache.org > Subject: Nutch fetching times out at 3 hours, not sure why. > > I crawl a list of roughly 2600 URLs all on my local server, and I'm only > crawling around 1000 of them. The fetcher quits after exactly 3 hours (give > or take a few milliseconds) with this message in the log: > > 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: > https://history.aip.org >> dropping! > > I've seen that 3 hours is the default in some Nutch installations, but I've > got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something > obvious. Any thoughts would be greatly appreciated. Thank you. > > Chip Calhoun > Digital Archivist > Niels Bohr Library & Archives > American Institute of Physics > One Physics Ellipse > College Park, MD 20740-3840 USA > Tel: +1 301-209-3180 > Email: ccalh...@aip.org > https://www.aip.org/history-programs/niels-bohr-library > >
RE: Nutch fetching times out at 3 hours, not sure why.
I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. -Original Message- From: Sadiki Latty [mailto:sla...@uottawa.ca] Sent: Tuesday, April 17, 2018 1:43 PM To: user@nutch.apache.org Subject: RE: Nutch fetching times out at 3 hours, not sure why. Which version are you running? That value is defaulted to -1 in my current version (1.14) so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure why. I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log: 2018-04-13 15:50:48,885 INFO fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library