Update : I get this after im done crawling Parsing http://www.brainpop.co.uk/ Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47) at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249) at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) at org.apache.nutch.crawl.Crawler.run(Crawler.java:171) at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <[email protected]>wrote: > I actually didnt have it specified, I now put this in the nutch-site.xml > looks like this. > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.agent.name</name> > <value>Balsa Crawler</value> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>false</value> > <description>If true, outlinks leading from a page to external hosts > will be ignored. This is an effective way to limit the crawl to include > only initially injected hosts, without creating complex URLFilters. > </description> > </property> > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.sql.store.SqlStore</value> > <description>The Gora DataStore class for storing and retrieving data. > Currently the following stores are available: .. > </description> > </property> > > </configuration> > > > > On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler < > [email protected]> wrote: > >> Did you set db.ignore.external in *conf/nutch-site.xml*? >> This avoids that external links are fetched. >> Another problem could be, that the robots.txt of the servers prevents the >> crawler from fetching. >> you can check this with *bin/nutch readdb*. There you see, if the sites >> are really fetched >> regards >> Stefan >> >> Am 18.08.2012 09:07, schrieb Robert Irribarren: >> >> I run this >>> nutch inject urls >>> nutch generate >>> bin/nutch crawl urls -depth 3 -topN 100 >>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex >>> echo Crawling completed >>> dir >>> >>> then I see alot of urls being fetched during the crawl phase. >>> When I run the solrindex it doesn't add all the urls i see when it says >>> fetching >>> >>> 54 URLs in 5 queues >>> fetching http://www.tarpits.org/join-us >>> fetching >>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp> >>> fetching >>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus> >>> >>> It doesn't add wikipedia nor the others. >>> >>> ADDITIONAL INFO : >>> My regex-urlfilter.txt >>> # skip file: ftp: and mailto: urls >>> -^(file|ftp|mailto): >>> >>> # skip image and other suffixes we can't yet parse >>> # for a more extensive coverage use the urlfilter-suffix plugin >>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|** >>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|** >>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$ >>> >>> # skip URLs containing certain characters as probable queries, etc. >>> -[?*!@=] >>> >>> # skip URLs with slash-delimited segment that repeats 3+ times, to break >>> loops >>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >>> >>> # accept anything else >>> +. >>> ##############################**##############################**##### >>> >>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0 >>> >>> >> >

