And here is my hadoop.log 2012-08-18 08:30:13,069 INFO solr.SolrIndexerJob - SolrIndexerJob: starting 2012-08-18 08:30:13,658 INFO plugin.PluginRepository - Plugins: looking in: /usr/share/nutch/runtime/local/plugins 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Registered Plugins: 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Registered Extension-Points: 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter) 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2012-08-18 08:30:13,881 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2012-08-18 08:30:13,883 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2012-08-18 08:30:13,883 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-08-18 08:30:14,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-08-18 08:30:15,960 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: content dest: content 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: site dest: site 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: title dest: title 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: host dest: host 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: segment dest: segment 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: boost dest: boost 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: digest dest: digest 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2012-08-18 08:30:16,094 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2012-08-18 08:30:16,094 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2012-08-18 08:30:16,094 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2012-08-18 08:30:16,957 INFO solr.SolrWriter - Adding 36 documents 2012-08-18 08:30:19,859 INFO solr.SolrIndexerJob - SolrIndexerJob: done.
On Sat, Aug 18, 2012 at 1:09 AM, Robert Irribarren <[email protected]>wrote: > WebTable statistics start > Statistics for WebTable: > min score: 0.0 > status 2 (status_fetched): 1053 > jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, > counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce > Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234, > REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418, > COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046, > COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936}, > FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145}, > File Output Format Counters ={BYTES_WRITTEN=375}}}} > retry 0: 1233 > retry 1: 1 > TOTAL urls: 1234 > status 4 (status_redir_temp): 32 > status 5 (status_redir_perm): 47 > max score: 1.0 > status 34 (status_retry): 16 > status 3 (status_gone): 17 > status 0 (null): 69 > avg score: 0.01614992 > WebTable statistics: done > min score: 0.0 > status 2 (status_fetched): 1053 > jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, > counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce > Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234, > REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418, > COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046, > COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936}, > FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145}, > File Output Format Counters ={BYTES_WRITTEN=375}}}} > retry 0: 1233 > retry 1: 1 > TOTAL urls: 1234 > status 4 (status_redir_temp): 32 > status 5 (status_redir_perm): 47 > max score: 1.0 > status 34 (status_retry): 16 > status 3 (status_gone): 17 > status 0 (null): 69 > avg score: 0.01614992 > > > This is what the db says but its not really what i see on my solr. Perhaps > I didn't set my solr directory somewhere? Please help > > > On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren > <[email protected]>wrote: > >> Update : I get this after im done crawling >> >> Parsing http://www.brainpop.co.uk/ >> Exception in thread "main" java.lang.RuntimeException: job failed: >> name=parse, jobid=job_local_0004 >> at >> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47) >> at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249) >> at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) >> at org.apache.nutch.crawl.Crawler.run(Crawler.java:171) >> at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawler.main(Crawler.java:257) >> >> >> >> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <[email protected] >> > wrote: >> >>> I actually didnt have it specified, I now put this in the nutch-site.xml >>> looks like this. >>> >>> <?xml version="1.0"?> >>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >>> >>> <!-- Put site-specific property overrides in this file. --> >>> >>> <configuration> >>> <property> >>> <name>http.agent.name</name> >>> <value>Balsa Crawler</value> >>> </property> >>> >>> <property> >>> <name>db.ignore.external.links</name> >>> <value>false</value> >>> <description>If true, outlinks leading from a page to external hosts >>> will be ignored. This is an effective way to limit the crawl to include >>> only initially injected hosts, without creating complex URLFilters. >>> </description> >>> </property> >>> >>> <property> >>> <name>storage.data.store.class</name> >>> <value>org.apache.gora.sql.store.SqlStore</value> >>> <description>The Gora DataStore class for storing and retrieving data. >>> Currently the following stores are available: .. >>> </description> >>> </property> >>> >>> </configuration> >>> >>> >>> >>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler < >>> [email protected]> wrote: >>> >>>> Did you set db.ignore.external in *conf/nutch-site.xml*? >>>> This avoids that external links are fetched. >>>> Another problem could be, that the robots.txt of the servers prevents >>>> the crawler from fetching. >>>> you can check this with *bin/nutch readdb*. There you see, if the sites >>>> are really fetched >>>> regards >>>> Stefan >>>> >>>> Am 18.08.2012 09:07, schrieb Robert Irribarren: >>>> >>>> I run this >>>>> nutch inject urls >>>>> nutch generate >>>>> bin/nutch crawl urls -depth 3 -topN 100 >>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex >>>>> echo Crawling completed >>>>> dir >>>>> >>>>> then I see alot of urls being fetched during the crawl phase. >>>>> When I run the solrindex it doesn't add all the urls i see when it says >>>>> fetching >>>>> >>>>> 54 URLs in 5 queues >>>>> fetching http://www.tarpits.org/join-us >>>>> fetching >>>>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp> >>>>> fetching >>>>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus> >>>>> >>>>> It doesn't add wikipedia nor the others. >>>>> >>>>> ADDITIONAL INFO : >>>>> My regex-urlfilter.txt >>>>> # skip file: ftp: and mailto: urls >>>>> -^(file|ftp|mailto): >>>>> >>>>> # skip image and other suffixes we can't yet parse >>>>> # for a more extensive coverage use the urlfilter-suffix plugin >>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|** >>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|** >>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$ >>>>> >>>>> # skip URLs containing certain characters as probable queries, etc. >>>>> -[?*!@=] >>>>> >>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to >>>>> break >>>>> loops >>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >>>>> >>>>> # accept anything else >>>>> +. >>>>> ##############################**##############################**##### >>>>> >>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0 >>>>> >>>>> >>>> >>> >> >

