I fixed the errors, thanks. On Sat, Aug 18, 2012 at 1:33 AM, Robert Irribarren <[email protected]>wrote:
> And here is my hadoop.log > 2012-08-18 08:30:13,069 INFO solr.SolrIndexerJob - SolrIndexerJob: > starting > 2012-08-18 08:30:13,658 INFO plugin.PluginRepository - Plugins: looking > in: /usr/share/nutch/runtime/local/plugins > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Registered Plugins: > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Tika > Parser Plug-in (parse-tika) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Anchor > Indexing Filter (index-anchor) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Registered > Extension-Points: > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2012-08-18 08:30:13,866 INFO plugin.PluginRepository - Parse > Filter (org.apache.nutch.parse.ParseFilter) > 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2012-08-18 08:30:13,867 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2012-08-18 08:30:13,881 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2012-08-18 08:30:13,883 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2012-08-18 08:30:13,883 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2012-08-18 08:30:14,946 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2012-08-18 08:30:15,960 INFO mapreduce.GoraRecordReader - > gora.buffer.read.limit = 10000 > 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: content > dest: content > 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: site dest: > site > 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: title dest: > title > 2012-08-18 08:30:16,091 INFO solr.SolrMappingReader - source: host dest: > host > 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: segment > dest: segment > 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: boost dest: > boost > 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: digest > dest: digest > 2012-08-18 08:30:16,092 INFO solr.SolrMappingReader - source: tstamp > dest: tstamp > 2012-08-18 08:30:16,094 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2012-08-18 08:30:16,094 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2012-08-18 08:30:16,094 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2012-08-18 08:30:16,957 INFO solr.SolrWriter - Adding 36 documents > 2012-08-18 08:30:19,859 INFO solr.SolrIndexerJob - SolrIndexerJob: done. > > > > On Sat, Aug 18, 2012 at 1:09 AM, Robert Irribarren > <[email protected]>wrote: > >> WebTable statistics start >> Statistics for WebTable: >> min score: 0.0 >> status 2 (status_fetched): 1053 >> jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, >> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce >> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234, >> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418, >> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046, >> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12, >> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, >> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936}, >> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145}, >> File Output Format Counters ={BYTES_WRITTEN=375}}}} >> retry 0: 1233 >> retry 1: 1 >> TOTAL urls: 1234 >> status 4 (status_redir_temp): 32 >> status 5 (status_redir_perm): 47 >> max score: 1.0 >> status 34 (status_retry): 16 >> status 3 (status_gone): 17 >> status 0 (null): 69 >> avg score: 0.01614992 >> WebTable statistics: done >> min score: 0.0 >> status 2 (status_fetched): 1053 >> jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, >> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce >> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234, >> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418, >> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046, >> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12, >> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, >> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936}, >> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145}, >> File Output Format Counters ={BYTES_WRITTEN=375}}}} >> retry 0: 1233 >> retry 1: 1 >> TOTAL urls: 1234 >> status 4 (status_redir_temp): 32 >> status 5 (status_redir_perm): 47 >> max score: 1.0 >> status 34 (status_retry): 16 >> status 3 (status_gone): 17 >> status 0 (null): 69 >> avg score: 0.01614992 >> >> >> This is what the db says but its not really what i see on my solr. >> Perhaps I didn't set my solr directory somewhere? Please help >> >> >> On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <[email protected] >> > wrote: >> >>> Update : I get this after im done crawling >>> >>> Parsing http://www.brainpop.co.uk/ >>> Exception in thread "main" java.lang.RuntimeException: job failed: >>> name=parse, jobid=job_local_0004 >>> at >>> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47) >>> at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249) >>> at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) >>> at org.apache.nutch.crawl.Crawler.run(Crawler.java:171) >>> at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Crawler.main(Crawler.java:257) >>> >>> >>> >>> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren < >>> [email protected]> wrote: >>> >>>> I actually didnt have it specified, I now put this in the >>>> nutch-site.xml looks like this. >>>> >>>> <?xml version="1.0"?> >>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >>>> >>>> <!-- Put site-specific property overrides in this file. --> >>>> >>>> <configuration> >>>> <property> >>>> <name>http.agent.name</name> >>>> <value>Balsa Crawler</value> >>>> </property> >>>> >>>> <property> >>>> <name>db.ignore.external.links</name> >>>> <value>false</value> >>>> <description>If true, outlinks leading from a page to external hosts >>>> will be ignored. This is an effective way to limit the crawl to >>>> include >>>> only initially injected hosts, without creating complex URLFilters. >>>> </description> >>>> </property> >>>> >>>> <property> >>>> <name>storage.data.store.class</name> >>>> <value>org.apache.gora.sql.store.SqlStore</value> >>>> <description>The Gora DataStore class for storing and retrieving data. >>>> Currently the following stores are available: .. >>>> </description> >>>> </property> >>>> >>>> </configuration> >>>> >>>> >>>> >>>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler < >>>> [email protected]> wrote: >>>> >>>>> Did you set db.ignore.external in *conf/nutch-site.xml*? >>>>> This avoids that external links are fetched. >>>>> Another problem could be, that the robots.txt of the servers prevents >>>>> the crawler from fetching. >>>>> you can check this with *bin/nutch readdb*. There you see, if the >>>>> sites are really fetched >>>>> regards >>>>> Stefan >>>>> >>>>> Am 18.08.2012 09:07, schrieb Robert Irribarren: >>>>> >>>>> I run this >>>>>> nutch inject urls >>>>>> nutch generate >>>>>> bin/nutch crawl urls -depth 3 -topN 100 >>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex >>>>>> echo Crawling completed >>>>>> dir >>>>>> >>>>>> then I see alot of urls being fetched during the crawl phase. >>>>>> When I run the solrindex it doesn't add all the urls i see when it >>>>>> says >>>>>> fetching >>>>>> >>>>>> 54 URLs in 5 queues >>>>>> fetching http://www.tarpits.org/join-us >>>>>> fetching >>>>>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp> >>>>>> fetching >>>>>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus> >>>>>> >>>>>> It doesn't add wikipedia nor the others. >>>>>> >>>>>> ADDITIONAL INFO : >>>>>> My regex-urlfilter.txt >>>>>> # skip file: ftp: and mailto: urls >>>>>> -^(file|ftp|mailto): >>>>>> >>>>>> # skip image and other suffixes we can't yet parse >>>>>> # for a more extensive coverage use the urlfilter-suffix plugin >>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|** >>>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|** >>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$ >>>>>> >>>>>> # skip URLs containing certain characters as probable queries, etc. >>>>>> -[?*!@=] >>>>>> >>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to >>>>>> break >>>>>> loops >>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >>>>>> >>>>>> # accept anything else >>>>>> +. >>>>>> ##############################**##############################**##### >>>>>> >>>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0 >>>>>> >>>>>> >>>>> >>>> >>> >> >

