Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Robert Irribarren Sat, 18 Aug 2012 15:01:58 -0700

I fixed the errors, thanks.

On Sat, Aug 18, 2012 at 1:33 AM, Robert Irribarren <[email protected]>wrote:


> And here is my hadoop.log
> 2012-08-18 08:30:13,069 INFO  solr.SolrIndexerJob - SolrIndexerJob:
> starting
> 2012-08-18 08:30:13,658 INFO  plugin.PluginRepository - Plugins: looking
> in: /usr/share/nutch/runtime/local/plugins
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered Plugins:
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         the nutch
> core extension points (nutch-extensionpoints)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic URL
> Normalizer (urlnormalizer-basic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Html Parse
> Plug-in (parse-html)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Basic
> Indexing Filter (index-basic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Filter (urlfilter-regex)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Http
> Protocol Plug-in (protocol-http)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Normalizer (urlnormalizer-regex)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Tika
> Parser Plug-in (parse-tika)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         OPIC
> Scoring Plug-in (scoring-opic)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         CyberNeko
> HTML Parser (lib-nekohtml)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Anchor
> Indexing Filter (index-anchor)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Regex URL
> Filter Framework (lib-regex-filter)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2012-08-18 08:30:13,866 INFO  plugin.PluginRepository -         Parse
> Filter (org.apache.nutch.parse.ParseFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch URL
> Filter (org.apache.nutch.net.URLFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> 2012-08-18 08:30:13,867 INFO  plugin.PluginRepository -         Nutch
> Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2012-08-18 08:30:13,881 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2012-08-18 08:30:13,883 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2012-08-18 08:30:13,883 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2012-08-18 08:30:14,946 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2012-08-18 08:30:15,960 INFO  mapreduce.GoraRecordReader -
> gora.buffer.read.limit = 10000
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: content
> dest: content
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: site dest:
> site
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2012-08-18 08:30:16,091 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: segment
> dest: segment
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: digest
> dest: digest
> 2012-08-18 08:30:16,092 INFO  solr.SolrMappingReader - source: tstamp
> dest: tstamp
> 2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2012-08-18 08:30:16,094 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2012-08-18 08:30:16,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2012-08-18 08:30:16,957 INFO  solr.SolrWriter - Adding 36 documents
> 2012-08-18 08:30:19,859 INFO  solr.SolrIndexerJob - SolrIndexerJob: done.
>
>
>
> On Sat, Aug 18, 2012 at 1:09 AM, Robert Irribarren 
> <[email protected]>wrote:
>
>> WebTable statistics start
>> Statistics for WebTable:
>> min score:      0.0
>> status 2 (status_fetched):      1053
>> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
>> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
>> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
>> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
>> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
>> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
>> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
>> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
>> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
>> File Output Format Counters ={BYTES_WRITTEN=375}}}}
>> retry 0:        1233
>> retry 1:        1
>> TOTAL urls:     1234
>> status 4 (status_redir_temp):   32
>> status 5 (status_redir_perm):   47
>> max score:      1.0
>> status 34 (status_retry):       16
>> status 3 (status_gone): 17
>> status 0 (null):        69
>> avg score:      0.01614992
>> WebTable statistics: done
>> min score:      0.0
>> status 2 (status_fetched):      1053
>> jobs:   {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
>> counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
>> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
>> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
>> COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
>> COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
>> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
>> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
>> FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
>> File Output Format Counters ={BYTES_WRITTEN=375}}}}
>> retry 0:        1233
>> retry 1:        1
>> TOTAL urls:     1234
>> status 4 (status_redir_temp):   32
>> status 5 (status_redir_perm):   47
>> max score:      1.0
>> status 34 (status_retry):       16
>> status 3 (status_gone): 17
>> status 0 (null):        69
>> avg score:      0.01614992
>>
>>
>> This is what the db says but its not really what i see on my solr.
>> Perhaps I didn't set my solr directory somewhere? Please help
>>
>>
>> On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <[email protected]
>> > wrote:
>>
>>> Update : I get this after im done crawling
>>>
>>> Parsing http://www.brainpop.co.uk/
>>> Exception in thread "main" java.lang.RuntimeException: job failed:
>>> name=parse, jobid=job_local_0004
>>>         at
>>> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
>>>         at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
>>>         at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
>>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
>>>         at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>>>
>>>
>>>
>>> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <
>>> [email protected]> wrote:
>>>
>>>> I actually didnt have it specified, I now put this in the
>>>> nutch-site.xml looks like this.
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <configuration>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>Balsa  Crawler</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>db.ignore.external.links</name>
>>>>   <value>false</value>
>>>>   <description>If true, outlinks leading from a page to external hosts
>>>>   will be ignored. This is an effective way to limit the crawl to
>>>> include
>>>>   only initially injected hosts, without creating complex URLFilters.
>>>>   </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>storage.data.store.class</name>
>>>> <value>org.apache.gora.sql.store.SqlStore</value>
>>>> <description>The Gora DataStore class for storing and retrieving data.
>>>> Currently the following stores are available: ..
>>>> </description>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>>
>>>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
>>>> [email protected]> wrote:
>>>>
>>>>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>>>>> This avoids that external links are fetched.
>>>>> Another problem could be, that the robots.txt of the servers prevents
>>>>> the crawler from fetching.
>>>>> you can check this with *bin/nutch readdb*. There you see, if the
>>>>> sites are really fetched
>>>>> regards
>>>>> Stefan
>>>>>
>>>>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>>>>
>>>>>  I run this
>>>>>> nutch inject urls
>>>>>> nutch generate
>>>>>> bin/nutch crawl urls -depth 3 -topN 100
>>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>>>> echo Crawling completed
>>>>>> dir
>>>>>>
>>>>>> then I see alot of urls being fetched during the crawl phase.
>>>>>> When I run the solrindex it doesn't add all the urls i see when it
>>>>>> says
>>>>>> fetching
>>>>>>
>>>>>> 54 URLs in 5 queues
>>>>>> fetching http://www.tarpits.org/join-us
>>>>>> fetching 
>>>>>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>>>>> fetching 
>>>>>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>>>>
>>>>>> It doesn't add wikipedia nor the others.
>>>>>>
>>>>>> ADDITIONAL INFO :
>>>>>> My regex-urlfilter.txt
>>>>>> # skip file: ftp: and mailto: urls
>>>>>> -^(file|ftp|mailto):
>>>>>>
>>>>>> # skip image and other suffixes we can't yet parse
>>>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>>>>
>>>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>>> -[?*!@=]
>>>>>>
>>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>>>> break
>>>>>> loops
>>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>>>
>>>>>> # accept anything else
>>>>>> +.
>>>>>> ##############################**##############################**#####
>>>>>>
>>>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Reply via email to