Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Robert Irribarren Sat, 18 Aug 2012 00:59:44 -0700

Update : I get this after im done crawling

Parsing http://www.brainpop.co.uk/
Exception in thread "main" java.lang.RuntimeException: job failed:
name=parse, jobid=job_local_0004
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
        at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
        at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)



On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren <[email protected]>wrote:

> I actually didnt have it specified, I now put this in the nutch-site.xml
> looks like this.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>Balsa  Crawler</value>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.sql.store.SqlStore</value>
> <description>The Gora DataStore class for storing and retrieving data.
> Currently the following stores are available: ..
> </description>
> </property>
>
> </configuration>
>
>
>
> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
> [email protected]> wrote:
>
>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>> This avoids that external links are fetched.
>> Another problem could be, that the robots.txt of the servers prevents the
>> crawler from fetching.
>> you can check this with *bin/nutch readdb*. There you see, if the sites
>> are really fetched
>> regards
>> Stefan
>>
>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>
>>  I run this
>>> nutch inject urls
>>> nutch generate
>>> bin/nutch crawl urls -depth 3 -topN 100
>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>> echo Crawling completed
>>> dir
>>>
>>> then I see alot of urls being fetched during the crawl phase.
>>> When I run the solrindex it doesn't add all the urls i see when it says
>>> fetching
>>>
>>> 54 URLs in 5 queues
>>> fetching http://www.tarpits.org/join-us
>>> fetching 
>>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>> fetching 
>>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>
>>> It doesn't add wikipedia nor the others.
>>>
>>> ADDITIONAL INFO :
>>> My regex-urlfilter.txt
>>> # skip file: ftp: and mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> -[?*!@=]
>>>
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>
>>> # accept anything else
>>> +.
>>> ##############################**##############################**#####
>>>
>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>
>>>
>>
>

Re: Nutch Fetching alot but SOLR doesn't include all the fetches

Reply via email to