WebTable statistics start
Statistics for WebTable:
min score: 0.0
status 2 (status_fetched): 1053
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
File Output Format Counters ={BYTES_WRITTEN=375}}}}
retry 0: 1233
retry 1: 1
TOTAL urls: 1234
status 4 (status_redir_temp): 32
status 5 (status_redir_perm): 47
max score: 1.0
status 34 (status_retry): 16
status 3 (status_gone): 17
status 0 (null): 69
avg score: 0.01614992
WebTable statistics: done
min score: 0.0
status 2 (status_fetched): 1053
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1234,
REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=65418,
COMMITTED_HEAP_BYTES=504635392, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1046,
COMBINE_INPUT_RECORDS=4936, REDUCE_INPUT_RECORDS=12,
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0,
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4936},
FileSystemCounters={FILE_BYTES_READ=878225, FILE_BYTES_WRITTEN=991145},
File Output Format Counters ={BYTES_WRITTEN=375}}}}
retry 0: 1233
retry 1: 1
TOTAL urls: 1234
status 4 (status_redir_temp): 32
status 5 (status_redir_perm): 47
max score: 1.0
status 34 (status_retry): 16
status 3 (status_gone): 17
status 0 (null): 69
avg score: 0.01614992
This is what the db says but its not really what i see on my solr. Perhaps
I didn't set my solr directory somewhere? Please help
On Sat, Aug 18, 2012 at 12:59 AM, Robert Irribarren <[email protected]>wrote:
> Update : I get this after im done crawling
>
> Parsing http://www.brainpop.co.uk/
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=parse, jobid=job_local_0004
> at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
> at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
> at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
> at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
> at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
>
>
>
> On Sat, Aug 18, 2012 at 12:30 AM, Robert Irribarren
> <[email protected]>wrote:
>
>> I actually didnt have it specified, I now put this in the nutch-site.xml
>> looks like this.
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>> <name>http.agent.name</name>
>> <value>Balsa Crawler</value>
>> </property>
>>
>> <property>
>> <name>db.ignore.external.links</name>
>> <value>false</value>
>> <description>If true, outlinks leading from a page to external hosts
>> will be ignored. This is an effective way to limit the crawl to include
>> only initially injected hosts, without creating complex URLFilters.
>> </description>
>> </property>
>>
>> <property>
>> <name>storage.data.store.class</name>
>> <value>org.apache.gora.sql.store.SqlStore</value>
>> <description>The Gora DataStore class for storing and retrieving data.
>> Currently the following stores are available: ..
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>>
>> On Sat, Aug 18, 2012 at 12:15 AM, Stefan Scheffler <
>> [email protected]> wrote:
>>
>>> Did you set db.ignore.external in *conf/nutch-site.xml*?
>>> This avoids that external links are fetched.
>>> Another problem could be, that the robots.txt of the servers prevents
>>> the crawler from fetching.
>>> you can check this with *bin/nutch readdb*. There you see, if the sites
>>> are really fetched
>>> regards
>>> Stefan
>>>
>>> Am 18.08.2012 09:07, schrieb Robert Irribarren:
>>>
>>> I run this
>>>> nutch inject urls
>>>> nutch generate
>>>> bin/nutch crawl urls -depth 3 -topN 100
>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
>>>> echo Crawling completed
>>>> dir
>>>>
>>>> then I see alot of urls being fetched during the crawl phase.
>>>> When I run the solrindex it doesn't add all the urls i see when it says
>>>> fetching
>>>>
>>>> 54 URLs in 5 queues
>>>> fetching http://www.tarpits.org/join-us
>>>> fetching
>>>> http://www.leonisadobemuseum.**org/history-leonis.asp<http://www.leonisadobemuseum.org/history-leonis.asp>
>>>> fetching
>>>> http://az.wikipedia.org/wiki/**Quercus_prinus<http://az.wikipedia.org/wiki/Quercus_prinus>
>>>>
>>>> It doesn't add wikipedia nor the others.
>>>>
>>>> ADDITIONAL INFO :
>>>> My regex-urlfilter.txt
>>>> # skip file: ftp: and mailto: urls
>>>> -^(file|ftp|mailto):
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|**ico|ICO|css|CSS|sit|SIT|eps|**
>>>> EPS|wmf|WMF|zip|ZIP|ppt|PPT|**mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|**
>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|**JPEG|bmp|BMP|js|JS)$
>>>>
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>> -[?*!@=]
>>>>
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>
>>>> # accept anything else
>>>> +.
>>>> ##############################**##############################**#####
>>>>
>>>> ADDITIONAL INFO : Running on solr 4.0 nutch 2.0
>>>>
>>>>
>>>
>>
>