Re: skipping invalid segments nutch 1.3

2011-07-21 Thread lewis john mcgibbney
Hi Leo,

From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL, however this may not be the case as I
have nothing to benchmark it on. Unfortuantely on the occasion the URL
http://wiki.apache.org actually redirects to
http://wiki.apache.org/general/so I'm going to post my log output from
last URL you specified in an attempt
to clear this one up. The following confirms that you are accurate with your
observations that not only does this produce invalid segments but also
nothing is fetched in the process.

Therefore the reason that we are getting the  - skipping invalid segment
message is that we are not actually fetching any content. My initial
thoughts were that your urlfilters were not set properly and I think that
this is part of the case.

Please follow the syntax very carefully and it will work perfectly for you
as follows

regex-urlfilter.txt
--

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# crawl URLs in the following domains.
+^http://([a-z0-9]*\.)*seek.com.au/

# accept anything else
#+.

seed file
--
http://www.seek.com.au

It sounds really trivial but I think that the trailing '/' in in your seed
file may have been making all of the difference.

Please try, test with readdb and readseg and comment back.

Sorry for the delayed posts on this one I have not had much time to get to
it. Hope all goes to plan. Evidence can be seen below

lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb crawldb
-stats
CrawlDb statistics start: crawldb
Statistics for CrawlDb: crawldb
TOTAL urls:48
retry 0:48
min score:0.017
avg score:0.041125
max score:1.175
status 1 (db_unfetched):47
status 2 (db_fetched):1
CrawlDb statistics: done





On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions llsub...@zudiewiener.com
 wrote:

 Following are the suggested commands and the result as suggested
  I left the redirect as 0 as 'crawl' works without any issues. The
 problem only occurs when running the individual commands.

 --- nutch-site.xml ---
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration

 property
  namehttp.agent.name/name
  valuelisters spider/value
 /property

 property
  namefetcher.verbose/name
  valuetrue/value
  descriptionIf true, fetcher will log more verbosely./description
 /property

 property
  namehttp.verbose/name
  valuetrue/value
  descriptionIf true, HTTP will log more verbosely./description
 /property

 /configuration
 ---

 -- Individual commands and results-

 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
 Injector: starting at 2011-07-21 12:24:52
 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
 Injector: urlDir: /home/llist/nutchData/seed/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02


 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 generate /home/llist/nutchData/crawl/crawldb
 /home/llist/nutchData/crawl/segments -topN 100
 Generator: starting at 2011-07-21 12:25:16
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 100
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519
 Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03


 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 fetch /home/llist/nutchData/crawl/segments/20110721122519
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting at 2011-07-21 12:26:36
 Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519
 Fetcher: threads: 10
 QueueFeeder finished: total 1 records + hit by time limit :0
 -finishing thread FetcherThread, activeThreads=1
 fetching http://wiki.apache.org/
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=1
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, 

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Sebastian Nagel

Hi Leo, hi Lewis,


From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL,


This may be the reason. Empty segments may break some of the crawler steps.

But if I'm not wrong it looks like the updatedb-command
is not quite correct:

 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 updatedb /home/llist/nutchData/crawl/crawldb
 -dir /home/llist/nutchData/crawl/segments/20110721122519
 CrawlDb update: starting at 2011-07-21 12:28:03
 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
 CrawlDb update: segments:
 [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
 file:/home/llist/nutchData/crawl/segments/20110721122519/content,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
 file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
 file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
 CrawlDb update: additions allowed: true

As for other commands reading segments there are two ways two
add segments as arguments: 1) all segments enumarated or 2) via -dir the parent 
directory
of all segments. See:

% $NUTCH_HOME/bin/nutch updatedb
Usage: CrawlDb crawldb (-dir segments | seg1 seg2 ...) [-force] [-normalize] [-filter] 
[-noAdditions]

crawldb CrawlDb to update
-dir segments   parent directory containing all segments to update from
seg1 seg2 ...   list of segment names to update from

Try your updatedb command without -dir, it should work.

Sebastian


Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis,

Will try your suggestion shortly, but am still puzzled why the crawl
command works. Isn't it using the same filter, etc?

Cheers,

Leo

On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:

 Hi Leo,
 
 From the times both the fetching and parsing took, I suspecting that
 maybe Nutch didn't actually fetch the URL, however this may not be the
 case as I have nothing to benchmark it on. Unfortuantely on the
 occasion the URL http://wiki.apache.org actually redirects to
 http://wiki.apache.org/general/ so I'm going to post my log output
 from last URL you specified in an attempt to clear this one up. The
 following confirms that you are accurate with your observations that
 not only does this produce invalid segments but also nothing is
 fetched in the process.
 
 Therefore the reason that we are getting the  - skipping invalid
 segment message is that we are not actually fetching any content. My
 initial thoughts were that your urlfilters were not set properly and I
 think that this is part of the case.
 
 Please follow the syntax very carefully and it will work perfectly for
 you as follows
 
 regex-urlfilter.txt
 --
 
 # skip URLs with slash-delimited segment that repeats 3+ times, to
 break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
 # crawl URLs in the following domains.
 +^http://([a-z0-9]*\.)*seek.com.au/
 
 # accept anything else
 #+.
 
 seed file
 --
 http://www.seek.com.au
 
 It sounds really trivial but I think that the trailing '/' in in your
 seed file may have been making all of the difference.
 
 Please try, test with readdb and readseg and comment back.
 
 Sorry for the delayed posts on this one I have not had much time to
 get to it. Hope all goes to plan. Evidence can be seen below
 
 lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb
 crawldb -stats
 CrawlDb statistics start: crawldb
 Statistics for CrawlDb: crawldb
 TOTAL urls:48
 retry 0:48
 min score:0.017
 avg score:0.041125
 max score:1.175
 status 1 (db_unfetched):47
 status 2 (db_fetched):1
 CrawlDb statistics: done
 
 
 
 
 
 
 On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions
 llsub...@zudiewiener.com wrote:
 
 Following are the suggested commands and the result as
 suggested
  I left the redirect as 0 as 'crawl' works without any issues.
 The
 problem only occurs when running the individual commands.
 
 --- nutch-site.xml ---
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
 !-- Put site-specific property overrides in this file. --
 
 configuration
 
 property
  namehttp.agent.name/name
  valuelisters spider/value
 /property
 
 property
  namefetcher.verbose/name
  valuetrue/value
  descriptionIf true, fetcher will log more
 verbosely./description
 /property
 
 property
  namehttp.verbose/name
  valuetrue/value
  descriptionIf true, HTTP will log more
 verbosely./description
 /property
 
 /configuration
 ---
 
 -- Individual commands and
 results-
 
 
 llist@LeosLinux:~/nutchData
 $ /usr/share/nutch/runtime/local/bin/nutch
 
 
 inject /home/llist/nutchData/crawl/crawldb 
 /home/llist/nutchData/seed/urls
 Injector: starting at 2011-07-21 12:24:52
 
 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
 Injector: urlDir: /home/llist/nutchData/seed/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 
 
 Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
 
 
 
 llist@LeosLinux:~/nutchData
 $ /usr/share/nutch/runtime/local/bin/nutch
 
 
 generate /home/llist/nutchData/crawl/crawldb 
 /home/llist/nutchData/crawl/segments -topN 100
 Generator: starting at 2011-07-21 12:25:16
 
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 
 
 Generator: topN: 100
 
 Generator: jobtracker is 'local', generating exactly one
 partition.
 Generator: Partitioning selected urls for politeness.
 
 
 Generator:
 segment: /home/llist/nutchData/crawl/segments/20110721122519
 Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
 
 
 
 llist@LeosLinux:~/nutchData
 $ /usr/share/nutch/runtime/local/bin/nutch
 
  

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis,

Following  are the things I tried ans the relevant source/logs


1. ran 'crawl' without  ending / in the url http://www.seek.com.au ;
Result OK
2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ;
Result OK
3. Had a look at the regex-urlfilter.txt and the relevant entries are as
follows

--- regex-urlfilter.txt -
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
--
4. I think you are correct in that fetch does not actually fetch
anything. Following are the relevant sections from the hadoop.log. First
the log when 'crawl' was running and then the log for 'inject, generate,
fetch'. The rest of the log up to the fetch is pretty much identical.
One thing I did notice is that the QueueFeeder returns 10 records for
'crawl' and 1 record for 'fetch'

- hadoop.log for 'crawl' ---

2011-07-22 10:02:27,226 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:02:27, elapsed: 00:00:03
2011-07-22 10:02:27,227 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:02:27
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722100225
2011-07-22 10:02:27,910 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:02:27,918 INFO  fetcher.Fetcher - QueueFeeder finished:
total 10 records + hit by time limit :0
2011-07-22 10:02:27,926 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/sales-jobs
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.host = null
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.port = 8080
2011-07-22 10:02:27,940 INFO  http.Http - http.timeout = 1
2011-07-22 10:02:27,940 INFO  http.Http - http.content.limit = 65536
2011-07-22 10:02:27,940 INFO  http.Http - http.agent = listers
spider/Nutch-1.3
2011-07-22 10:02:27,940 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-07-22 10:02:28,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:29,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:30,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:31,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:32,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:33,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:34,932 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:35,091 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/mining-resources-energy-jobs/
2011-07-22 10:02:35,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:36,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:37,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:38,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:39,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:40,363 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/marketing-communications-jobs/
2011-07-22 10:02:40,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=7


etc.

-

--- hadoop.log for 'fetch'
---
2011-07-22 10:14:37,645 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:14:37, elapsed: 00:00:03
2011-07-22 10:16:46,088 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:16:46
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722101436
2011-07-22 10:16:46,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:16:46,741 INFO  plugin.PluginRepository - Plugins: looking
in: /usr/share/nutch/runtime/local/plugins
2011-07-22 10:16:46,746 INFO  fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
2011-07-22 10:16:46,815 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]

---

Cheers,

Leo


On Fri, 2011-07-22 at 09:51 +1000, Leo Subscriptions wrote:

 Hi Lewis,
 
 Will try your suggestion shortly, 

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Sebastian,

I think the problem is with the fetch not returning any results. I
checked your suggestion, but it did not work.

Cheers,

Leo

On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote:

 Hi Leo, hi Lewis,
 
  From the times both the fetching and parsing took, I suspecting that maybe
  Nutch didn't actually fetch the URL,
 
 This may be the reason. Empty segments may break some of the crawler steps.
 
 But if I'm not wrong it looks like the updatedb-command
 is not quite correct:
 
   llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
   updatedb /home/llist/nutchData/crawl/crawldb
   -dir /home/llist/nutchData/crawl/segments/20110721122519
   CrawlDb update: starting at 2011-07-21 12:28:03
   CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
   CrawlDb update: segments:
   [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
   file:/home/llist/nutchData/crawl/segments/20110721122519/content,
   file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
   file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
   file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
   file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
   CrawlDb update: additions allowed: true
 
 As for other commands reading segments there are two ways two
 add segments as arguments: 1) all segments enumarated or 2) via -dir the 
 parent directory
 of all segments. See:
 
 % $NUTCH_HOME/bin/nutch updatedb
 Usage: CrawlDb crawldb (-dir segments | seg1 seg2 ...) [-force] 
 [-normalize] [-filter] 
 [-noAdditions]
  crawldb CrawlDb to update
  -dir segments   parent directory containing all segments to update 
 from
  seg1 seg2 ...   list of segment names to update from
 
 Try your updatedb command without -dir, it should work.
 
 Sebastian




Re: Configuration issue: Custom parser not being recognised.

2011-07-21 Thread amrutbudi...@gmail.com
Found the issue! plugin.xml defined extension id which didn't match id inside
mimeType=application/xhtml+xml tag parse-plugins.xml.

i.e.: below bold highlighted should match.
plugin.xml:
?xml version=1.0 encoding=UTF-8?
plugin
   id=food
   name=Food Parser.
   version=1.0.0
   provider-name=amrut

   runtime
  library name=food.jar
 export name=*/
  /library
   /runtime

   requires
  import plugin=nutch-extensionpoints/
   /requires

   extension id=com.amrut.parser.TDRParser
  name=TDR Parser
  point=org.apache.nutch.parse.Parser

*
implementation id=com.amrut.parser.TDRParser
 class=com.amrut.parser.TDRParser
parameter name=contentType value=application/xhtml+xml/
  /implementation
*
   /extension
/plugin

parse-plugins.xml:

?xml version=1.0 encoding=UTF-8?
parse-plugins

mimeType name=application/xhtml+xml
*   plugin id=food /*
/mimeType


aliases
*   alias name=food
extension-id=com.amrut.parser.TDRParser /*
alias name=parse-tika 
extension-id=org.apache.nutch.parse.tika.TikaParser /
alias name=parse-ext extension-id=ExtParser /
alias name=parse-html
extension-id=org.apache.nutch.parse.html.HtmlParser /
alias name=parse-js extension-id=JSParser /
alias name=feed
extension-id=org.apache.nutch.parse.feed.FeedParser /
alias name=parse-swf
extension-id=org.apache.nutch.parse.swf.SWFParser /
alias name=parse-zip
extension-id=org.apache.nutch.parse.zip.ZipParser /
/aliases
/parse-plugins

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3190290.html
Sent from the Nutch - User mailing list archive at Nabble.com.