Re: nutch issue: error parsing

Tejas Patil Sun, 03 Feb 2013 23:54:59 -0800

Can you provide the stack trace from logs ?
Also, the value of "plugin.includes" property from nutch-site.xml and
nutch-default.xml (ideally this file should not be modified and changes
must be done to the first one... but sometimes people accidentally do it).


Thanks,
Tejas Patil


On Sun, Feb 3, 2013 at 9:11 PM, Meiping Wang(Amelia) <
[email protected]> wrote:

>   Hey: ****
>
> solrUrl is not set, indexing will be skipped...****
>
> crawl started in: crawl****
>
> rootUrlDir = urls****
>
> threads = 10****
>
> depth = 2****
>
> solrUrl=null****
>
> topN = 5****
>
> Injector: starting at 2013-02-04 13:05:18****
>
> Injector: crawlDb: crawl/crawldb****
>
> Injector: urlDir: urls****
>
> Injector: Converting injected urls to crawl db entries.****
>
> Injector: total number of urls rejected by filters: 0****
>
> Injector: total number of urls injected after normalization and filtering:
> 1****
>
> Injector: Merging injected urls into crawl db.****
>
> Injector: finished at 2013-02-04 13:05:33, elapsed: 00:00:14****
>
> Generator: starting at 2013-02-04 13:05:33****
>
> Generator: Selecting best-scoring urls due for fetch.****
>
> Generator: filtering: true****
>
> Generator: normalizing: true****
>
> Generator: topN: 5****
>
> Generator: jobtracker is 'local', generating exactly one partition.****
>
> Generator: Partitioning selected urls for politeness.****
>
> Generator: segment: crawl/segments/20130204130541****
>
> Generator: finished at 2013-02-04 13:05:48, elapsed: 00:00:15****
>
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.****
>
> Fetcher: starting at 2013-02-04 13:05:48****
>
> Fetcher: segment: crawl/segments/20130204130541****
>
> Using queue mode : byHost****
>
> Fetcher: threads: 10****
>
> Fetcher: time-out divisor: 2****
>
> QueueFeeder finished: total 1 records + hit by time limit :0****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> fetching http://nutch.apache.org/ (queue crawl delay=5000ms)****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> Using queue mode : byHost****
>
> Fetcher: throughput threshold: -1****
>
> Fetcher: throughput threshold retries: 5****
>
> -finishing thread FetcherThread, activeThreads=1****
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0****
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0****
>
> -finishing thread FetcherThread, activeThreads=0****
>
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0****
>
> -activeThreads=0****
>
> Fetcher: finished at 2013-02-04 13:05:58, elapsed: 00:00:10****
>
> ParseSegment: starting at 2013-02-04 13:05:58****
>
> ParseSegment: segment: crawl/segments/20130204130541****
>
> *Error parsing: http://nutch.apache.org/: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse
> content*
>
> Parsed (15ms):http://nutch.apache.org/****
>
> ParseSegment: finished at 2013-02-04 13:06:05, elapsed: 00:00:07****
>
> CrawlDb update: starting at 2013-02-04 13:06:05****
>
> CrawlDb update: db: crawl/crawldb****
>
> CrawlDb update: segments: [crawl/segments/20130204130541]****
>
> CrawlDb update: additions allowed: true****
>
> CrawlDb update: URL normalizing: true****
>
> CrawlDb update: URL filtering: true****
>
> CrawlDb update: 404 purging: false****
>
> CrawlDb update: Merging segment data into db.****
>
> CrawlDb update: finished at 2013-02-04 13:06:18, elapsed: 00:00:13****
>
> Generator: starting at 2013-02-04 13:06:18****
>
> Generator: Selecting best-scoring urls due for fetch.****
>
> Generator: filtering: true****
>
> Generator: normalizing: true****
>
> Generator: topN: 5****
>
> Generator: jobtracker is 'local', generating exactly one partition.****
>
> Generator: 0 records selected for fetching, exiting ...****
>
> Stopping at depth=1 - no more URLs to fetch.****
>
> LinkDb: starting at 2013-02-04 13:06:25****
>
> LinkDb: linkdb: crawl/linkdb****
>
> LinkDb: URL normalize: true****
>
> LinkDb: URL filter: true****
>
> LinkDb: internal links will be ignored.****
>
> LinkDb: adding segment:
> file:/E:/SearchEngine/workspace/Nutch2.1/crawl/segments/20130204130541****
>
> LinkDb: finished at 2013-02-04 13:06:32, elapsed: 00:00:07****
>
> crawl finished: crawl****
>
> ** **
>
> After running the Nutch2.1 in the Eclipse (OS is Windows), there are some
> problems having been showed in red on the above. can anybody give me the
> right instructions?****
>
> ** **
>
> Best Regards****
>
> Amelia (Meiping Wang)****
>

Re: nutch issue: error parsing

Reply via email to