I could solve my issue. I am not sure if this was fixed in 1.7 or not. But
with Nutch 1.6, all I did was "application/xml" in file
plugins/parse-html/plugin.xml -> <parameter name=contentType"
value="text/html|application/xhtml+xml|*application/xml*" />. That fixed my
issue. Hopefully it should help someone with the same problem.


On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote:

> While Nutch 1.6, I could not crawl one particular site and it is giving me
> the following error message in the parsing stage. I tried to google this
> issue, I tried changing parse.timeout to 3600 and I even tried changing it
> to -1, it doesn't seem to make any difference.
> Please help.
>
>
> Error message: Error parsing http://www.####.com/ failed(2,0): XML parse
> error
>
> From the logs:
>
> 2013-08-02 10:12:03,446 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,465 INFO  http.Http - http.proxy.host = null
> 2013-08-02 10:12:03,466 INFO  http.Http - http.proxy.port = 8080
> 2013-08-02 10:12:03,466 INFO  http.Http - http.timeout = 240000
> 2013-08-02 10:12:03,466 INFO  http.Http - http.content.limit = -1
> 2013-08-02 10:12:03,466 INFO  http.Http - http.agent = Nutch
> Spider/Nutch-1.6
> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-08-02 10:12:03,472 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-08-02 10:12:03,473 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,476 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,610 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=3
> 2013-08-02 10:12:03,612 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=2
> 2013-08-02 10:12:03,619 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-08-02 10:12:03,611 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold: -1
> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2013-08-02 10:12:03,638 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2013-08-02 10:12:04,598 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2013-08-02 10:12:04,631 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2013-08-02 10:12:04,635 INFO  fetcher.Fetcher - -activeThreads=0
> 2013-08-02 10:12:09,293 INFO  fetcher.Fetcher - Fetcher: finished at
> 2013-08-02 10:12:09, elapsed: 00:00:07
> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: starting
> at 2013-08-02 10:12:09
> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: segment:
> crawl-0802-test-3/segments/20130802101154
> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found for
> conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
> mapred-site.xml,
> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
> instantiating a new object cache
> 2013-08-02 10:12:10,362 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/xml, but they are not mapped to it  in the parse-plugins.xml
> file
> 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [
> http://www.#####.com/] with
> [org.apache.nutch.parse.tika.TikaParser@4b3788e1]
> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser
> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml*
> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing
> http://www.####.com/
> org.apache.tika.exception.TikaException: XML parse error*
>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>     at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber:
> 144; The entity name must immediately follow the '&' in the entity
> reference.*
>     at
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
>     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
> Source)
>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>     at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>     at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
>     at
> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
> Source)
>     at
> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
> Source)
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>     at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>     at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>     ... 8 more
> 2013-08-02 10:12:11,246 WARN  parse.ParseSegment -* Error parsing: 
> http://www.####.com/:
> failed(2,0): XML parse error*
> 2013-08-02 10:12:11,256 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2013-08-02 10:12:11,295 INFO  parse.ParseSegment - Parsed (50ms):
> http://www.####.com/
> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found for
> conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
> mapred-site.xml,
> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
> instantiating a new object cache
> 2013-08-02 10:12:16,705 INFO  parse.ParseSegment - ParseSegment: finished
> at 2013-08-02 10:12:16, elapsed: 00:00:07
> 2013-08-02 10:12:16,709 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2013-08-02 10:12:16
> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawl-0802-test-3/crawldb
> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawl-0802-test-3/segments/20130802101154]
> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found for
> conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
> mapred-site.xml,
> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml,
> instantiating a new object cache
> 2013-08-02 10:12:17,594 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
>
>

Reply via email to