Thanks feng! Yes I could add that and it started working without any issue
On Sunday, August 4, 2013, feng lu <[email protected]> wrote: > Hi Laxmi > > I see that http://www.####.com/ <http://www./####.com/> mimeType is > application/xml, so parse-html plugin think it can not parse xml content. > so it use parse-tika to parse that XML content, but actually that content > format is HTML. so I think that is not an issue. and u can also add a > mimeType property in conf/parse-plugins.xml. > > > > On Sat, Aug 3, 2013 at 3:23 AM, A Laxmi <[email protected]> wrote: > >> *all I did was add "application/xml" in file plugins/parse-html/plugin.xml >> >> >> On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote: >> >> > I could solve my issue. I am not sure if this was fixed in 1.7 or not. >> But >> > with Nutch 1.6, all I did was "application/xml" in file >> > plugins/parse-html/plugin.xml -> <parameter name=contentType" >> > value="text/html|application/xhtml+xml|*application/xml*" />. That fixed >> > my issue. Hopefully it should help someone with the same problem. >> > >> > >> > On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote: >> > >> >> While Nutch 1.6, I could not crawl one particular site and it is giving >> >> me the following error message in the parsing stage. I tried to google >> this >> >> issue, I tried changing parse.timeout to 3600 and I even tried changing >> it >> >> to -1, it doesn't seem to make any difference. >> >> Please help. >> >> >> >> >> >> Error message: Error parsing http://www.####.com/ failed(2,0): XML >> parse >> >> error >> >> >> >> From the logs: >> >> >> >> 2013-08-02 10:12:03,446 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,465 INFO http.Http - http.proxy.host = null >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.proxy.port = 8080 >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.timeout = 240000 >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.content.limit = -1 >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.agent = Nutch >> >> Spider/Nutch-1.6 >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept.language = >> >> en-us,en-gb,en;q=0.7,*;q=0.3 >> >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept = >> >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> >> 2013-08-02 10:12:03,472 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=1 >> >> 2013-08-02 10:12:03,473 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,476 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=1 >> >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,610 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=3 >> >> 2013-08-02 10:12:03,612 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=2 >> >> 2013-08-02 10:12:03,619 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=1 >> >> 2013-08-02 10:12:03,611 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Using queue mode : >> byHost >> >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=1 >> >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput >> >> threshold: -1 >> >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput >> >> threshold retries: 5 >> >> 2013-08-02 10:12:03,638 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=1 >> >> 2013-08-02 10:12:04,598 INFO fetcher.Fetcher - -finishing thread >> >> FetcherThread, activeThreads=0 >> >> 2013-08-02 10:12:04,631 INFO fetcher.Fetcher - -activeThreads=0, >> >> spinWaiting=0, fetchQueues.totalSize=0 >> >> 2013-08-02 10:12:04,635 INFO fetcher.Fetcher - -activeThreads=0 >> >> 2013-08-02 10:12:09,293 INFO fetcher.Fetcher - Fetcher: finished at >> >> 2013-08-02 10:12:09, elapsed: 00:00:07 >> >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: >> starting >> >> at 2013-08-02 10:12:09 >> >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: >> segment: >> >> crawl-0802-test-3/segments/20130802101154 >> >> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found >> >> for conf=Configuration: core-default.xml, core-site.xml, >> >> mapred-default.xml, mapred-site.xml, >> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, >> >> instantiating a new object cache >> >> 2013-08-02 10:12:10,362 INFO parse.ParserFactory - The parsing plugins: >> >> [org.apache.nutch.parse.tika.Tik> >> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser >> >> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml* >> >> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing >> >> http://www.####.com/ >> >> org.apache.tika.exception.TikaException: XML parse error* >> >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) >> >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) >> >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) >> >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) >> >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) >> >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> >> at >> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> >> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: >> >> 144; The entity name must immediately follow the '&' in the entity >> >> reference.* >> >> at >> >> >> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown >> >> Source) >> >> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown >> >> Source) >> >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown >> Source) >> >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown >> Source) >> >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown >> Source) >> >> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown >> Source) >> >> at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown >> >> Source) >> >> at >> >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown >> >> Source) >> >> at >> >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown >> >> Source) >> >> at >> >> >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown >> >> Source) >> >> at >> >> >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown >> >> Source) >> >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown >> Source) >> >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown >> Source) >> >> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) >> >> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) >> >> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown >> >> Source) >> >> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) >> >> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) >> >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) >> >> ... 8 more >> >> 2013-08-02 10:12:11,246 WARN parse.ParseSegment -* Error parsing: >> http://www.####.com/: >> >> failed(2,0): XML parse error* >> >> 2013-08-02 10:12:11,256 INFO crawl.SignatureFactory - Using Signature >> >> impl: org.apache.nutch.crawl.MD5Signature >> >> 2013-08-02 10:12:11,295 INFO parse.ParseSegment - Parsed (50ms): >> >> http://www.####.com/ >> >> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found >> >> for conf=Configuration: core-default.xml, core-site.xml, >> >> mapred-default.xml, mapred-site.xml, >> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, >> >> instantiating a new object cache >> >> 2013-08-02 10:12:16,705 INFO parse.ParseSegment - ParseSegment: >> finished >> >> at 2013-08-02 10:12:16, elapsed: 00:00:07 >> >> 2013-08-02 10:12:16,709 INFO crawl.CrawlDb - CrawlDb update: starting >> at >> >> 2013-08-02 10:12:16 >> >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: db: >> >> crawl-0802-test-3/crawldb >> >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: segments: >> >> [crawl-0802-test-3/segments/20130802101154] >> >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: additions >> >> allowed: true >> >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL >> >> normalizing: true >> >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL >> >> filtering: true >> >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: 404 >> >> purging: false >> >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: Merging >> >> segment data into db. >> >> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found >> >> for conf=Configuration: core-default.xml, core-site.xml, >> >> mapred-default.xml, mapred-site.xml, >> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml, >> >> instantiating a new object cache >> >> 2013-08-02 10:12:17,594 INFO regex.RegexURLNormalizer - can't find >> rules >> >> for scope 'crawldb', using default >> >> >> >> >> > >> > > > > -- > Don't Grow Old, Grow Up... :-) >

