Thanks feng! Yes I could add that and it started working without any issue

On Sunday, August 4, 2013, feng lu <[email protected]> wrote:
> Hi Laxmi
>
> I see that  http://www.####.com/ <http://www./####.com/> mimeType is
> application/xml, so parse-html plugin think it can not parse xml content.
> so it use parse-tika to parse that XML content, but actually that content
> format is HTML. so I think that is not an issue. and u can also add a
> mimeType property in conf/parse-plugins.xml.
>
>
>
> On Sat, Aug 3, 2013 at 3:23 AM, A Laxmi <[email protected]> wrote:
>
>> *all I did was add "application/xml" in file
plugins/parse-html/plugin.xml
>>
>>
>> On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote:
>>
>> > I could solve my issue. I am not sure if this was fixed in 1.7 or not.
>> But
>> > with Nutch 1.6, all I did was "application/xml" in file
>> > plugins/parse-html/plugin.xml -> <parameter name=contentType"
>> > value="text/html|application/xhtml+xml|*application/xml*" />. That
fixed
>> > my issue. Hopefully it should help someone with the same problem.
>> >
>> >
>> > On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]>
wrote:
>> >
>> >> While Nutch 1.6, I could not crawl one particular site and it is
giving
>> >> me the following error message in the parsing stage. I tried to google
>> this
>> >> issue, I tried changing parse.timeout to 3600 and I even tried
changing
>> it
>> >> to -1, it doesn't seem to make any difference.
>> >> Please help.
>> >>
>> >>
>> >> Error message: Error parsing http://www.####.com/ failed(2,0): XML
>> parse
>> >> error
>> >>
>> >> From the logs:
>> >>
>> >> 2013-08-02 10:12:03,446 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,465 INFO  http.Http - http.proxy.host = null
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.proxy.port = 8080
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.timeout = 240000
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.content.limit = -1
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.agent = Nutch
>> >> Spider/Nutch-1.6
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept.language =
>> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept =
>> >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> >> 2013-08-02 10:12:03,472 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=1
>> >> 2013-08-02 10:12:03,473 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,476 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=1
>> >> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,610 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=3
>> >> 2013-08-02 10:12:03,612 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=2
>> >> 2013-08-02 10:12:03,619 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=1
>> >> 2013-08-02 10:12:03,611 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Using queue mode :
>> byHost
>> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=1
>> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> threshold: -1
>> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> threshold retries: 5
>> >> 2013-08-02 10:12:03,638 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=1
>> >> 2013-08-02 10:12:04,598 INFO  fetcher.Fetcher - -finishing thread
>> >> FetcherThread, activeThreads=0
>> >> 2013-08-02 10:12:04,631 INFO  fetcher.Fetcher - -activeThreads=0,
>> >> spinWaiting=0, fetchQueues.totalSize=0
>> >> 2013-08-02 10:12:04,635 INFO  fetcher.Fetcher - -activeThreads=0
>> >> 2013-08-02 10:12:09,293 INFO  fetcher.Fetcher - Fetcher: finished at
>> >> 2013-08-02 10:12:09, elapsed: 00:00:07
>> >> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment:
>> starting
>> >> at 2013-08-02 10:12:09
>> >> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment:
>> segment:
>> >> crawl-0802-test-3/segments/20130802101154
>> >> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found
>> >> for conf=Configuration: core-default.xml, core-site.xml,
>> >> mapred-default.xml, mapred-site.xml,
>> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
>> >> instantiating a new object cache
>> >> 2013-08-02 10:12:10,362 INFO  parse.ParserFactory - The parsing
plugins:
>> >> [org.apache.nutch.parse.tika.Tik> >> *2013-08-02 10:12:11,168 DEBUG
tika.TikaParser - Using Tika parser
>> >> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml*
>> >> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing
>> >> http://www.####.com/
>> >> org.apache.tika.exception.TikaException: XML parse error*
>> >>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>> >>     at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
>> >>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
>> >>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>> >>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>> >>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> >>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> >>     at
>> >>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> >> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18;
columnNumber:
>> >> 144; The entity name must immediately follow the '&' in the entity
>> >> reference.*
>> >>     at
>> >>
>>
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
>> >> Source)
>> >>     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
>> >> Source)
>> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
>> Source)
>> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
>> Source)
>> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
>> Source)
>> >>     at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown
>> Source)
>> >>     at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown
>> >> Source)
>> >>     at
>> >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
>> >> Source)
>> >>     at
>> >>
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
>> >> Source)
>> >>     at
>> >>
>>
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> >> Source)
>> >>     at
>> >>
>>
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> >> Source)
>> >>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>> Source)
>> >>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
>> Source)
>> >>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>> >>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
>> >>     at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
>> >> Source)
>> >>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>> >>     at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>> >>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>> >>     ... 8 more
>> >> 2013-08-02 10:12:11,246 WARN  parse.ParseSegment -* Error parsing:
>> http://www.####.com/:
>> >> failed(2,0): XML parse error*
>> >> 2013-08-02 10:12:11,256 INFO  crawl.SignatureFactory - Using Signature
>> >> impl: org.apache.nutch.crawl.MD5Signature
>> >> 2013-08-02 10:12:11,295 INFO  parse.ParseSegment - Parsed (50ms):
>> >> http://www.####.com/
>> >> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found
>> >> for conf=Configuration: core-default.xml, core-site.xml,
>> >> mapred-default.xml, mapred-site.xml,
>> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
>> >> instantiating a new object cache
>> >> 2013-08-02 10:12:16,705 INFO  parse.ParseSegment - ParseSegment:
>> finished
>> >> at 2013-08-02 10:12:16, elapsed: 00:00:07
>> >> 2013-08-02 10:12:16,709 INFO  crawl.CrawlDb - CrawlDb update: starting
>> at
>> >> 2013-08-02 10:12:16
>> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: db:
>> >> crawl-0802-test-3/crawldb
>> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update:
segments:
>> >> [crawl-0802-test-3/segments/20130802101154]
>> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update:
additions
>> >> allowed: true
>> >> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> normalizing: true
>> >> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> filtering: true
>> >> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: 404
>> >> purging: false
>> >> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> >> segment data into db.
>> >> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found
>> >> for conf=Configuration: core-default.xml, core-site.xml,
>> >> mapred-default.xml, mapred-site.xml,
>> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml,
>> >> instantiating a new object cache
>> >> 2013-08-02 10:12:17,594 INFO  regex.RegexURLNormalizer - can't find
>> rules
>> >> for scope 'crawldb', using default
>> >>
>> >>
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to