Seems correct indeed. Please check the logs, they may tell some more.
-----Original message----- > From:Sudip Datta <[email protected]> > Sent: Thu 12-Jul-2012 21:51 > To: Markus Jelsma <[email protected]> > Cc: [email protected] > Subject: Re: Error parsing html > > Hi Markus, > > Yes, they seem to be rightly mapped: > > parse-plugins.xml reads: > > <mimeType name="text/html"> > <plugin id="parse-html"/> > </mimeType> > > and tika's plugin.xml reads: > > <extension point="org.apache.nutch.parse.Parser" > id="org.apache.nutch.parse.tika" name="TikaParser"> > <implementation id="org.apache.nutch.parse.tika.TikaParser" > class="org.apache.nutch.parse.tika.TikaParser"> > <parameter name="contentType" value="*"/> > </implementation> > </extension> > > This one > http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems > to have a similar problem but doesn't mention where in code he has an > error. > > Thanks, > > --Sudip. > > On Fri, Jul 13, 2012 at 12:19 AM, Markus Jelsma > <[email protected]>wrote: > > > strange, check if text/html is mapped to parse-tika or parse-html in > > parse-plugins.xml. You may also want to check tika's plugin.xml, it must be > > mapped to * or a regex of content types. > > > > > > -----Original message----- > > > From:Sudip Datta <[email protected]> > > > Sent: Thu 12-Jul-2012 20:36 > > > To: [email protected] > > > Subject: Re: Error parsing html > > > > > > Nopes. That didn't help. In fact, I had added that entry minutes before > > > sending a mail to the group and after couple of hours of frustration in > > > trying to get the parser to work. > > > > > > On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > > > For starters there is no parse-xhtml plugin unless of course this is a > > > > custom one you've written yourself. > > > > > > > > Unless this is the case then remove this from the plugin.includes > > > > property and re-spin it > > > > > > > > hth > > > > > > > > On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta <[email protected]> wrote: > > > > > Hi, > > > > > > > > > > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine > > > > before > > > > > I made some changes to the SolrWriter (which I believe has nothing > > to do > > > > > with my problem). Since then, I am getting: > > > > > > > > > > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully > > parse > > > > > content <webpage> of type text/html > > > > > INFO : org.apache.nutch.parse.ParseSegment - Parsing: <webpage> > > > > > WARN : org.apache.nutch.parse.ParseSegment - Error parsing: > > <webpage>: > > > > > failed(2,200): org.apache.nutch.parse.ParseException: Unable to > > > > > successfully parse content > > > > > > > > > > for any <webpage> that I try to crawl! > > > > > > > > > > My nutch-site.xml file reads: > > > > > > > > > > > <value>protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > > > > > > > > What could be going wrong? > > > > > > > > > > Thanks, > > > > > > > > > > --Sudip. > > > > > > > > > > > > > > > > -- > > > > Lewis > > > > > > > > > >

