Hi Lewis,

Thanks for your support. The URL I attempted to parse is :
www.ab-advisory.com a simple web site under drupal7 with only html pages. I
tried in the past to crawl www.tripadvisor.com, www.nytime.com and I ended
up with same result.

kr, Arcondo


On Wed, Jan 9, 2013 at 12:22 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Arcondo,
>
> On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva
> <[email protected]>wrote:
>
> > My question : why I can't use Tika to parse Html instead of Neko ? is it
> > possible to get ride of Neko or it is mandatory ?
> >
>
> I would urge you to override the parsing logic in parse-plugins.xml [0]
> (which by default uses Tika to guess the Mimetype before assigning the
> correct plugin).
> You can do this like
>
> <mimeType name="text/html">
>                 <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="application/xhtml+xml">
>                 <plugin id="parse-tika" />
> </mimeType>
>
>
>  Please note that you will have to rebuild Nutch from source once this is
> done OK.
>
>
> > The other weird thing with neko is when I dig into
> > nutch21/src/plugins/lib-nekohtml, there only build, ivy and plugin.xml
> with
> > no src folder with java classes whereas the others plugins having them.
> is
> > it important ?
>
>
> Yes it is important, but it is not the root of the problem.
>
> > how it could be possible to build them if there aren't
> > present ?
> >
>
> Because the legacy HTML parsing logic resides in parse-html, this then uses
> lib-nekohtml as a requirement, please see [1]
>
> I hope this overrides the problem, however there certainly seems to be a
> problem here. Can you pass the URL you are attempting to parse?
>
> Thank you
>
> Lewis
>
> [0] http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
> [1]
>
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-html/plugin.xml
>

Reply via email to