Re: Problem implementing my own HtmlParseFilter

Matthias Naber Thu, 23 Jun 2011 16:08:51 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
Hey,

first of all I'm using nutch v.1.3 stable.

The goal was to crawl a web app and then publish the data to solr. For
the crawling and parsing part I take nutch. Therefore I implemented my
own ParsingFilter (the only thing it does is to extract a certain node
from the DOM and write its contents (node.textContents()) into a new
field. this field was added to the solr schema and to the
nutch-solr-mapping and everything works quite well)

Except for some URLs that are not properly handled (aka "my
ParseFilter is not invoked"). That URLs do not differ from these that
work. The xHTML is valid -- its a simple .(x)html document.
(magic-mime-type is something like application/xhtml+xml)
These pages are parsed by the parse-html but as I said, my ParseFilter
is not invoked on only a subset of all the pages. There is no
Exception. The document will be shown in the solr -- but without my
cusom field from above.

I sourrounded my whole code with a try{...}catch(Throwable th) in case
something weird happens within my code, but this still don't do the
trick. And since it doesn't get called, there is not much to log. No
Exceptions nor errors at all :(
Has a ParseFilter to be registered for a certain mime type?

Regards,
mana

Am 24.06.2011 00:39, schrieb lewis john mcgibbney:
> Hi Mana,
>
> I think you would be best to provide details on the following.
>
> What the htmlparsefilter plugin does some log data displaying how
> it works with some urls but not witrh others e.g. so we can see the
> nature of the urls it is not working with and vice versa Which
> version of nutch you are using
>
> Some comments on your indexing plugin, in my own opinion it is much
> easier to create fields to be indexed if we write this into our
> mapping schema and in our Solr implementation. My assumption is
> that you are not using Solr for indexing, this is why you are
> experiencing some problem getting your fields to map to the index.
> Is it convenient to try Solr, without access to code for yoyur
> plugin it makes it extremely hard to try and route out the problem
> you are experiencing.
>
> On Thu, Jun 23, 2011 at 12:16 PM, Matthias Naber <
> [email protected]> wrote:
>
> Hey,
>
> I'm new to the nutch project and just started to test some things.
> So I followed this example
> http://wiki.apache.org/nutch/WritingPluginExample and implemented
> my own HtmlParseFilter.
>
> My custom MyHtmlParseFilter works fine on most of the pages - but
> isn't called at all on others. (I also implemented an
> IndexingFilter that works just fine)
>
> The goal was to add a new field to the search index. For most of
> the pages my stuff is called what adds a custom field to the later
> search-index-documents. For some few pages, my code is ignored and
> I don't see this field in the index-documents.
>
> To sum this up: my ParseFilter doesn't get called at all for only
> a few random pages ... why!?!
>
> I guess this may be related to the MIME-type of the pages to be
> parsed? Has anyone an idea what may cause this?
>
> Regards, mana
>
> # I'm using nutch v.1.3 stable
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Dx0wACgkQzp84az+gLK0WTwCdFPLc0H9ULE1C+Yg1ZYZffzgv
d7oAn18bT3ekHlgtp/y9KVSSMt/mUbfS
=L06R
-----END PGP SIGNATURE-----

Re: Problem implementing my own HtmlParseFilter

Reply via email to