Re: Problem implementing my own HtmlParseFilter

Markus Jelsma Thu, 23 Jun 2011 16:20:28 -0700

Can you provide steps to reproduce with a public or sample XHTML document? 
Received HTTP headers may be interesting as well (e.g. type, length, 
redirections).


> Hey,
> 
> first of all I'm using nutch v.1.3 stable.
> 
> The goal was to crawl a web app and then publish the data to solr. For
> the crawling and parsing part I take nutch. Therefore I implemented my
> own ParsingFilter (the only thing it does is to extract a certain node
> from the DOM and write its contents (node.textContents()) into a new
> field. this field was added to the solr schema and to the
> nutch-solr-mapping and everything works quite well)
> 
> Except for some URLs that are not properly handled (aka "my
> ParseFilter is not invoked"). That URLs do not differ from these that
> work. The xHTML is valid -- its a simple .(x)html document.
> (magic-mime-type is something like application/xhtml+xml)
> These pages are parsed by the parse-html but as I said, my ParseFilter
> is not invoked on only a subset of all the pages. There is no
> Exception. The document will be shown in the solr -- but without my
> cusom field from above.
> 
> I sourrounded my whole code with a try{...}catch(Throwable th) in case
> something weird happens within my code, but this still don't do the
> trick. And since it doesn't get called, there is not much to log. No
> Exceptions nor errors at all :(
> Has a ParseFilter to be registered for a certain mime type?
> 
> Regards,
> mana
> 
> Am 24.06.2011 00:39, schrieb lewis john mcgibbney:
> > Hi Mana,
> > 
> > I think you would be best to provide details on the following.
> > 
> > What the htmlparsefilter plugin does some log data displaying how
> > it works with some urls but not witrh others e.g. so we can see the
> > nature of the urls it is not working with and vice versa Which
> > version of nutch you are using
> > 
> > Some comments on your indexing plugin, in my own opinion it is much
> > easier to create fields to be indexed if we write this into our
> > mapping schema and in our Solr implementation. My assumption is
> > that you are not using Solr for indexing, this is why you are
> > experiencing some problem getting your fields to map to the index.
> > Is it convenient to try Solr, without access to code for yoyur
> > plugin it makes it extremely hard to try and route out the problem
> > you are experiencing.
> > 
> > On Thu, Jun 23, 2011 at 12:16 PM, Matthias Naber <
> > [email protected]> wrote:
> > 
> > Hey,
> > 
> > I'm new to the nutch project and just started to test some things.
> > So I followed this example
> > http://wiki.apache.org/nutch/WritingPluginExample and implemented
> > my own HtmlParseFilter.
> > 
> > My custom MyHtmlParseFilter works fine on most of the pages - but
> > isn't called at all on others. (I also implemented an
> > IndexingFilter that works just fine)
> > 
> > The goal was to add a new field to the search index. For most of
> > the pages my stuff is called what adds a custom field to the later
> > search-index-documents. For some few pages, my code is ignored and
> > I don't see this field in the index-documents.
> > 
> > To sum this up: my ParseFilter doesn't get called at all for only
> > a few random pages ... why!?!
> > 
> > I guess this may be related to the MIME-type of the pages to be
> > parsed? Has anyone an idea what may cause this?
> > 
> > Regards, mana
> > 
> > # I'm using nutch v.1.3 stable

Re: Problem implementing my own HtmlParseFilter

Reply via email to