Re: indexing hierarchical data, schema design

lewis john mcgibbney Fri, 17 Jun 2011 15:20:15 -0700

Hi Jasimop,

Some initial thoughts of mine are the following
>Is you situation identical to the example you provided where every page you
crawl is of format
<html><body><paragraph>...</paragraph>...<paragraph>...</paragraph></body></html>?
If this is the case then it looks like you require some some sort of plugin
similar to what Andrzej suggested here [1] however the HtmlParseFilter
plugin you implement will need to invoke an action after </paragraph> to
initiate the indexing of that section.
>If on the other hand your webpage is not exactly like explained in your
example e.g. lots of clutter around your required textual content then you
are looking solely at extending the functionality discussed here [2] to
include the indexing step discussed above.


I am sorry I cannot be of more help just now. I have still to familiarise
myself with boilerpipe.


[1]
http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
[2] https://issues.apache.org/jira/browse/NUTCH-961

Can anyone guide me into the right direction? Where should I start to
> search? Classes, wikis, homepages, books?
> Nutch does a great job for what I need it now, but I think it lacks a bit
> of
> documentation, especially when it comes to plugin development.
>

Yes I do agree with you here to an extent. It would appear that less users
have been contributing their knowledge to this section of our wiki. There
appears to be a wealth of info relating to legacy stuff on the wiki though!
It would be nice to see more examples of good practice and use cases in the
future.


> How would a bare-bones plugin look like?
>

Maybe someone can elaborate on the above or possibly correct me if my advice
is off track


-- 
*Lewis*

Re: indexing hierarchical data, schema design

Reply via email to