Parse and index tags from crawled HTML documents

Simone Fonda Thu, 29 Sep 2011 08:26:31 -0700

Hello everybody,
we are trying to setup nutch+solr to crawl and index an entire website.


We would like to extract, index and store some information kept in the
document <meta> tags (with some fixed 'name' attribute) for our own
convenience.


We (mistakenly) tried the urlmeta plugin, but it seems that it only
propagate the values you enter in the initial seed urls file, and
that's not what we need.

We tried to use the parse-metatags plugin found at
https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get
ant to build it properly: first 'ivy.xml' was missing, we created one
stealing it from another plugin. Then we updated ANT to the latest
release, but the compiler still complained that he didnt found some
libraries imported in the java source files (some org.apache.lucene.*
stuff).

Since nutch's 'title' field is something VERY similar to what we need
(in the end it just extracts the content of the <title>), we tried to
discover if the parse-html plugin could fit our needs someway, with no
success yet.

We tried to find more informations on how to use an xpath-driven
approach, with no luck.



Does anybody ever indexed and stored the content of <meta> tags?

Thanks a lot,
Simone

Parse and index tags from crawled HTML documents

Reply via email to