Hello everybody, we are trying to setup nutch+solr to crawl and index an entire website.
We would like to extract, index and store some information kept in the document <meta> tags (with some fixed 'name' attribute) for our own convenience. We (mistakenly) tried the urlmeta plugin, but it seems that it only propagate the values you enter in the initial seed urls file, and that's not what we need. We tried to use the parse-metatags plugin found at https://issues.apache.org/jira/browse/NUTCH-809 , but we couldnt get ant to build it properly: first 'ivy.xml' was missing, we created one stealing it from another plugin. Then we updated ANT to the latest release, but the compiler still complained that he didnt found some libraries imported in the java source files (some org.apache.lucene.* stuff). Since nutch's 'title' field is something VERY similar to what we need (in the end it just extracts the content of the <title>), we tried to discover if the parse-html plugin could fit our needs someway, with no success yet. We tried to find more informations on how to use an xpath-driven approach, with no luck. Does anybody ever indexed and stored the content of <meta> tags? Thanks a lot, Simone

