Hi,
i do a similar task two weeks ago.
I created a plugin and use dynamic field in Solr.
In the plugin, I have MetadataHtmlParser whith the filter method where
specify the pattern for retrive the tags with regex
Pattern tagPattern =
Pattern.compile("^<meta\\sname=\"([^\"]+)\"\\scontent=\"([^\"]+)\">$");
In my case, i have the metatags at the beggining of html document, so i do
the control only in the firtst 40 lines of each document.
If line matches pattern, add new metatag in tag hashmap:
Matcher m = tagPattern.matcher(line);
if (m.find()) {
LOG.debug("Adding tag=" + m.group(1));
tags.put(m.group(1), m.group(2));
}
In the MetadataIndexFilter, I add the tags found in MetadataHtmlParser :
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
HashMap<String, String> tags = new HashMap<String, String>();
String[] tagNames = parse.getData().getParseMeta().names();
for(String name: tagNames){
String[] tagValues =
parse.getData().getParseMeta().getValues(name);
tags.put(name, tagValues[0]);
}
if (tags == null || tags.size() == 0) {
return doc;
}
// add to the nutch document, the properties of the field are
set in
// the addIndexBackendOptions method.
Set<String> keys = tags.keySet();
for (String tag : keys) {
LOGGER.debug("Adding tag: [" + tag + "] for URL: " +
url.toString());
doc.add(tag+"_meta", tags.get(tag));
}
return doc;
}
Solr Dynamic fields can be:
<dynamicField name="*_meta" type="string" indexed="true" stored="true"/>
Add it on schema.xml, so each metatags find in a document come in index with
a separate field. For example:
Title_meta
Keywords_meta
Cavalaglio Davide
2010/5/21 Claus Daldorph Nielsen <[email protected]>
> Hi,
>
> I am new to Nutch and trying to get Nutch to index meta tags from html
> pages and store them for searching in Solr. The tags are on this form:
> <meta name="TITLE" content="Some title" />
> <meta name="KEYWORDS" content="Forum, help, build, stuff" />
>
> I would like to store the tags as two different fields in the index. I
> have tried the example explaining how to create a plugin but the example
> is for Nutch 0.9 and only helps me getting started.
>
> I think that I should look at :
>
> $NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
>
> and find the line:
> HTMLMetaProcessor.getMetaTags(metaTags, root, base);
>
> But I'm not sure how to go on from here. Any help would be appreciated and
> you are welcome to inform me if you know of an existing plugin that will
> index the meta tags.
>
>
>
> Claus Daldorph Nielsen
>
> Theilgaard Mortensen a/s