Hi,
I have been trying to parse and index different portions of an HTML page
using Tika & Lucene. For eg. I would like to index text within Title, H1,
H2, A tags of a HTML page separately and provide a different boost to each
of them. I am using Tika for HTML parsing and creating a Document object
with the appropriate fields that need to be indexed. However I could not
find anything within Tika which would help me index the tags I want right
out of the box.
My code looks something like this :
InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);
try {
parser.parse(is, handler, metadata, context);
} finally {
is.close();
}
Document doc = new Document();
doc.add(new Field("contents", handler.toString(),
Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));
for (String name : metadata.names()) {
String value = metadata.get(name);
if (textualMetadataFields.contains(name)) {
doc.add(new Field("contents", value,
Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));
}
doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
}
Stepping into Tika's HTML parsing code I found that it is
org.apache.tika.parser.html.HtmlHandler class that fills up metadata object.
Do I need to write a custom HTML handler like HtmlHandler ? Is there some
class in Tika which can parse out text within different HTML tags that one
specifies ? Can someone please provide code samples for solutions that you
propose ?
Thanks,
Amg