How to parse different portions of an HTML page using Tika & index using Lucene ?

amg qas Mon, 10 Jan 2011 11:02:39 -0800

Hi,

I have been trying to parse and index different portions of an HTML page
using Tika & Lucene. For eg. I would like to index text within Title, H1,
H2, A tags of a HTML page separately and provide a different boost to each
of them. I am using Tika for HTML parsing and creating a Document object
with the appropriate fields that need to be indexed. However I could not
find anything within Tika which would help me index the tags I want right
out of the box.


My code looks something like this :

 InputStream is = new FileInputStream(f);
 Parser parser = new AutoDetectParser();
 ContentHandler handler = new BodyContentHandler(-1);
 ParseContext context = new ParseContext();

  context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);

 try {
  parser.parse(is, handler, metadata, context);
 } finally {
  is.close();
 }

 Document doc = new Document();
 doc.add(new Field("contents", handler.toString(),

   Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));

 for (String name : metadata.names()) {
  String value = metadata.get(name);

  if (textualMetadataFields.contains(name)) {

   doc.add(new Field("contents", value,
     Field.Store.NO <http://field.store.no/>, Field.Index.ANALYZED));
  }

  doc.add(new Field(name, value, Field.Store.YES, Field.Index.YES));
 }

Stepping into Tika's HTML parsing code I found that it is
org.apache.tika.parser.html.HtmlHandler class that fills up metadata object.
Do I need to write a custom HTML handler like HtmlHandler ? Is there some
class in Tika which can parse out text within different HTML tags that one
specifies ? Can someone please provide code samples for solutions that you
propose ?
Thanks,
Amg

How to parse different portions of an HTML page using Tika & index using Lucene ?

Reply via email to