Hi,

On Mon, Aug 17, 2009 at 7:29 AM, Dave Pawson<dave.paw...@gmail.com> wrote:
> New to tika, early user of Lucene.
> Particular interest in indexing and searching XML instances.
> I currently have about 800+ instances, with about 20 different schemas
> (XML based user documentation for Erlang) that I'm working with.
>
> Seeking guidance on how best to handle XML.
> E.g. How to get boost on certain elements,
> ignore other element content.
>
> Are there any developments in this area?

Tika currently just pulls out the character content from XML
documents, dropping much of the structural information except for some
Dublin Core metadata if found.

If you want more control in indexing your XML documents, you should
consider parsing them directly without Tika in between. Alternatively
we may want to consider adding some generic parser options in Tika to
for example turn specific elements in the input XML document to links
or <em/> elements in the resulting XHTML output for use by the
indexer.

BR,

Jukka Zitting

Reply via email to