Hi, On Mon, Aug 17, 2009 at 7:29 AM, Dave Pawson<dave.paw...@gmail.com> wrote: > New to tika, early user of Lucene. > Particular interest in indexing and searching XML instances. > I currently have about 800+ instances, with about 20 different schemas > (XML based user documentation for Erlang) that I'm working with. > > Seeking guidance on how best to handle XML. > E.g. How to get boost on certain elements, > ignore other element content. > > Are there any developments in this area?
Tika currently just pulls out the character content from XML documents, dropping much of the structural information except for some Dublin Core metadata if found. If you want more control in indexing your XML documents, you should consider parsing them directly without Tika in between. Alternatively we may want to consider adding some generic parser options in Tika to for example turn specific elements in the input XML document to links or <em/> elements in the resulting XHTML output for use by the indexer. BR, Jukka Zitting