Re: DIH using values from solrconfig.xml inside data-config.xml

Fergus McMenemie Wed, 04 Feb 2009 03:46:00 -0800

>: > The solr data field is populated properly. So I guess that bit works.
>: > I really wish I could use xpath="//para"
>
>: The limitation comes from streaming the XML instead of creating a DOM.
>: XPathRecordReader is a custom streaming XPath parser implementation and
>: streaming is easy only because we limit the syntax. You can use
>: PlainTextEntityProcessor which gives the XML as a string to a  custom
>: Transformer. This Transformer can create a DOM, run your XPath query and
>: populate the fields. It's more expensive but it is an option.
>
>Maybe it's just me, but it seems like i'm noticing that as DIH gets used 
>more, many people are noting that the XPath processing in DIH doesn't work 
>the way they expect because it's a custom XPath parser/engine designed for 
>streaming.  
>
>It seems like it would be helpful to have an alternate processor for 
>people who don't need the streaming support (ie: are dealing with small 
>enough docs that they can load the full DOM tree into memory) that would 
>use the default Java XPath engine (and have less caveats/suprises) ... i 
>wou think it would probably even make sense for this new XPath processor 
>to be the one we suggest for new users, and only suggest the existing 
>(stream based) processor if they have really big xml docs to deal with.
>
>(In hindsight XPathEntityProcessor and XPathRecordReader should probably 
>have been named StreamingXPathEntityProcessor and 
>StreamingXPathRecordReader)
>
Four thoughts!


1) My use case involves a few million XML documents ranging in size
   from a few K to 500K. 95% of the documents are under 25KBytes, 
   5 of the documents are around 0.5Mbytes. So.. sod it, I think I
   need a streaming parser.

2) "streaming XPath parser"? I only half understand all this stuff,
   but, and this is based on the little bit of SAX stuff I have written,
   I would have thought that //para was trivial for any kind of
   streaming XML parser.

3) Much of the confusion may be arising because the DIH wiki page is
   not to clear on what is and is not allowed. We need better,
   more explicit examples. What seems to be allowed is:-
    <field column="streetname" xpath="/record/address/@c" /> 
    <field column="title"      xpath="/record/title" />
    <field column="date"       xpath="/record/da...@qualifier='pubDate']" />
   I will add these to the wiki. Just to be sure, I tested 
   xpath="//para". It does not work!

4) XML documents are ether well structured with good separation of 
   data and presentation in which case absolute xpaths work fine.
   Or older, in my case text documents, which have been forced into
   XML format with poor structure where the data and presentation 
   is all mixed up. I suspect that the addition of //para would
   cover many of the use cases, and what was left could be covered
   by a preceding XSLT transform. 
-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: DIH using values from solrconfig.xml inside data-config.xml

Reply via email to