Apologies in advance if this topic/question has been previously answered…I have scoured the docs, mail archives, web looking for an answer(s) with no luck. I am sure I am just being dense or missing something obvious…please point out my stupidity as my head hurts trying to get this working.
Solr 3.1 Java 1.6 Eclipse/Tomcat 7/Maven 2.x Goal: to extract manufacturer names from a repeating list of keywords each denoted by a Category, one of which is "Manufacturer", and load them into a MsgKeywordMF field (see xml below) I have xml files I am loading via DIH. This an abbreviated example xml data (each file has repeating "Report" items, each report has repeating MsgSet, Msg, MsgList, etc items). Notice the nested repeating groups, namely MsgItems, within each document (Report): <Report> <ReportMeta> <ReportDate>02/22/2011</ReportDate> … </ReportMeta> <MsgSet> <Msg> <SourceDocID>http://someurl.com/path/to/doc</SourceDocID> … <DocumentText>........blah blah</DocumentText> <MsgList> <MsgItem> <MsgType>SomeType</MsgType> <Category>Location</Category> <Keyword>USA</Keyword> </MsgItem> <MsgItem> <MsgType>AnotherType</MsgType> <Category>Manufacturer</Category> <Keyword>Apple</Keyword> </MsgItem> … </MsgList> </Msg> </MsgSet> </Report> <Report> … </Report> <Report> … </Report> … Here is my data-config.xml: <dataConfig> <dataSource type="FileDataSource" encoding="UTF-8" /> <document> <entity name="fileload" rootEntity="false" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="false" baseDir="/files/xml/"> <entity name="report" rootEntity="true" pk="id" url="${fileload.fileAbsolutePath}" processor="XPathEntityProcessor" forEach="/Report/MsgSet/Msg" onError="skip" transformer="DateFormatTransformer,RegexTransformer"> <field column="DocumentText" xpath="/Report/MsgSet/Msg/DocumentText"/> <field column="id" xpath="/Report/MsgSet/Msg/SourceDocID"/> <field column="MsgCategory" xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Category" /> <field column="MsgKeyword" xpath="/Report/MsgSet/Msg/MsgList/MsgItem/Keyword" /> <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" /> … </entity> </entity> </document> </dataConfig> As seen in my config and sample data above, I am extracting the repeating "Keywords" into the the MsgKeyword field. Also, and the part that does NOT work, I am trying to extract into a separate field just the keywords that have a "Category" of "Manufacturer" --> <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[Category='Manufacturer']/Keyword" /> I have also tried: <field column="MsgKeywordMF" xpath="/Report/MsgSet/Msg/MsgList/MsgItem[@Category='Manufacturer']/Keyword" /> …after changing the "Category" to an attribute of MsgItem (<MsgItem Category="Location">) but it too fails to match. I have tested my xpath notation against my xml data file using various xpath evaluator tools, like within Eclipse, and it matches perfectly…but I can't get it to match/work during import. As I am able to understand it, DIH does not support nested/correlated entities, at least not with XML data sources using nested entity tags. I've tried without success to nest entities but I can't "correlate" the nested entity with the parent. I think the way I'm trying should work, but no luck so far…. BTW, I can't easily change the xml format, although it is possible with some pain… Any ideas? TIA, -- Eric