[
https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761124#action_12761124
]
Fergus McMenemie commented on SOLR-1437:
----------------------------------------
I am quite pleased with it as far as it goes and think it would be good for
1.4. I have tested it against my test set of 3000 XML documents and replacing:
{code}
<field column="para1" name="text"
xpath="/record/sect1/para" flatten="true"/>
<field column="para2" name="text"
xpath="/record/list/listitem/para" flatten="true"/>
<field column="para32" name="text"
xpath="/record/address/para" flatten="true" />
<field column="para40" name="text"
xpath="/record/authoredBy/para" flatten="true" />
<field column="para43" name="text"
xpath="/record/dataGroup/address/para" flatten="true" />
<field column="para47" name="text"
xpath="/record/dataGroup/keyPersonnel/doubleList/first/para" flatten="true" />
<field column="para49" name="text"
xpath="/record/dataGroup/keyPersonnel/doubleList/second/para" flatten="true" />
<field column="para50" name="text"
xpath="/record/dataGroup/keyPersonnel/para" flatten="true" />
<field column="para51" name="text"
xpath="/record/dataGroup/para" flatten="true" />
<field column="para57" name="text"
xpath="/record/doubleList/first/para" flatten="true" />
<field column="para59" name="text"
xpath="/record/doubleList/second/para" flatten="true" />
<field column="para63" name="text"
xpath="/record/keyPersonnel/doubleList/first/para" flatten="true" />
<field column="para65" name="text"
xpath="/record/keyPersonnel/doubleList/second/para" flatten="true" />
<field column="para68" name="text"
xpath="/record/list/listItem/para" flatten="true" />
<field column="para75" name="text"
xpath="/record/mediaBlock/doubleList/first/para" flatten="true" />
<field column="para77" name="text"
xpath="/record/mediaBlock/doubleList/second/para" flatten="true" />
<field column="para172" name="text"
xpath="/record/noteGroup/note/para" flatten="true" />
<field column="para174" name="text"
xpath="/record/para" flatten="true" />
<field column="para179" name="text"
xpath="/record/relatedInfo/list/listItem/relatedArticle/para" flatten="true" />
<field column="para184" name="text"
xpath="/record/sect1/address/dataGroup/para" flatten="true" />
<field column="para185" name="text"
xpath="/record/sect1/address/para" flatten="true" />
<field column="para195" name="text"
xpath="/record/sect1/dataGroup/address/para" flatten="true" />
<field column="para199" name="text"
xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"
flatten="true" />
<field column="para201" name="text"
xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"
flatten="true" />
<field column="para202" name="text"
xpath="/record/sect1/dataGroup/keyPersonnel/para" flatten="true" />
<field column="para203" name="text"
xpath="/record/sect1/dataGroup/para" flatten="true" />
<field column="para208" name="text"
xpath="/record/sect1/doubleList/first/para" flatten="true" />
<field column="para212" name="text"
xpath="/record/sect1/doubleList/second/list/listItem/para" flatten="true" />
<field column="para213" name="text"
xpath="/record/sect1/doubleList/second/para" flatten="true" />
<field column="para217" name="text"
xpath="/record/sect1/keyPersonnel/doubleList/first/para" flatten="true" />
<field column="para219" name="text"
xpath="/record/sect1/keyPersonnel/doubleList/second/para" flatten="true" />
<field column="para220" name="text"
xpath="/record/sect1/keyPersonnel/para" flatten="true" />
<field column="para225" name="text"
xpath="/record/sect1/list/listItem/list/listItem/para" flatten="true" />
<field column="para226" name="text"
xpath="/record/sect1/list/listItem/para" flatten="true" />
<field column="para240" name="text"
xpath="/record/sect1/para" flatten="true" />
<field column="para244" name="text"
xpath="/record/sect1/sect2/doubleList/first/para" flatten="true" />
<field column="para246" name="text"
xpath="/record/sect1/sect2/doubleList/second/para" flatten="true" />
<field column="para251" name="text"
xpath="/record/sect1/sect2/list/listItem/list/listItem/para" flatten="true" />
<field column="para252" name="text"
xpath="/record/sect1/sect2/list/listItem/para" flatten="true" />
<field column="para258" name="text"
xpath="/record/sect1/sect2/noteGroup/note/para" flatten="true" />
<field column="para259" name="text"
xpath="/record/sect1/sect2/para" flatten="true" />
<field column="para265" name="text"
xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"
flatten="true" />
<field column="para266" name="text"
xpath="/record/sect1/sect2/sect3/list/listItem/para" flatten="true" />
<field column="para271" name="text"
xpath="/record/sect1/sect2/sect3/para" flatten="true" />
<field column="para275" name="text"
xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para" flatten="true" />
<field column="para279" name="text"
xpath="/record/sect1/sect2/sect3/sect4/para" flatten="true" />
<field column="para284" name="text"
xpath="/record/sect1/sect2/sect3/sect4/sect5/para" flatten="true" />
<field column="para295" name="text"
xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/note/para"
flatten="true" />
<field column="para297" name="text"
xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"
flatten="true" />
<field column="para301" name="text"
xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"
flatten="true" />
<field column="para312" name="text"
xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/para"
flatten="true" />
<field column="para315" name="text"
xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/para"
flatten="true" />
<field column="para316" name="text"
xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"
flatten="true" />
<field column="para318" name="text"
xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para" flatten="true" />
<field column="para322" name="text"
xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para" flatten="true" />
<field column="para341" name="text"
xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"
flatten="true" />
<field column="para342" name="text"
xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"
flatten="true" />
<field column="para344" name="text"
xpath="/record/sect1/table/tgroup/tbody/row/entry/para" flatten="true" />
<field column="para348" name="text"
xpath="/record/sect1/table/tgroup/thead/row/entry/para" flatten="true" />
<field column="para371" name="text"
xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"
flatten="true" />
<field column="para373" name="text"
xpath="/record/table/tgroup/tbody/row/entry/para" flatten="true" />
<field column="para377" name="text"
xpath="/record/table/tgroup/thead/row/entry/para" flatten="true" />
{code]
with
{code}
<field column="text" xpath="//para"
flatten="true"/>
{code}
The indexes seemed equivalent and time to index was also equivalent.
I have one concern which should be addressed before any 1.4 release. I still do
not understand the purpose of the HashSet childrenFound and putNulls, if its
important then I suspect that whatever is done to childNodes when an
end_element is parsed also needs done to descNodes; but I have a feeling the
whole lot may be unnecessary and can be removed. If it is required we need to
explain it.
The last change I would like to see, which I am happy to leave to 1.5, involves
making sure emitted records do not contain tags from parent nodes unless they
are stipulated by "commonField"
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
> Key: SOLR-1437
> URL: https://issues.apache.org/jira/browse/SOLR-1437
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Affects Versions: 1.4
> Reporter: Fergus McMenemie
> Assignee: Noble Paul
> Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-1437.patch, SOLR-1437.patch
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> As per
> http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html
> it would be nice to be able to use expressions such as //tagname when
> parsing XML documents.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.