Looks like the DcXMLParser spits out a metadata field called "title" and 
another title as part of the Tika XML stream. That metadata field is then added 
to the solr document by solrcell. If you add "title" to the captures the title 
from the XML stream gets added as well by solrcell.

JSON support has been released in morphlines-0.4.1 (which flume trunk is now 
depending on): 
http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson

Note that Tika XML doesn't really support/capture XPath extraction with 
SolrCell. We have added proper support for reading, extracting and transforming 
XML and HTML with XPath, XQuery and XSLT on the current morphlines trunk (not 
yet released), similar to the way we already support JSON and Avro. This should 
make XML handling a lot more straightforward, and make the very limited XML 
SolrCell approach obsolete. Look for the new "xquery" and "xslt" command in 
https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence

Meanwhile, consider using these new commands or, use JSON or Avro, or write 
your own custom morphline commands that extract whatever you want from your XML 
data.

Wolfgang.

On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:

> Hi to all,
> I'm trying to understand how to "master" Morphline configuration files in 
> order to put some data into Solr but I'm facing some problem with 
> TestMorphlineSolrSink. This is what I done:
> 
> 1) Since I want to index the title of the testXML.xml (i.e. "Tika test 
> document") so I commented out all the parsers except 
> org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata)
> 2) In schema.xml I added the following field:
>     <field name="title" type="text_en" indexed="true" stored="true" 
> multiValued="false" />
> 
> But:
>  - If I don't add anything to fmap or capture everything works fine but I 
> don't understand why (who fills that field?). If instead I add to capture 
> title or/and to famp title: title (or dc_title:title) Solr complains that 2 
> values are retrieved for 'title' (debugging the values I see the title and 
> one empty value in the 'title\ metadata array...).
> Thus, the problem is that everything works magically if the field is named 
> title, but if I change its name to something like doc_title there's no way to 
> make it non-multivalued.  Am I right? How can I fix this problem?
> - I'd like to manage JSON files..How can I map JSON fields to Solr fields? 
> Could someone give a simple example?
> 
> Best,
> Flavio

Reply via email to