There are many tests for this in the morphlines repo. Wolfgang.
On Jul 22, 2013, at 11:43 AM, Flavio Pompermaier wrote: > > Thank you for the great support Wolfgang! > Flume + Morphlines is undoubtedly an exciting road but its taking me too much > time :( > Do you think you could add some more tests including readJson and the new > xquery and xslt in trunk? > > Best, > Flavio > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <[email protected]> > wrote: > Looks like the DcXMLParser spits out a metadata field called "title" and > another title as part of the Tika XML stream. That metadata field is then > added to the solr document by solrcell. If you add "title" to the captures > the title from the XML stream gets added as well by solrcell. > > JSON support has been released in morphlines-0.4.1 (which flume trunk is now > depending on): > http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson > > Note that Tika XML doesn't really support/capture XPath extraction with > SolrCell. We have added proper support for reading, extracting and > transforming XML and HTML with XPath, XQuery and XSLT on the current > morphlines trunk (not yet released), similar to the way we already support > JSON and Avro. This should make XML handling a lot more straightforward, and > make the very limited XML SolrCell approach obsolete. Look for the new > "xquery" and "xslt" command in > https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence > > Meanwhile, consider using these new commands or, use JSON or Avro, or write > your own custom morphline commands that extract whatever you want from your > XML data. > > Wolfgang. > > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote: > > > Hi to all, > > I'm trying to understand how to "master" Morphline configuration files in > > order to put some data into Solr but I'm facing some problem with > > TestMorphlineSolrSink. This is what I done: > > > > 1) Since I want to index the title of the testXML.xml (i.e. "Tika test > > document") so I commented out all the parsers except > > org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata) > > 2) In schema.xml I added the following field: > > <field name="title" type="text_en" indexed="true" stored="true" > > multiValued="false" /> > > > > But: > > - If I don't add anything to fmap or capture everything works fine but I > > don't understand why (who fills that field?). If instead I add to capture > > title or/and to famp title: title (or dc_title:title) Solr complains that 2 > > values are retrieved for 'title' (debugging the values I see the title and > > one empty value in the 'title\ metadata array...). > > Thus, the problem is that everything works magically if the field is named > > title, but if I change its name to something like doc_title there's no way > > to make it non-multivalued. Am I right? How can I fix this problem? > > - I'd like to manage JSON files..How can I map JSON fields to Solr fields? > > Could someone give a simple example? > > > > Best, > > Flavio > > >
