Re: SolrCell help!

Wolfgang Hoschek Tue, 23 Jul 2013 01:23:50 -0700

Tests pass on java 6 but fail on java 7. Correspondingly, I have filed 
https://issues.cloudera.org/browse/CDK-80. We'll fix it. Meanwhile, please try 
java 6.


Wolfgang.

On Jul 23, 2013, at 12:51 AM, Flavio Pompermaier wrote:

> I tried to download the current trunk but it doesn't compile..for example it 
> hangs on
> https://repository.cloudera.com/artifactory/cloudera-repos/com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml
> that doesn't exists anymore..
> 
> 
> On Mon, Jul 22, 2013 at 11:14 PM, Flavio Pompermaier <[email protected]> 
> wrote:
> You couldn't be more precise ;)
>  
> Thanks,
> Flavio
> 
> On Mon, Jul 22, 2013 at 11:02 PM, Wolfgang Hoschek <[email protected]> 
> wrote:
> Docs for the xquery and xslt morphline commands are here (look for xquery"): 
> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
> 
> Example morphlines for the new xquery and xslt commands are here: 
> https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-morphlines
> 
> Sample input data is here: 
> https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-documents
> 
> Unit tests are here: 
> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-saxon/src/test/java/com/cloudera/cdk/morphline/saxon/SaxonMorphlineTest.java
> 
> Wolfgang.
> 
> On Jul 22, 2013, at 1:41 PM, Flavio Pompermaier wrote:
> 
> > Ok, I'll try to follow the code! Just one last thing: for morphine-neon I 
> > manage to find the test (in cdk repository) but for the new xslt and xquery 
> > I'm not able to find the tests code..could you give me an hook?
> >
> > On Mon, Jul 22, 2013 at 9:21 PM, Wolfgang Hoschek <[email protected]> 
> > wrote:
> > There are many tests for this in the morphlines repo.
> >
> > Wolfgang.
> >
> > On Jul 22, 2013, at 11:43 AM, Flavio Pompermaiert wrote:
> >
> > >
> > > Thank you for the great support Wolfgang!
> > > Flume + Morphlines is undoubtedly an exciting road but its taking me too 
> > > much time :(
> > > Do you think you could add some more tests including readJson and the new 
> > > xquery and xslt in trunk?
> > >
> > > Best,
> > > Flavio
> > > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <[email protected]> 
> > > wrote:
> > > Looks like the DcXMLParser spits out a metadata field called "title" and 
> > > another title as part of the Tika XML stream. That metadata field is then 
> > > added to the solr document by solrcell. If you add "title" to the 
> > > captures the title from the XML stream gets added as well by solrcell.
> > >
> > > JSON support has been released in morphlines-0.4.1 (which flume trunk is 
> > > now depending on): 
> > > http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson
> > >
> > > Note that Tika XML doesn't really support/capture XPath extraction with 
> > > SolrCell. We have added proper support for reading, extracting and 
> > > transforming XML and HTML with XPath, XQuery and XSLT on the current 
> > > morphlines trunk (not yet released), similar to the way we already 
> > > support JSON and Avro. This should make XML handling a lot more 
> > > straightforward, and make the very limited XML SolrCell approach 
> > > obsolete. Look for the new "xquery" and "xslt" command in 
> > > https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
> > >
> > > Meanwhile, consider using these new commands or, use JSON or Avro, or 
> > > write your own custom morphline commands that extract whatever you want 
> > > from your XML data.
> > >
> > > Wolfgang.
> > >
> > > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:
> > >
> > > > Hi to all,
> > > > I'm trying to understand how to "master" Morphline configuration files 
> > > > in order to put some data into Solr but I'm facing some problem with 
> > > > TestMorphlineSolrSink. This is what I done:
> > > >
> > > > 1) Since I want to index the title of the testXML.xml (i.e. "Tika test 
> > > > document") so I commented out all the parsers except 
> > > > org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core 
> > > > metadata)
> > > > 2) In schema.xml I added the following field:
> > > >     <field name="title" type="text_en" indexed="true" stored="true" 
> > > > multiValued="false" />
> > > >
> > > > But:
> > > >  - If I don't add anything to fmap or capture everything works fine but 
> > > > I don't understand why (who fills that field?). If instead I add to 
> > > > capture title or/and to famp title: title (or dc_title:title) Solr 
> > > > complains that 2 values are retrieved for 'title' (debugging the values 
> > > > I see the title and one empty value in the 'title\ metadata array...).
> > > > Thus, the problem is that everything works magically if the field is 
> > > > named title, but if I change its name to something like doc_title 
> > > > there's no way to make it non-multivalued.  Am I right? How can I fix 
> > > > this problem?
> > > > - I'd like to manage JSON files..How can I map JSON fields to Solr 
> > > > fields? Could someone give a simple example?
> > > >
> > > > Best,
> > > > Flavio
> > >
> > >
> > >
> >
> 
> 
> 
> 
>

Re: SolrCell help!

Reply via email to