Re: SolrCell help!

Flavio Pompermaier Tue, 23 Jul 2013 00:53:00 -0700

I tried to download the current trunk but it doesn't compile..for example
it hangs on
https://repository.cloudera.com/artifactory/cloudera-repos/com/twitter/parquet-avro/1.0.0-SNAPSHOT/maven-metadata.xml
that doesn't exists anymore..



On Mon, Jul 22, 2013 at 11:14 PM, Flavio Pompermaier
<[email protected]>wrote:

> You couldn't be more precise ;)
>
> Thanks,
> Flavio
>
> On Mon, Jul 22, 2013 at 11:02 PM, Wolfgang Hoschek 
> <[email protected]>wrote:
>
>> Docs for the xquery and xslt morphline commands are here (look for
>> xquery"):
>> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
>>
>> Example morphlines for the new xquery and xslt commands are here:
>> https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-morphlines
>>
>> Sample input data is here:
>> https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-documents
>>
>> Unit tests are here:
>> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-saxon/src/test/java/com/cloudera/cdk/morphline/saxon/SaxonMorphlineTest.java
>>
>> Wolfgang.
>>
>> On Jul 22, 2013, at 1:41 PM, Flavio Pompermaier wrote:
>>
>> > Ok, I'll try to follow the code! Just one last thing: for morphine-neon
>> I manage to find the test (in cdk repository) but for the new xslt and
>> xquery I'm not able to find the tests code..could you give me an hook?
>> >
>> > On Mon, Jul 22, 2013 at 9:21 PM, Wolfgang Hoschek <
>> [email protected]> wrote:
>> > There are many tests for this in the morphlines repo.
>> >
>> > Wolfgang.
>> >
>> > On Jul 22, 2013, at 11:43 AM, Flavio Pompermaiert wrote:
>> >
>> > >
>> > > Thank you for the great support Wolfgang!
>> > > Flume + Morphlines is undoubtedly an exciting road but its taking me
>> too much time :(
>> > > Do you think you could add some more tests including readJson and the
>> new xquery and xslt in trunk?
>> > >
>> > > Best,
>> > > Flavio
>> > > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <
>> [email protected]> wrote:
>> > > Looks like the DcXMLParser spits out a metadata field called "title"
>> and another title as part of the Tika XML stream. That metadata field is
>> then added to the solr document by solrcell. If you add "title" to the
>> captures the title from the XML stream gets added as well by solrcell.
>> > >
>> > > JSON support has been released in morphlines-0.4.1 (which flume trunk
>> is now depending on):
>> http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson
>> > >
>> > > Note that Tika XML doesn't really support/capture XPath extraction
>> with SolrCell. We have added proper support for reading, extracting and
>> transforming XML and HTML with XPath, XQuery and XSLT on the current
>> morphlines trunk (not yet released), similar to the way we already support
>> JSON and Avro. This should make XML handling a lot more straightforward,
>> and make the very limited XML SolrCell approach obsolete. Look for the new
>> "xquery" and "xslt" command in
>> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
>> > >
>> > > Meanwhile, consider using these new commands or, use JSON or Avro, or
>> write your own custom morphline commands that extract whatever you want
>> from your XML data.
>> > >
>> > > Wolfgang.
>> > >
>> > > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:
>> > >
>> > > > Hi to all,
>> > > > I'm trying to understand how to "master" Morphline configuration
>> files in order to put some data into Solr but I'm facing some problem with
>> TestMorphlineSolrSink. This is what I done:
>> > > >
>> > > > 1) Since I want to index the title of the testXML.xml (i.e. "Tika
>> test document") so I commented out all the parsers except
>> org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata)
>> > > > 2) In schema.xml I added the following field:
>> > > >     <field name="title" type="text_en" indexed="true" stored="true"
>> multiValued="false" />
>> > > >
>> > > > But:
>> > > >  - If I don't add anything to fmap or capture everything works fine
>> but I don't understand why (who fills that field?). If instead I add to
>> capture title or/and to famp title: title (or dc_title:title) Solr
>> complains that 2 values are retrieved for 'title' (debugging the values I
>> see the title and one empty value in the 'title\ metadata array...).
>> > > > Thus, the problem is that everything works magically if the field
>> is named title, but if I change its name to something like doc_title
>> there's no way to make it non-multivalued.  Am I right? How can I fix this
>> problem?
>> > > > - I'd like to manage JSON files..How can I map JSON fields to Solr
>> fields? Could someone give a simple example?
>> > > >
>> > > > Best,
>> > > > Flavio
>> > >
>> > >
>> > >
>> >
>>
>>

Re: SolrCell help!

Reply via email to