Ah. That's because Tika processor does not support path extraction. You
need to nest one more level.

Regards,
      Alex
On 22 Aug 2013 13:34, "Andreas Owen" <a...@conx.ch> wrote:

> i can do it like this but then the content isn't copied to text. it's just
> in text_test
>
> <entity name="tika" processor="TikaEntityProcessor"
> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>         <field column="text" name="text_test">
>         <copyField source="text_test" dest="text" />
> </entity>
>
>
> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
>
> > i put it in the tika-entity as attribute, but it doesn't change
> anything. my bigger concern is why text_test isn't populated at all
> >
> > On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
> >
> >> Can you try SOLR-4530 switch:
> >> https://issues.apache.org/jira/browse/SOLR-4530
> >>
> >> Specifically, setting htmlMapper="identity" on the entity definition.
> This
> >> will tell Tika to send full HTML rather than a seriously stripped one.
> >>
> >> Regards,
> >> Alex.
> >>
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all at
> >> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> >>
> >>
> >> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote:
> >>
> >>> i'm trying to index a html page and only user the div with the
> >>> id="content". unfortunately nothing is working within the tika-entity,
> only
> >>> the standard text (content) is populated.
> >>>
> >>>       do i have to use copyField for test_text to get the data?
> >>>       or is there a problem with the entity-hirarchy?
> >>>       or is the xpath wrong, even though i've tried it without and just
> >>> using text?
> >>>       or should i use the updateextractor?
> >>>
> >>> data-config.xml:
> >>>
> >>> <dataConfig>
> >>>       <dataSource type="BinFileDataSource" name="data"/>
> >>>       <dataSource type="BinURLDataSource" name="dataUrl"/>
> >>>       <dataSource type="URLDataSource" baseUrl="
> >>> http://127.0.0.1/tkb/internet/"; name="main"/>
> >>> <document>
> >>>       <entity name="rec" processor="XPathEntityProcessor"
> >>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
> >>>               <field column="title" xpath="//title" />
> >>>               <field column="id" xpath="//id" />
> >>>               <field column="file" xpath="//file" />
> >>>               <field column="path" xpath="//path" />
> >>>               <field column="url" xpath="//url" />
> >>>               <field column="Author" xpath="//author" />
> >>>
> >>>               <entity name="tika" processor="TikaEntityProcessor"
> >>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
> >>>                       <!-- <copyField source="text" dest="text_test" />
> >>> -->
> >>>                       <field column="text_test"
> >>> xpath="//div[@id='content']" />
> >>>               </entity>
> >>>       </entity>
> >>> </document>
> >>> </dataConfig>
> >>>
> >>> docImporterUrl.xml:
> >>>
> >>> <?xml version="1.0" encoding="utf-8"?>
> >>> <docs>
> >>> <doc>
> >>>               <id>5</id>
> >>>               <author>tkb</author>
> >>>               <title>Startseite</title>
> >>>               <description>blabla ...</description>
> >>>               <file>http://localhost/tkb/internet/index.cfm</file>
> >>>               <url>http://localhost/tkb/internet/index.cfm/url</url>
> >>>               <path2>http\specialConf</path2>
> >>>       </doc>
> >>>       <doc>
> >>>               <id>6</id>
> >>>               <author>tkb</author>
> >>>               <title>Eigenheim</title>
> >>>               <description>Machen Sie sich erste Gedanken über den
> >>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
> gar ein
> >>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
> >>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
> finanzieller
> >>> Hinsicht gelingt.</description>
> >>>               <file>
> >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file>
> >>>               <url>
> >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url>
> >>>       </doc>
> >>> </docs>
>
>

Reply via email to