Ah. That's because Tika processor does not support path extraction. You need to nest one more level.
Regards, Alex On 22 Aug 2013 13:34, "Andreas Owen" <a...@conx.ch> wrote: > i can do it like this but then the content isn't copied to text. it's just > in text_test > > <entity name="tika" processor="TikaEntityProcessor" > url="${rec.path}${rec.file}" dataSource="dataUrl" > > <field column="text" name="text_test"> > <copyField source="text_test" dest="text" /> > </entity> > > > On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: > > > i put it in the tika-entity as attribute, but it doesn't change > anything. my bigger concern is why text_test isn't populated at all > > > > On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: > > > >> Can you try SOLR-4530 switch: > >> https://issues.apache.org/jira/browse/SOLR-4530 > >> > >> Specifically, setting htmlMapper="identity" on the entity definition. > This > >> will tell Tika to send full HTML rather than a seriously stripped one. > >> > >> Regards, > >> Alex. > >> > >> Personal website: http://www.outerthoughts.com/ > >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > >> - Time is the quality of nature that keeps events from happening all at > >> once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book) > >> > >> > >> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote: > >> > >>> i'm trying to index a html page and only user the div with the > >>> id="content". unfortunately nothing is working within the tika-entity, > only > >>> the standard text (content) is populated. > >>> > >>> do i have to use copyField for test_text to get the data? > >>> or is there a problem with the entity-hirarchy? > >>> or is the xpath wrong, even though i've tried it without and just > >>> using text? > >>> or should i use the updateextractor? > >>> > >>> data-config.xml: > >>> > >>> <dataConfig> > >>> <dataSource type="BinFileDataSource" name="data"/> > >>> <dataSource type="BinURLDataSource" name="dataUrl"/> > >>> <dataSource type="URLDataSource" baseUrl=" > >>> http://127.0.0.1/tkb/internet/" name="main"/> > >>> <document> > >>> <entity name="rec" processor="XPathEntityProcessor" > >>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> > >>> <field column="title" xpath="//title" /> > >>> <field column="id" xpath="//id" /> > >>> <field column="file" xpath="//file" /> > >>> <field column="path" xpath="//path" /> > >>> <field column="url" xpath="//url" /> > >>> <field column="Author" xpath="//author" /> > >>> > >>> <entity name="tika" processor="TikaEntityProcessor" > >>> url="${rec.path}${rec.file}" dataSource="dataUrl" > > >>> <!-- <copyField source="text" dest="text_test" /> > >>> --> > >>> <field column="text_test" > >>> xpath="//div[@id='content']" /> > >>> </entity> > >>> </entity> > >>> </document> > >>> </dataConfig> > >>> > >>> docImporterUrl.xml: > >>> > >>> <?xml version="1.0" encoding="utf-8"?> > >>> <docs> > >>> <doc> > >>> <id>5</id> > >>> <author>tkb</author> > >>> <title>Startseite</title> > >>> <description>blabla ...</description> > >>> <file>http://localhost/tkb/internet/index.cfm</file> > >>> <url>http://localhost/tkb/internet/index.cfm/url</url> > >>> <path2>http\specialConf</path2> > >>> </doc> > >>> <doc> > >>> <id>6</id> > >>> <author>tkb</author> > >>> <title>Eigenheim</title> > >>> <description>Machen Sie sich erste Gedanken über den > >>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder > gar ein > >>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den > >>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in > finanzieller > >>> Hinsicht gelingt.</description> > >>> <file> > >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file> > >>> <url> > >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url> > >>> </doc> > >>> </docs> > >