ok but i'm not doing any path extraction, at least i don't think so. htmlMapper="identity" isn't preserving html
it's reading the content of the pages but it's not putting it into "text_test" and "text". it's only in "text_test" the copyField isn't working. data-config.xml: <dataConfig> <dataSource type="BinFileDataSource" name="data"/> <dataSource type="BinURLDataSource" name="dataUrl"/> <dataSource type="URLDataSource" name="main"/> <document> <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <field column="title" xpath="//title" /> <field column="id" xpath="//id" /> <field column="file" xpath="//file" /> <field column="path" xpath="//path" /> <field column="url" xpath="//url" /> <field column="Author" xpath="//author" /> <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" > <field column="text" name="text_test" /> <copyField source="text_test" dest="text" /> <!-- <field column="text_test" xpath="//div[@id='content']" /> --> </entity> </entity> </document> </dataConfig> On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: > Ah. That's because Tika processor does not support path extraction. You > need to nest one more level. > > Regards, > Alex > On 22 Aug 2013 13:34, "Andreas Owen" <a...@conx.ch> wrote: > >> i can do it like this but then the content isn't copied to text. it's just >> in text_test >> >> <entity name="tika" processor="TikaEntityProcessor" >> url="${rec.path}${rec.file}" dataSource="dataUrl" > >> <field column="text" name="text_test"> >> <copyField source="text_test" dest="text" /> >> </entity> >> >> >> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: >> >>> i put it in the tika-entity as attribute, but it doesn't change >> anything. my bigger concern is why text_test isn't populated at all >>> >>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: >>> >>>> Can you try SOLR-4530 switch: >>>> https://issues.apache.org/jira/browse/SOLR-4530 >>>> >>>> Specifically, setting htmlMapper="identity" on the entity definition. >> This >>>> will tell Tika to send full HTML rather than a seriously stripped one. >>>> >>>> Regards, >>>> Alex. >>>> >>>> Personal website: http://www.outerthoughts.com/ >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>>> - Time is the quality of nature that keeps events from happening all at >>>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> book) >>>> >>>> >>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote: >>>> >>>>> i'm trying to index a html page and only user the div with the >>>>> id="content". unfortunately nothing is working within the tika-entity, >> only >>>>> the standard text (content) is populated. >>>>> >>>>> do i have to use copyField for test_text to get the data? >>>>> or is there a problem with the entity-hirarchy? >>>>> or is the xpath wrong, even though i've tried it without and just >>>>> using text? >>>>> or should i use the updateextractor? >>>>> >>>>> data-config.xml: >>>>> >>>>> <dataConfig> >>>>> <dataSource type="BinFileDataSource" name="data"/> >>>>> <dataSource type="BinURLDataSource" name="dataUrl"/> >>>>> <dataSource type="URLDataSource" baseUrl=" >>>>> http://127.0.0.1/tkb/internet/" name="main"/> >>>>> <document> >>>>> <entity name="rec" processor="XPathEntityProcessor" >>>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >>>>> <field column="title" xpath="//title" /> >>>>> <field column="id" xpath="//id" /> >>>>> <field column="file" xpath="//file" /> >>>>> <field column="path" xpath="//path" /> >>>>> <field column="url" xpath="//url" /> >>>>> <field column="Author" xpath="//author" /> >>>>> >>>>> <entity name="tika" processor="TikaEntityProcessor" >>>>> url="${rec.path}${rec.file}" dataSource="dataUrl" > >>>>> <!-- <copyField source="text" dest="text_test" /> >>>>> --> >>>>> <field column="text_test" >>>>> xpath="//div[@id='content']" /> >>>>> </entity> >>>>> </entity> >>>>> </document> >>>>> </dataConfig> >>>>> >>>>> docImporterUrl.xml: >>>>> >>>>> <?xml version="1.0" encoding="utf-8"?> >>>>> <docs> >>>>> <doc> >>>>> <id>5</id> >>>>> <author>tkb</author> >>>>> <title>Startseite</title> >>>>> <description>blabla ...</description> >>>>> <file>http://localhost/tkb/internet/index.cfm</file> >>>>> <url>http://localhost/tkb/internet/index.cfm/url</url> >>>>> <path2>http\specialConf</path2> >>>>> </doc> >>>>> <doc> >>>>> <id>6</id> >>>>> <author>tkb</author> >>>>> <title>Eigenheim</title> >>>>> <description>Machen Sie sich erste Gedanken über den >>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder >> gar ein >>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in >> finanzieller >>>>> Hinsicht gelingt.</description> >>>>> <file> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file> >>>>> <url> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url> >>>>> </doc> >>>>> </docs> >> >>