i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all
On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: > Can you try SOLR-4530 switch: > https://issues.apache.org/jira/browse/SOLR-4530 > > Specifically, setting htmlMapper="identity" on the entity definition. This > will tell Tika to send full HTML rather than a seriously stripped one. > > Regards, > Alex. > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) > > > On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote: > >> i'm trying to index a html page and only user the div with the >> id="content". unfortunately nothing is working within the tika-entity, only >> the standard text (content) is populated. >> >> do i have to use copyField for test_text to get the data? >> or is there a problem with the entity-hirarchy? >> or is the xpath wrong, even though i've tried it without and just >> using text? >> or should i use the updateextractor? >> >> data-config.xml: >> >> <dataConfig> >> <dataSource type="BinFileDataSource" name="data"/> >> <dataSource type="BinURLDataSource" name="dataUrl"/> >> <dataSource type="URLDataSource" baseUrl=" >> http://127.0.0.1/tkb/internet/" name="main"/> >> <document> >> <entity name="rec" processor="XPathEntityProcessor" >> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >> <field column="title" xpath="//title" /> >> <field column="id" xpath="//id" /> >> <field column="file" xpath="//file" /> >> <field column="path" xpath="//path" /> >> <field column="url" xpath="//url" /> >> <field column="Author" xpath="//author" /> >> >> <entity name="tika" processor="TikaEntityProcessor" >> url="${rec.path}${rec.file}" dataSource="dataUrl" > >> <!-- <copyField source="text" dest="text_test" /> >> --> >> <field column="text_test" >> xpath="//div[@id='content']" /> >> </entity> >> </entity> >> </document> >> </dataConfig> >> >> docImporterUrl.xml: >> >> <?xml version="1.0" encoding="utf-8"?> >> <docs> >> <doc> >> <id>5</id> >> <author>tkb</author> >> <title>Startseite</title> >> <description>blabla ...</description> >> <file>http://localhost/tkb/internet/index.cfm</file> >> <url>http://localhost/tkb/internet/index.cfm/url</url> >> <path2>http\specialConf</path2> >> </doc> >> <doc> >> <id>6</id> >> <author>tkb</author> >> <title>Eigenheim</title> >> <description>Machen Sie sich erste Gedanken über den >> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein >> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller >> Hinsicht gelingt.</description> >> <file> >> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file> >> <url> >> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url> >> </doc> >> </docs>