i put it in the tika-entity as attribute, but it doesn't change anything. my 
bigger concern is why text_test isn't populated at all

On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:

> Can you try SOLR-4530 switch:
> https://issues.apache.org/jira/browse/SOLR-4530
> 
> Specifically, setting htmlMapper="identity" on the entity definition. This
> will tell Tika to send full HTML rather than a seriously stripped one.
> 
> Regards,
> Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote:
> 
>> i'm trying to index a html page and only user the div with the
>> id="content". unfortunately nothing is working within the tika-entity, only
>> the standard text (content) is populated.
>> 
>>        do i have to use copyField for test_text to get the data?
>>        or is there a problem with the entity-hirarchy?
>>        or is the xpath wrong, even though i've tried it without and just
>> using text?
>>        or should i use the updateextractor?
>> 
>> data-config.xml:
>> 
>> <dataConfig>
>>        <dataSource type="BinFileDataSource" name="data"/>
>>        <dataSource type="BinURLDataSource" name="dataUrl"/>
>>        <dataSource type="URLDataSource" baseUrl="
>> http://127.0.0.1/tkb/internet/"; name="main"/>
>> <document>
>>        <entity name="rec" processor="XPathEntityProcessor"
>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>                <field column="title" xpath="//title" />
>>                <field column="id" xpath="//id" />
>>                <field column="file" xpath="//file" />
>>                <field column="path" xpath="//path" />
>>                <field column="url" xpath="//url" />
>>                <field column="Author" xpath="//author" />
>> 
>>                <entity name="tika" processor="TikaEntityProcessor"
>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>                        <!-- <copyField source="text" dest="text_test" />
>> -->
>>                        <field column="text_test"
>> xpath="//div[@id='content']" />
>>                </entity>
>>        </entity>
>> </document>
>> </dataConfig>
>> 
>> docImporterUrl.xml:
>> 
>> <?xml version="1.0" encoding="utf-8"?>
>> <docs>
>> <doc>
>>                <id>5</id>
>>                <author>tkb</author>
>>                <title>Startseite</title>
>>                <description>blabla ...</description>
>>                <file>http://localhost/tkb/internet/index.cfm</file>
>>                <url>http://localhost/tkb/internet/index.cfm/url</url>
>>                <path2>http\specialConf</path2>
>>        </doc>
>>        <doc>
>>                <id>6</id>
>>                <author>tkb</author>
>>                <title>Eigenheim</title>
>>                <description>Machen Sie sich erste Gedanken über den
>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
>> Hinsicht gelingt.</description>
>>                <file>
>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file>
>>                <url>
>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url>
>>        </doc>
>> </docs>

Reply via email to