ok but i'm not doing any path extraction, at least i don't think so.
htmlMapper=identity isn't preserving html
it's reading the content of the pages but it's not putting it into text_test
and text. it's only in text_test the copyField isn't working.
data-config.xml:
dataConfig
i changed following line (xpath): field column=text
xpath=//div[@id='content'] name=text_test /
On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:
Ah. That's because Tika processor does not support path extraction. You
need to nest one more level.
Regards,
Alex
On 22 Aug
i'm trying to index a html page and only user the div with the id=content.
unfortunately nothing is working within the tika-entity, only the standard text
(content) is populated.
do i have to use copyField for test_text to get the data?
or is there a problem with the
Can you try SOLR-4530 switch:
https://issues.apache.org/jira/browse/SOLR-4530
Specifically, setting htmlMapper=identity on the entity definition. This
will tell Tika to send full HTML rather than a seriously stripped one.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn:
i put it in the tika-entity as attribute, but it doesn't change anything. my
bigger concern is why text_test isn't populated at all
On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
Can you try SOLR-4530 switch:
https://issues.apache.org/jira/browse/SOLR-4530
Specifically, setting
i can do it like this but then the content isn't copied to text. it's just in
text_test
entity name=tika processor=TikaEntityProcessor
url=${rec.path}${rec.file} dataSource=dataUrl
field column=text name=text_test
copyField source=text_test dest=text /
/entity
On 22. Aug
Ah. That's because Tika processor does not support path extraction. You
need to nest one more level.
Regards,
Alex
On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:
i can do it like this but then the content isn't copied to text. it's just
in text_test
entity name=tika