Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen
ok but i'm not doing any path extraction, at least i don't think so. htmlMapper=identity isn't preserving html it's reading the content of the pages but it's not putting it into text_test and text. it's only in text_test the copyField isn't working. data-config.xml: dataConfig

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen
i changed following line (xpath): field column=text xpath=//div[@id='content'] name=text_test / On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug

dataimporter tika fields empty

2013-08-22 Thread Andreas Owen
i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the

Re: dataimporter tika fields empty

2013-08-22 Thread Alexandre Rafalovitch
Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn:

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen
i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen
i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug

Re: dataimporter tika fields empty

2013-08-22 Thread Alexandre Rafalovitch
Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika