Re: dataimporter tika fields empty
ok but i'm not doing any path extraction, at least i don't think so. htmlMapper=identity isn't preserving html it's reading the content of the pages but it's not putting it into text_test and text. it's only in text_test the copyField isn't working. data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity field column=text name=text_test / copyField source=text_test dest=text / !-- field column=text_test xpath=//div[@id='content'] / -- /entity /entity /document /dataConfig On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url
Re: dataimporter tika fields empty
i changed following line (xpath): field column=text xpath=//div[@id='content'] name=text_test / On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs
Re: dataimporter tika fields empty
Ah. That's because Tika processor does not support path extraction. You need to nest one more level. Regards, Alex On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote: i can do it like this but then the content isn't copied to text. it's just in text_test entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl field column=text name=text_test copyField source=text_test dest=text / /entity On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: Can you try SOLR-4530 switch: https://issues.apache.org/jira/browse/SOLR-4530 Specifically, setting htmlMapper=identity on the entity definition. This will tell Tika to send full HTML rather than a seriously stripped one. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote: i'm trying to index a html page and only user the div with the id=content. unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource baseUrl= http://127.0.0.1/tkb/internet/; name=main/ document entity name=rec processor=XPathEntityProcessor url=docImportUrl.xml forEach=/docs/doc dataSource=main field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl !-- copyField source=text dest=text_test / -- field column=text_test xpath=//div[@id='content'] / /entity /entity /document /dataConfig docImporterUrl.xml: ?xml version=1.0 encoding=utf-8? docs doc id5/id authortkb/author titleStartseite/title descriptionblabla .../description filehttp://localhost/tkb/internet/index.cfm/file urlhttp://localhost/tkb/internet/index.cfm/url/url path2http\specialConf/path2 /doc doc id6/id authortkb/author titleEigenheim/title descriptionMachen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt./description file http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file url http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url /doc /docs