Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen
ok but i'm not doing any path extraction, at least i don't think so.

htmlMapper=identity isn't preserving html

it's reading the content of the pages but it's not putting it into text_test 
and text. it's only in text_test the copyField isn't working. 

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
dataSource=main 
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
htmlMapper=identity 
field column=text name=text_test /
copyField source=text_test dest=text /
!-- field column=text_test 
xpath=//div[@id='content'] /  --
/entity
/entity
/document
/dataConfig


On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

 Ah. That's because Tika processor does not support path extraction. You
 need to nest one more level.
 
 Regards,
  Alex
 On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:
 
 i can do it like this but then the content isn't copied to text. it's just
 in text_test
 
 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
 /entity
 
 
 On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
 
 i put it in the tika-entity as attribute, but it doesn't change
 anything. my bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition.
 This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity,
 only
 the standard text (content) is populated.
 
  do i have to use copyField for test_text to get the data?
  or is there a problem with the entity-hirarchy?
  or is the xpath wrong, even though i've tried it without and just
 using text?
  or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
  field column=title xpath=//title /
  field column=id xpath=//id /
  field column=file xpath=//file /
  field column=path xpath=//path /
  field column=url xpath=//url /
  field column=Author xpath=//author /
 
  entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
  !-- copyField source=text dest=text_test /
 --
  field column=text_test
 xpath=//div[@id='content'] /
  /entity
  /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
  id5/id
  authortkb/author
  titleStartseite/title
  descriptionblabla .../description
  filehttp://localhost/tkb/internet/index.cfm/file
  urlhttp://localhost/tkb/internet/index.cfm/url/url
  path2http\specialConf/path2
  /doc
  doc
  id6/id
  authortkb/author
  titleEigenheim/title
  descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
 gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
 finanzieller
 Hinsicht gelingt./description
  file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
  url
 

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen
i changed following line (xpath): field column=text 
xpath=//div[@id='content'] name=text_test /

On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

 Ah. That's because Tika processor does not support path extraction. You
 need to nest one more level.
 
 Regards,
  Alex
 On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:
 
 i can do it like this but then the content isn't copied to text. it's just
 in text_test
 
 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
 /entity
 
 
 On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
 
 i put it in the tika-entity as attribute, but it doesn't change
 anything. my bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition.
 This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity,
 only
 the standard text (content) is populated.
 
  do i have to use copyField for test_text to get the data?
  or is there a problem with the entity-hirarchy?
  or is the xpath wrong, even though i've tried it without and just
 using text?
  or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
  dataSource type=BinFileDataSource name=data/
  dataSource type=BinURLDataSource name=dataUrl/
  dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
  entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
  field column=title xpath=//title /
  field column=id xpath=//id /
  field column=file xpath=//file /
  field column=path xpath=//path /
  field column=url xpath=//url /
  field column=Author xpath=//author /
 
  entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
  !-- copyField source=text dest=text_test /
 --
  field column=text_test
 xpath=//div[@id='content'] /
  /entity
  /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
  id5/id
  authortkb/author
  titleStartseite/title
  descriptionblabla .../description
  filehttp://localhost/tkb/internet/index.cfm/file
  urlhttp://localhost/tkb/internet/index.cfm/url/url
  path2http\specialConf/path2
  /doc
  doc
  id6/id
  authortkb/author
  titleEigenheim/title
  descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
 gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
 finanzieller
 Hinsicht gelingt./description
  file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
  url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
  /doc
 /docs
 
 



Re: dataimporter tika fields empty

2013-08-22 Thread Alexandre Rafalovitch
Can you try SOLR-4530 switch:
https://issues.apache.org/jira/browse/SOLR-4530

Specifically, setting htmlMapper=identity on the entity definition. This
will tell Tika to send full HTML rather than a seriously stripped one.

Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:

 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity, only
 the standard text (content) is populated.

 do i have to use copyField for test_text to get the data?
 or is there a problem with the entity-hirarchy?
 or is the xpath wrong, even though i've tried it without and just
 using text?
 or should i use the updateextractor?

 data-config.xml:

 dataConfig
 dataSource type=BinFileDataSource name=data/
 dataSource type=BinURLDataSource name=dataUrl/
 dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
 entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
 field column=title xpath=//title /
 field column=id xpath=//id /
 field column=file xpath=//file /
 field column=path xpath=//path /
 field column=url xpath=//url /
 field column=Author xpath=//author /

 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
 !-- copyField source=text dest=text_test /
 --
 field column=text_test
 xpath=//div[@id='content'] /
 /entity
 /entity
 /document
 /dataConfig

 docImporterUrl.xml:

 ?xml version=1.0 encoding=utf-8?
 docs
 doc
 id5/id
 authortkb/author
 titleStartseite/title
 descriptionblabla .../description
 filehttp://localhost/tkb/internet/index.cfm/file
 urlhttp://localhost/tkb/internet/index.cfm/url/url
 path2http\specialConf/path2
 /doc
 doc
 id6/id
 authortkb/author
 titleEigenheim/title
 descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
 Hinsicht gelingt./description
 file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
 url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
 /doc
 /docs


Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen
i put it in the tika-entity as attribute, but it doesn't change anything. my 
bigger concern is why text_test isn't populated at all

On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:

 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition. This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity, only
 the standard text (content) is populated.
 
do i have to use copyField for test_text to get the data?
or is there a problem with the entity-hirarchy?
or is the xpath wrong, even though i've tried it without and just
 using text?
or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /
 
entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
!-- copyField source=text dest=text_test /
 --
field column=text_test
 xpath=//div[@id='content'] /
/entity
/entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
id5/id
authortkb/author
titleStartseite/title
descriptionblabla .../description
filehttp://localhost/tkb/internet/index.cfm/file
urlhttp://localhost/tkb/internet/index.cfm/url/url
path2http\specialConf/path2
/doc
doc
id6/id
authortkb/author
titleEigenheim/title
descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
 Hinsicht gelingt./description
file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
/doc
 /docs



Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen
i can do it like this but then the content isn't copied to text. it's just in 
text_test

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl 
field column=text name=text_test
copyField source=text_test dest=text /
/entity


On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:

 i put it in the tika-entity as attribute, but it doesn't change anything. my 
 bigger concern is why text_test isn't populated at all
 
 On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
 Can you try SOLR-4530 switch:
 https://issues.apache.org/jira/browse/SOLR-4530
 
 Specifically, setting htmlMapper=identity on the entity definition. This
 will tell Tika to send full HTML rather than a seriously stripped one.
 
 Regards,
 Alex.
 
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
 i'm trying to index a html page and only user the div with the
 id=content. unfortunately nothing is working within the tika-entity, only
 the standard text (content) is populated.
 
   do i have to use copyField for test_text to get the data?
   or is there a problem with the entity-hirarchy?
   or is the xpath wrong, even though i've tried it without and just
 using text?
   or should i use the updateextractor?
 
 data-config.xml:
 
 dataConfig
   dataSource type=BinFileDataSource name=data/
   dataSource type=BinURLDataSource name=dataUrl/
   dataSource type=URLDataSource baseUrl=
 http://127.0.0.1/tkb/internet/; name=main/
 document
   entity name=rec processor=XPathEntityProcessor
 url=docImportUrl.xml forEach=/docs/doc dataSource=main
   field column=title xpath=//title /
   field column=id xpath=//id /
   field column=file xpath=//file /
   field column=path xpath=//path /
   field column=url xpath=//url /
   field column=Author xpath=//author /
 
   entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
   !-- copyField source=text dest=text_test /
 --
   field column=text_test
 xpath=//div[@id='content'] /
   /entity
   /entity
 /document
 /dataConfig
 
 docImporterUrl.xml:
 
 ?xml version=1.0 encoding=utf-8?
 docs
 doc
   id5/id
   authortkb/author
   titleStartseite/title
   descriptionblabla .../description
   filehttp://localhost/tkb/internet/index.cfm/file
   urlhttp://localhost/tkb/internet/index.cfm/url/url
   path2http\specialConf/path2
   /doc
   doc
   id6/id
   authortkb/author
   titleEigenheim/title
   descriptionMachen Sie sich erste Gedanken über den
 Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
 spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
 Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
 Hinsicht gelingt./description
   file
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
   url
 http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
   /doc
 /docs



Re: dataimporter tika fields empty

2013-08-22 Thread Alexandre Rafalovitch
Ah. That's because Tika processor does not support path extraction. You
need to nest one more level.

Regards,
  Alex
On 22 Aug 2013 13:34, Andreas Owen a...@conx.ch wrote:

 i can do it like this but then the content isn't copied to text. it's just
 in text_test

 entity name=tika processor=TikaEntityProcessor
 url=${rec.path}${rec.file} dataSource=dataUrl 
 field column=text name=text_test
 copyField source=text_test dest=text /
 /entity


 On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:

  i put it in the tika-entity as attribute, but it doesn't change
 anything. my bigger concern is why text_test isn't populated at all
 
  On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
 
  Can you try SOLR-4530 switch:
  https://issues.apache.org/jira/browse/SOLR-4530
 
  Specifically, setting htmlMapper=identity on the entity definition.
 This
  will tell Tika to send full HTML rather than a seriously stripped one.
 
  Regards,
  Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
  On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen a...@conx.ch wrote:
 
  i'm trying to index a html page and only user the div with the
  id=content. unfortunately nothing is working within the tika-entity,
 only
  the standard text (content) is populated.
 
do i have to use copyField for test_text to get the data?
or is there a problem with the entity-hirarchy?
or is the xpath wrong, even though i've tried it without and just
  using text?
or should i use the updateextractor?
 
  data-config.xml:
 
  dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource baseUrl=
  http://127.0.0.1/tkb/internet/; name=main/
  document
entity name=rec processor=XPathEntityProcessor
  url=docImportUrl.xml forEach=/docs/doc dataSource=main
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /
 
entity name=tika processor=TikaEntityProcessor
  url=${rec.path}${rec.file} dataSource=dataUrl 
!-- copyField source=text dest=text_test /
  --
field column=text_test
  xpath=//div[@id='content'] /
/entity
/entity
  /document
  /dataConfig
 
  docImporterUrl.xml:
 
  ?xml version=1.0 encoding=utf-8?
  docs
  doc
id5/id
authortkb/author
titleStartseite/title
descriptionblabla .../description
filehttp://localhost/tkb/internet/index.cfm/file
urlhttp://localhost/tkb/internet/index.cfm/url/url
path2http\specialConf/path2
/doc
doc
id6/id
authortkb/author
titleEigenheim/title
descriptionMachen Sie sich erste Gedanken über den
  Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
 gar ein
  spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
  Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
 finanzieller
  Hinsicht gelingt./description
file
  http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/file
url
  http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url/url
/doc
  /docs