No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if TikaEntityProcessor supports such a thing.
On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen <a...@conx.ch> wrote: > so could i just nest it in a XPathEntityProcessor to filter the html or is > there something like xpath for tika? > > <entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" > forEach="/div[@id='content']" dataSource="main"> > <entity name="tika" processor="TikaEntityProcessor" > url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" > format="html" > > <field column="text" /> > </entity> > </entity> > > but now i dont know how to pass the text to tika, what do i put in url and > datasource? > > > On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: > >> I don't know much about Tika but in the example data-config.xml that >> you posted, the "xpath" attribute on the field "text" won't work >> because the xpath attribute is used only by a XPathEntityProcessor. >> >> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <a...@conx.ch> wrote: >>> I want tika to only index the content in <div id="content">...</div> for >>> the field "text". unfortunately it's indexing the hole page. Can't xpath do >>> this? >>> >>> data-config.xml: >>> >>> <dataConfig> >>> <dataSource type="BinFileDataSource" name="data"/> >>> <dataSource type="BinURLDataSource" name="dataUrl"/> >>> <dataSource type="URLDataSource" name="main"/> >>> <document> >>> <entity name="rec" processor="XPathEntityProcessor" >>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" >>> dataSource="main"> <!--transformer="script:GenerateId"--> >>> <field column="title" xpath="//title" /> >>> <field column="id" xpath="//id" /> >>> <field column="file" xpath="//file" /> >>> <field column="path" xpath="//path" /> >>> <field column="url" xpath="//url" /> >>> <field column="Author" xpath="//author" /> >>> >>> <entity name="tika" processor="TikaEntityProcessor" >>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" >>> htmlMapper="identity" format="html" > >>> <field column="text" xpath="//div[@id='content']" /> >>> >>> </entity> >>> </entity> >>> </document> >>> </dataConfig> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. > -- Regards, Shalin Shekhar Mangar.