I would say that you could determine a row that gives a bad URL, and then
run it in DIH admin interface (or the command-line) with "debug" enabled
The url parameter going into tika should be present in its transformed form
before the next entity gets going.   This works in a similar scenario for
me.

On Tue, Dec 2, 2014 at 1:19 PM, Teague James <teag...@insystechinc.com>
wrote:

> Hi all,
>
> I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
> field. In the DIH Tika uses that field to fetch and parse the documents.
> The
> URL from the field is valid and will download the document in the browser
> just fine. But Tika is getting HTTP response code 400. Any ideas why?
>
> ERROR
> BinURLDataSource
> java.io.IOException: Server returned HTTP response code: 400 for URL:
>
> EntityProcessorWrapper
> Exception in entity :
> tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Exception in invoking url
>
> DIH
> <dataConfig>
>         <dataSource type="JdbcDataSource"
>               name="ds-1"
>               driver="net.sourceforge.jtds.jdbc.Driver"
>
> url="jdbc:jtds:sqlserver://
> 1.2.3.4/database;instance=INSTANCE;user=USER;pass
> word=PASSWORD" />
>
>         <dataSource type="BinURLDataSource" name="ds-2" />
>
>         <document>
>         <entity name="db_content" dataSource="ds-1"
> transformer="ClobTransformer, RegexTransformer"
>                 query="SELECT ContentID,
>                         DownloadURL
>                         FROM DATABASE.VIEW
>                         <field column="ContentID" name="id" />
>                         <field column="DownloadURL" clob="true"
> name="DownloadURL" />
>
>                         <entity name="tika_content"
> processor="TikaEntityProcessor" url="${db_content.DownloadURL}"
> onError="continue" dataSource="ds-2">
>                                 <field column="TikaParsedContent" />
>                         </entity>
>
>         </entity>
>         </document>
> </dataConfig>
>
> SCHEMA - Fields
> <field name="DownloadURL" type="string" indexed="true" stored="true" />
> <field name="TikaParsedContent" type="text_general" indexed="true"
> stored="true" multiValued="true"/>
>
>
>
>

Reply via email to