I would say that you could determine a row that gives a bad URL, and then run it in DIH admin interface (or the command-line) with "debug" enabled The url parameter going into tika should be present in its transformed form before the next entity gets going. This works in a similar scenario for me.
On Tue, Dec 2, 2014 at 1:19 PM, Teague James <teag...@insystechinc.com> wrote: > Hi all, > > I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL > field. In the DIH Tika uses that field to fetch and parse the documents. > The > URL from the field is valid and will download the document in the browser > just fine. But Tika is getting HTTP response code 400. Any ideas why? > > ERROR > BinURLDataSource > java.io.IOException: Server returned HTTP response code: 400 for URL: > > EntityProcessorWrapper > Exception in entity : > tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException: > Exception in invoking url > > DIH > <dataConfig> > <dataSource type="JdbcDataSource" > name="ds-1" > driver="net.sourceforge.jtds.jdbc.Driver" > > url="jdbc:jtds:sqlserver:// > 1.2.3.4/database;instance=INSTANCE;user=USER;pass > word=PASSWORD" /> > > <dataSource type="BinURLDataSource" name="ds-2" /> > > <document> > <entity name="db_content" dataSource="ds-1" > transformer="ClobTransformer, RegexTransformer" > query="SELECT ContentID, > DownloadURL > FROM DATABASE.VIEW > <field column="ContentID" name="id" /> > <field column="DownloadURL" clob="true" > name="DownloadURL" /> > > <entity name="tika_content" > processor="TikaEntityProcessor" url="${db_content.DownloadURL}" > onError="continue" dataSource="ds-2"> > <field column="TikaParsedContent" /> > </entity> > > </entity> > </document> > </dataConfig> > > SCHEMA - Fields > <field name="DownloadURL" type="string" indexed="true" stored="true" /> > <field name="TikaParsedContent" type="text_general" indexed="true" > stored="true" multiValued="true"/> > > > >