Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic.
Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor <g...@inovem.com> wrote: > I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly > add a Solr document for each epub file in my local directory. > > I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and > then "solr create -c hn2" to create a new core. > > I want to index a load of epub files that I've got in a directory. So I > created a data-import.xml (in solr\hn2\conf): > > <dataConfig> > <dataSource type="BinFileDataSource" name="bin" /> > <document> > <entity name="files" dataSource="null" rootEntity="false" > processor="FileListEntityProcessor" > baseDir="c:/Users/gt/Documents/epub" fileName=".*epub" > onError="skip" > recursive="true"> > <field column="fileAbsolutePath" name="id" /> > <field column="fileSize" name="size" /> > <field column="fileLastModified" name="lastModified" /> > > <entity name="documentImport" processor="TikaEntityProcessor" > url="${files.fileAbsolutePath}" format="text" > dataSource="bin" onError="skip"> > <field column="file" name="fileName"/> > <field column="Author" name="author" meta="true"/> > <field column="title" name="title" meta="true"/> > <field column="text" name="content"/> > </entity> > </entity> > </document> > </dataConfig> > > In my solrconfig.xml, I added a requestHandler entry to reference my > data-import.xml: > > <requestHandler name="/dataimport" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > <lst name="defaults"> > <str name="config">data-import.xml</str> > </lst> > </requestHandler> > > I renamed managed-schema to schema.xml, and ensured the following doc fields > were setup: > > <field name="id" type="string" indexed="true" stored="true" > required="true" multiValued="false" /> > <field name="fileName" type="string" indexed="true" stored="true" /> > <field name="author" type="string" indexed="true" stored="true" /> > <field name="title" type="string" indexed="true" stored="true" /> > > <field name="size" type="long" indexed="true" stored="true" /> > <field name="lastModified" type="date" indexed="true" stored="true" /> > > <field name="content" type="text_en" indexed="false" stored="true" > multiValued="false"/> > <field name="text" type="text_en" indexed="true" stored="false" > multiValued="true"/> > > <copyField source="content" dest="text"/> > > I copied all the jars from dist and contrib\* into server\solr\lib. > > Stopping and restarting solr then creates a new managed-schema file and > renames schema.xml to schema.xml.back > > All good so far. > > Now I go to the web admin for dataimport > (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and > execute a full import. > > But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - > ie. it only adds one document (the very first one) even though it's iterated > over 58! > > No errors are reported in the logs. > > I can search on the contents of that first epub document, so it's extracting > OK in Tika, but there's a problem somewhere in my config that's causing only > 1 document to be indexed in Solr. > > Thanks for any assistance / pointers. > > Regards, > Gary > > -- > Gary Taylor | www.inovem.com | www.kahootz.com > > INOVEM Ltd is registered in England and Wales No 4228932 > Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE > kahootz.com is a trading name of INOVEM Ltd. >