Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

Alexandre Rafalovitch Wed, 25 Feb 2015 09:15:34 -0800

Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.


Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor <g...@inovem.com> wrote:
> I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
> add a Solr document for each epub file in my local directory.
>
> I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and
> then "solr create -c hn2" to create a new core.
>
> I want to index a load of epub files that I've got in a directory. So I
> created a data-import.xml (in solr\hn2\conf):
>
> <dataConfig>
>     <dataSource type="BinFileDataSource" name="bin" />
>     <document>
>         <entity name="files" dataSource="null" rootEntity="false"
>             processor="FileListEntityProcessor"
>             baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>             onError="skip"
>             recursive="true">
>             <field column="fileAbsolutePath" name="id" />
>             <field column="fileSize" name="size" />
>             <field column="fileLastModified" name="lastModified" />
>
>             <entity name="documentImport" processor="TikaEntityProcessor"
>                 url="${files.fileAbsolutePath}" format="text"
> dataSource="bin" onError="skip">
>                 <field column="file" name="fileName"/>
>                 <field column="Author" name="author" meta="true"/>
>                 <field column="title" name="title" meta="true"/>
>                 <field column="text" name="content"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> In my solrconfig.xml, I added a requestHandler entry to reference my
> data-import.xml:
>
>   <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>       <lst name="defaults">
>           <str name="config">data-import.xml</str>
>       </lst>
>   </requestHandler>
>
> I renamed managed-schema to schema.xml, and ensured the following doc fields
> were setup:
>
>       <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>       <field name="fileName" type="string" indexed="true" stored="true" />
>       <field name="author" type="string" indexed="true" stored="true" />
>       <field name="title" type="string" indexed="true" stored="true" />
>
>       <field name="size" type="long" indexed="true" stored="true" />
>       <field name="lastModified" type="date" indexed="true" stored="true" />
>
>       <field name="content" type="text_en" indexed="false" stored="true"
> multiValued="false"/>
>       <field name="text" type="text_en" indexed="true" stored="false"
> multiValued="true"/>
>
>     <copyField source="content" dest="text"/>
>
> I copied all the jars from dist and contrib\* into server\solr\lib.
>
> Stopping and restarting solr then creates a new managed-schema file and
> renames schema.xml to schema.xml.back
>
> All good so far.
>
> Now I go to the web admin for dataimport
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
> execute a full import.
>
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" -
> ie. it only adds one document (the very first one) even though it's iterated
> over 58!
>
> No errors are reported in the logs.
>
> I can search on the contents of that first epub document, so it's extracting
> OK in Tika, but there's a problem somewhere in my config that's causing only
> 1 document to be indexed in Solr.
>
> Thanks for any assistance / pointers.
>
> Regards,
> Gary
>
> --
> Gary Taylor | www.inovem.com | www.kahootz.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
> kahootz.com is a trading name of INOVEM Ltd.
>

Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

Reply via email to