I fixe the prb using requestHandler dataimoprt:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">tika-data-config.xml</str>
</lst>
</requestHandler>
I configure the tika-data-config.xml according to my needs to get the right
value :
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="files" processor="FileListEntityProcessor"
dataSource="null" rootEntity="false"
baseDir="D:\Lucene\document"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<field column="file" name="fileName" />
now dont need indexing from Commandline using simpleposttool just go to to
the web admin for dataimport and try and execute a full import.
2015-12-04 17:05 GMT+00:00 kostali hassan <[email protected]>:
> thank you , that's why I choose to add the exact value using solarium PHP
> Client, but the time out stop indexing after 30seconde:
>
> $dir = new Folder($dossier);
> $files = $dir->find('.*\.*');
> foreach ($files as $file) {
> $file = new File($dir->pwd() . DS . $file);
>
> $query = $client->createExtract();
> $query->setFile($file->pwd());
> $query->setCommit(true);
> $query->setOmitHeader(false);
>
> $doc = $query->createDocument();
> $doc->id =$file->pwd();
> $doc->name = $file->name;
> $doc->title = $file->name();
>
> $query->setDocument($doc);
>
> 2015-12-04 16:50 GMT+00:00 Erik Hatcher <[email protected]>:
>
>> Kostali -
>>
>> See if the "Introspect rich document parsing and extraction” section of
>> http://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/
>> helps*. You’ll be able to see the output of /update/extract (aka Tika) and
>> adjust your mappings and configurations accordingly.
>>
>> * And apologies that bin/post isn’t Windows savvy at this point, but
>> you’ve got the hang of the Windows-compatible command-line it looks like.
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com
>>
>>
>>
>> > On Dec 4, 2015, at 11:44 AM, kostali hassan <[email protected]>
>> wrote:
>> >
>> > thank you Erick, i follow you advice and take a look to config apache
>> tika,
>> > I have modifie my request handler /update/extract:
>> >
>> > <requestHandler name="/update/extract"
>> > startup="lazy"
>> > class="solr.extraction.ExtractingRequestHandler" >
>> > <lst name="defaults">
>> > <str name="fmap.Last-Modified">last_modified</str>
>> > <str name="uprefix">ignored_</str>
>> >
>> > <!-- capture link hrefs but ignore div attributes -->
>> > <str name="captureAttr">true</str>
>> > <str name="fmap.a">links</str>
>> > <str name="fmap.div">ignored_</str>
>> > </lst>
>> > <str
>> >
>> name="tika.config">D:\solr\solr-5.3.1\server\solr\tika-data-config.xml</str>
>> > </requestHandler>
>> >
>> > and config tika :
>> >
>> > dataConfig>
>> > <dataSource type="BinFileDataSource" />
>> > <document>
>> > <entity name="files" processor="FileListEntityProcessor"
>> > dataSource="null" rootEntity="false"
>> > baseDir="D:\Lucene\document"
>> > fileName=".*.(doc)|(pdf)|(docx)"
>> > onError="skip"
>> > recursive="true">
>> > <field column="fileAbsolutePath" name="lux_uri" />
>> > <field column="fileSize" name="size" />
>> > <field column="fileLastModified" name="lastModified" />
>> >
>> > <entity
>> > name="documentImport"
>> > processor="TikaEntityProcessor"
>> > url="${files.fileAbsolutePath}"
>> > format="text">
>> > <field column="file" name="fileName" meta="true"/>
>> > <field column="Author" name="author" meta="true"/>
>> > <field column="name" name="name" meta="true"/>
>> > <field column="title" name="title" meta="true"/>
>> > <field column="text" name="text"/>
>> > <field column="custom:Testmeta" name="Testmeta"
>> > meta="true"/>
>> > <field column="LastModifiedBy" name="LastModifiedBy"
>> > meta="true"/>
>> > </entity>
>> > </entity>
>> > </document>
>> > </dataConfig>
>> >
>> > and schema.xml:
>> >
>> > <field name="Testmeta" type="text" indexed="true" stored="true" />
>> >
>> >
>> >
>> > but the prb is the same title of indexed files is wrong for msword
>>
>>
>