RE: Indexing data from pdf

Dyer, James Fri, 11 May 2012 10:54:34 -0700

The document you tried to index has an "id" but not a "fake_id".  Because 
"fake_id" is your index uniqueKey, you have to include it in every document you 
index.  Your most likely fix for this is to use a Transformer to generate a 
"fake_id".  You might get away with changing this:


<field column="fake_id" name="fake_id" meta="true" />

to this:

<field column="fake_id" name="id" meta="true" />

This assumes, of course, for these pdf documents the "fake_id" should always be 
the same as the "id".

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: anarchos78 [mailto:rigasathanasio...@hotmail.com] 
Sent: Friday, May 11, 2012 12:32 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from pdf

I have included the extras and I am getting the following:
*From Solr:*
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int></lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">2</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-05-11 20:21:50</str>
<str name="">Indexing completed. Added/Updated: 0 documents. Deleted 0
documents.</str>
<str name="Committed">2012-05-11 20:21:51</str>
<str name="Total Documents Processed">0</str><str name="Total Documents
Failed">1</str>
<str name="Time taken">0:0:1.284</str></lst><str name="WARNING">This
response format is experimental.  It is likely to change in the
future.</str>
</response>


*The log file:*
org.apache.solr.handler.dataimport.SolrWriter upload
WARNING: Error creating document : SolrInputDocument[{id=id(1.0)={1},
biog=biog(1.0)={Dinos Michailidis
Dinos Michailidis (1355 or 1356 – 1418) was a medieval Egyptian writer and
mathematician born in a village in the Nile Delta. He is the author of
Subh al-a 'sha, a fourteen volume encyclopedia in Arabic, which included a
section on cryptology. This information was attributed to Taj ad-Din Ali
ibn ad-Duraihim ben Muhammad ath-Tha 'alibi al-Mausili who lived from 1312
to 1361, but whose writings on cryptology have been lost. The list of
ciphers in this work included both substitution and transposition, and for
the first time, a cipher with multiple substitutions for each plaintext
letter.
Also traced to Ibn al-Duraihim is an exposition on and worked example of
cryptanalysis, including the use of tables of letter frequencies and sets of
letters which can not occur together in one word. 


}, model=model(1.0)={patata}}]
org.apache.solr.common.SolrException: [doc=null] missing required field:
fake_id
        at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:355)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
        at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:66)
        at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:723)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)

*The data-config.xml:*
<?xml version="1.0" encoding="utf-8"?>

<dataConfig>
  <dataSource type="BinFileDataSource" name="binary" />
        <document>
                <entity name="f" dataSource="binary" rootEntity="false"
processor="FileListEntityProcessor" baseDir="/solr/solr/docu/"
fileName=".*pdf" recursive="true">
                        <entity name="tika" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}" format="text">
                                <field column="id" name="id" meta="true" />
                                <field column="fake_id" name="fake_id" 
meta="true" />
                                <field column="model" name="model" meta="true" 
/>
                                <field column="text" name="biog" />
                        </entity>
                </entity>
        </document>
  
</dataConfig>

*The schema.xml (fields):*

<fields>
<field  name="id" type="string" indexed="true" stored="true" /> 
  <field  name="fake_id" type="string" indexed="true" stored="true" /> 
  <field  name="model" type="text_en" indexed="true" stored="true"  />
  <field  name="firstname" type="text_en" indexed="true" stored="true"/>
  <field  name="lastname" type="text_en" indexed="true" stored="true"/>
  <field  name="title" type="text_en" indexed="true" stored="true"/>
  <field  name="biog" type="text_en" indexed="true" stored="true"/>
</fields>

<uniqueKey>fake_id</uniqueKey>
<defaultSearchField>text</defaultSearchField>

What is going wrong now? I have included all the required fields in the
schema.xml.
Thank you. 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-data-from-pdf-tp3979876p3980571.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Indexing data from pdf

Reply via email to