Hi,

I'm using ManifoldCF 1.2 with ElasticSearch 0.90.  I'm trying to index PDF 
files via the "Windows Shares" repository connector.  I have the 
elasticsearch-mapper-attachments plugin installed in ElasticSearch.

When I run the job on an empty index, a 'flat' schema is created:
{
  "pdf_docs_flat_schema" : {
    "pdf_docs" : {
      "properties" : {
        "_content_type" : {
          "type" : "string"
        },
        "_name" : {
          "type" : "string"
        },
        "allow_token_document" : {
          "type" : "string"
        },
        "allow_token_share" : {
          "type" : "string"
        },
        "deny_token_document" : {
          "type" : "string"
        },
        "deny_token_share" : {
          "type" : "string"
        },
        "file" : {
          "type" : "string"
        },
        "lastModified" : {
          "type" : "string"
        },
        "type" : {
          "type" : "string"
        }
      }
    }
  }
}

Notice that the _content_type, _name, file, and type fields are all properties 
of type "string".  As far as I can tell the 'type' of "attachment" sent with 
indexed file is just treated as a normal piece of metadata and the 'file' field 
(which is snet as a base64 encoded string) is never processed as an attachment.

According to 
http://www.elasticsearch.org/guide/reference/mapping/attachment-type/ it seems 
that the connector should use a mapping command to set the 'file' property with 
a type of 'attachment', with "_content_type" and "_name" fields as subfields of 
the 'file' property.  Also, through testing I found that if you want the 
'date', 'title', 'author', and 'keywords' fields extracted from the document 
and saved, they need to be listed in the mapping too.   (Unfortunately, using a 
mapping changes the JSON code for adding the document to the index.  Instead of 
sending the base64 encoded file attached to the 'file' field, it's attached to 
the 'contents' subfield.)

Am I missing something obvious here?  All I want is my documents properly 
indexed.
Is this something for the 'dev' mailing list instead?

Thanks,
Rick


============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================

Reply via email to