How to extract text content and index in elastic-search

Dileepa Jayakody Fri, 06 Oct 2017 05:35:54 -0700

Hi All,

I'm trying out a small demo, with a file system repository connector and
elastic search output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.


When I run the documents get indexed in elastic-search but the content is
been indexed in binary.

See below the indexed content in ES. Can I please know how to extract the
spread-sheet content to text format here?
Even for a text file, I see the content is been indexed as binary.
Is there a configuration I need to do here to get the text content
extracted and indexed in ES?

{
        "_index": "test",
        "_type": "generictype",
        "_id":
"file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
        "_score": 1,
        "_source": {
          "stream_size": "101613",
          "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
          "stream_name": "MI - Project2 - Estimation v1.0.xlsx",
          "protected": "false",
          "resourceName": "MI - Project2 - Estimation v1.0.xlsx",
          "uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2
- Estimation v1.0.xlsx",
          "Content-Type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "allow_token_document": "__nosecurity__",
          "deny_token_document": "__nosecurity__",
          "allow_token_share": "__nosecurity__",
          "deny_token_share": "__nosecurity__",
          "allow_token_parent": "__nosecurity__",
          "deny_token_parent": "__nosecurity__",
          "file": {
            "_content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "_name": "MI - Project2 - Estimation v1.0.xlsx",
            "_content":
"RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
        }
      }
    ]
  }
}

Thanks,
Dileepa

How to extract text content and index in elastic-search

Reply via email to