Hi All,
I'm trying out a small demo, with a file system repository connector and
elastic search output connector to extract spreadsheet documents and index.
I've also added tika transform connector in the job.
When I run the documents get indexed in elastic-search but the content is
been indexed in binary.
See below the indexed content in ES. Can I please know how to extract the
spread-sheet content to text format here?
Even for a text file, I see the content is been indexed as binary.
Is there a configuration I need to do here to get the text content
extracted and indexed in ES?
{
"_index": "test",
"_type": "generictype",
"_id":
"file:/home/dileepa/Documents/hackathon/test_data/MI%20-%20Project2%20-%20Estimation%20v1.0.xlsx",
"_score": 1,
"_source": {
"stream_size": "101613",
"X-Parsed-By": "org.apache.tika.parser.DefaultParser",
"stream_name": "MI - Project2 - Estimation v1.0.xlsx",
"protected": "false",
"resourceName": "MI - Project2 - Estimation v1.0.xlsx",
"uri": "/home/dileepa/Documents/hackathon/test_data/MI - Project2
- Estimation v1.0.xlsx",
"Content-Type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"allow_token_document": "__nosecurity__",
"deny_token_document": "__nosecurity__",
"allow_token_share": "__nosecurity__",
"deny_token_share": "__nosecurity__",
"allow_token_parent": "__nosecurity__",
"deny_token_parent": "__nosecurity__",
"file": {
"_content_type":
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"_name": "MI - Project2 - Estimation v1.0.xlsx",
"_content":
"RGV2ZWxvcG1lbnQgRXN0aW1hdGVzCglTZWN0aW9uCUZlYXR1cmUJQXNzdW1wdGlvbnMgYW5kIHNjb3BlCUFkZGl0aW9uYWwgaJlYWxpMAkwCTAJ....."
}
}
]
}
}
Thanks,
Dileepa