1) Is Tika able to extract and parse the security of the document collected ? Can it extract authorization on the file it parses ? I guess Nutch can collect these but I have not seen evidence of that. We need this because we have to apply the security at the document level (not just at the index or repository level) because this is about content that should not be seen unless someone is authorized to do so. Is this stored within each document, or is this stored external to the document. If internal, depending on the file format, we may already be extracting it. If we aren’t extracting it, please open a ticket with an example (dummy) file, and we’ll add that. If external to the document, no, this is not part of what Tika can do, but you could add it to the Metadata object before parsing the document...if that would be of any convenience in your workflow.
From: Claude Garceau [mailto:[email protected]] Sent: Wednesday, May 10, 2017 3:38 PM To: [email protected] Subject: French Language Detection with Tika Let me tell you about my concept, th ebig picture of where I want to We want to collect content from a document management system (Nuxeo), an intranet (Drupal) and files from file system (shared drives) in oprder to be retrivable by means of a search engine. All of of these sources are internal information for internal audience, this is about unstructured content (documents and web pages) We want to use Nutch as the crawler on these sources. Then Tika would extract and format the data and commit to Elasticsearch (or SolR). We then index all of the content Elasticsearch and make them available through a web aplication, probably an SPA build on Angluar2. We would make API Calls from this SPA to Elasticsearch, so to formulate queries and get results. Now I have 2 questions: 1) Is Tika able to extract and parse the security of the document collected ? Can it extract authorization on the file it parses ? I guess Nutch can collect these but I have not seen evidence of that. We need this because we have to apply the security at the document level (not just at the index or repository level) because this is about content that should not be seen unless someone is authorized to do so. 2) Is Tika efficiently handles the French language detection and extraction ? This is a critical capability for my project. I am currently performing a market survey based on functional and technical criterias to select the best tools that would fit my concept. So far ES, Tika and Nutch and well positionned ! I am not sure if I will have time to test the French ability in Tika, If you are able to refer me to someone or a reference place in that respect, I'll have a better degree of confidence im my recommandation Best Regards
