1)    Is Tika able to extract and parse the security of the document collected 
? Can it extract authorization on the file it parses ? I guess Nutch can 
collect these but I have not seen evidence of that. We need this because we 
have to apply the security at the document level (not just at the index or 
repository level) because this is about content that should not be seen unless 
someone is authorized to do so. 
Is this stored within each document, or is this stored external to the 
document.  If internal, depending on the file format, we may already be 
extracting it.  If we aren’t extracting it, please open a ticket with an 
example (dummy) file, and we’ll add that.
If external to the document, no, this is not part of what Tika can do, but you 
could add it to the Metadata object before parsing the document...if that would 
be of any convenience in your workflow.

From: Claude Garceau [mailto:[email protected]]
Sent: Wednesday, May 10, 2017 3:38 PM
To: [email protected]
Subject: French Language Detection with Tika


Let me tell you about my concept, th ebig picture of where I want to
 

We want to collect content from a document management system (Nuxeo), an 
intranet (Drupal) and files from file system (shared drives) in oprder to be 
retrivable by means of a search engine. All of of these sources are internal 
information for internal audience, this is about unstructured content 
(documents and web pages) We want to use Nutch as the crawler on these sources. 
Then Tika would extract and format the data and commit to Elasticsearch (or 
SolR). We then index all of the content Elasticsearch and make them available 
through a web aplication, probably an SPA build on Angluar2. We would make API 
Calls from this SPA to Elasticsearch, so to formulate queries and get results.
 

Now I have 2 questions:  

1) Is Tika able to extract and parse the security of the document collected ? 
Can it extract authorization on the file it parses ? I guess Nutch can collect 
these but I have not seen evidence of that. We need this because we have to 
apply the security at the document level (not just at the index or repository 
level) because this is about content that should not be seen unless someone is 
authorized to do so. 

2) Is Tika efficiently handles the French language detection and extraction ? 
This is a critical capability for my project.

I am currently performing a market survey based on functional and technical 
criterias to select the best tools that would fit my concept. So far ES, Tika 
and Nutch and well positionned ! I am not sure if I will have time to test the 
French ability in Tika, If you are able to refer me to someone or a reference 
place in that respect, I'll have a better degree of confidence im my 
recommandation

Best Regards

Reply via email to