Let me tell you about my concept, th ebig picture of where I want to

We want to collect content from a document management system (Nuxeo), an
intranet (Drupal) and files from file system (shared drives) in oprder to
be retrivable by means of a search engine. All of of these sources are
internal information for internal audience, this is about unstructured
content (documents and web pages) We want to use Nutch as the crawler on
these sources. Then Tika would extract and format the data and commit to
Elasticsearch (or SolR). We then index all of the content Elasticsearch and
make them available through a web aplication, probably an SPA build on
Angluar2. We would make API Calls from this SPA to Elasticsearch, so to
formulate queries and get results.

Now I have 2 questions:

1) Is Tika able to extract and parse the security of the document collected
? Can it extract authorization on the file it parses ? I guess Nutch can
collect these but I have not seen evidence of that. We need this because we
have to apply the security at the document level (not just at the index or
repository level) because this is about content that should not be seen
unless someone is authorized to do so.

2) Is Tika efficiently handles the French language detection and extraction
? This is a critical capability for my project.

I am currently performing a market survey based on functional and technical
criterias to select the best tools that would fit my concept. So far ES,
Tika and Nutch and well positionned ! I am not sure if I will have time to
test the French ability in Tika, If you are able to refer me to someone or
a reference place in that respect, I'll have a better degree of confidence
im my recommandation

Best Regards

Reply via email to