Hi Claude,

I can’t speak to your first question, but I’ve been involved in language 
detection.

By “efficiency” I assume you mean false positive and false negative rates for 
French documents, yes?

Tika is using the "language-detector” project, which has a false negative rate 
of about 0.2%, and a false positive rate of about 0.001%

But this is on a clean EU dataset for 17 languages. If the document text is 
short, or contains multiple languages, or has markup left in it, then the 
results will be worse.

— Ken

> On May 10, 2017, at 12:38pm, Claude Garceau <[email protected]> 
> wrote:
> 
> Let me tell you about my concept, th ebig picture of where I want to 
>  
> 
> We want to collect content from a document management system (Nuxeo), an 
> intranet (Drupal) and files from file system (shared drives) in oprder to be 
> retrivable by means of a search engine. All of of these sources are internal 
> information for internal audience, this is about unstructured content 
> (documents and web pages) We want to use Nutch as the crawler on these 
> sources. Then Tika would extract and format the data and commit to 
> Elasticsearch (or SolR). We then index all of the content Elasticsearch and 
> make them available through a web aplication, probably an SPA build on 
> Angluar2. We would make API Calls from this SPA to Elasticsearch, so to 
> formulate queries and get results.
>  
> 
> Now I have 2 questions:  
> 
> 1) Is Tika able to extract and parse the security of the document collected ? 
> Can it extract authorization on the file it parses ? I guess Nutch can 
> collect these but I have not seen evidence of that. We need this because we 
> have to apply the security at the document level (not just at the index or 
> repository level) because this is about content that should not be seen 
> unless someone is authorized to do so. 
> 
> 2) Is Tika efficiently handles the French language detection and extraction ? 
> This is a critical capability for my project.
> 
> I am currently performing a market survey based on functional and technical 
> criterias to select the best tools that would fit my concept. So far ES, Tika 
> and Nutch and well positionned ! I am not sure if I will have time to test 
> the French ability in Tika, If you are able to refer me to someone or a 
> reference place in that respect, I'll have a better degree of confidence im 
> my recommandation 
> 
> Best Regards
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply via email to