Hi Claude,

> 1) Is Nutch an appropriate solution to collect documents and their
> metadatas from file system (shared drives) ?

That works if the shared drives have a mount point. The plugin protocol-file
supports crawling file:/ URLs, in a way similar to a web server with directory
listings enabled.

> 2) Is Nutch has the ability to collect the permissions that are set on the
> NTFS Security tab of the directory tree or on the file ?
>

Not out of the box. Permissions are quite specific to platforms / operating 
systems.
But it would be possible to extend the plugin so that permissions are attached 
as metadata.

Best,
Sebastian

On 05/12/2017 08:56 PM, Claude Garceau wrote:
> Here is the scope of my project.
> 
> We want to collect content from a document management system (Nuxeo), an
> intranet (Drupal) and files from file system (shared drives) in oprder to
> be retrivable by means of a search engine. All of of these sources are
> internal information for internal audience, this is about unstructured
> content (documents and web pages) We want to use Nutch as the crawler on
> these sources. Then Tika would extract and format the data and commit to
> Elasticsearch (or SolR).
> 
> 1) Is Nutch an appropriate solution to collect documents and their
> metadatas from file system (shared drives) ?
> 
> 2) Is Nutch has the ability to collect the permissions that are set on the
> NTFS Security tab of the directory tree or on the file ?
> 

Reply via email to