Hi Claude, > 1) Is Nutch an appropriate solution to collect documents and their > metadatas from file system (shared drives) ?
That works if the shared drives have a mount point. The plugin protocol-file supports crawling file:/ URLs, in a way similar to a web server with directory listings enabled. > 2) Is Nutch has the ability to collect the permissions that are set on the > NTFS Security tab of the directory tree or on the file ? > Not out of the box. Permissions are quite specific to platforms / operating systems. But it would be possible to extend the plugin so that permissions are attached as metadata. Best, Sebastian On 05/12/2017 08:56 PM, Claude Garceau wrote: > Here is the scope of my project. > > We want to collect content from a document management system (Nuxeo), an > intranet (Drupal) and files from file system (shared drives) in oprder to > be retrivable by means of a search engine. All of of these sources are > internal information for internal audience, this is about unstructured > content (documents and web pages) We want to use Nutch as the crawler on > these sources. Then Tika would extract and format the data and commit to > Elasticsearch (or SolR). > > 1) Is Nutch an appropriate solution to collect documents and their > metadatas from file system (shared drives) ? > > 2) Is Nutch has the ability to collect the permissions that are set on the > NTFS Security tab of the directory tree or on the file ? >

