On 25 February 2011 11:40, kaola <[email protected]> wrote:

>
> Hello everyone,
>
> I'm currently thinking of using Nutch in a new website project.
> My aim is to index files (HTML, TXT, PDF ...) stored on a filesystem (which
> Nutch can ), but some of the files may have meta-information stored in a
> separate file.
> Then, a web user may search the index containing those files.
>
> For example, the " technical_documentation.pdf " file, may have a "
> technical_documentation.xml " linked to it (for example in the same folder
> ), this XML containing informations like " <type>documentation</type> " and
> so.
>
> Is there any way to achieve this using Nutch ? Is it able to combine
> informations/content from two files into a single searchable item ? Or
> maybe
> I'm not choosing the right tool to achieve this?
>

You can put the metadata for each url in the seed file e.g.

http://slashdot.org/
<http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fslashdot.org%2F>
   blawg_corp=Geeknet
http://geek.com/
<http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fgeek.com%2F>
       blawg_corp=Geeknet
http://engadget.com/
<http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fengadget.com%2F>
   blawg_corp=Weblogs
http://gizmodo.com/
<http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fgizmodo.com%2F>
    blawg_corp=Gawker

then the configure the plugin urlmeta.

See https://issues.apache.org/jira/browse/NUTCH-855 for more details

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to