On 25 February 2011 11:40, kaola <[email protected]> wrote: > > Hello everyone, > > I'm currently thinking of using Nutch in a new website project. > My aim is to index files (HTML, TXT, PDF ...) stored on a filesystem (which > Nutch can ), but some of the files may have meta-information stored in a > separate file. > Then, a web user may search the index containing those files. > > For example, the " technical_documentation.pdf " file, may have a " > technical_documentation.xml " linked to it (for example in the same folder > ), this XML containing informations like " <type>documentation</type> " and > so. > > Is there any way to achieve this using Nutch ? Is it able to combine > informations/content from two files into a single searchable item ? Or > maybe > I'm not choosing the right tool to achieve this? >
You can put the metadata for each url in the seed file e.g. http://slashdot.org/ <http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fslashdot.org%2F> blawg_corp=Geeknet http://geek.com/ <http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fgeek.com%2F> blawg_corp=Geeknet http://engadget.com/ <http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fengadget.com%2F> blawg_corp=Weblogs http://gizmodo.com/ <http://search.lucidimagination.com/search/out?u=http%3A%2F%2Fgizmodo.com%2F> blawg_corp=Gawker then the configure the plugin urlmeta. See https://issues.apache.org/jira/browse/NUTCH-855 for more details Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

