Re: Index weightings of different types of text node...h1, h2 anchor etc..

2009-07-09 Thread Magnús Skúlason
yes that is correct, in order to do that you could modify the parser to store the content of special tags into another field that you would give a higher boost. best regards, Magnus On Thu, Jul 9, 2009 at 3:30 PM, Joel Halbert wrote: > Hi, Would I be correct in thinking that Nutch, when indexin

Re: Using Nutch to crawl PubMed

2009-07-21 Thread Magnús Skúlason
Hi, You can have Nutch crawl and index pretty much everything, for specific protocols and formats you only need to write custom protocol, parse and maybe even indexing plugins. The protocol plugin, takes care of accessing the content. The parse plugin takes care of parsing the content, extracting

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Magnús Skúlason
Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to

Only indexing pages meeting certain criteria

2009-10-08 Thread Magnús Skúlason
Hi, I want nutch to only index some of the documents that it crawls, I have tried what is suggested here: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11649.html That is in an IndexingFilter I check for the condition whether to index the document and if not I return null. When I th

Nutch indexer failing

2009-10-18 Thread Magnús Skúlason
Hi, I am getting the following exception when indexing (right after adding segments): Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /home/user/nutch/crawl/indexes already exists at org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(

Re: Nutch Developers needed for a new Search engine

2010-01-12 Thread Magnús Skúlason
Hi, I am interested in hearing more about this. I have 1 and a half year experience with nutch and lucene and 7 years of experience with Java in total. best regards, Magnus 2010/1/6 SC Interactive Global Media SRL > Happy Nerw Year to all Developers. > > We are looking for nutch developers wit

Re: Crawling site, but only indexing certain pages

2010-02-24 Thread Magnús Skúlason
Hi, This is actually very easy, just create a indexing plugging, analyse the url format and return null from the indexing pluggin if you don't want to index it. best regards, Magnus On Wed, Feb 24, 2010 at 6:09 PM, Steven Wichers wrote: > On some of the sites I want to index with nutch, there

Can't open a nutch 1.0 index with luke

2010-04-01 Thread Magnús Skúlason
Hi, I am getting the following exception when I try to open a nutch 1.0 (I am using the official release) index with Luke (0.9.9.1) java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput. java:151) at org.apache.lucene.store.Buff

Re: Can't open a nutch 1.0 index with luke

2010-04-01 Thread Magnús Skúlason
9:20 PM, Andrzej Bialecki wrote: > On 2010-04-01 21:09, Magnús Skúlason wrote: > > Hi, > > > > I am getting the following exception when I try to open a nutch 1.0 (I am > > using the official release) index with Luke (0.9.9.1) > > > > java.io.IO