yes that is correct, in order to do that you could modify the parser to
store the content of special tags into another field that you would give a
higher boost.
best regards,
Magnus
On Thu, Jul 9, 2009 at 3:30 PM, Joel Halbert wrote:
> Hi, Would I be correct in thinking that Nutch, when indexin
Hi,
You can have Nutch crawl and index pretty much everything, for specific
protocols and formats you only need to write custom protocol, parse and
maybe even indexing plugins.
The protocol plugin, takes care of accessing the content. The parse plugin
takes care of parsing the content, extracting
Actually its quite easy to modify the parse-html filter to do this.
That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to
Hi,
I want nutch to only index some of the documents that it crawls, I have
tried what is suggested here:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11649.html
That is in an IndexingFilter I check for the condition whether to index the
document and if not I return null.
When I th
Hi,
I am getting the following exception when indexing (right after adding
segments):
Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
/home/user/nutch/crawl/indexes already exists
at
org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(
Hi,
I am interested in hearing more about this. I have 1 and a half year
experience with nutch and lucene and 7 years of experience with Java in
total.
best regards,
Magnus
2010/1/6 SC Interactive Global Media SRL
> Happy Nerw Year to all Developers.
>
> We are looking for nutch developers wit
Hi,
This is actually very easy, just create a indexing plugging, analyse the url
format and return null from the indexing pluggin if you don't want to index
it.
best regards,
Magnus
On Wed, Feb 24, 2010 at 6:09 PM, Steven Wichers wrote:
> On some of the sites I want to index with nutch, there
Hi,
I am getting the following exception when I try to open a nutch 1.0 (I am
using the official release) index with Luke (0.9.9.1)
java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.
java:151)
at
org.apache.lucene.store.Buff
9:20 PM, Andrzej Bialecki wrote:
> On 2010-04-01 21:09, Magnús Skúlason wrote:
> > Hi,
> >
> > I am getting the following exception when I try to open a nutch 1.0 (I am
> > using the official release) index with Luke (0.9.9.1)
> >
> > java.io.IO