update crawldb

2006-12-19 Thread Aïcha
hello, I use the Prune tool to remove documents from segment indexes but it does not remove pages and links from WebDB. To prevent the presence of the unwanted URLs when new segments are created, it is advised to use our own link net.nutch.net.URLFilter, or PruneDBTool (under construction...

Re: subcollections IT DOESN'T WORK!

2006-12-19 Thread kauu
hi , i'm new to nutch ,i want to know what's the useness of the subcollection plugin? where is the introduction? On 12/19/06, liv <[EMAIL PROTECTED]> wrote: I may be loosing all and every credit ... it's still in the same state - reindex doesn't change the subcollection field! I did a REFET

Re: subcollections IT DOESN'T WORK!

2006-12-19 Thread liv
look here: http://issues.apache.org/jira/browse/NUTCH-201?page=all unfortunately it doesn't work as expected... yet kauu wrote: > > hi , i'm new to nutch ,i want to know what's the useness of the > subcollection plugin? > where is the introduction? > -- View this message in context: http:/

Re: subcollections

2006-12-19 Thread liv
I checked the patch for subcollections (http://issues.apache.org/jira/browse/NUTCH-201) - although I assumed it is included in the latest public release 0.8.1. Compared to the current source code, it looks like having has an extra file (which doesn't exist in version 0.8.1) src/plugin/subcollect

How best to add "sponsored link" support..??

2006-12-19 Thread RP
Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages

Re: How best to add "sponsored link" support..??

2006-12-19 Thread Jim Wilson
You may want to consider letting a third-party handle your sponsored links, unless of course you already have an infrastructure for handling everything you already mentioned as well as the following: * Advertiser registration * Advertiser purchase of keywords/page space * Calculation of impressio

Re: How best to add "sponsored link" support..??

2006-12-19 Thread Sean Dean
I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean --

Re: How best to add "sponsored link" support..??

2006-12-19 Thread RP
Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search resu

Re: How best to add "sponsored link" support..??

2006-12-19 Thread Sami Siren
Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.

Re: How best to add "sponsored link" support..??

2006-12-19 Thread RP
Thanks Sami, This is closer from an initial look - does this do anything on the backend (i.e. defining the data flags sow e can get a match) as well or do we need to build that..?? Sami Siren wrote: Are you looking for something like the google keymatch as described in [1] which was then mo

Re: large number of urls from Generator are not fetched?

2006-12-19 Thread Dennis Kubes
For anyone searching this thread in the future. One possible cause of this is when the hadoop nodes are not time synchronized with ntp or something similar. For example if one or more of the slave nodes is a few minutes ahead of the others and an inject job is run on one of those nodes (and

Need help with deleteduplicates

2006-12-19 Thread sdeck
Hello, I am running nutch .8 against hadoop .4, just for reference I want to add a delete duplicate based on a similarity algorithm, as opposed to the hash method that is currently in there. I would have to say I am pretty lost as to how the delete duplicates class is working. I would guess that