update crawldb

2006-12-19 Thread Aïcha
hello,

I use the Prune tool to remove documents  from segment indexes but it does not 
remove pages and links from WebDB. 
To prevent the presence of the unwanted URLs  when new segments are created, it 
is advised to use our own link net.nutch.net.URLFilter, or PruneDBTool (under 
construction...). 

The Prune DBTool seems always under construction.
and I don't understand how to use my own link net.nutch.net.URLFilter.
Please could somebody  help me...

Thanks in advance.
Aïcha






___ 
Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son 
interface révolutionnaire.
http://fr.mail.yahoo.com

Re: subcollections IT DOESN'T WORK!

2006-12-19 Thread kauu

hi ,  i'm new to nutch ,i want to know what's the useness of the
subcollection plugin?
where is the introduction?


On 12/19/06, liv [EMAIL PROTECTED] wrote:



I may be loosing all and every credit ... it's still in the same state -
reindex doesn't change the subcollection field!

I did a REFETCH by mistake (before reindex), and I was happy to notice
that
subcollections were changed - but I assumed it happened only due to
reindex.

However I am looking for REINDEX only - and subcollection field looks that
it doesn't change (on corresponding changes on subcollection.xml file).

Any help in debugging would be greatly appreciated... however I'm not
acquinted to java to pursue this by myself.

thanks


liv wrote:

 I have no ideea why this hapened - probably due to luke, because of it
not
 re-reading the indexes? very strange!

 Anyway, it works as it should - after a reindex the subcollection field
is
 populated with latest data.

 Please excuse my insistence and my clumsiness, and thanks for your
 answers.



 liv wrote:

 Unfortunately my java knowledge is too poor to debug this one. However
I
 doubt that the file subcollections.xml from inside the nutch-xxx.jobis
 used. This because the file nutchxxx.job is old enough - has the date
 since the day I made he nutch installation.


 Sami Siren-2 wrote:

 liv wrote:
 - I reindex the db: delete folder indexes, run the command:

 bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
 crawl/segments/*

 - then I inspect the resulting db with luke again

 Unfortunately nothing has changed. Maybe I am missing something...
 Please
 tell me if you see anything wrong.

 If you did exactly those steps then what happens is that the
 subcollections.xml is read from inside the .job file. You need to
 rebuild the .job to put new file inside of it.

 simply do ant and rerun indexing and it should work as expected.

 --
  Sami Siren








--
View this message in context:
http://www.nabble.com/subcollections-tf2821188.html#a7935139
Sent from the Nutch - User mailing list archive at Nabble.com.





--
www.babatu.com


Re: subcollections IT DOESN'T WORK!

2006-12-19 Thread liv

look here:
http://issues.apache.org/jira/browse/NUTCH-201?page=all

unfortunately it doesn't work as expected... yet


kauu wrote:
 
 hi ,  i'm new to nutch ,i want to know what's the useness of the
 subcollection plugin?
 where is the introduction?
 
-- 
View this message in context: 
http://www.nabble.com/subcollections-tf2821188.html#a7946767
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: subcollections

2006-12-19 Thread liv

I checked the patch for subcollections
(http://issues.apache.org/jira/browse/NUTCH-201) - although I assumed it is
included in the latest public release 0.8.1.

Compared to the current source code, it looks like having has an extra file
(which doesn't exist in version 0.8.1)

src/plugin/subcollection/src/java/org/apache/nutch/util/DomUtil.java 

Could this be the case for collection not being updated on re-indexing?


Sami Siren-2 wrote:
 
 liv wrote:
 - I reindex the db: delete folder indexes, run the command:
 
 bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
 
 - then I inspect the resulting db with luke again
 
 Unfortunately nothing has changed. Maybe I am missing something... Please
 tell me if you see anything wrong.
 
 If you did exactly those steps then what happens is that the
 subcollections.xml is read from inside the .job file. You need to
 rebuild the .job to put new file inside of it.
 
 simply do ant and rerun indexing and it should work as expected.
 
 --
  Sami Siren
 
 
 

-- 
View this message in context: 
http://www.nabble.com/subcollections-tf2821188.html#a7947722
Sent from the Nutch - User mailing list archive at Nabble.com.



How best to add sponsored link support..??

2006-12-19 Thread RP

Hi all,

I've been tasked with looking into this and am not a coder - that said, 
Nutch  is doing great and the bean counters have asked me to look into 
adding sponsored link results and I'm wondering how best to add this.


It would be nice to utilize the Nutch engine to come up with the pages 
versus just doing a lookup on words and results in a flat file but the 
key word data could change daily (hourly) and would need to be able to 
be hand entered (or automated) as people sign up (re-index is not really 
an option).  I'm not sure this would fly within the main Nutch segments 
and index, but I could see maybe a separate index or possibly adding a 
flag to the existing data but I've not seen any easy to use tools to 
change/update/insert records into what is already there (yes Luke on the 
index but that does not touch the segment data, right?).  I don't want 
to change existing searched data and I don't see an issue with having 
duplicate results (sponsored up top and existing entry down below 
somewhere) but it would be more elegant to not have that occur.  I also 
see issues in a simple flat file look up as a multiple word search is 
best handled inside Nutch to score the results versus having to do 
something similar in the sponsored results.  I can see the need to 
control the summary text displayed and also pass thru any codes in the 
URL which are currently being stripped during the main crawl/index 
cycle.  I also see issues with seriously customizing the internals as 
they would have to be maintained as Nutch itself is updated


If anyone has looked at this and has at least some ideas on how best to 
do this let me know.  I need to come up with a preliminary estimate 
before I can engage and pay the coders to make this happen so if there 
are any easy or best practices ways on doing this any help/pointers 
would be appreciated


--
rp





Re: How best to add sponsored link support..??

2006-12-19 Thread RP
Let me qualify this - ad banner rotation is dealt with - I'm looking for 
something that will use our Nutch engine to serve up relevant links from 
people who pay for that privilege.  We do not want to serve up ad's from 
someone else's system i.e. the big G or Y, but use our own Nutch search 
results to serve up relevant paying links that we have sold and 
maintain.   In a simple relational SQL world we would add a flag and 
another table with the links and scores and look that up and pass back 
when needed.  Problem with that is that we lose the whole multi word 
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a 
Chicago pizza ad first and beer ads further down, just like our search 
results have relevancy (not a great example but you get the idea). 
Re-writing a scoring engine to do that in SQL seems like a waste when 
Nutch already does it just fine.


So in a nutshell - we need to do what the big G and Y and other do when 
serving up key word based sponsor links.  My thought - automate the 
build of a dummy page with the key words bought that would be indexed 
and served up just like regular crawled and indexed pages, using the 
scoring to rank them in terms of relevancy and placement - I have not 
seen any snippets of code to do simple insert/update/delete operations 
on a Nutch segment or index however


This is the idea gathering phase - think like a school/college search 
engine with local paying advertisers - we want to serve those links up 
to the searchers to help offset the cost of the service and serve up or 
flag links that rank first because of payment followed by normal search 
link results


rp

Sean Dean wrote:

I might be totally off base with what your asking to do, but take a look at 
this open source project: http://phpadsnew.com/two/.
 
Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads.


Sean

- Original Message 
From: RP [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tuesday, December 19, 2006 10:52:56 AM
Subject: How best to add sponsored link support..??


Hi all,

I've been tasked with looking into this and am not a coder - that said, 
Nutch  is doing great and the bean counters have asked me to look into 
adding sponsored link results and I'm wondering how best to add this.


It would be nice to utilize the Nutch engine to come up with the pages 
versus just doing a lookup on words and results in a flat file but the 
key word data could change daily (hourly) and would need to be able to 
be hand entered (or automated) as people sign up (re-index is not really 
an option).  I'm not sure this would fly within the main Nutch segments 
and index, but I could see maybe a separate index or possibly adding a 
flag to the existing data but I've not seen any easy to use tools to 
change/update/insert records into what is already there (yes Luke on the 
index but that does not touch the segment data, right?).  I don't want 
to change existing searched data and I don't see an issue with having 
duplicate results (sponsored up top and existing entry down below 
somewhere) but it would be more elegant to not have that occur.  I also 
see issues in a simple flat file look up as a multiple word search is 
best handled inside Nutch to score the results versus having to do 
something similar in the sponsored results.  I can see the need to 
control the summary text displayed and also pass thru any codes in the 
URL which are currently being stripped during the main crawl/index 
cycle.  I also see issues with seriously customizing the internals as 
they would have to be maintained as Nutch itself is updated


If anyone has looked at this and has at least some ideas on how best to 
do this let me know.  I need to come up with a preliminary estimate 
before I can engage and pay the coders to make this happen so if there 
are any easy or best practices ways on doing this any help/pointers 
would be appreciated


  




Re: How best to add sponsored link support..??

2006-12-19 Thread Sami Siren

Are you looking for something like the google keymatch as described in [1]
which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]

--
Sami Siren

[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/
[3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java

2006/12/19, RP [EMAIL PROTECTED]:


Let me qualify this - ad banner rotation is dealt with - I'm looking for
something that will use our Nutch engine to serve up relevant links from
people who pay for that privilege.  We do not want to serve up ad's from
someone else's system i.e. the big G or Y, but use our own Nutch search
results to serve up relevant paying links that we have sold and
maintain.   In a simple relational SQL world we would add a flag and
another table with the links and scores and look that up and pass back
when needed.  Problem with that is that we lose the whole multi word
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a
Chicago pizza ad first and beer ads further down, just like our search
results have relevancy (not a great example but you get the idea).
Re-writing a scoring engine to do that in SQL seems like a waste when
Nutch already does it just fine.

So in a nutshell - we need to do what the big G and Y and other do when
serving up key word based sponsor links.  My thought - automate the
build of a dummy page with the key words bought that would be indexed
and served up just like regular crawled and indexed pages, using the
scoring to rank them in terms of relevancy and placement - I have not
seen any snippets of code to do simple insert/update/delete operations
on a Nutch segment or index however

This is the idea gathering phase - think like a school/college search
engine with local paying advertisers - we want to serve those links up
to the searchers to help offset the cost of the service and serve up or
flag links that rank first because of payment followed by normal search
link results

rp

Sean Dean wrote:
 I might be totally off base with what your asking to do, but take a look
at this open source project: http://phpadsnew.com/two/.

 Its basically an advertising engine, built on PHP. Integration within
any application is a breeze, and it supports external advertising such as
Google Ads.

 Sean

 - Original Message 
 From: RP [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, December 19, 2006 10:52:56 AM
 Subject: How best to add sponsored link support..??


 Hi all,

 I've been tasked with looking into this and am not a coder - that said,
 Nutch  is doing great and the bean counters have asked me to look into
 adding sponsored link results and I'm wondering how best to add this.

 It would be nice to utilize the Nutch engine to come up with the pages
 versus just doing a lookup on words and results in a flat file but the
 key word data could change daily (hourly) and would need to be able to
 be hand entered (or automated) as people sign up (re-index is not really
 an option).  I'm not sure this would fly within the main Nutch segments
 and index, but I could see maybe a separate index or possibly adding a
 flag to the existing data but I've not seen any easy to use tools to
 change/update/insert records into what is already there (yes Luke on the
 index but that does not touch the segment data, right?).  I don't want
 to change existing searched data and I don't see an issue with having
 duplicate results (sponsored up top and existing entry down below
 somewhere) but it would be more elegant to not have that occur.  I also
 see issues in a simple flat file look up as a multiple word search is
 best handled inside Nutch to score the results versus having to do
 something similar in the sponsored results.  I can see the need to
 control the summary text displayed and also pass thru any codes in the
 URL which are currently being stripped during the main crawl/index
 cycle.  I also see issues with seriously customizing the internals as
 they would have to be maintained as Nutch itself is updated

 If anyone has looked at this and has at least some ideas on how best to
 do this let me know.  I need to come up with a preliminary estimate
 before I can engage and pay the coders to make this happen so if there
 are any easy or best practices ways on doing this any help/pointers
 would be appreciated






Re: How best to add sponsored link support..??

2006-12-19 Thread RP

Thanks Sami,

This is closer from an initial look - does this do anything on the 
backend (i.e. defining the data flags sow e can get a match) as well or 
do we need to build that..??


Sami Siren wrote:
Are you looking for something like the google keymatch as described in 
[1]

which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]

--
Sami Siren

[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ 


[3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java

2006/12/19, RP [EMAIL PROTECTED]:


Let me qualify this - ad banner rotation is dealt with - I'm looking for
something that will use our Nutch engine to serve up relevant links from
people who pay for that privilege.  We do not want to serve up ad's from
someone else's system i.e. the big G or Y, but use our own Nutch search
results to serve up relevant paying links that we have sold and
maintain.   In a simple relational SQL world we would add a flag and
another table with the links and scores and look that up and pass back
when needed.  Problem with that is that we lose the whole multi word
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a
Chicago pizza ad first and beer ads further down, just like our search
results have relevancy (not a great example but you get the idea).
Re-writing a scoring engine to do that in SQL seems like a waste when
Nutch already does it just fine.

So in a nutshell - we need to do what the big G and Y and other do when
serving up key word based sponsor links.  My thought - automate the
build of a dummy page with the key words bought that would be indexed
and served up just like regular crawled and indexed pages, using the
scoring to rank them in terms of relevancy and placement - I have not
seen any snippets of code to do simple insert/update/delete operations
on a Nutch segment or index however

This is the idea gathering phase - think like a school/college search
engine with local paying advertisers - we want to serve those links up
to the searchers to help offset the cost of the service and serve up or
flag links that rank first because of payment followed by normal search
link results

rp

Sean Dean wrote:
 I might be totally off base with what your asking to do, but take a 
look

at this open source project: http://phpadsnew.com/two/.

 Its basically an advertising engine, built on PHP. Integration within
any application is a breeze, and it supports external advertising 
such as

Google Ads.

 Sean

 - Original Message 
 From: RP [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, December 19, 2006 10:52:56 AM
 Subject: How best to add sponsored link support..??


 Hi all,

 I've been tasked with looking into this and am not a coder - that 
said,

 Nutch  is doing great and the bean counters have asked me to look into
 adding sponsored link results and I'm wondering how best to add this.

 It would be nice to utilize the Nutch engine to come up with the pages
 versus just doing a lookup on words and results in a flat file but the
 key word data could change daily (hourly) and would need to be able to
 be hand entered (or automated) as people sign up (re-index is not 
really
 an option).  I'm not sure this would fly within the main Nutch 
segments

 and index, but I could see maybe a separate index or possibly adding a
 flag to the existing data but I've not seen any easy to use tools to
 change/update/insert records into what is already there (yes Luke 
on the

 index but that does not touch the segment data, right?).  I don't want
 to change existing searched data and I don't see an issue with having
 duplicate results (sponsored up top and existing entry down below
 somewhere) but it would be more elegant to not have that occur.  I 
also

 see issues in a simple flat file look up as a multiple word search is
 best handled inside Nutch to score the results versus having to do
 something similar in the sponsored results.  I can see the need to
 control the summary text displayed and also pass thru any codes in the
 URL which are currently being stripped during the main crawl/index
 cycle.  I also see issues with seriously customizing the internals as
 they would have to be maintained as Nutch itself is updated

 If anyone has looked at this and has at least some ideas on how 
best to

 do this let me know.  I need to come up with a preliminary estimate
 before I can engage and pay the coders to make this happen so if there
 are any easy or best practices ways on doing this any help/pointers
 would be appreciated








--
rp




Need help with deleteduplicates

2006-12-19 Thread sdeck

Hello,
  I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say yes, these are dupes, so remove the
first one)

Can anyone help out?
Thanks,
S
-- 
View this message in context: 
http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a7985094
Sent from the Nutch - User mailing list archive at Nabble.com.