update crawldb
hello, I use the Prune tool to remove documents from segment indexes but it does not remove pages and links from WebDB. To prevent the presence of the unwanted URLs when new segments are created, it is advised to use our own link net.nutch.net.URLFilter, or PruneDBTool (under construction...). The Prune DBTool seems always under construction. and I don't understand how to use my own link net.nutch.net.URLFilter. Please could somebody help me... Thanks in advance. Aïcha ___ Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son interface révolutionnaire. http://fr.mail.yahoo.com
Re: subcollections IT DOESN'T WORK!
hi , i'm new to nutch ,i want to know what's the useness of the subcollection plugin? where is the introduction? On 12/19/06, liv [EMAIL PROTECTED] wrote: I may be loosing all and every credit ... it's still in the same state - reindex doesn't change the subcollection field! I did a REFETCH by mistake (before reindex), and I was happy to notice that subcollections were changed - but I assumed it happened only due to reindex. However I am looking for REINDEX only - and subcollection field looks that it doesn't change (on corresponding changes on subcollection.xml file). Any help in debugging would be greatly appreciated... however I'm not acquinted to java to pursue this by myself. thanks liv wrote: I have no ideea why this hapened - probably due to luke, because of it not re-reading the indexes? very strange! Anyway, it works as it should - after a reindex the subcollection field is populated with latest data. Please excuse my insistence and my clumsiness, and thanks for your answers. liv wrote: Unfortunately my java knowledge is too poor to debug this one. However I doubt that the file subcollections.xml from inside the nutch-xxx.jobis used. This because the file nutchxxx.job is old enough - has the date since the day I made he nutch installation. Sami Siren-2 wrote: liv wrote: - I reindex the db: delete folder indexes, run the command: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* - then I inspect the resulting db with luke again Unfortunately nothing has changed. Maybe I am missing something... Please tell me if you see anything wrong. If you did exactly those steps then what happens is that the subcollections.xml is read from inside the .job file. You need to rebuild the .job to put new file inside of it. simply do ant and rerun indexing and it should work as expected. -- Sami Siren -- View this message in context: http://www.nabble.com/subcollections-tf2821188.html#a7935139 Sent from the Nutch - User mailing list archive at Nabble.com. -- www.babatu.com
Re: subcollections IT DOESN'T WORK!
look here: http://issues.apache.org/jira/browse/NUTCH-201?page=all unfortunately it doesn't work as expected... yet kauu wrote: hi , i'm new to nutch ,i want to know what's the useness of the subcollection plugin? where is the introduction? -- View this message in context: http://www.nabble.com/subcollections-tf2821188.html#a7946767 Sent from the Nutch - User mailing list archive at Nabble.com.
Re: subcollections
I checked the patch for subcollections (http://issues.apache.org/jira/browse/NUTCH-201) - although I assumed it is included in the latest public release 0.8.1. Compared to the current source code, it looks like having has an extra file (which doesn't exist in version 0.8.1) src/plugin/subcollection/src/java/org/apache/nutch/util/DomUtil.java Could this be the case for collection not being updated on re-indexing? Sami Siren-2 wrote: liv wrote: - I reindex the db: delete folder indexes, run the command: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* - then I inspect the resulting db with luke again Unfortunately nothing has changed. Maybe I am missing something... Please tell me if you see anything wrong. If you did exactly those steps then what happens is that the subcollections.xml is read from inside the .job file. You need to rebuild the .job to put new file inside of it. simply do ant and rerun indexing and it should work as expected. -- Sami Siren -- View this message in context: http://www.nabble.com/subcollections-tf2821188.html#a7947722 Sent from the Nutch - User mailing list archive at Nabble.com.
How best to add sponsored link support..??
Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated -- rp
Re: How best to add sponsored link support..??
Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated
Re: How best to add sponsored link support..??
Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ [3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java 2006/12/19, RP [EMAIL PROTECTED]: Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated
Re: How best to add sponsored link support..??
Thanks Sami, This is closer from an initial look - does this do anything on the backend (i.e. defining the data flags sow e can get a match) as well or do we need to build that..?? Sami Siren wrote: Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ [3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java 2006/12/19, RP [EMAIL PROTECTED]: Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated -- rp
Need help with deleteduplicates
Hello, I am running nutch .8 against hadoop .4, just for reference I want to add a delete duplicate based on a similarity algorithm, as opposed to the hash method that is currently in there. I would have to say I am pretty lost as to how the delete duplicates class is working. I would guess that I need to implement a compareTo method, but I am not really sure what to return. Also, when I do return something, where do I implement the functionality to say yes, these are dupes, so remove the first one) Can anyone help out? Thanks, S -- View this message in context: http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a7985094 Sent from the Nutch - User mailing list archive at Nabble.com.