Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt
Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Sat Mar 10 10:03:07 2007 New Revision: 516759 URL: http://svn.apache.org/viewvc?view=revrev=516759 Log: Updated to reflect commits of NUTCH-233 and NUTCH-436. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167 59r1=516758r2=516759 == --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007 @@ -50,6 +50,13 @@ 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan +Groschupf via kubes) + +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL + path is empty (kubes) + + ** WARNING !!! * This upgrade breaks data format compatibility. A tool 'convertdb' * * was added to migrate existing CrawlDb-s to the new format. Segment data *
Re: Indexing the Interesting Part Only...
We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? Are there existing plug ins I should consider using? On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote: You have to build a special HTML Junk parser. 2007/3/9, d e [EMAIL PROTECTED]: If I'm indexing a news article, I want to avoid getting the junk (other than the title, auther and article) into the index. I want to avoid getting the advertizments, etc. How do I do that sort of thing? What parts of what manual should I be reading so I will know how to do this sort of thing.
RE: Indexing the Interesting Part Only...
I think if anyone here had the perfect answer for that one they would have sold it Google, Microsoft or Yahoo for a ton of money. You will need an algorithm that can detect ads. I have not written ad filters since my search engine is currently using a domain whitelist. I can tell you that a whole web crawl will definetly need it since it can cut down on pages in the index by 10-20%. If you do a whole web crawl you will also need spam detection. I would recommend looking for some academic papers on the topic. Maybe use CiteSeer or something like that. Steve -Original Message- From: d e [mailto:[EMAIL PROTECTED] Sent: Saturday, March 10, 2007 3:07 PM To: nutch-dev@lucene.apache.org Subject: Re: Indexing the Interesting Part Only... We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? Are there existing plug ins I should consider using? On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote: You have to build a special HTML Junk parser. 2007/3/9, d e [EMAIL PROTECTED]: If I'm indexing a news article, I want to avoid getting the junk (other than the title, auther and article) into the index. I want to avoid getting the advertizments, etc. How do I do that sort of thing? What parts of what manual should I be reading so I will know how to do this sort of thing.
Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt
Chris Mattmann wrote: Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris Sorry about that. I say the warning message thinking it was a version break. Everything should be fixed now. Dennis Kubes On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Sat Mar 10 10:03:07 2007 New Revision: 516759 URL: http://svn.apache.org/viewvc?view=revrev=516759 Log: Updated to reflect commits of NUTCH-233 and NUTCH-436. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167 59r1=516758r2=516759 == --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007 @@ -50,6 +50,13 @@ 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan +Groschupf via kubes) + +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL + path is empty (kubes) + + ** WARNING !!! * This upgrade breaks data format compatibility. A tool 'convertdb' * * was added to migrate existing CrawlDb-s to the new format. Segment data *
Re: Indexing the Interesting Part Only...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There are quite a few ways to do this. In fact, Google's PageRank is one such approach. Text classification (as done in spam filters, for example) is another. It just depends on what you are going to do. d e wrote: We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? - -- Best regards, Bjoern Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc 1CyrQfD+5vCzSBvYbviX17o= =+TK/ -END PGP SIGNATURE-
Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt
Dennis, No probs. Thanks, a lot! Cheers, Chris On 3/10/07 5:35 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris Sorry about that. I say the warning message thinking it was a version break. Everything should be fixed now. Dennis Kubes On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Sat Mar 10 10:03:07 2007 New Revision: 516759 URL: http://svn.apache.org/viewvc?view=revrev=516759 Log: Updated to reflect commits of NUTCH-233 and NUTCH-436. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=51 67 59r1=516758r2=516759 == --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007 @@ -50,6 +50,13 @@ 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan +Groschupf via kubes) + +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL + path is empty (kubes) + + ** WARNING !!! * This upgrade breaks data format compatibility. A tool 'convertdb' * * was added to migrate existing CrawlDb-s to the new format. Segment data *
Re: Indexing the Interesting Part Only...
I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the junk *on each page*. I am indexing news sites. I want to harvest news STORIES, not the advertisements and other junk text around the outside of each page. Got suggestions for THAT problem? Thanks! On 3/10/07, Björn Wilmsmann [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There are quite a few ways to do this. In fact, Google's PageRank is one such approach. Text classification (as done in spam filters, for example) is another. It just depends on what you are going to do. d e wrote: We plan to index many websites. Got any suggestions on how to drop the junk without having to do too much work for each such site? Know anyone who has a background on doing this sort of thing? What sorts of approaches would you recommend? - -- Best regards, Bjoern Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc 1CyrQfD+5vCzSBvYbviX17o= =+TK/ -END PGP SIGNATURE-