Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann
Hi Dennis,

 Not to nit-pick, but the place where you inserted your change isn't at the
end (where they typically should be placed). You inserted in the middle of
the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
the unreleased changes section). Could you please append your changes to the
end of the file, and recommit?

 Thanks a lot!

Cheers,
  Chris



On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Author: kubes
 Date: Sat Mar 10 10:03:07 2007
 New Revision: 516759
 
 URL: http://svn.apache.org/viewvc?view=revrev=516759
 Log:
 Updated to reflect commits of NUTCH-233 and NUTCH-436.
 
 Modified:
 lucene/nutch/trunk/CHANGES.txt
 
 Modified: lucene/nutch/trunk/CHANGES.txt
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167
 59r1=516758r2=516759
 ==
 --- lucene/nutch/trunk/CHANGES.txt (original)
 +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
 @@ -50,6 +50,13 @@
  
  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
  
 +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
 +Groschupf via kubes)
 +
 +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
 + path is empty (kubes)
 +
 +
** WARNING !!! 
* This upgrade breaks data format compatibility. A tool 'convertdb'   *
* was added to migrate existing CrawlDb-s to the new format. Segment data *
 
 




Re: Indexing the Interesting Part Only...

2007-03-10 Thread d e

We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?

Are there existing plug ins I should consider using?


On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote:


You have to build a special HTML Junk parser.

2007/3/9, d e [EMAIL PROTECTED]:

 If I'm indexing a news article, I want to avoid getting the junk (other
 than
 the title, auther and article) into the index. I want to avoid getting
the
 advertizments, etc. How do I do that sort of thing?

 What parts of what manual should I be reading so I will know how to do
 this
 sort of thing.




RE: Indexing the Interesting Part Only...

2007-03-10 Thread Steve Severance
I think if anyone here had the perfect answer for that one they would have
sold it Google, Microsoft or Yahoo for a ton of money. You will need an
algorithm that can detect ads. I have not written ad filters since my search
engine is currently using a domain whitelist. I can tell you that a whole
web crawl will definetly need it since it can cut down on pages in the index
by 10-20%. If you do a whole web crawl you will also need spam detection.

I would recommend looking for some academic papers on the topic. Maybe use
CiteSeer or something like that.

Steve
-Original Message-
From: d e [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 10, 2007 3:07 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing the Interesting Part Only...

We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?

Are there existing plug ins I should consider using?


On 3/9/07, J. Delgado [EMAIL PROTECTED] wrote:

 You have to build a special HTML Junk parser.

 2007/3/9, d e [EMAIL PROTECTED]:
 
  If I'm indexing a news article, I want to avoid getting the junk (other
  than
  the title, auther and article) into the index. I want to avoid getting
 the
  advertizments, etc. How do I do that sort of thing?
 
  What parts of what manual should I be reading so I will know how to do
  this
  sort of thing.
 




Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Dennis Kubes



Chris Mattmann wrote:

Hi Dennis,

 Not to nit-pick, but the place where you inserted your change isn't at the
end (where they typically should be placed). You inserted in the middle of
the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
the unreleased changes section). Could you please append your changes to the
end of the file, and recommit?

 Thanks a lot!

Cheers,
  Chris


Sorry about that.  I say the warning message thinking it was a version 
break.  Everything should be fixed now.


Dennis Kubes




On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:


Author: kubes
Date: Sat Mar 10 10:03:07 2007
New Revision: 516759

URL: http://svn.apache.org/viewvc?view=revrev=516759
Log:
Updated to reflect commits of NUTCH-233 and NUTCH-436.

Modified:
lucene/nutch/trunk/CHANGES.txt

Modified: lucene/nutch/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167

59r1=516758r2=516759
==
--- lucene/nutch/trunk/CHANGES.txt (original)
+++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
@@ -50,6 +50,13 @@
 
 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
 
+18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan

+Groschupf via kubes)
+
+19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL

+ path is empty (kubes)
+
+
   ** WARNING !!! 
   * This upgrade breaks data format compatibility. A tool 'convertdb'   *
   * was added to migrate existing CrawlDb-s to the new format. Segment data *







Re: Indexing the Interesting Part Only...

2007-03-10 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

There are quite a few ways to do this. In fact, Google's PageRank is  
one such approach. Text classification (as done in spam filters, for  
example) is another. It just depends on what you are going to do.


d e wrote:

We plan to index many websites. Got any suggestions on how to drop  
the junk
without having to do too much work for each such site? Know anyone  
who has a
background on doing this sort of thing? What sorts of approaches  
would you

recommend?


- --
Best regards,
Bjoern Wilmsmann



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
1CyrQfD+5vCzSBvYbviX17o=
=+TK/
-END PGP SIGNATURE-


Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann
Dennis,

 No probs. Thanks, a lot!

Cheers,
  Chris



On 3/10/07 5:35 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 
 
 Chris Mattmann wrote:
 Hi Dennis,
 
  Not to nit-pick, but the place where you inserted your change isn't at the
 end (where they typically should be placed). You inserted in the middle of
 the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
 the unreleased changes section). Could you please append your changes to the
 end of the file, and recommit?
 
  Thanks a lot!
 
 Cheers,
   Chris
 
 Sorry about that.  I say the warning message thinking it was a version
 break.  Everything should be fixed now.
 
 Dennis Kubes
 
 
 
 On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: kubes
 Date: Sat Mar 10 10:03:07 2007
 New Revision: 516759
 
 URL: http://svn.apache.org/viewvc?view=revrev=516759
 Log:
 Updated to reflect commits of NUTCH-233 and NUTCH-436.
 
 Modified:
 lucene/nutch/trunk/CHANGES.txt
 
 Modified: lucene/nutch/trunk/CHANGES.txt
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=51
 67
 59r1=516758r2=516759
 
 ==
 --- lucene/nutch/trunk/CHANGES.txt (original)
 +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
 @@ -50,6 +50,13 @@
  
  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
  
 +18. NUTCH-233 - Wrong regular expression hangs reduce process forever
 (Stefan
 +Groschupf via kubes)
 +
 +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
 + path is empty (kubes)
 +
 +
** WARNING !!!
 
* This upgrade breaks data format compatibility. A tool 'convertdb'
 *
* was added to migrate existing CrawlDb-s to the new format. Segment data
 *
 
 
 
 




Re: Indexing the Interesting Part Only...

2007-03-10 Thread d e

I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the outside of
each page. Got suggestions for THAT problem?

Thanks!


On 3/10/07, Björn Wilmsmann [EMAIL PROTECTED] wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

There are quite a few ways to do this. In fact, Google's PageRank is
one such approach. Text classification (as done in spam filters, for
example) is another. It just depends on what you are going to do.

d e wrote:

 We plan to index many websites. Got any suggestions on how to drop
 the junk
 without having to do too much work for each such site? Know anyone
 who has a
 background on doing this sort of thing? What sorts of approaches
 would you
 recommend?

- --
Best regards,
Bjoern Wilmsmann



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
1CyrQfD+5vCzSBvYbviX17o=
=+TK/
-END PGP SIGNATURE-