[jira] Commented: (NUTCH-299) Bittorrent Parser

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414643 ] 

Stefan Neufeind commented on NUTCH-299:
---

Could you briefly explain what it does? Extract meta-data and index the comment 
as content of that page? Or does it also follow the URL to the tracker 
(maybe) to discover other torrents etc.?

 Bittorrent Parser
 -

  Key: NUTCH-299
  URL: http://issues.apache.org/jira/browse/NUTCH-299
  Project: Nutch
 Type: New Feature

 Reporter: Hasan Diwan
 Priority: Minor
  Attachments: BitTorrent.jar

 BitTorrent information file parser

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-298?page=comments#action_12414647 ] 

Stefan Neufeind commented on NUTCH-298:
---

Is the description-line of this bug correct? I've been indexing pages without 
robots.txt, and I just  checked that those hosts give a 404 since robots.txt 
does not exist.

 if a 404 for a robots.txt is returned no page is fetched at all from the host
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-299) Bittorrent Parser

2006-06-04 Thread Hasan Diwan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414648 ] 

Hasan Diwan commented on NUTCH-299:
---

Extracts and indexes meta-data. Doesn't follow the URL to the tracker. I would 
add that if I have the time, or maybe someone else can.

 Bittorrent Parser
 -

  Key: NUTCH-299
  URL: http://issues.apache.org/jira/browse/NUTCH-299
  Project: Nutch
 Type: New Feature

 Reporter: Hasan Diwan
 Priority: Minor
  Attachments: BitTorrent.jar

 BitTorrent information file parser

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 
404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

 if a 404 for a robots.txt is returned a NPE is thrown
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-294) Topic-maps of related searchwords

2006-06-04 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414653 ] 

Stefan Neufeind commented on NUTCH-294:
---

I'm not sure. On a quick run I wasn't able to get the 
clustering-carrot2-plugin to work - though I thought I simply need to include 
it.
Maybe somebody else already worked with it and could comment if that plugin is 
within scope of this feature-request.
To what I found about carror2 it's also used to cluster data from multiple 
search-engines - not sure how that relates to topic-clusters.

 Topic-maps of related searchwords
 -

  Key: NUTCH-294
  URL: http://issues.apache.org/jira/browse/NUTCH-294
  Project: Nutch
 Type: New Feature

   Components: searcher
 Reporter: Stefan Neufeind


 Would it be possible to offer a user  topic-maps? It's when you search for 
 something and get topic-related words that might also be of interest for you. 
 I wonder if that's somehow possible with the ngram-index for did you mean 
 (see separate feature-enhancement-bug for this), but we'd need to have a 
 relation between words (in what context do they occur).
 For the webfrontend usually trees are used  - which for some users offer 
 quite impressive eye-candy :-) E.g. see this advertisement by Novell where 
 I've just seen a similar topic-map as well:
 http://www.novell.com/de-de/company/advertising/defineyouropen.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: search engine spam detector

2006-06-04 Thread Stefan Neufeind
Stefan Groschupf wrote:
 
 a interesting tool:
 http://tool.motoricerca.info/spam-detector/

Do you have good/bad experience with that tool? The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)

From the FAQ I read that the author is a PHP-guy - I'd say luckily ...
but for nutch that would at least mean translating a big part. Question
still remains how advanced his ideas already are and if he would
contribute to such an extension. But contributing the ideas behind it
might be an interesting collaboration.

  Stefan


[jira] Resolved: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-04 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-258?page=all ]
 
Chris A. Mattmann resolved NUTCH-258:
-

Resolution: Won't Fix

The use of LOG.severe in the fetcher indicates an unrecoverable error: thus, 
this issue is not a bug, and in fact describes the actual intended behavior of 
the system.

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-04 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-258?page=all ]
 
Chris A. Mattmann closed NUTCH-258:
---


Won't fix: issue describes intended behavior of system (fetcher component).

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: search engine spam detector

2006-06-04 Thread Stefan Groschupf


The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)


This will be a very long way.
I collect some thoughts and a list of web spam related papers in my  
blog.
http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072- 
F98E13CAEFE1.html

Feedback is welcome.


Stefan



Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-04 Thread ogjunk-nutch
Hi,

What exactly does this plugin do?  I haven't seen it mentioned and the 
README.txt doesn't really describe it.

Thanks,
Otis

- Original Message 
From: [EMAIL PROTECTED]
To: nutch-commits@lucene.apache.org
Sent: Sunday, June 4, 2006 3:44:23 PM
Subject: [Nutch-cvs] svn commit: r411594 - 
/lucene/nutch/trunk/contrib/web2/plugins/build.xml

Author: siren
Date: Sun Jun  4 12:44:23 2006
New Revision: 411594

URL: http://svn.apache.org/viewvc?rev=411594view=rev
Log:
initial import of web-keymatch plugin

Modified:
lucene/nutch/trunk/contrib/web2/plugins/build.xml

Modified: lucene/nutch/trunk/contrib/web2/plugins/build.xml
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/build.xml?rev=411594r1=411593r2=411594view=diff
==
--- lucene/nutch/trunk/contrib/web2/plugins/build.xml (original)
+++ lucene/nutch/trunk/contrib/web2/plugins/build.xml Sun Jun  4 12:44:23 2006
@@ -15,6 +15,7 @@
 ant dir=web-more target=deploy/
 ant dir=web-resources target=deploy/
 ant dir=web-clustering target=deploy/
+ant dir=web-keymatch target=deploy/
 ant dir=web-query-propose-ontology target=deploy/
 ant dir=web-query-propose-spellcheck target=deploy/
   /target
@@ -25,6 +26,7 @@
   target name=test
 parallel threadCount=2
   ant dir=web-caching-oscache target=test/
+  ant dir=web-keymatch target=test/
 /parallel
   /target
 
@@ -35,6 +37,7 @@
 ant dir=web-caching-oscache target=clean/
 ant dir=web-resources target=clean/
 ant dir=web-more target=clean/
+ant dir=web-keymatch target=clean/
 ant dir=web-clustering target=clean/
 ant dir=web-query-propose-ontology target=clean/
 ant dir=web-query-propose-spellcheck target=clean/




___
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs