[jira] Commented: (NUTCH-299) Bittorrent Parser
[ http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414643 ] Stefan Neufeind commented on NUTCH-299: --- Could you briefly explain what it does? Extract meta-data and index the comment as content of that page? Or does it also follow the URL to the tracker (maybe) to discover other torrents etc.? Bittorrent Parser - Key: NUTCH-299 URL: http://issues.apache.org/jira/browse/NUTCH-299 Project: Nutch Type: New Feature Reporter: Hasan Diwan Priority: Minor Attachments: BitTorrent.jar BitTorrent information file parser -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host
[ http://issues.apache.org/jira/browse/NUTCH-298?page=comments#action_12414647 ] Stefan Neufeind commented on NUTCH-298: --- Is the description-line of this bug correct? I've been indexing pages without robots.txt, and I just checked that those hosts give a 404 since robots.txt does not exist. if a 404 for a robots.txt is returned no page is fetched at all from the host - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: fixNpeRobotRuleSet.patch What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-299) Bittorrent Parser
[ http://issues.apache.org/jira/browse/NUTCH-299?page=comments#action_12414648 ] Hasan Diwan commented on NUTCH-299: --- Extracts and indexes meta-data. Doesn't follow the URL to the tracker. I would add that if I have the time, or maybe someone else can. Bittorrent Parser - Key: NUTCH-299 URL: http://issues.apache.org/jira/browse/NUTCH-299 Project: Nutch Type: New Feature Reporter: Hasan Diwan Priority: Minor Attachments: BitTorrent.jar BitTorrent information file parser -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a 404 for a robots.txt is returned no page is fetched at all from the host) Sorry, worng description. if a 404 for a robots.txt is returned a NPE is thrown - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: fixNpeRobotRuleSet.patch What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-294) Topic-maps of related searchwords
[ http://issues.apache.org/jira/browse/NUTCH-294?page=comments#action_12414653 ] Stefan Neufeind commented on NUTCH-294: --- I'm not sure. On a quick run I wasn't able to get the clustering-carrot2-plugin to work - though I thought I simply need to include it. Maybe somebody else already worked with it and could comment if that plugin is within scope of this feature-request. To what I found about carror2 it's also used to cluster data from multiple search-engines - not sure how that relates to topic-clusters. Topic-maps of related searchwords - Key: NUTCH-294 URL: http://issues.apache.org/jira/browse/NUTCH-294 Project: Nutch Type: New Feature Components: searcher Reporter: Stefan Neufeind Would it be possible to offer a user topic-maps? It's when you search for something and get topic-related words that might also be of interest for you. I wonder if that's somehow possible with the ngram-index for did you mean (see separate feature-enhancement-bug for this), but we'd need to have a relation between words (in what context do they occur). For the webfrontend usually trees are used - which for some users offer quite impressive eye-candy :-) E.g. see this advertisement by Novell where I've just seen a similar topic-map as well: http://www.novell.com/de-de/company/advertising/defineyouropen.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: search engine spam detector
Stefan Groschupf wrote: a interesting tool: http://tool.motoricerca.info/spam-detector/ Do you have good/bad experience with that tool? The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) From the FAQ I read that the author is a PHP-guy - I'd say luckily ... but for nutch that would at least mean translating a big part. Question still remains how advanced his ideas already are and if he would contribute to such an extension. But contributing the ideas behind it might be an interesting collaboration. Stefan
[jira] Resolved: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Chris A. Mattmann resolved NUTCH-258: - Resolution: Won't Fix The use of LOG.severe in the fetcher indicates an unrecoverable error: thus, this issue is not a bug, and in fact describes the actual intended behavior of the system. Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Priority: Critical Attachments: dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Chris A. Mattmann closed NUTCH-258: --- Won't fix: issue describes intended behavior of system (fetcher component). Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Priority: Critical Attachments: dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: search engine spam detector
The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) This will be a very long way. I collect some thoughts and a list of web spam related papers in my blog. http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072- F98E13CAEFE1.html Feedback is welcome. Stefan
Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml
Hi, What exactly does this plugin do? I haven't seen it mentioned and the README.txt doesn't really describe it. Thanks, Otis - Original Message From: [EMAIL PROTECTED] To: nutch-commits@lucene.apache.org Sent: Sunday, June 4, 2006 3:44:23 PM Subject: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml Author: siren Date: Sun Jun 4 12:44:23 2006 New Revision: 411594 URL: http://svn.apache.org/viewvc?rev=411594view=rev Log: initial import of web-keymatch plugin Modified: lucene/nutch/trunk/contrib/web2/plugins/build.xml Modified: lucene/nutch/trunk/contrib/web2/plugins/build.xml URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/build.xml?rev=411594r1=411593r2=411594view=diff == --- lucene/nutch/trunk/contrib/web2/plugins/build.xml (original) +++ lucene/nutch/trunk/contrib/web2/plugins/build.xml Sun Jun 4 12:44:23 2006 @@ -15,6 +15,7 @@ ant dir=web-more target=deploy/ ant dir=web-resources target=deploy/ ant dir=web-clustering target=deploy/ +ant dir=web-keymatch target=deploy/ ant dir=web-query-propose-ontology target=deploy/ ant dir=web-query-propose-spellcheck target=deploy/ /target @@ -25,6 +26,7 @@ target name=test parallel threadCount=2 ant dir=web-caching-oscache target=test/ + ant dir=web-keymatch target=test/ /parallel /target @@ -35,6 +37,7 @@ ant dir=web-caching-oscache target=clean/ ant dir=web-resources target=clean/ ant dir=web-more target=clean/ +ant dir=web-keymatch target=clean/ ant dir=web-clustering target=clean/ ant dir=web-query-propose-ontology target=clean/ ant dir=web-query-propose-spellcheck target=clean/ ___ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs