from:"Matt Kangas \(JIRA\)"

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2005-09-10 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ] 

Matt Kangas commented on NUTCH-87:
--

Sample plugin.xml file for use with WhitelistURLFilter

?xml version=1.0 encoding=UTF-8?
plugin
   id=epile-whitelisturlfilter
   name=Epile whitelist URL filter
   version=1.0.0
   provider-name=teamgigabyte.com

   extension-point
  id=org.apache.nutch.net.URLFilter
  name=Nutch URL Filter/

   runtime/runtime

   extension id=org.apache.nutch.net.urlfiler
  name=Epile Whitelist URL Filter
  point=org.apache.nutch.net.URLFilter
 
  implementation id=WhitelistURLFilter
 class=epile.crawl.plugin.WhitelistURLFilter/
   /extension
/plugin

 Efficient site-specific crawling for a large number of sites
 

  Key: NUTCH-87
  URL: http://issues.apache.org/jira/browse/NUTCH-87
  Project: Nutch
 Type: New Feature
   Components: fetcher
  Environment: cross-platform
 Reporter: AJ Chen
  Attachments: JIRA-87-whitelistfilter.tar.gz

 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-20 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332660 ] 

Matt Kangas commented on NUTCH-82:
--

Another pure Java solution is to rewrite the nutch bash script in BeanShell 
(http://www.beanshell.org).

I just took a quick (~1 hr) stab at this. The syntax seems quite agreeable, 
with many builtin versions of standard unix commands (cd(), cat(), etc). 
However, I quickly hit two barriers:

1) Reading environment variables. System.getenv() works on 1.5, but is 
nonfunctional on Java 1.3 and The only workaround on 1.4 is what ant does: run 
a native command, read the
output, and set system properties.

2) Setting -Xmx et al. My sense is that it's simply not possible.

Other than these issues, it would be quite easy to rewrite all of the 
usage/command/path-building logic into a beanshell script. Then there could be 
two *small* scripts (bash  .bat) to handle the stuff that can't be done in 
Java, and one beanshell script for the rest. Does that seem useful? 

FYI, the core beanshell interpreter is ~143k.

 Nutch Commands should run on Windows without external tools
 ---

  Key: NUTCH-82
  URL: http://issues.apache.org/jira/browse/NUTCH-82
  Project: Nutch
 Type: New Feature
  Environment: Windows 2000
 Reporter: AJ Banck
  Attachments: nutch.bat, nutch.bat, nutch.pl

 Currently there is only a shellscript to run the Nutch commands. This should 
 be platform independant.
 Best would be Ant tools, or scripts generated by a template tool to avoid 
 replication.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-143) Improper error numbers returned on exit

2005-12-17 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-143?page=comments#action_12360689 ] 

Matt Kangas commented on NUTCH-143:
---

I'd like to see this fixed too. It would make error-checking in wrapper scripts 
much simpler to implement.

A fix would have to touch every .java file that has a main() method, because 
the problem is that the JVM returns status=0 from main(), because main() has a 
_void_ return type, after all.

To solve this, I recommend renaming all existing main() methods to doMain() 
and adding the following to each affected file:

  /**
   * main() wrapper that returns proper exit status
   */
  public static void main(String[] args) {
Runtime rt = Runtime.getRuntime();
try {
  boolean status = doMain(args);
  rt.exit(status ? 0 : 1);
}
catch (Exception e) {
  LOG.log(Level.SEVERE, LOGPREFIX + error, caught Exception in main(), 
e);  
  rt.exit(1);
}
  }


 Improper error numbers returned on exit
 ---

  Key: NUTCH-143
  URL: http://issues.apache.org/jira/browse/NUTCH-143
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Rod Taylor


 Nutch does not obey standard command line error numbers which can make it 
 difficult to script around commands.
 Both of the below should have exited with an error number larger than 0 
 causing the shell script to enter into the 'Failed' case.
 bash-3.00$ /opt/nutch/bin/nutch updatedb  echo ==Success || echo 
 ==Failed
 Usage: crawldb segment
 ==Success
 bash-3.00$ /opt/nutch/bin/nutch readdb  echo ==Success || echo 
 ==Failed
 Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -url url)
 crawldb   directory name where crawldb is located
 -stats  print overall statistics to System.out
 -dump out_dir dump the whole db to a text file in out_dir
 -url url  print information on url to System.out
 ==Success
 Note that the nutch shell script functions as expected:
 bash-3.00$ /opt/nutch/bin/nutch   echo ==Success || echo ==Failed
 Usage: nutch COMMAND
 where COMMAND is one of:
   crawl one-step crawler for intranets
   readdbread / dump crawl db
   readlinkdbread / dump link db
   admin database administration, including creation
   injectinject new urls into the database
   generate  generate new segments to fetch
   fetch fetch a segment's pages
   parse parse a segment's pages
   updatedb  update crawl db from segments after fetching
   invertlinks   create a linkdb from parsed segments
   index run the indexer on parsed segments and linkdb
   merge merge several segment indexes
   dedup remove duplicates from a set of segment indexes
   serverrun a search server
   namenode  run the NDFS namenode
   datanode  run an NDFS datanode
   ndfs  run an NDFS admin client
   jobtrackerrun the MapReduce job Tracker node
   tasktracker   run a MapReduce task Tracker node
   job   manipulate MapReduce jobs
  or
   CLASSNAME run the class named CLASSNAME
 Most commands print help when invoked w/o parameters.
 ==Failed

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-

Attachment: build.xml.patch
urlfilter-whitelist.tar.gz

THIS REPLACES THE PREVIOUS TARBALL
SEE THE INCLUDED README.txt FOR USAGE GUIDELINES

Place both of these files into ~nutch/src/plugin, then:
- untar the tarball
- apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist 
to be built

Next, cd ~nutch and build (ant).

A JUnit test is included. It will be run automatically by ant test-plugins.

Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt

 Efficient site-specific crawling for a large number of sites
 

  Key: NUTCH-87
  URL: http://issues.apache.org/jira/browse/NUTCH-87
  Project: Nutch
 Type: New Feature
   Components: fetcher
  Environment: cross-platform
 Reporter: AJ Chen
  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
 urlfilter-whitelist.tar.gz

 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ] 

Matt Kangas commented on NUTCH-87:
--

JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file 
instead.

 Efficient site-specific crawling for a large number of sites
 

  Key: NUTCH-87
  URL: http://issues.apache.org/jira/browse/NUTCH-87
  Project: Nutch
 Type: New Feature
   Components: fetcher
  Environment: cross-platform
 Reporter: AJ Chen
  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
 urlfilter-whitelist.tar.gz

 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-

Version: 0.7.2-dev
 0.8-dev

 Efficient site-specific crawling for a large number of sites
 

  Key: NUTCH-87
  URL: http://issues.apache.org/jira/browse/NUTCH-87
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Versions: 0.8-dev, 0.7.2-dev
  Environment: cross-platform
 Reporter: AJ Chen
  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
 urlfilter-whitelist.tar.gz

 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-

Attachment: build.xml.patch-0.8

The previous patch file is valid for 0.7. Here is one that works for 0.8-dev 
(trunk).

(It's three separate one-line additions, to include the plugin in the deploy, 
test , and clean targets.)

 Efficient site-specific crawling for a large number of sites
 

  Key: NUTCH-87
  URL: http://issues.apache.org/jira/browse/NUTCH-87
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Versions: 0.8-dev, 0.7.2-dev
  Environment: cross-platform
 Reporter: AJ Chen
  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
 build.xml.patch-0.8, urlfilter-whitelist.tar.gz

 There is a gap between whole-web crawling and single (or handful) site 
 crawling. Many applications actually fall in this gap, which usually require 
 to crawl a large number of selected sites, say 10 domains. Current 
 CrawlTool is designed for a handful of sites. So, this request calls for a 
 new feature or improvement on CrawTool so that nutch crawl command can 
 efficiently deal with large number of sites. One requirement is to add or 
 change smallest amount of code so that this feature can be implemented sooner 
 rather than later. 
 There is a discussion about adding a URLFilter to implement this requested 
 feature, see the following thread - 
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
 The idea is to use a hashtable in URLFilter for looking up regex for any 
 given domain. Hashtable will be much faster than list implementation 
 currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
 such idea before for his own application and is willing to make it available 
 for adaptation to Nutch. I'll be happy to help him in this regard.  
 But, before we do it, we would like to hear more discussions or comments 
 about this approach or other approaches. Particularly, let us know what 
 potential downside will be for hashtable lookup in a new URLFilter plugin.
 AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)

Log when db.max configuration limits reached


 Key: NUTCH-182
 URL: http://issues.apache.org/jira/browse/NUTCH-182
 Project: Nutch
Type: Improvement
  Components: fetcher  
Versions: 0.8-dev
Reporter: Matt Kangas
Priority: Trivial


Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html

There are three db.max parameters currently in nutch-default.xml:
 * db.max.outlinks.per.page
 * db.max.anchor.length
 * db.max.inlinks

Having values that are too low can result in a site being under-crawled. 
However, currently there is nothing written to the log when these limits are 
hit, so users have to guess when they need to raise these values.

I suggest that we add three new log messages at the appropriate points:
 * Exceeded db.max.outlinks.per.page for URL 
 * Exceeded db.max.anchor.length for URL 
 * Exceeded db.max.inlinks for URL 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]

Matt Kangas updated NUTCH-182:
--

Attachment: ParseData.java.patch
LinkDb.java.patch

Two patches are attached for nutch/trunk (0.8-dev).

LinkDb.java.patch adds two new LOG.info() statements:
 * Exceeded db.max.anchor.length for URL url
 * Exceeded db.max.inlinks for URL url

ParseData.java.patch adds a private static LOG variable, pluse one LOG.info() 
statement:
 * Exceeded db.max.outlinks.per.page

I would have preferred to print the URL too on the latter, but it's not 
available in the method where the cutoff is performed (afaik).

 Log when db.max configuration limits reached
 

  Key: NUTCH-182
  URL: http://issues.apache.org/jira/browse/NUTCH-182
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.8-dev
 Reporter: Matt Kangas
 Priority: Trivial
  Attachments: LinkDb.java.patch, ParseData.java.patch

 Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html
 There are three db.max parameters currently in nutch-default.xml:
  * db.max.outlinks.per.page
  * db.max.anchor.length
  * db.max.inlinks
 Having values that are too low can result in a site being under-crawled. 
 However, currently there is nothing written to the log when these limits are 
 hit, so users have to guess when they need to raise these values.
 I suggest that we add three new log messages at the appropriate points:
  * Exceeded db.max.outlinks.per.page for URL 
  * Exceeded db.max.anchor.length for URL 
  * Exceeded db.max.inlinks for URL 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] 

Matt Kangas commented on NUTCH-272:
---

I've been thinking about this after hitting several sites that explode into 1.5 
M URLs (or more). I could sleep easier at night if I could set a cap at 50k 
URLs/site and just check my log files in the morning.

Counting total URLs/domain needs to happen in one of the places where Nutch 
already traverses the crawldb. For Nutch 0.8 this is nutch generate and 
nutch updatedb. 

URLs are added by both nutch inject and nutch updatedb. These tools use the 
URLFilter plugin x-point to determine which URLs to keep, and which to reject. 
But note that updatedb could only compute URLs/domain _after_ traversing 
crawldb, during which time it merges the new URLs.

So, one way to approach it is:

* Count URLs/domain during update. If a domain exceeds the limit, write to a 
file.

* Read this file at the start of update (next cycle) and block further 
additions

* Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter()

If you do it all in update, you won't catch URLs added via inject, but it 
would still halt runaway crawls, and it would be simpler because it would be a 
one-file patch.

 Max. pages to crawl/fetch per site (emergency limit)
 

  Key: NUTCH-272
  URL: http://issues.apache.org/jira/browse/NUTCH-272
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 If I'm right, there is no way in place right now for setting an emergency 
 limit to fetch a certain max. number of pages per site. Is there an easy 
 way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ] 

Matt Kangas commented on NUTCH-272:
---

btw, I'd love to be proven wrong, because if generate.max.per.host parameter 
works as a hard URL cap per site, I could be sleeping better quite soon. :)

 Max. pages to crawl/fetch per site (emergency limit)
 

  Key: NUTCH-272
  URL: http://issues.apache.org/jira/browse/NUTCH-272
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 If I'm right, there is no way in place right now for setting an emergency 
 limit to fetch a certain max. number of pages per site. Is there an easy 
 way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-22 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ] 

Matt Kangas commented on NUTCH-272:
---

Scratch my last comment. :-) I assumed that URLFilters.filter() was applied 
while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied 
during Generate.

(Wow. This means the crawldb will accumulate lots of junk URLs over time. Is 
this a feature or a bug?)

 Max. pages to crawl/fetch per site (emergency limit)
 

  Key: NUTCH-272
  URL: http://issues.apache.org/jira/browse/NUTCH-272
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 If I'm right, there is no way in place right now for setting an emergency 
 limit to fetch a certain max. number of pages per site. Is there an easy 
 way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] 

Matt Kangas commented on NUTCH-289:
---

+1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher 
or otherwise)

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-30 Thread Matt Kangas (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] 

Matt Kangas commented on NUTCH-272:
---

Thanks Doug, that makes more sense now. Running URLFilters.filter() during 
Generate seems very handy, albeit costly for large crawls. (Should have an 
option to turn off?)

I also see that URLFilters.filter() is applied in Fetcher (for redirects) and 
ParseOutputFormat, plus other tools.

Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, 
and they're sorted. You can veto crawldb additions here. Could you effectively 
count URLs/host here? (Not sure when distributed.) Would it require setting a 
Partitioner, like crawl.PartitionUrlByHost?

 Max. pages to crawl/fetch per site (emergency limit)
 

  Key: NUTCH-272
  URL: http://issues.apache.org/jira/browse/NUTCH-272
  Project: Nutch
 Type: Improvement

 Reporter: Stefan Neufeind


 If I'm right, there is no way in place right now for setting an emergency 
 limit to fetch a certain max. number of pages per site. Is there an easy 
 way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2007-12-04 Thread Matt Kangas (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420
]

Matt Kangas commented on NUTCH-585:
---

Simplest path forward... that I can think of:

1) Add a new indexing plugin extension-point for filtering page content.
2) Put your apriori marked-up content exclusion logic into a plugin.
3) Someone else figures out a more general-purpose solution later, and swaps
out your plugin at that time.

Ergo, you generalize the interface, and lazy-load the more general
implementation. :-)

[PARSE-HTML plugin] Block certain parts of HTML code from being indexed
---

Key: NUTCH-585
URL: https://issues.apache.org/jira/browse/NUTCH-585
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Environment: All operating systems
Reporter: Andrea Spinelli
Priority: Minor

We are using nutch to index our own web sites; we would like not to index
certain parts of our pages, because we know they are not relevant (for
instance, there are several links to change the background color) and
generate spurious matches.
We have modified the plugin so that it ignores HTML code between certain HTML
comments, like
!-- START-IGNORE --
... ignored part ...
!-- STOP-IGNORE --
We feel this might be useful to someone else, maybe factorizing the comment
strings as constants in the configuration files (say parser.html.ignore.start
and parser.html.ignore.stop in nutch-site.xml).
We are almost ready to contribute our code snippet. Looking forward for any
expression of interest - or for an explanation why waht we are doing is
plain wrong!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

[jira] Commented: (NUTCH-143) Improper error numbers returned on exit

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

15 matches

Site Navigation

Mail list logo

Footer information