[jira] Created: (NUTCH-187) Run Nutch on Windows without Cygwin
Run Nutch on Windows without Cygwin --- Key: NUTCH-187 URL: http://issues.apache.org/jira/browse/NUTCH-187 Project: Nutch Type: Improvement Components: ndfs Versions: 0.8-dev Environment: Windows Reporter: Dominik Friedrich Priority: Minor Currently you cannot start Nutch datanodes on Windows outside of a cygwin environment because it relies on the df command to read the free disk space. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-187) Run Nutch on Windows without Cygwin
[ http://issues.apache.org/jira/browse/NUTCH-187?page=all ] Dominik Friedrich updated NUTCH-187: Attachment: DF.diff This patch enables Nutch to read the free disk space on Windows systems. This version is not able to read the partition size but only the free space. On Windows capacity is set to two times free disk space, used to free disk space, percent used to 50 and mount to the partition e.g. c:. This patch has only been used in some experiments just to be able to start a datanode on Windows from within Eclipse IDE. Run Nutch on Windows without Cygwin --- Key: NUTCH-187 URL: http://issues.apache.org/jira/browse/NUTCH-187 Project: Nutch Type: Improvement Components: ndfs Versions: 0.8-dev Environment: Windows Reporter: Dominik Friedrich Priority: Minor Attachments: DF.diff Currently you cannot start Nutch datanodes on Windows outside of a cygwin environment because it relies on the df command to read the free disk space. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363942 ] Andrzej Bialecki commented on NUTCH-139: - Yes, this should work ok ... but it strikes me as unnecessarily complicated. After all, in most cases we will have single values and no overrides, so this solution complicates the most common cases... At this point it's probably easier just to keep the original key, val[] in one Map, and potential overrides key, val1[] in another Map, and then provide a container/facade with appropriate methods to add/get/set whichever value is necessary. E.g.: public class MetaData { private HashMap original = new HashMap(); private HashMap actual = new HashMap(); public void add(String key, String val) { // same as in ContentProperties now, uses the original map ... } public void set(String key, String val) { // same as in ContentProperties now, uses the original map ... } public void setFinal(String key, String val) { // as above, but uses the actual map } // return the final value, if it's missing then return the original value public Object getFinal(String key) { Object res = actual.get(key); if (res == null) res = original.get(key); return res; } ... } This seems to satisfy all the requirements, and with minimal overhead. If this is ok with you, please prepare a patch, and we should commit it - there are many other changes waiting in the queue that depend on this patch being applied ... (BTW. I think it's conceptually the same as using the X-nutch to avoid name clashes, but from the point of view of correct OO programming it looks more kosher now... ;-) ) Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Optimizing which links to fetch
Ken Krugler wrote: It seems that the default behavior of Nutch when sorting links to fetch is to use scoreByLinkCount. This then sets the fetch score for links on a page to be the same as the containing page's in-bound link score (or actually the log of same). Please also see: http://issues.apache.org/jira/browse/NUTCH-61 This is an extensible mechanism for altering the fetch schedule. Similarly, we need an extensible mechanism for computing page scores, which are used to prioritize the fetching of scheduled pages. Note that the scoring mechanism has changed substantially in the development trunk from what is in the 0.7 release. Doug
Re: Ideas for enhancements
Howie Wang wrote: 1. A String[] HitDetails.getValues(String field) method that returns an array of the values. The current only returns a single string, and Lucene indexes can have multiple values per field. That sounds useful. Please submit a patch against the trunk attached to a bug report. 2. In Link.java, put in a field (parentURL) for the URL of the page that contains the link. Right now it seems we just have the links themselves and we can't backtrack where they come from. Being able to backtrack through the links is handy for doing something like categorization. For example, you see that all the links are coming from a page about poodles, so you might categorize the linked page as a poodle page. It might also come in handy for doing something like a Google TrustRank scoring, where you penalize certain sites if they're a known link farm, or boost them if they're from some place respected like DMOZ. This would certainly be useful functionality. The link db has changed substantially in the current trunk and there is no longer a class named Link. This has been replaced with Inlink and Outlink. Have a look at the trunk and see if what you need isn't already there. 3. Get sorting to work on multiple fields. Lucene already works on multiple fields so it shouldn't be difficult to get this working. Just change the places where is passes down String field so that it accepts an array. The sort fields could be read from the query string in order: search.jsp?sort=scorereverse=truesort=datereverse=false This would also be useful. Please submit a patch against the trunk. Thanks! Doug
Re: Searchable mailing lists on nutch.org?
Andy Liu wrote: We're getting a lot of repeat questions in the mailing lists these days. I think it's partly because people don't know of a way to search the archives. The Mail Archive provides this: http://www.mail-archive.com/index.php?hunt=nutch Whoever maintains the http://lucene.apache.org/nutch/mailing_lists.html page, maybe post the mail archive link? Andy, An Archives section on this page would indeed be useful. Please feel free to submit one as a patch to the source file: src/site/src/documentation/content/xdocs/mailing_lists.xml Thanks, Doug
need volunteer to develop search for apache.org
Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug
Re: need volunteer to develop search for apache.org
I'll be happy to do it. --- Doug Cutting [EMAIL PROTECTED] wrote: Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug
Re: need volunteer to develop search for apache.org
Hi Doug, I would be willing to do set it up if I can use OpenEdit for formating results. We use Nutch for crawling sites and I have lots of Lucene experience. We have used OpenEdit on sites that get 200+ simultaneous searches. http://www.openedit.org Doug Cutting wrote: Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug -- Christopher Burkey 513-542-3401 [EMAIL PROTECTED] http://www.openedit.org
[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml
[ http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12364010 ] Gal Nitzan commented on NUTCH-186: -- After reading the code and I think I figured it... :) The issue of the mapred-default.xml is totaly misleading. Actualy : mapred.map.tasks and mapred.reduce.tasks properties does not have any effect when placed in mapred-default.xml (unless JobConf needs it which I didnĀ“t check) because this file is loaded only when JobConf is constructed. But tasktracker is looking for these properties in nutch-site and not in mapred-default. If these properties does not exists in nutch-site.xm with the correct values for your system, these values will be picked from nutch-defaul.xml. Further, I am not sure that nutch-site.xml overiding everything should be the correct behavior. Most users knows that nutch-site.xml overides nutch-default but I think we should leave it up to them the option to override nutch-site and it will be a good start into breaking configuration to parts (ndfs and mapred are going to be seperated from nutch)... Gal mapred-default.xml is over ridden by nutch-site.xml --- Key: NUTCH-186 URL: http://issues.apache.org/jira/browse/NUTCH-186 Project: Nutch Type: Bug Versions: 0.8-dev Environment: All Reporter: Gal Nitzan Priority: Minor Attachments: myBeautifulPatch.patch, myBeautifulPatch.patch If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and also in mapred-default.xml the definitions from nutch-site.xml are those that will take effect. So if a user mistakenly copies those entries into nutch-site.xml from the nutch-default.xml she will not understand what happens. I would like to propose removing these setting completely from the nutch-default.xml and put it only in mapred-default.xml where it belongs. I will be happy to supply a patch for that if the proposition accepted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira