[jira] Created: (NUTCH-187) Run Nutch on Windows without Cygwin

2006-01-25 Thread Dominik Friedrich (JIRA)
Run Nutch on Windows without Cygwin
---

 Key: NUTCH-187
 URL: http://issues.apache.org/jira/browse/NUTCH-187
 Project: Nutch
Type: Improvement
  Components: ndfs  
Versions: 0.8-dev
 Environment: Windows
Reporter: Dominik Friedrich
Priority: Minor


Currently you cannot start Nutch datanodes on Windows outside of a cygwin 
environment because it relies on the df command to read the free disk space.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-187) Run Nutch on Windows without Cygwin

2006-01-25 Thread Dominik Friedrich (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-187?page=all ]

Dominik Friedrich updated NUTCH-187:


Attachment: DF.diff

This patch enables Nutch to read the free disk space on Windows systems. This 
version is not able to read the partition size but only the free space. On 
Windows capacity is set to two times free disk space, used to free disk space, 
percent used to 50 and mount to the partition e.g. c:.

This patch has only been used in some experiments just to be able to start a 
datanode on Windows from within Eclipse IDE.

 Run Nutch on Windows without Cygwin
 ---

  Key: NUTCH-187
  URL: http://issues.apache.org/jira/browse/NUTCH-187
  Project: Nutch
 Type: Improvement
   Components: ndfs
 Versions: 0.8-dev
  Environment: Windows
 Reporter: Dominik Friedrich
 Priority: Minor
  Attachments: DF.diff

 Currently you cannot start Nutch datanodes on Windows outside of a cygwin 
 environment because it relies on the df command to read the free disk space.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-25 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363942 ] 

Andrzej Bialecki  commented on NUTCH-139:
-

Yes, this should work ok ... but it strikes me as unnecessarily complicated. 
After all, in most cases we will have single values and no overrides, so this 
solution complicates the most common cases...

At this point it's probably easier just to keep the original key, val[] in 
one Map, and potential overrides key, val1[] in another Map, and then provide 
a container/facade with appropriate methods to add/get/set whichever value is 
necessary.

E.g.:

public class MetaData {
  private HashMap original = new HashMap();
  private HashMap actual = new HashMap();

  public void add(String key, String val) {
// same as in ContentProperties now, uses the original map
...
  }

  public void set(String key, String val) {
// same as in ContentProperties now, uses the original map
...
  }

  public void setFinal(String key, String val) {
   // as above, but uses the actual map
  }

  // return the final value, if it's missing then return the original value
  public Object getFinal(String key) {
Object res = actual.get(key);
if (res == null) res = original.get(key);
return res;
  }
...
}

This seems to satisfy all the requirements, and with minimal overhead. If this 
is ok with you, please prepare a patch, and we should commit it - there are 
many other changes waiting in the queue that depend on this patch being applied 
...

(BTW. I think it's conceptually the same as using the X-nutch to avoid name 
clashes, but from the point of view of correct OO programming it looks more 
kosher now... ;-) )

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Optimizing which links to fetch

2006-01-25 Thread Doug Cutting

Ken Krugler wrote:
It seems that the default behavior of Nutch when sorting links to fetch 
is to use scoreByLinkCount. This then sets the fetch score for links on 
a page to be the same as the containing page's in-bound link score (or 
actually the log of same).


Please also see:

http://issues.apache.org/jira/browse/NUTCH-61

This is an extensible mechanism for altering the fetch schedule. 
Similarly, we need an extensible mechanism for computing page scores, 
which are used to prioritize the fetching of scheduled pages.  Note that 
the scoring mechanism has changed substantially in the development trunk 
from what is in the 0.7 release.


Doug


Re: Ideas for enhancements

2006-01-25 Thread Doug Cutting

Howie Wang wrote:

1. A String[] HitDetails.getValues(String field) method that
returns an array of the values. The current only returns a
single string, and Lucene indexes can have multiple values
per field.


That sounds useful.  Please submit a patch against the trunk attached to 
a bug report.



2. In Link.java, put in a field (parentURL) for the URL of the page that
contains the link. Right now it seems we just have the links themselves
and we can't backtrack where they come from. Being able to backtrack
through the links is handy for doing something like categorization. For
example, you see that all the links are coming from a page about poodles,
so you might categorize the linked page as a poodle page. It might also
come in handy for doing something like a Google TrustRank scoring, where
you penalize certain sites if they're a known link farm, or boost them 
if they're

from some place respected like DMOZ.


This would certainly be useful functionality.  The link db has changed 
substantially in the current trunk and there is no longer a class named 
Link.  This has been replaced with Inlink and Outlink.  Have a look at 
the trunk and see if what you need isn't already there.



3. Get sorting to work on multiple fields. Lucene already works on
multiple fields so it shouldn't be difficult to get this working. Just
change the places where is passes down String field so that it
accepts an array. The sort fields could be read from the query
string in order:

  search.jsp?sort=scorereverse=truesort=datereverse=false


This would also be useful.  Please submit a patch against the trunk.

Thanks!

Doug


Re: Searchable mailing lists on nutch.org?

2006-01-25 Thread Doug Cutting

Andy Liu wrote:

We're getting a lot of repeat questions in the mailing lists these
days.  I think it's partly because people don't know of a way to
search the archives.  The Mail Archive provides this:

http://www.mail-archive.com/index.php?hunt=nutch

Whoever maintains the
http://lucene.apache.org/nutch/mailing_lists.html page, maybe post the
mail archive link?


Andy,

An Archives section on this page would indeed be useful.  Please feel 
free to submit one as a patch to the source file:


  src/site/src/documentation/content/xdocs/mailing_lists.xml

Thanks,

Doug


need volunteer to develop search for apache.org

2006-01-25 Thread Doug Cutting
Would someone volunteer to develop Nutch-based site-search engine for 
all apache.org domains?  We now have a Solaris zone to host this.


Thanks,

Doug


Re: need volunteer to develop search for apache.org

2006-01-25 Thread Byron Miller
I'll be happy to do it.

--- Doug Cutting [EMAIL PROTECTED] wrote:

 Would someone volunteer to develop Nutch-based
 site-search engine for 
 all apache.org domains?  We now have a Solaris zone
 to host this.
 
 Thanks,
 
 Doug
 



Re: need volunteer to develop search for apache.org

2006-01-25 Thread Christopher Burkey

Hi Doug,

   I would be willing to do set it up if I can use OpenEdit for 
formating results.


   We use Nutch for crawling sites and I have lots of Lucene 
experience. We have used OpenEdit on sites that get 200+ simultaneous 
searches.


http://www.openedit.org



Doug Cutting wrote:
Would someone volunteer to develop Nutch-based site-search engine for 
all apache.org domains?  We now have a Solaris zone to host this.


Thanks,

Doug



--
Christopher Burkey
513-542-3401
[EMAIL PROTECTED]
http://www.openedit.org



[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml

2006-01-25 Thread Gal Nitzan (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12364010 ] 

Gal Nitzan commented on NUTCH-186:
--

After reading the code and I think I figured it... :)

The issue of the mapred-default.xml is totaly misleading.

Actualy : mapred.map.tasks and mapred.reduce.tasks properties does not have any 
effect when placed in mapred-default.xml (unless JobConf needs it which I 
didnĀ“t check) because this file is loaded only when JobConf is constructed.
But tasktracker is looking for these properties in nutch-site and not in 
mapred-default.

If these properties does not exists in nutch-site.xm with the correct values 
for your system, these values will be picked from nutch-defaul.xml.

Further, I am not sure that nutch-site.xml overiding everything should be the 
correct behavior. Most users knows that nutch-site.xml overides nutch-default 
but I think we should leave it up to them the option to override nutch-site and 
it  will be a good start into breaking configuration to parts (ndfs and mapred 
are going to be seperated from nutch)...

Gal

 mapred-default.xml is over ridden by nutch-site.xml
 ---

  Key: NUTCH-186
  URL: http://issues.apache.org/jira/browse/NUTCH-186
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
  Environment: All
 Reporter: Gal Nitzan
 Priority: Minor
  Attachments: myBeautifulPatch.patch, myBeautifulPatch.patch

 If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and 
 also in mapred-default.xml the definitions from nutch-site.xml are those that 
 will take effect.
 So if a user mistakenly copies those entries into nutch-site.xml from the 
 nutch-default.xml she will not understand what happens.
 I would like to propose removing these setting completely from the 
 nutch-default.xml and put it only in mapred-default.xml where it belongs.
 I will be happy to supply a patch for that  if the proposition accepted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira