What/how num of required maps is set?

2006-01-09 Thread Gal Nitzan

I am trying to figure out how the required map is set/calculated by
Nutch.

I have 3 task trackers.

I added one more.

When I run fetch only the initial three are fetching.

I have added the task tracker before calling generate (if it has any
meanning)

Thanks,

G.





Re: What/how num of required maps is set? OOP Wrong list

2006-01-09 Thread Gal Nitzan
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote:
 I am trying to figure out how the required map is set/calculated by
 Nutch.
 
 I have 3 task trackers.
 
 I added one more.
 
 When I run fetch only the initial three are fetching.
 
 I have added the task tracker before calling generate (if it has any
 meanning)
 
 Thanks,
 
 G.
 
 
 
 




why index not in segment anymore

2006-01-09 Thread Stefan Groschupf

Hi Doug,
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be  
may be better to have the index also in the segment folder, since the  
segment folder would be the single item to manage a life-cycle,

Thanks for a explanation.

Stefan 


Re: test suite fails?

2006-01-09 Thread Piotr Kosiorowski
It fails on my machine on parse-ext tests. I am not sure what is causing 
it yet and I am afraid I do not have time to investigate it today - 
maybe in few days. I did a small change to make it compile a few days 
ago, but all tests went ok before I committed it.

Regards
Piotr
Stefan Groschupf wrote:

Hi,

is anyone able to run the test suite without any problems?

Stefan

---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net







Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too.
I don't understand what happens.
In fact, the CommandRunner returns a -1 exit code, but nothing in the error
output and the good string in the standard output (nutch rocks nutch rocks
nutch rocks).
All seems to be ok but the exit code.

Jérôme

On 1/9/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:

 It fails on my machine on parse-ext tests. I am not sure what is causing
 it yet and I am afraid I do not have time to investigate it today -
 maybe in few days. I did a small change to make it compile a few days
 ago, but all tests went ok before I committed it.
 Regards
 Piotr
 Stefan Groschupf wrote:
  Hi,
 
  is anyone able to run the test suite without any problems?
 
  Stefan
 
  ---
  company:http://www.media-style.com
  forum:http://www.text-mining.org
  blog:http://www.find23.net
 
 
 




--
http://motrech.free.fr/
http://www.frutch.org/


Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
I've been having a lot of trouble lately with the newest nutch src. Both 
my crawls and parses are failing (for our fetches we crawl and parse at 
the same time with just the default nutch config, just to get the 
outlinks and update the crawldb, but then later on, after the fetch we 
do another parse with custom parse filters). Here are the exceptions below.


This exception happens sometimes when crawling (on the linkdb part of 
the crawl):


Exception in thread main java.io.IOException: Not a file: 
/user/nutch/segments/20060107130328/parse_data/part-0/data

   at org.apache.nutch.ipc.Client.call(Client.java:294)
   at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
   at $Proxy1.submitJob(Unknown Source)
   at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
   at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never 
being created for some reason):
java.io.IOException: Cannot open filename 
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml

  at org.apache.nutch.ipc.Client.call(Client.java:294)
  at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
  at $Proxy1.open(Unknown Source)
  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) 

  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.init(NDFSClient.java:242) 


  at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
  at 
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
  at 
org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) 

  at 
org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:221)
  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
  at 
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) 

  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.init(TaskTracker.java:332) 

  at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)

  at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
  at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)

060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at 
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)

060109 093759 task_m_ltgpnj ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker 
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. 
ex=java.lang.NullPointerException  Retrying...


On a different segment we got this instead:
Exception in thread main java.io.IOException: No input directories 
specified in: NutchConf: nutch-default.xml , mapred-default.xml , 
/nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml , 
nutch-site.xml

   at org.apache.nutch.ipc.Client.call(Client.java:294)
   at 

Re: Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
Just a followup, i figured out the 3rd exception below ( Exception in 
thread main java.io.IOException: No input directories specified in: 
NutchConf..) so no worries there. but the others are still issues.


Matt Zytaruk wrote:

I've been having a lot of trouble lately with the newest nutch src. 
Both my crawls and parses are failing (for our fetches we crawl and 
parse at the same time with just the default nutch config, just to get 
the outlinks and update the crawldb, but then later on, after the 
fetch we do another parse with custom parse filters). Here are the 
exceptions below.


This exception happens sometimes when crawling (on the linkdb part of 
the crawl):


Exception in thread main java.io.IOException: Not a file: 
/user/nutch/segments/20060107130328/parse_data/part-0/data

   at org.apache.nutch.ipc.Client.call(Client.java:294)
   at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
   at $Proxy1.submitJob(Unknown Source)
   at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
   at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never 
being created for some reason):
java.io.IOException: Cannot open filename 
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml

  at org.apache.nutch.ipc.Client.call(Client.java:294)
  at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
  at $Proxy1.open(Unknown Source)
  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) 

  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.init(NDFSClient.java:242) 


  at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
  at 
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
  at 
org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) 

  at 
org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:221) 

  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
  at 
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) 

  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) 

  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.init(TaskTracker.java:332) 

  at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)

  at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
  at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: 
java.io.EOFException
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)

060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at 
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 


060109 093759 task_m_ltgpnj ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker 
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. 
ex=java.lang.NullPointerException  Retrying...


On a different segment we got this instead:
Exception in thread main java.io.IOException: No input directories 
specified 

Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan  8 16:13:42 2006
@@ -6,13 +6,14 @@
   !-- Build  deploy all the plugin jars.--
   !-- == --
   target name=deploy
- !--ant dir=analysis-de target=deploy/--
- !--ant dir=analysis-fr target=deploy/--
+ ant dir=analysis-de target=deploy/
+ ant dir=analysis-fr target=deploy/


Was this change intentional?  It looks unrelated.

Otherwise, this looks great!

Doug


wiki:commandline options classpaths

2006-01-09 Thread Jerry Russell

I noticed that the command line options in the wiki has net.nutch.* instead of 
the newer org.apache.*. Just wanted to confirm if its ok to change them all. 
(I'm new to this group, just wanted to confirm first)

Thanks,
Jerry


[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] 

Doug Cutting commented on NUTCH-139:


We can just use different names, rather than two metaData objects: X-nutch 
names for derived or other values that are usually protocol independent; and 
(possibly prefixed) names for protocol- or format-specific values.  The latter 
are sometimes multivalued, but the former are probably not.

The relevance to this patch is that this patch currently uses un-prefixed 
protocol-specific names to store derived, protocol-independent data, which is 
confusing.  This patch is meant to standardize property names.  Let's just 
standardize them once.  Protocol- and format-specific names should be defined 
in protocol- and format-specific files.  For example, if we want to define 
constants for http headers, they should probably go in the (new) lib-http 
plugin.

We also need to change ContentProperties to distinguish add(String,String) from 
set(String,String), and we may need to change some protocols to call 
add(String,String) instead of set(String,String).  I think that it makes sense 
to bundle that change in this patch too.

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Reporter interface

2006-01-09 Thread Doug Cutting

Andrew McNabb wrote:

I'm looking at the Reporter interface, and I would like to verify my
understanding of what it is.  It appears to me that Reporter.setStatus()
is called periodically during an operation to give a human-readable
description of how far the progress is so far.  Is that correct?


Yes.  These strings appear in the web interface and in logs.

Reporter also has another function, to tell the MapReduce system that 
things are not hung, that progress is still being made.  If an 
individual operation (map, reduce, close) may take longer than the task 
timeout (10 minutes by default?) then this should be called or the task 
will be assumed to be hung and it will be killed.



If so, is there a reason that RecordWriter.close() requires a Reporter
(are there situations where it takes a long time)? 


Some reduce processes (e.g., Lucene indexing) write to temporary local 
files and then copy their final output to NDFS on close.



Also, is there a
standard NullReporter class for situations where updating is not
needed?


A NullReporter would be easy to define, but I'm not sure why you ask 
since Reporter's are not usually created by user code but rather by the 
MapReduce system.


Doug


Re: why index not in segment anymore

2006-01-09 Thread Doug Cutting

Stefan Groschupf wrote:

in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be  
may be better to have the index also in the segment folder, since the  
segment folder would be the single item to manage a life-cycle,


The current indexer command line is optimized for one-shot, batch, 
crawling.  In this case it is best to index everything at the end, in 
order to have the most up-to-date page scores from the crawl db.  So it 
indexes everything in a single MapReduce pass, which produces a set of 
indexes that are not aligned with segments.


It would be easy to modify Indexer.index() to index just a segment at a 
time, but each would need to process the entire crawl and link dbs as 
inputs, and would thus be less efficient than indexing all segments at once.


So both modes may be useful.  We could add an Indexer.index() method 
that takes just a single segment name and indexes it, storing the index 
in the segment, and modify Indexer.main() to be able to invoke it.  Then 
we'd also need to modify NutchBean to find these indexes, and 
IndexMerger, etc.


Doug


Re: wiki:commandline options classpaths

2006-01-09 Thread ogjunk-nutch
Yes, everything is in org.apache now, I believe.  Thanks for helping out.

Otis

- Original Message 
From: Jerry Russell [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Mon 09 Jan 2006 02:20:02 PM EST
Subject: wiki:commandline options  classpaths


I noticed that the command line options in the wiki has net.nutch.* instead of 
the newer org.apache.*. Just wanted to confirm if its ok to change them all. 
(I'm new to this group, just wanted to confirm first)

Thanks,
Jerry





Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!!
I remove it immediately.
Thanks

On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] wrote:
  --- lucene/nutch/trunk/src/plugin/build.xml (original)
  +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan  8 16:13:42 2006
  @@ -6,13 +6,14 @@
 !-- Build  deploy all the plugin jars.--
 !-- == --
 target name=deploy
  - !--ant dir=analysis-de target=deploy/--
  - !--ant dir=analysis-fr target=deploy/--
  + ant dir=analysis-de target=deploy/
  + ant dir=analysis-fr target=deploy/

 Was this change intentional?  It looks unrelated.

 Otherwise, this looks great!

 Doug




--
http://motrech.free.fr/
http://www.frutch.org/


[jira] Created: (NUTCH-168) setting http.content.limit to -1 seems to break text parsing on some files

2006-01-09 Thread Jerry Russell (JIRA)
setting http.content.limit to -1 seems to break text parsing on some files
--

 Key: NUTCH-168
 URL: http://issues.apache.org/jira/browse/NUTCH-168
 Project: Nutch
Type: Bug
  Components: fetcher  
Versions: 0.7
 Environment: Windows 2000
java version 1.4.2_05
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
Reporter: Jerry Russell


Setting http.content limit to -1 (which is supposed to mean no limit causes 
some pages not to index. I have seen this in some PDFs and this one URL in 
particular. The steps to reproduce are below:

Reproduce:

  1) install fresh nutch-0.7
  2) configure urlfilters to allow any URL
  3) create urllist with only the following URL: 
http://www.circuitsonline.net/circuits/view/71
  4) perform a crawl with a depth of 1
  5) do segread and see that the content is there
  6) change the http.content.limit to -1 in nutch-default.xml 
  7) repeat the crawl to a new directory 
  8) do segread and see that the content is not there

contact [EMAIL PROTECTED] for more information.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Reporter interface

2006-01-09 Thread Doug Cutting

Andrew McNabb wrote:

One of the great things about open source is that projects can be used
for unintended purposes.  In fact, Nutch works well for parallel
computing in general, not just for web indexing.  Apparently Google has
thousands of projects that use MapReduce.


The plan is to move NDFS and MapReduce from Nutch to a new Lucene 
sub-project, probably sometime in the next few months.



I'm using Nutch right now (and I love it), but I currently have very
little interest in web indexing.  I have a project with a custom Mapper
and Reducer, and I needed to be able to read in the data from a
SequenceFile, which led me to the issue I emailed about.

I'd send you a patch with a NullReporter, but it's only four or five
lines. :)


I'm still not clear why one might need a NullReporter.

Doug


HTMLMetaProcessor a bug?

2006-01-09 Thread Gal Nitzan
Hi,

I was going over the code and I noticed the following in

class org.apache.nutch.parse.html.HTMLMetaProcessor

method getMetaTagsHelper

the following code would fail in case the meta tags are in upper case

Node nameNode = attrs.getNamedItem(name);
Node equivNode = attrs.getNamedItem(http-equiv);
Node contentNode = attrs.getNamedItem(content);


G.




Re: Reporter interface

2006-01-09 Thread Andrew McNabb
On Mon, Jan 09, 2006 at 03:28:45PM -0800, Doug Cutting wrote:
 
 I'm still not clear why one might need a NullReporter.

To be more clear I should be a little more specific.  I had to read in
from a SequenceFile to interpret results of a string of MapReduce
stages.  Here's a simplified snippet.  In this case I made a Reporter
called nullreporter that just does nothing.

SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, 
nullreporter);

I don't like having to specify a Reporter to getRecordReader().
Actually, as I've thought more about it, it's probably a bad idea to
make a NullReporter class (although that might be better than nothing).
Maybe a better solution would be simply to allow null to be passed in,
but before calling setStatus(), check to make sure that it isn't null.
Is that a good idea?

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868


signature.asc
Description: Digital signature


[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-09 Thread Paul Baclace (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] 

Paul Baclace commented on NUTCH-153:


 NUTCH-160?

There is slowness and then there is continental drift.  The quantifiers should 
be used with any regex package unless the quantifier itself is a significant 
cost during match().  

The general solution is non-fatal per-file time limits on parsers, at least 
when regular expressions (OutlinkExtractor) are used.  That is, spawn a daemon 
thread as an alarm to interrupt() the thread doing match().  

I could make a match() timeout patch, but I have also seen a case where tagsoup 
spent a huge amount of time parsing files of type text/vnd.viewcvs-markup; I 
don't know what causes the problem, but this MIME type must be high in 
tortuosity since Chandler's mime-torture tests includes many examples.  Thus, a 
general solution of non-fatal per-file time limits on parsing files would be 
better placed to take care of present and future problems of this type.



 TextParser is only supposed to parse plain text, but if given postscript, it 
 can take hours and then fail
 -

  Key: NUTCH-153
  URL: http://issues.apache.org/jira/browse/NUTCH-153
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.8-dev
  Environment: all
 Reporter: Paul Baclace
  Attachments: TextParser.java.patch

 If TextParser is given postscript, it can take hours and then fail.  This can 
 be avoided with careful configuration, but if the server MIME type is wrong 
 and the basename of the URL has no file extension, then the this parser 
 will take a long time and fail every time.
 Analysis: The real problem is OutlinkExtractor.java as reported with bug 
 NUTCH-150, but the problem cannot be entirely addressed with that patch since 
 the first call to reg expr match() can take a long time, despite quantifier 
 limits.  
 Suggested fix: Reject files with %!PS-Adobe in the first 40 characters of 
 the file.
 Actual experience has shown that for safety and fail-safe reasons, it is 
 worth protecting against GIGO directly in TextParse for this case, even 
 though the suggested fix is not a general solution.  (A general solution 
 would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Reporter interface

2006-01-09 Thread Doug Cutting

Andrew McNabb wrote:

SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, 
nullreporter);


To read sequence files directly outside of MapReduce, just use 
SequenceFile directly, e.g., something like:


MyKey key = new MyKey();
MyValue value = new MyValue();

SequenceFile.Reader reader =
  new SequenceFile.reader(NutchFileSystem.get(local), file);

while (reader.next(key, value)) {
  ... process key/value pair ...
}

Wouldn't that be simpler?

Doug


[jira] Commented: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2006-01-09 Thread Paul Baclace (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ] 

Paul Baclace commented on NUTCH-162:


The best practice for identifying localization is to use the ISO language and 
country code in the form of lowercase language code followed by upper case 
country code.  This makes it possible to use specific idioms used in particular 
countries.  English has over a dozen variants; a few examples are:

  enAU-English-Australia
  enIE-English-Ireland
  enJM-English-Jamaica
  enUS-English-United_States

Inexplicably, different codes were used for the Japanese language and the 
country Japan.   The locale is jaJP.  Meanwhile, Javanese in Java is jwJA.   

The web gui should obtain the user's prefered language and country combination 
from the HTTP request headers and  use the nearest matching Locale:

  http://java.sun.com/docs/books/tutorial/i18n/locale/create.html

This is preferred over having the user pick the language and/or conutry from a 
list since the user might not be able to read the labels.



 country code jp is used instead of language code ja for Japanese
 

  Key: NUTCH-162
  URL: http://issues.apache.org/jira/browse/NUTCH-162
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1
  Environment: n/a
 Reporter: KuroSaka TeruHiko
 Priority: Trivial


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira