What/how num of required maps is set?
I am trying to figure out how the required map is set/calculated by Nutch. I have 3 task trackers. I added one more. When I run fetch only the initial three are fetching. I have added the task tracker before calling generate (if it has any meanning) Thanks, G.
Re: What/how num of required maps is set? OOP Wrong list
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote: I am trying to figure out how the required map is set/calculated by Nutch. I have 3 task trackers. I added one more. When I run fetch only the initial three are fetching. I have added the task tracker before calling generate (if it has any meanning) Thanks, G.
why index not in segment anymore
Hi Doug, in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle, Thanks for a explanation. Stefan
Re: test suite fails?
It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf wrote: Hi, is anyone able to run the test suite without any problems? Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: test suite fails?
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output (nutch rocks nutch rocks nutch rocks). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf wrote: Hi, is anyone able to run the test suite without any problems? Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net -- http://motrech.free.fr/ http://www.frutch.org/
Crawl and parse exceptions
I've been having a lot of trouble lately with the newest nutch src. Both my crawls and parses are failing (for our fetches we crawl and parse at the same time with just the default nutch config, just to get the outlinks and update the crawldb, but then later on, after the fetch we do another parse with custom parse filters). Here are the exceptions below. This exception happens sometimes when crawling (on the linkdb part of the crawl): Exception in thread main java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131) We also got this for awhile (seems like the mapred/system dir is never being created for some reason): java.io.IOException: Cannot open filename /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.open(Unknown Source) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.init(NDFSClient.java:242) at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79) at org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66) at org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:221) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149) at org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.init(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651) Then, on parsing, we got this, within 10 second of the parse starting: 060109 093759 task_m_ltgpnj Error running child 060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.getContent(Content.java:124) 060109 093759 task_m_ltgpnj at org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33) 060109 093759 task_m_ltgpnj at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapTask.run(MapTask.java:116) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603) 060109 093759 task_m_ltgpnj Caused by: java.io.EOFException 060109 093759 task_m_ltgpnj at java.io.DataInputStream.readFully(DataInputStream.java:268) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readChars(UTF8.java:212) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readString(UTF8.java:204) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 060109 093759 task_m_ltgpnj ... 6 more 060109 093802 task_m_txrnu3 done; removing files. 060109 093802 Server connection on port 50050 from 127.0.0.2: exiting 060109 093805 task_m_ltgpnj done; removing files. 060109 093805 Lost connection to JobTracker [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. ex=java.lang.NullPointerException Retrying... On a different segment we got this instead: Exception in thread main java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml , nutch-site.xml at org.apache.nutch.ipc.Client.call(Client.java:294) at
Re: Crawl and parse exceptions
Just a followup, i figured out the 3rd exception below ( Exception in thread main java.io.IOException: No input directories specified in: NutchConf..) so no worries there. but the others are still issues. Matt Zytaruk wrote: I've been having a lot of trouble lately with the newest nutch src. Both my crawls and parses are failing (for our fetches we crawl and parse at the same time with just the default nutch config, just to get the outlinks and update the crawldb, but then later on, after the fetch we do another parse with custom parse filters). Here are the exceptions below. This exception happens sometimes when crawling (on the linkdb part of the crawl): Exception in thread main java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131) We also got this for awhile (seems like the mapred/system dir is never being created for some reason): java.io.IOException: Cannot open filename /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.open(Unknown Source) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.init(NDFSClient.java:242) at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79) at org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66) at org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:221) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149) at org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.init(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651) Then, on parsing, we got this, within 10 second of the parse starting: 060109 093759 task_m_ltgpnj Error running child 060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.getContent(Content.java:124) 060109 093759 task_m_ltgpnj at org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33) 060109 093759 task_m_ltgpnj at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapTask.run(MapTask.java:116) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603) 060109 093759 task_m_ltgpnj Caused by: java.io.EOFException 060109 093759 task_m_ltgpnj at java.io.DataInputStream.readFully(DataInputStream.java:268) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readChars(UTF8.java:212) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readString(UTF8.java:204) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 060109 093759 task_m_ltgpnj ... 6 more 060109 093802 task_m_txrnu3 done; removing files. 060109 093802 Server connection on port 50050 from 127.0.0.2: exiting 060109 093805 task_m_ltgpnj done; removing files. 060109 093805 Lost connection to JobTracker [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. ex=java.lang.NullPointerException Retrying... On a different segment we got this instead: Exception in thread main java.io.IOException: No input directories specified
Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/
[EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13 +6,14 @@ !-- Build deploy all the plugin jars.-- !-- == -- target name=deploy - !--ant dir=analysis-de target=deploy/-- - !--ant dir=analysis-fr target=deploy/-- + ant dir=analysis-de target=deploy/ + ant dir=analysis-fr target=deploy/ Was this change intentional? It looks unrelated. Otherwise, this looks great! Doug
wiki:commandline options classpaths
I noticed that the command line options in the wiki has net.nutch.* instead of the newer org.apache.*. Just wanted to confirm if its ok to change them all. (I'm new to this group, just wanted to confirm first) Thanks, Jerry
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] Doug Cutting commented on NUTCH-139: We can just use different names, rather than two metaData objects: X-nutch names for derived or other values that are usually protocol independent; and (possibly prefixed) names for protocol- or format-specific values. The latter are sometimes multivalued, but the former are probably not. The relevance to this patch is that this patch currently uses un-prefixed protocol-specific names to store derived, protocol-independent data, which is confusing. This patch is meant to standardize property names. Let's just standardize them once. Protocol- and format-specific names should be defined in protocol- and format-specific files. For example, if we want to define constants for http headers, they should probably go in the (new) lib-http plugin. We also need to change ContentProperties to distinguish add(String,String) from set(String,String), and we may need to change some protocols to call add(String,String) instead of set(String,String). I think that it makes sense to bundle that change in this patch too. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Reporter interface
Andrew McNabb wrote: I'm looking at the Reporter interface, and I would like to verify my understanding of what it is. It appears to me that Reporter.setStatus() is called periodically during an operation to give a human-readable description of how far the progress is so far. Is that correct? Yes. These strings appear in the web interface and in logs. Reporter also has another function, to tell the MapReduce system that things are not hung, that progress is still being made. If an individual operation (map, reduce, close) may take longer than the task timeout (10 minutes by default?) then this should be called or the task will be assumed to be hung and it will be killed. If so, is there a reason that RecordWriter.close() requires a Reporter (are there situations where it takes a long time)? Some reduce processes (e.g., Lucene indexing) write to temporary local files and then copy their final output to NDFS on close. Also, is there a standard NullReporter class for situations where updating is not needed? A NullReporter would be easy to define, but I'm not sure why you ask since Reporter's are not usually created by user code but rather by the MapReduce system. Doug
Re: why index not in segment anymore
Stefan Groschupf wrote: in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle, The current indexer command line is optimized for one-shot, batch, crawling. In this case it is best to index everything at the end, in order to have the most up-to-date page scores from the crawl db. So it indexes everything in a single MapReduce pass, which produces a set of indexes that are not aligned with segments. It would be easy to modify Indexer.index() to index just a segment at a time, but each would need to process the entire crawl and link dbs as inputs, and would thus be less efficient than indexing all segments at once. So both modes may be useful. We could add an Indexer.index() method that takes just a single segment name and indexes it, storing the index in the segment, and modify Indexer.main() to be able to invoke it. Then we'd also need to modify NutchBean to find these indexes, and IndexMerger, etc. Doug
Re: wiki:commandline options classpaths
Yes, everything is in org.apache now, I believe. Thanks for helping out. Otis - Original Message From: Jerry Russell [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Mon 09 Jan 2006 02:20:02 PM EST Subject: wiki:commandline options classpaths I noticed that the command line options in the wiki has net.nutch.* instead of the newer org.apache.*. Just wanted to confirm if its ok to change them all. (I'm new to this group, just wanted to confirm first) Thanks, Jerry
Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13 +6,14 @@ !-- Build deploy all the plugin jars.-- !-- == -- target name=deploy - !--ant dir=analysis-de target=deploy/-- - !--ant dir=analysis-fr target=deploy/-- + ant dir=analysis-de target=deploy/ + ant dir=analysis-fr target=deploy/ Was this change intentional? It looks unrelated. Otherwise, this looks great! Doug -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Created: (NUTCH-168) setting http.content.limit to -1 seems to break text parsing on some files
setting http.content.limit to -1 seems to break text parsing on some files -- Key: NUTCH-168 URL: http://issues.apache.org/jira/browse/NUTCH-168 Project: Nutch Type: Bug Components: fetcher Versions: 0.7 Environment: Windows 2000 java version 1.4.2_05 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) Reporter: Jerry Russell Setting http.content limit to -1 (which is supposed to mean no limit causes some pages not to index. I have seen this in some PDFs and this one URL in particular. The steps to reproduce are below: Reproduce: 1) install fresh nutch-0.7 2) configure urlfilters to allow any URL 3) create urllist with only the following URL: http://www.circuitsonline.net/circuits/view/71 4) perform a crawl with a depth of 1 5) do segread and see that the content is there 6) change the http.content.limit to -1 in nutch-default.xml 7) repeat the crawl to a new directory 8) do segread and see that the content is not there contact [EMAIL PROTECTED] for more information. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Reporter interface
Andrew McNabb wrote: One of the great things about open source is that projects can be used for unintended purposes. In fact, Nutch works well for parallel computing in general, not just for web indexing. Apparently Google has thousands of projects that use MapReduce. The plan is to move NDFS and MapReduce from Nutch to a new Lucene sub-project, probably sometime in the next few months. I'm using Nutch right now (and I love it), but I currently have very little interest in web indexing. I have a project with a custom Mapper and Reducer, and I needed to be able to read in the data from a SequenceFile, which led me to the issue I emailed about. I'd send you a patch with a NullReporter, but it's only four or five lines. :) I'm still not clear why one might need a NullReporter. Doug
HTMLMetaProcessor a bug?
Hi, I was going over the code and I noticed the following in class org.apache.nutch.parse.html.HTMLMetaProcessor method getMetaTagsHelper the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); G.
Re: Reporter interface
On Mon, Jan 09, 2006 at 03:28:45PM -0800, Doug Cutting wrote: I'm still not clear why one might need a NullReporter. To be more clear I should be a little more specific. I had to read in from a SequenceFile to interpret results of a string of MapReduce stages. Here's a simplified snippet. In this case I made a Reporter called nullreporter that just does nothing. SequenceFileInputFormat inputformat = new SequenceFileInputFormat(); RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter); I don't like having to specify a Reporter to getRecordReader(). Actually, as I've thought more about it, it's probably a bad idea to make a NullReporter class (although that might be better than nothing). Maybe a better solution would be simply to allow null to be passed in, but before calling setStatus(), check to make sure that it isn't null. Is that a good idea? -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 signature.asc Description: Digital signature
[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] Paul Baclace commented on NUTCH-153: NUTCH-160? There is slowness and then there is continental drift. The quantifiers should be used with any regex package unless the quantifier itself is a significant cost during match(). The general solution is non-fatal per-file time limits on parsers, at least when regular expressions (OutlinkExtractor) are used. That is, spawn a daemon thread as an alarm to interrupt() the thread doing match(). I could make a match() timeout patch, but I have also seen a case where tagsoup spent a huge amount of time parsing files of type text/vnd.viewcvs-markup; I don't know what causes the problem, but this MIME type must be high in tortuosity since Chandler's mime-torture tests includes many examples. Thus, a general solution of non-fatal per-file time limits on parsing files would be better placed to take care of present and future problems of this type. TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail - Key: NUTCH-153 URL: http://issues.apache.org/jira/browse/NUTCH-153 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: all Reporter: Paul Baclace Attachments: TextParser.java.patch If TextParser is given postscript, it can take hours and then fail. This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no file extension, then the this parser will take a long time and fail every time. Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits. Suggested fix: Reject files with %!PS-Adobe in the first 40 characters of the file. Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution. (A general solution would be a timeout on match().) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Reporter interface
Andrew McNabb wrote: SequenceFileInputFormat inputformat = new SequenceFileInputFormat(); RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter); To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like: MyKey key = new MyKey(); MyValue value = new MyValue(); SequenceFile.Reader reader = new SequenceFile.reader(NutchFileSystem.get(local), file); while (reader.next(key, value)) { ... process key/value pair ... } Wouldn't that be simpler? Doug
[jira] Commented: (NUTCH-162) country code jp is used instead of language code ja for Japanese
[ http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ] Paul Baclace commented on NUTCH-162: The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code followed by upper case country code. This makes it possible to use specific idioms used in particular countries. English has over a dozen variants; a few examples are: enAU-English-Australia enIE-English-Ireland enJM-English-Jamaica enUS-English-United_States Inexplicably, different codes were used for the Japanese language and the country Japan. The locale is jaJP. Meanwhile, Javanese in Java is jwJA. The web gui should obtain the user's prefered language and country combination from the HTTP request headers and use the nearest matching Locale: http://java.sun.com/docs/books/tutorial/i18n/locale/create.html This is preferred over having the user pick the language and/or conutry from a list since the user might not be able to read the labels. country code jp is used instead of language code ja for Japanese Key: NUTCH-162 URL: http://issues.apache.org/jira/browse/NUTCH-162 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: n/a Reporter: KuroSaka TeruHiko Priority: Trivial In locale switching link for Japanese, jp is used as language code but it is an ISO country code. The language code ja should be used. By the way, I don't think many users are familiar with the ISO language codes. A Canadian user may click on ca uknowoing that ca stands for Catalan, not Canadian English or French. Rather than listing the language code, listing the language names in the prospective languages may be better. (I say may be because the browser could show some language names in corrupted text if the current font does not support that language --- this is a difficult problem.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira