Re: problems http-client

2006-01-06 Thread Jérôme Charron
A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.free.fr/

Re: problems http-client

2006-01-06 Thread Andrzej Bialecki
Jérôme Charron wrote: A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on?

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Doug Cutting updated NUTCH-139: --- Comment: was deleted Standard metadata property names in the ParseData metadata -- Key:

creating MapFiles from unsorted data?

2006-01-06 Thread Matt Kangas
Hi folks, I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on JIRA), and I've got a question about working with org.apache.nutch.io.MapFile. I am parsing a textfile with one key/value pair per line. I want to write this into a new MapFile. MapFile.Writer requires keys to

Re: mapred crawling exception - Job failed!

2006-01-06 Thread Lukas Vlcek
Huh... anybody interested in this? Normally I would be so pushy but to me it seems that Nutch dies if it meets word document which can't be parsed. This seems like a serious issue to me. Or did I overlooked something important/fundamental? Lukas On 1/6/06, Lukas Vlcek [EMAIL PROTECTED] wrote:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] Doug Cutting commented on NUTCH-139: Jerome, Some HTTP headers have multiple values. Correctly reflecting that was I thought the primary motivation for adding multiple

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361995 ] KuroSaka TeruHiko commented on NUTCH-153: - The strings command would work with mostly ASCII text content. It is highly doubtful if we can have a universal strings

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361997 ] KuroSaka TeruHiko commented on NUTCH-153: - Actually, shouldn't turning on the mime.type.magic property do the job that the patch is trying to address? TextParser

Re: Normalizing URLs with anchors

2006-01-06 Thread Doug Cutting
Ken Krugler wrote: I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different: http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it

Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Doug Cutting
Stefan Groschupf wrote: Different parameters are sent to each address. So params.length should equal addresses.length, and if params.length==0 then addresses.length==0 and there's no call to be made. Make sense? It might be clearer if the test were changed to addresses.length==0. Yes,

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Matt Zytaruk wrote: The newest src (as of this morning) of trunk is occaisionally giving ClassCastExceptions when doing a crawl, with parsing (and by occaisionally I mean this was the only page out of the small list I crawled that it happened on). This is with the nothing changed from the

Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Stefan Groschupf
What bug was that? What is your one-line fix? http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html something like: Object[] values; method.getReturnType()!=null ? values = (Object[])Array.newInstance (method.getReturnType(),wrappedValues.length) : values = new Object[0];

[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] Doug Cutting commented on NUTCH-160: +1 I like this patch. I don't see a need for us to use oro anywhere, since Java now has good builtin regex support. And Java's

Re: Adaptive fetch interval unmodified content detection, episode II

2006-01-06 Thread Doug Cutting
Andrzej Bialecki wrote: For efficiency reasons, most of this information is stored and passed to processing jobs inside instances of CrawlDatum - for the key step of DB update any other parts of segments (such as Content, ParseData or ParseText) are not used, which prevents easy access to

Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57) at

Re: problems http-client

2006-01-06 Thread AJ Chen
I have started to see this problem recently. topN=20 per crawl, but fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included. I also see lots of Response content

Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?),

Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Doug Cutting
Okay, here's my patch attached. We don't need an all-new unit test file, when just a few lines are needed there. Does this look right to you? Doug Stefan Groschupf wrote: What bug was that? What is your one-line fix? http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ] Doug Cutting commented on NUTCH-153: Paul, Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too? I.e., is at least part of the problem that oro has

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Matt Zytaruk wrote: Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Hi, I attached the patch. Please test. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ] Doug Cutting commented on NUTCH-139: Also, since the primary use of multiple metadata values should be for protocols where multiple-values are required, the method to add

Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
So will this throw an exception on older segments? or will it just not get the correct metadata? I have a lot of older segments I still need to use. Thanks for your help. -Matt Zytaruk Andrzej Bialecki wrote: Matt Zytaruk wrote: Here you go. java.lang.ClassCastException:

[jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ] Doug Cutting commented on NUTCH-152: re 1,2,5: sounds good. re 3: Why is a separate thread needed for stdout? Can you please elaborate on how this causes problems? re 4:

[jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Doug Cutting resolved NUTCH-151: Fix Version: 0.8-dev Resolution: Fixed I just committed this. Thanks, Paul! CommandRunner can hang after the main thread exec is finished and has

[jira] Resolved: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2006-01-06 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ] Doug Cutting resolved NUTCH-150: Fix Version: 0.7.2-dev Resolution: Fixed I just committed this. Thanks, Paul! OutlinkExtractor extremely slow on some non-plain text