Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Jérôme Charron
the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); This code works well, because Nutch HTML Parser uses Xerces

XmlInputFortmat ?

2006-01-10 Thread Jack Tang
Hi I am going to feed nutch-0.8-dev crawler with seeds in xml format. And I have read nutch TextInputFormat/InputFormatBase. It seems now nutch breaks the plain text files into chars and parses on them. My question is how to support XmlInputFormat, in my eye, xml format is not character-based but

[jira] Created: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
remove static NutchConf --- Key: NUTCH-169 URL: http://issues.apache.org/jira/browse/NUTCH-169 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Removing the static NutchConf.get is

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Stefan Groschupf updated NUTCH-169: --- Attachment: nutchConf.patch The patch was created by Marko Bauhardt with some help from me, so full credits to Marko! It remove any access of nutchConf

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Gal Nitzan
Thanks, I was checking something with the default from jdk... On Tue, 2006-01-10 at 11:06 +0100, Jérôme Charron wrote: the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode =

[jira] Reopened: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron reopened NUTCH-151: -- Due to the removal of calling barrier in PumperThread the process is always timedout (for instance , unit tests of parse-ext fails) because only the main

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron updated NUTCH-151: - Attachment: CommandRunner.060110.patch Here is a very small patch that solves this issue. If Paul is ok with this, I will commit. CommandRunner can hang

Re: Reporter interface

2006-01-10 Thread Andrew McNabb
On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote: To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like: MyKey key = new MyKey(); MyValue value = new MyValue(); SequenceFile.Reader reader = new

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362334 ] Stefan Groschupf commented on NUTCH-169: I missed to mentioned that is the first version just for discussing and provide Jerome the changed API it is not the final

fetch of XXX failed with: java.lang.ClassCastException: java.util.ArrayList

2006-01-10 Thread Gal Nitzan
Hi, I traced it to ParseData line 147. UTF8.writeString(out, (String) e.getKey()); UTF8.writeString(out, (String) e.getValue()); it seems that Set-Cookie key comes with a ArrayList value?

Re: Reporter interface

2006-01-10 Thread Doug Cutting
Andrew McNabb wrote: On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote: To read sequence files directly outside of MapReduce, just use SequenceFile directly, e.g., something like: MyKey key = new MyKey(); MyValue value = new MyValue(); SequenceFile.Reader reader = new

Re: fetch of XXX failed with: java.lang.ClassCastException: java.util.ArrayList

2006-01-10 Thread Doug Cutting
Gal Nitzan wrote: I traced it to ParseData line 147. UTF8.writeString(out, (String) e.getKey()); UTF8.writeString(out, (String) e.getValue()); it seems that Set-Cookie key comes with a ArrayList value? I think that was fixed yesterday by Andrzej.

ParserFactory test fail

2006-01-10 Thread Stefan Groschupf
Hi Jerome, I'm not sure but could it happen that with your new html protocol plugin the ParserFactory fails, since a component require log4j? May we should than add log4j into the core classpath, since I had added log4j to the NUTCH_HOME/lib and than the test was running successfully.

Re: ParserFactory test fail

2006-01-10 Thread Jérôme Charron
Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

[jira] Created: (NUTCH-170) Crash with multiple temp directories

2006-01-10 Thread Rod Taylor (JIRA)
Crash with multiple temp directories Key: NUTCH-170 URL: http://issues.apache.org/jira/browse/NUTCH-170 Project: Nutch Type: Bug Reporter: Rod Taylor Priority: Critical A brief read of the code indicated it may be

[jira] Commented: (NUTCH-170) Crash with multiple temp directories

2006-01-10 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362355 ] Rod Taylor commented on NUTCH-170: -- Wish there was an edit option in JIRA. Obviously it was within the SegmentReader -- though I don't believe it does anything special to

Re: Reporter interface

2006-01-10 Thread Andrew McNabb
On Tue, Jan 10, 2006 at 08:44:46AM -0800, Doug Cutting wrote: NutchFileSystem fs = NutchFileSystem.get(); File[] files = fs.listFiles(directory); Thanks. I'll try doing it this way instead of how I was doing it earlier. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint:

Re: ParserFactory test fail

2006-01-10 Thread Stefan Groschupf
Sure, my mistake. Am 10.01.2006 um 18:24 schrieb Jérôme Charron: Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

OpenOffice and Excel parsers

2006-01-10 Thread Rida Benjelloun
Hi, Is someone working on OpenOffice and Excel parsers ? because I have already developed them in Lius (http://sourceforge.net/projects/lius) and I whant to adapt them for nutch. I have checked the SVN and I didn't find OO and Excel parser. Best regards Rida.

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Gal Nitzan
Because I needed to add two more fields from the meta tags in the html page I have revised some of the code in HTMLMetaProcessor and in DOMContentUtils. I believe it to be a little more generic than the existing code (look at DOMContentUtils.GetMetaAttributes) and from the sample here from Jérôme

[jira] Commented: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ] Paul Baclace commented on NUTCH-151: The number of threads that invoke _barrier.barrier() or .attemptBarrier() should match the count passed to the contructor of

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.Http.060111.patch Attached is the patch for http related classes (lib-http, protocol-http and protocol-httpclient). Pfou, Stefan, it

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.Fetcher.060111.patch Same as the one provided in Stefan patch + the Fetcher set the NutchConf to protocol. Not sure it is the right

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362393 ] Stefan Groschupf commented on NUTCH-169: Great! Thanks a lot Jerome!!! We will continue to fix some smaller bugs we introduced and JobConf related issue and hopefully

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.RegexURLFilter.060111.patch This patch is a merge of the version provided in Stefan's patch and the last changes committed by Doug (use