the following code would fail in case the meta tags are in upper case
Node nameNode = attrs.getNamedItem(name);
Node equivNode = attrs.getNamedItem(http-equiv);
Node contentNode = attrs.getNamedItem(content);
This code works well, because Nutch HTML Parser uses Xerces
Hi
I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
breaks the plain text files into chars and parses on them. My question
is how to support XmlInputFormat, in my eye, xml format is not
character-based but
remove static NutchConf
---
Key: NUTCH-169
URL: http://issues.apache.org/jira/browse/NUTCH-169
Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Critical
Fix For: 0.8-dev
Removing the static NutchConf.get is
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]
Stefan Groschupf updated NUTCH-169:
---
Attachment: nutchConf.patch
The patch was created by Marko Bauhardt with some help from me, so full
credits to Marko!
It remove any access of nutchConf
Thanks, I was checking something with the default from jdk...
On Tue, 2006-01-10 at 11:06 +0100, Jérôme Charron wrote:
the following code would fail in case the meta tags are in upper case
Node nameNode = attrs.getNamedItem(name);
Node equivNode =
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Jerome Charron reopened NUTCH-151:
--
Due to the removal of calling barrier in PumperThread the process is always
timedout (for instance , unit tests of parse-ext fails) because only the main
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Jerome Charron updated NUTCH-151:
-
Attachment: CommandRunner.060110.patch
Here is a very small patch that solves this issue.
If Paul is ok with this, I will commit.
CommandRunner can hang
On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote:
To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:
MyKey key = new MyKey();
MyValue value = new MyValue();
SequenceFile.Reader reader =
new
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362334 ]
Stefan Groschupf commented on NUTCH-169:
I missed to mentioned that is the first version just for discussing and provide
Jerome the changed API it is not the final
Hi,
I traced it to ParseData line 147.
UTF8.writeString(out, (String) e.getKey());
UTF8.writeString(out, (String) e.getValue());
it seems that Set-Cookie key comes with a ArrayList value?
Andrew McNabb wrote:
On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote:
To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:
MyKey key = new MyKey();
MyValue value = new MyValue();
SequenceFile.Reader reader =
new
Gal Nitzan wrote:
I traced it to ParseData line 147.
UTF8.writeString(out, (String) e.getKey());
UTF8.writeString(out, (String) e.getValue());
it seems that Set-Cookie key comes with a ArrayList value?
I think that was fixed yesterday by Andrzej.
Hi Jerome,
I'm not sure but could it happen that with your new html protocol
plugin the ParserFactory fails, since a component require log4j?
May we should than add log4j into the core classpath, since I had
added log4j to the NUTCH_HOME/lib and than the test was running
successfully.
Hi Stefan,
No in fact, I have refactored the code of protocol-http plugins, not html
parser.
So, I don't think the log4 error comes from this code.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Crash with multiple temp directories
Key: NUTCH-170
URL: http://issues.apache.org/jira/browse/NUTCH-170
Project: Nutch
Type: Bug
Reporter: Rod Taylor
Priority: Critical
A brief read of the code indicated it may be
[
http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362355 ]
Rod Taylor commented on NUTCH-170:
--
Wish there was an edit option in JIRA. Obviously it was within the
SegmentReader -- though I don't believe it does anything special to
On Tue, Jan 10, 2006 at 08:44:46AM -0800, Doug Cutting wrote:
NutchFileSystem fs = NutchFileSystem.get();
File[] files = fs.listFiles(directory);
Thanks. I'll try doing it this way instead of how I was doing it
earlier.
--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint:
Sure, my mistake.
Am 10.01.2006 um 18:24 schrieb Jérôme Charron:
Hi Stefan,
No in fact, I have refactored the code of protocol-http plugins,
not html
parser.
So, I don't think the log4 error comes from this code.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Hi,
Is someone working on OpenOffice and Excel parsers ? because I have already
developed them in Lius (http://sourceforge.net/projects/lius) and I whant to
adapt them for nutch.
I have checked the SVN and I didn't find OO and Excel parser.
Best regards
Rida.
Because I needed to add two more fields from the meta tags in the html
page I have revised some of the code in HTMLMetaProcessor and in
DOMContentUtils.
I believe it to be a little more generic than the existing code (look at
DOMContentUtils.GetMetaAttributes) and from the sample here from Jérôme
[
http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ]
Paul Baclace commented on NUTCH-151:
The number of threads that invoke _barrier.barrier() or .attemptBarrier()
should match the count passed to the contructor of
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]
Jerome Charron updated NUTCH-169:
-
Attachment: NutchConf.Http.060111.patch
Attached is the patch for http related classes (lib-http, protocol-http and
protocol-httpclient).
Pfou, Stefan, it
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]
Jerome Charron updated NUTCH-169:
-
Attachment: NutchConf.Fetcher.060111.patch
Same as the one provided in Stefan patch + the Fetcher set the NutchConf to
protocol.
Not sure it is the right
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362393 ]
Stefan Groschupf commented on NUTCH-169:
Great! Thanks a lot Jerome!!! We will continue to fix some smaller bugs we
introduced and JobConf related issue and hopefully
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]
Jerome Charron updated NUTCH-169:
-
Attachment: NutchConf.RegexURLFilter.060111.patch
This patch is a merge of the version provided in Stefan's patch and the last
changes committed by Doug (use
25 matches
Mail list logo