A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.free.fr/
Jérôme Charron wrote:
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key:
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key:
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key:
Hi folks,
I'm in the process of cleaning up my WhitelistURLFilter (NUTCH-87 on
JIRA), and I've got a question about working with
org.apache.nutch.io.MapFile.
I am parsing a textfile with one key/value pair per line. I want to
write this into a new MapFile. MapFile.Writer requires keys to
Huh...
anybody interested in this?
Normally I would be so pushy but to me it seems that Nutch dies if it
meets word document which can't be parsed. This seems like a serious
issue to me.
Or did I overlooked something important/fundamental?
Lukas
On 1/6/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ]
Doug Cutting commented on NUTCH-139:
Jerome,
Some HTTP headers have multiple values. Correctly reflecting that was I
thought the primary motivation for adding multiple
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361995 ]
KuroSaka TeruHiko commented on NUTCH-153:
-
The strings command would work with mostly ASCII text content. It is highly
doubtful if we can have a universal strings
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12361997 ]
KuroSaka TeruHiko commented on NUTCH-153:
-
Actually, shouldn't turning on the mime.type.magic property do the job that the
patch is trying to address?
TextParser
Ken Krugler wrote:
I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as different:
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex
and
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html
Is it
Stefan Groschupf wrote:
Different parameters are sent to each address. So params.length
should equal addresses.length, and if params.length==0 then
addresses.length==0 and there's no call to be made. Make sense? It
might be clearer if the test were changed to addresses.length==0.
Yes,
Matt Zytaruk wrote:
The newest src (as of this morning) of trunk is occaisionally giving
ClassCastExceptions when doing a crawl, with parsing (and by
occaisionally I mean this was the only page out of the small list I
crawled that it happened on). This is with the nothing changed from
the
What bug was that? What is your one-line fix?
http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html
something like:
Object[] values;
method.getReturnType()!=null ? values = (Object[])Array.newInstance
(method.getReturnType(),wrappedValues.length) : values = new Object[0];
[
http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ]
Doug Cutting commented on NUTCH-160:
+1
I like this patch. I don't see a need for us to use oro anywhere, since Java
now has good builtin regex support. And Java's
Andrzej Bialecki wrote:
For efficiency reasons, most of this information is stored and passed to
processing jobs inside instances of CrawlDatum - for the key step of DB
update any other parts of segments (such as Content, ParseData or
ParseText) are not used, which prevents easy access to
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
at
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of Response content
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?),
Okay, here's my patch attached.
We don't need an all-new unit test file, when just a few lines are
needed there.
Does this look right to you?
Doug
Stefan Groschupf wrote:
What bug was that? What is your one-line fix?
http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ]
Doug Cutting commented on NUTCH-153:
Paul,
Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too?
I.e., is at least part of the problem that oro has
Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
Hi,
I attached the patch. Please test.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ]
Doug Cutting commented on NUTCH-139:
Also, since the primary use of multiple metadata values should be for protocols
where multiple-values are required, the method to add
So will this throw an exception on older segments? or will it just not
get the correct metadata? I have a lot of older segments I still need to
use.
Thanks for your help.
-Matt Zytaruk
Andrzej Bialecki wrote:
Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException:
[
http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ]
Doug Cutting commented on NUTCH-152:
re 1,2,5: sounds good.
re 3: Why is a separate thread needed for stdout? Can you please elaborate on
how this causes problems?
re 4:
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Doug Cutting resolved NUTCH-151:
Fix Version: 0.8-dev
Resolution: Fixed
I just committed this. Thanks, Paul!
CommandRunner can hang after the main thread exec is finished and has
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Doug Cutting resolved NUTCH-150:
Fix Version: 0.7.2-dev
Resolution: Fixed
I just committed this. Thanks, Paul!
OutlinkExtractor extremely slow on some non-plain text
28 matches
Mail list logo