[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ]
Doug Cutting commented on NUTCH-171:
Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth
usage constant.
Overlapping map2 with reduce1 should
[
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ]
Doug Cutting commented on NUTCH-240:
First, I hope my critical remarks were not taken personally. I am thankful for
this and all of your contributions.
Initially, I did
[
http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372581 ]
Doug Cutting commented on NUTCH-242:
Shouldn't you use the returned value of the filter? If so, then this should be
done in a mapper, not in the reducer.
Add optional
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ]
Doug Cutting commented on NUTCH-171:
Generate for 20 Segments of 10M in size is almost as fast as 1 segment that
is 10M in size. A single 200M URL segment is unweildly
[
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ]
Doug Cutting commented on NUTCH-240:
The generator store/restore score stuff seems ugly. And it is not used by
OPIC. Could we insteadhave a method that computes
[
http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371122 ]
Doug Cutting commented on NUTCH-235:
I'm concerned about all of the contains() calls this adds to an ArrayList.
This is a linear scan, and makes the cost of building
[
http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371142 ]
Doug Cutting commented on NUTCH-235:
The iterator shouldn't be a problem. When we're indexing we also dedup them by
domain, which is much more expensive than creating
[
http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371147 ]
Doug Cutting commented on NUTCH-235:
+1 This looks good. It will be a little slower for simple crawls, where each
link is only processed once, but probably not noticeably
Jérôme Charron wrote:
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).
If it were easy to implement all java regex features in
Stefan Groschupf wrote:
Instead I would suggest go a step forward by add a (configurable)
timeout mechanism and skip bad records in reducing in general.
Processing such big data and losing all data because just of one bad
record is very sad.
That's a good suggestion. Ideally we could use
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ]
Doug Cutting commented on NUTCH-230:
Andrzej, that's true if we think links that are filtered are bad links, but if
we instead think of them as non-links then this fix
Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update
cycles, because we would lock pages selected by FetchListTool for a
period of time.
Now we don't do this. The advantage is that we don't have to rewrite
CrawlDB. But operations on CrawlDB are
Jérôme Charron wrote:
It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)
It was not intentional. Thanks for fixing
Rod Taylor wrote:
First is to allow for cleaning up. This consists of a new option to
updatedb which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset,
Piotr Kosiorowski wrote:
I found an email from Doug with title [Fwd: Crawler submits forms?]
stating: This has been fixed in the mapred branch, but that patch is
not in 0.7.1. This alone might be a reason to make a 0.7.2 release.
I just want to make sure it was fixed by svn commit: r348533
[EMAIL PROTECTED] wrote:
Don't generate URLs that don't pass URLFilters.
Just to be clear, this is to support folks changing their filters while
they're crawling, right? We already filter before we put things into
the db, so we're filtering twice now, no? If so, then perhaps there
should
Andrzej Bialecki wrote:
Stefan Groschupf wrote:
I notice filtering urls is done in the output format until parsing.
Wouldn't it be better to filter it until updating crawlDb?
Until == during ?
As you observed, doing it at this stage saves space in segment data, and
in consequence saves on
Piotr Kosiorowski wrote:
It looks like Nutch web site was updated with site built from latest
trunk - the only problem is it contains tutorial for unreleased (yet)
version 0.8. I think we talked about it and agreed to keep tutorial for
latest release on the Web. I have just updated site in svn
Toby DiPasquale wrote:
I have a question about the MapReduce and NDFS implementations. When
writing records into an NDFS file, how does one make sure that records
terminate cleanly on block boundaries such that a Map job's input does not
span multiple physical blocks?
We do not currently
Jérôme Charron wrote:
It seems that NUTCH-143 patch has been commited too... is it intentional?
That was indeed a mistake. Thanks for catching it! I just reverted the
unintentional changes. Thanks also to:
http://svnbook.red-bean.com/en/1.0/ch04s04.html#svn-ch-4-sect-4.2
Doug
[
http://issues.apache.org/jira/browse/NUTCH-221?page=comments#action_12368779 ]
Doug Cutting commented on NUTCH-221:
+1 Thanks!
prepare nutch for upcoming lucene 2.0
-
Key: NUTCH-221
URL
[EMAIL PROTECTED] wrote:
Modified: lucene/nutch/trunk/src/plugin/analysis-de/build.xml
URL:
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/plugin/analysis-de/build.xml?rev=378655r1=378654r2=378655view=diff
==
---
Jérôme Charron wrote:
It just ensure that the last modified core version is automatically compiled
while compiling a single plugin.
From my point of view the time for a whole build is not a problem.
If I just work on core, then I can use the fast compile-core target.
And if I just work on a
need DOAP file for Nutch
Key: NUTCH-218
URL: http://issues.apache.org/jira/browse/NUTCH-218
Project: Nutch
Type: Task
Reporter: Doug Cutting
Can someone please draft a DOAP file for Nutch, so that we're listed at
http
Andrzej Bialecki wrote:
* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s
from crawl_parse with the same URL, which means that we get:
* the original CrawlDatum
* (optionally a CrawlDatum that contains just a Signature)
* all CrawlDatum.LINKED entries pointing to
Nutch developer wrote:
What is the estimated date for a stable version of 0.8?
I'm hoping to have a stable release of Hadoop by April 15th. This
should substantially stablilize Nutch. So a 0.8 release of Nutch should
probably follow shortly thereafter.
By the way:
What are the criteria
[ http://issues.apache.org/jira/browse/NUTCH-216?page=all ]
Doug Cutting resolved NUTCH-216:
Fix Version: 0.8-dev
Resolution: Fixed
The reason 'exec' was used was to also restore file permissions, which 'untar'
does not. So I switched
Mike Smith wrote:
060219 142408 task_m_grycae Parent died. Exiting task_m_grycae
This means the child process, executing the task, was unable to ping its
parent process (the task tracker).
060219 142408 task_m_grycae Child Error
java.io.IOException: Task process exit with nonzero status.
Stefan Groschupf wrote:
do we still need the lib/jetty-ext/ jars? Since the jobtracker info
server is now part of hadoop someone may can delete them.
They're included with Nutch so that folks don't have to separately
download Hadoop. You should be able to simply download Nutch and run
Chris Schneider wrote:
My experience recently seeing attempted fetches of many ingrida.be URLs
made me question the Nutch 0.8 algorithm for partitioning URLs among
TaskTrackers (and their children processes). As I understand it, Nutch
doesn't worry about two lexically distinct domains (e.g.,
Jack Tang wrote:
In FetchedSegments class, below code shows how to get the hit summaries.
public String[] getSummary(HitDetails[] details, Query query)
throws IOException {
SummaryThread[] threads = new SummaryThread[details.length];
for (int i = 0; i threads.length; i++) {
Jérôme Charron wrote:
Finaly, the more I look at the ant code for plugins the more I think we must
redesign it.
In the actual ant scripts, each plugin is a ant project, so there is no way
to define ant dependencies between plugins.
(= if you compile a plugin A that depends on another one (B),
Gal Nitzan wrote:
I have implemented a down and dirty Global Locking:
[ ... ]
I changed FetcherThread constructor to create an instance of
SyncManager.
And in also in the run method I try to get a lock on the host. If not
successful I add the url into a ListArraykey,datum for a later
[ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]
Doug Cutting resolved NUTCH-211:
Resolution: Fixed
I committed this, with a bunch of whitespace fixes.
FetchedSegments leave readers open
[
http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366505 ]
Doug Cutting commented on NUTCH-211:
The interfaces that FetchedSegments implements should have a close method.
Moreover, these interfaces should extend a Closeable
Andrzej Bialecki wrote:
(FYI: if you wonder how it was working before, the trick was to generate
just 1 split for the fetch job, which then lead to just one task being
created for any input fetchlist.
I don't think that's right. The generator uses setNumReduceTasks() to
the desired number
Jérôme Charron wrote:
Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?
Yes, you're right.
I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml
+1
Thanks!
Doug
There are a number of duplicated libs in the plugins, namely:
commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib
commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
log4j-1.2.11.jar src/plugin/clustering-carrot2/lib
log4j-1.2.6.jar 1
[ http://issues.apache.org/jira/browse/NUTCH-209?page=all ]
Doug Cutting resolved NUTCH-209:
Resolution: Fixed
I just committed this.
Michael, the 'bin/hadoop jar' command is not (yet) used by Nutch. Please file
a Hadoop bug to add the feature
[
http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365798 ]
Doug Cutting commented on NUTCH-209:
Andrzej, sorry, I didn't see your remark before I committed this!
A DFSClassLoader would have problems with plugins, since our plugin
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365618 ]
Doug Cutting commented on NUTCH-192:
Since these mappings are not something that users should alter, I'm not sure
they should be in the config file. I added related
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365619 ]
Doug Cutting commented on NUTCH-139:
+1 This looks great. Thanks for all the hard work on this one!
Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]
Doug Cutting updated NUTCH-192:
---
Attachment: (was: metadata08_02_06.patch)
meta data support for CrawlDatum
Key: NUTCH-192
URL: http
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365643 ]
Doug Cutting commented on NUTCH-192:
+1 This looks good to me. Thanks for your persistence.
meta data support for CrawlDatum
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365450 ]
Doug Cutting commented on NUTCH-192:
Sorry, I misspoke and overstated things too. There are problems, but not with
MapWritable, rather with WritableName: this refers
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365087 ]
Doug Cutting commented on NUTCH-193:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365089 ]
Doug Cutting commented on NUTCH-139:
Jerome: yes, it makes sense, but there's also metadata that's not tightly
related to the protocol or the parser, e.g., the nutch
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365130 ]
Doug Cutting commented on NUTCH-193:
Okay, I've moved the code from Nutch to Hadoop. Now I need to repair Nutch so
that it still works!
One remaining problem is the need
[EMAIL PROTECTED] wrote:
URL: http://svn.apache.org/viewcvs?rev=374731view=rev
Log:
removed unused imports
Sami,
I was in the middle of the process of fixing NUTCH-193 (moving things to
the new Hadoop project) when you made this commit. I merged in those
changes you made to things still in
[ http://issues.apache.org/jira/browse/NUTCH-193?page=all ]
Doug Cutting resolved NUTCH-193:
Resolution: Fixed
I just committed this. Phew!
move NDFS and MapReduce to a separate project
Doug Cutting (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-193?page=all ]
Doug Cutting resolved NUTCH-193:
Resolution: Fixed
I just committed this. Phew!
The major incompatibility I introduced with this was changing the
top
[ http://issues.apache.org/jira/browse/NUTCH-197?page=all ]
Doug Cutting resolved NUTCH-197:
Fix Version: 0.8-dev
Resolution: Fixed
I just committed this. Thanks, Owen!
NullPointerException in TaskRunner if application jar does not have lib
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ]
Doug Cutting commented on NUTCH-192:
I'm worried that this will substantially slow things.
I'd like to see some effort made to ensure that:
1. If no metadata is used
move NDFS and MapReduce to a separate project
-
Key: NUTCH-193
URL: http://issues.apache.org/jira/browse/NUTCH-193
Project: Nutch
Type: Task
Components: ndfs
Versions: 0.8-dev
Reporter: Doug Cutting
[
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ]
Doug Cutting commented on NUTCH-191:
We've thus far avoided loading job-specific code in the JobTracker and
TaskTracker, in order to keep these more reliable. File
FYI
Original Message
Subject: NutchCVS/0.8-dev
Date: Mon, 30 Jan 2006 13:40:45 +0900 (JST)
From: [EMAIL PROTECTED]
Reply-To: nutch-agent@lucene.apache.org
To: nutch-agent@lucene.apache.org
Hi, I see that NutchCVS/0.8-dev is trying to crawl the
firecat.nihonsoft.org website,
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ]
Doug Cutting commented on NUTCH-193:
Otis: yes, thanks, I meant org.apache.hadoop.dfs.
Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today.
I'll
Andrzej Bialecki wrote:
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters?
I'm not sure there are that many places where it could make a big
difference.
* UTF8 (2-byte string length)
Andrzej Bialecki wrote:
Namely? I didn't notice any ... I think it's better to avoid bash-isms,
if we easily can. Not all the world looks like Linux. ;-)
IFS, at least. I tried running this on Solaris, where /bin/sh is not
bash, and it didn't work. It complained about unsetting IFS.
Doug
Andrzej Bialecki wrote:
Right, Solaris /bin/sh doesn't allow that... Hmm. Does this IFS
setting/unsetting work for you? I mean, I just tried it on Linux, using
the real Bash. I put the nutch distrib in a path containing spaces, and
I'm not able to run anything...
I initially added it to make
Rod Taylor wrote:
Please don't do that.
bash-2.05b$ ls /bin/bash
ls: /bin/bash: No such file or directory
bash-2.05b$ uname -a
FreeBSD home 6.0-RELEASE FreeBSD 6.0-RELEASE #13: Sat Nov 5
00:19:49 EST 2005 [EMAIL
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Attachment: (was: NUTCH-139.jc.review.patch.txt)
Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Attachment: (was: NUTCH-139.Mattmann.patch.txt)
Standard metadata property names in the ParseData metadata
Andrzej Bialecki wrote:
#!/usr/bin/env bash
+1
This works on Solaris, Linux cygwin. Does it work on FreeBSD?
Doug
The Sourceforge archives are still there, just hard to find, e.g.:
http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers
These lists are also archived at mail-archive.com:
http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/
Doug
Gordon Mohr (archive.org)
Gordon Mohr (archive.org) wrote:
Doug Cutting wrote:
The Sourceforge archives are still there, just hard to find, e.g.:
http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers
When I visit that URL, I get:
# Permission Denied
#
# Access to this page is restricted (either
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ]
Doug Cutting commented on NUTCH-139:
I think we're near agreement here.
Here are the changes I think this patch still needs:
MetadataNames belongs in the protocol package
John X wrote:
Please count me in.
Thanks, John.
I forgot to mention that I'd prefer a committer for this, and you're a
committer, so that works well!
Is there a timetable for it?
No, whenever you can get to it.
I'll make you an account and send you the details.
Doug
Andrzej Bialecki wrote:
Erhm.. please bear with me. I'd rather see these two classes in a
separate package altogether, org.apache.nutch.metadata. The reason is
that most likely these two classes will be used elsewhere too, not just
in the protocol and parse/fetch related context. I'm
Ken Krugler wrote:
It seems that the default behavior of Nutch when sorting links to fetch
is to use scoreByLinkCount. This then sets the fetch score for links on
a page to be the same as the containing page's in-bound link score (or
actually the log of same).
Please also see:
Howie Wang wrote:
1. A String[] HitDetails.getValues(String field) method that
returns an array of the values. The current only returns a
single string, and Lucene indexes can have multiple values
per field.
That sounds useful. Please submit a patch against the trunk attached to
a bug
Andy Liu wrote:
We're getting a lot of repeat questions in the mailing lists these
days. I think it's partly because people don't know of a way to
search the archives. The Mail Archive provides this:
http://www.mail-archive.com/index.php?hunt=nutch
Whoever maintains the
Would someone volunteer to develop Nutch-based site-search engine for
all apache.org domains? We now have a Solaris zone to host this.
Thanks,
Doug
[
http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363554 ]
Doug Cutting commented on NUTCH-183:
Byron, that's exactly what Mike means by speculative execution.
MapReduce has a series of problems concerning task-allocation
[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
Doug Cutting closed NUTCH-179:
--
Resolution: Invalid
Closed at submitter's request.
Proposition: Enable Nutch to use a parser plugin not just based on content
type
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
Doug Cutting resolved NUTCH-177:
Fix Version: 0.8-dev
Resolution: Fixed
The problem is that your seed url does not end in a slash, yet your url filter
requires a slash. In 0.8-dev
[ http://issues.apache.org/jira/browse/NUTCH-176?page=all ]
Doug Cutting resolved NUTCH-176:
Resolution: Won't Fix
This check is intentionally made to prevent folks from accidentally overwriting
crawls.
Using -dir: creates an error, when
Andrzej Bialecki wrote:
In the 0.7 branch, whenever a segment was generated the WebDB was
modified, so that the entries that ended up in the fetchlist wouldn't be
immediately available to the next segment generation, if that happened
before the WebDB was updated with the data from that first
[
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ]
Doug Cutting commented on NUTCH-136:
The mapred-default.xml file is actually the best place to set these.
mapreduce segment generator generates 50 % less than excepted
[
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ]
Doug Cutting commented on NUTCH-173:
Couldn't you instead use a prefix-urlfilter generated from your crawl seed?
PerHost Crawling Policy ( crawl.ignore.external.links
[ http://issues.apache.org/jira/browse/NUTCH-102?page=all ]
Doug Cutting resolved NUTCH-102:
Resolution: Fixed
I just applied this patch. Thanks, Owen.
jobtracker does not start when webapps is in src
Stefan Groschupf wrote:
Did I miss something in general to be able to support non required
terms in nutch?
I left OR and nesting out of the API to simplify what query filters have
to process. Nutch's query features are approximately what Google
supported for its first three years. (Google
Matt Zytaruk wrote:
Exception in thread main java.io.IOException: Not a file:
/user/nutch/segments/20060107130328/parse_data/part-0/data
at org.apache.nutch.ipc.Client.call(Client.java:294)
This is an error returned from an RPC call. There should be more
details about this in a
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ]
Doug Cutting commented on NUTCH-171:
I'd like to hear more about why you want multiple segments, what's motivating
this patch. The 0.7 -numFetchers parameter was designed
Andrew McNabb wrote:
On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote:
To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:
MyKey key = new MyKey();
MyValue value = new MyValue();
SequenceFile.Reader reader =
new
Gal Nitzan wrote:
I traced it to ParseData line 147.
UTF8.writeString(out, (String) e.getKey());
UTF8.writeString(out, (String) e.getValue());
it seems that Set-Cookie key comes with a ArrayList value?
I think that was fixed yesterday by Andrzej.
[EMAIL PROTECTED] wrote:
--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006
@@ -6,13 +6,14 @@
!-- Build deploy all the plugin jars.--
!-- == --
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ]
Doug Cutting commented on NUTCH-139:
We can just use different names, rather than two metaData objects: X-nutch
names for derived or other values that are usually protocol
Andrew McNabb wrote:
I'm looking at the Reporter interface, and I would like to verify my
understanding of what it is. It appears to me that Reporter.setStatus()
is called periodically during an operation to give a human-readable
description of how far the progress is so far. Is that correct?
Stefan Groschupf wrote:
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be
may be better to have the index also in the segment folder, since the
segment folder would be the single item to manage a life-cycle,
Andrew McNabb wrote:
One of the great things about open source is that projects can be used
for unintended purposes. In fact, Nutch works well for parallel
computing in general, not just for web indexing. Apparently Google has
thousands of projects that use MapReduce.
The plan is to move
Andrew McNabb wrote:
SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob,
nullreporter);
To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key: NUTCH
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key: NUTCH
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Doug Cutting updated NUTCH-139:
---
Comment: was deleted
Standard metadata property names in the ParseData metadata
--
Key: NUTCH
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ]
Doug Cutting commented on NUTCH-139:
Jerome,
Some HTTP headers have multiple values. Correctly reflecting that was I
thought the primary motivation for adding multiple
Ken Krugler wrote:
I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as different:
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex
and
http://www.dina.kvl.dk/~sestoft/gcsharp/index.html
Is it
Stefan Groschupf wrote:
Different parameters are sent to each address. So params.length
should equal addresses.length, and if params.length==0 then
addresses.length==0 and there's no call to be made. Make sense? It
might be clearer if the test were changed to addresses.length==0.
Yes,
[
http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ]
Doug Cutting commented on NUTCH-160:
+1
I like this patch. I don't see a need for us to use oro anywhere, since Java
now has good builtin regex support. And Java's
Andrzej Bialecki wrote:
For efficiency reasons, most of this information is stored and passed to
processing jobs inside instances of CrawlDatum - for the key step of DB
update any other parts of segments (such as Content, ParseData or
ParseText) are not used, which prevents easy access to
101 - 200 of 329 matches
Mail list logo