[
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364773 ]
Owen O'Malley commented on NUTCH-191:
-
I would schedule the getSplits task and when it completed, I would schedule the
map jobs. It would be pretty parallel to the way the
Hi developers,
some people are already in the process of writing a web based
administration interface for nutch.
The goal is to get newbies faster and easier started with nutch.
I wrote our plans together so you can get an idea what we are working
on.
http://wiki.apache.org/nutch/NutchAdmi
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]
Stefan Groschupf updated NUTCH-192:
---
Attachment: metadata310106.patch
Now 1 byte for the class type and the size of the type itself, this means we
can have only 2 byte keys and 2 byte values
RPC call times out while indexing map task is computing splits
--
Key: NUTCH-195
URL: http://issues.apache.org/jira/browse/NUTCH-195
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
[
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364743 ]
Bryan Pendleton commented on NUTCH-191:
---
I think the reason to keep getSplits() in the jobtracker, is because the result
of getSplits() determines the actual number of ma
[
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364739 ]
Owen O'Malley commented on NUTCH-191:
-
Wouldn't it be appropriate to make input splitting into a task, so that
getSplits could be run by the TaskTrackerChild? That way the
Andrzej Bialecki wrote:
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters?
I'm not sure there are that many places where it could make a big
difference.
* UTF8 (2-byte string length)
C
[ http://issues.apache.org/jira/browse/NUTCH-194?page=all ]
Marko Bauhardt updated NUTCH-194:
-
Attachment: NutchConf.371869.patch
This patch fix the above described problems.
> Nutch-169 introduced two tiny bugs
> --
>
>
Nutch-169 introduced two tiny bugs
--
Key: NUTCH-194
URL: http://issues.apache.org/jira/browse/NUTCH-194
Project: Nutch
Type: Bug
Components: searcher
Versions: 0.8-dev
Reporter: Marko Bauhardt
Priority: Blocker
1
+1 :-)
Am 31.01.2006 um 22:06 schrieb Andrzej Bialecki:
Hi,
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters? We could "borrow" the code from Lucene and
create a VIntWritable for this
Hi,
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters? We could "borrow" the code from Lucene and create a
VIntWritable for this purpose. I'm thinking specifically about the
following place
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ]
Stefan Groschupf commented on NUTCH-192:
* plus whatever it takes to put the class name->id mapping in the MapWritable
header (the mapping table): let's assume 40 bytes
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364694 ]
Andrzej Bialecki commented on NUTCH-192:
-
What I meant was that both keys and values should be Strings (or rather UTF8),
for the sake of simplicity. Let's take your ex
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ]
Doug Cutting commented on NUTCH-193:
Otis: yes, thanks, I meant org.apache.hadoop.dfs.
Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today.
I'll t
Thanks for the clarification, i missed all this cross links!
You definitely 'are in the know'. :-)
Stefan
Am 31.01.2006 um 20:31 schrieb Doug Cutting:
Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job.
In this method the main input folder is defined:
Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job. In
this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are added.
This confuses me sin
FYI
Original Message
Subject: NutchCVS/0.8-dev
Date: Mon, 30 Jan 2006 13:40:45 +0900 (JST)
From: [EMAIL PROTECTED]
Reply-To: nutch-agent@lucene.apache.org
To: nutch-agent@lucene.apache.org
Hi, I see that NutchCVS/0.8-dev is trying to crawl the
firecat.nihonsoft.org website, but
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ]
Stefan Groschupf commented on NUTCH-192:
Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have
string keys and values, or just string keys?
It
[
http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12364679 ]
Sami Siren commented on NUTCH-44:
-
Byron, have you made any progress with this?
> too many search results
> ---
>
> Key: NUTCH-44
> URL: ht
[
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ]
Doug Cutting commented on NUTCH-191:
We've thus far avoided loading job-specific code in the JobTracker and
TaskTracker, in order to keep these more reliable. File splitti
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364674 ]
Doug Cutting commented on NUTCH-192:
I agree that Writable is probably overkill, that strings should be sufficient.
A mapping dictionary would save a lot of space, even wit
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364672 ]
Andrzej Bialecki commented on NUTCH-193:
-
Ok, the sooner the better from my POV. I didn;t have anything in mind that
would be included in Hadoop, rather Nutch patches
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364669 ]
Otis Gospodnetic commented on NUTCH-193:
I assume Doug meant org.apache.hadoop.dfs, not org.apache.nutch.dfs.
> move NDFS and MapReduce to a separate project
>
Andrzej Bialecki (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ]
Andrzej Bialecki commented on NUTCH-169:
-
This patch looks good! If there are no further objections, I'll test it and
commit it within
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364665 ]
Doug Cutting commented on NUTCH-193:
Andrzej: I'd like to do this soon, this week or next. No matter how long I
wait, there will probably always be a few patches queued th
Sami Siren wrote:
Andrzej Bialecki (JIRA) wrote:
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544
]
Andrzej Bialecki commented on NUTCH-169:
-
This patch looks good! If there are no further objections, I'll test
it an
Well, it was at least the best way we had seen, since NutchConfigured
require to implement a constructor that in most cases was unused as
well, since most classes are instantiated class.newInstance().
So both solutions was optimal, and we decide for the interface solution.
I'm pretty sure this
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364663 ]
Sami Siren commented on NUTCH-193:
--
+1
I quess the fuse-j - ndfs work from John/me could be part of hadoop /contrib
after this change?
> move NDFS and MapReduce to a separa
Andrzej Bialecki (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ]
Andrzej Bialecki commented on NUTCH-169:
-
This patch looks good! If there are no further objections, I'll test it and
commit it within
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364657 ]
Doug Cutting commented on NUTCH-193:
NDFS, the Nutch Distributed Filesystem will be renamed HDFS, the Hadoop
Distributed Filesystem. Its code will live in the package
org
[
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364662 ]
Andrzej Bialecki commented on NUTCH-193:
-
What timeframe did you have in mind? There are a few patches in the queue,
which will be affected by this split.
Other than
move NDFS and MapReduce to a separate project
-
Key: NUTCH-193
URL: http://issues.apache.org/jira/browse/NUTCH-193
Project: Nutch
Type: Task
Components: ndfs
Versions: 0.8-dev
Reporter: Doug Cutting
Assigne
Hi Michael,
this question should be asked in the nutch-users list.
Take a look at a thread: So many Unfetched Pages using MapReduce
G.
On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
> Hi,
>
> the last days I gave the mapred-branch a try and I was impressed!
>
> But I still have a pro
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]
Andrzej Bialecki closed NUTCH-169:
---
Resolution: Fixed
Patches applied, with some changes (mostly whitespace related). Thank you!
> remove static NutchConf
> ---
>
>
Byron Miller wrote:
Has indexsorter code discussed a while back been
pushed to jira or put in SVN? I'd like to give it a
whirl on some of my indexes and the archive i can find
cut the post with the code attached..
It's committed to trunk/ . It works very well, if you have good
differentiatio
Hi,
the last days I gave the mapred-branch a try and I was impressed!
But I still have a problem with the incremental crawling. My setup: I
have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running
one round of "crawling" consists out of the steps:
- generate (I set a limit of
Has indexsorter code discussed a while back been
pushed to jira or put in SVN? I'd like to give it a
whirl on some of my indexes and the archive i can find
cut the post with the code attached..
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ]
Andrzej Bialecki commented on NUTCH-169:
-
This patch looks good! If there are no further objections, I'll test it and
commit it within the next 12 hours.
> remove sta
[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364542 ]
Andrzej Bialecki commented on NUTCH-192:
-
I have two comments:
* it's not obvious to me what are the strong arguments in favor of storing
Writables. I'd think that fo
39 matches
Mail list logo