[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364542 ] Andrzej Bialecki commented on NUTCH-192: - I have two comments: * it's not obvious to me what are the strong arguments in favor of storing Writables. I'd think that

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it and commit it within the next 12 hours. remove

indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Byron Miller
Has indexsorter code discussed a while back been pushed to jira or put in SVN? I'd like to give it a whirl on some of my indexes and the archive i can find cut the post with the code attached..

mapred: config parameters

2006-01-31 Thread Michael Nebel
Hi, the last days I gave the mapred-branch a try and I was impressed! But I still have a problem with the incremental crawling. My setup: I have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running one round of crawling consists out of the steps: - generate (I set a limit of

[jira] Created: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
move NDFS and MapReduce to a separate project - Key: NUTCH-193 URL: http://issues.apache.org/jira/browse/NUTCH-193 Project: Nutch Type: Task Components: ndfs Versions: 0.8-dev Reporter: Doug Cutting

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364662 ] Andrzej Bialecki commented on NUTCH-193: - What timeframe did you have in mind? There are a few patches in the queue, which will be affected by this split. Other than

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Sami Siren
Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it and commit it within

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364663 ] Sami Siren commented on NUTCH-193: -- +1 I quess the fuse-j - ndfs work from John/me could be part of hadoop /contrib after this change? move NDFS and MapReduce to a

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Stefan Groschupf
Well, it was at least the best way we had seen, since NutchConfigured require to implement a constructor that in most cases was unused as well, since most classes are instantiated class.newInstance(). So both solutions was optimal, and we decide for the interface solution. I'm pretty sure

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ] Doug Cutting commented on NUTCH-191: We've thus far avoided loading job-specific code in the JobTracker and TaskTracker, in order to keep these more reliable. File

[jira] Commented: (NUTCH-44) too many search results

2006-01-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12364679 ] Sami Siren commented on NUTCH-44: - Byron, have you made any progress with this? too many search results --- Key: NUTCH-44 URL:

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ] Stefan Groschupf commented on NUTCH-192: Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have string keys and values, or just string keys?

[Fwd: NutchCVS/0.8-dev]

2006-01-31 Thread Doug Cutting
FYI Original Message Subject: NutchCVS/0.8-dev Date: Mon, 30 Jan 2006 13:40:45 +0900 (JST) From: [EMAIL PROTECTED] Reply-To: nutch-agent@lucene.apache.org To: nutch-agent@lucene.apache.org Hi, I see that NutchCVS/0.8-dev is trying to crawl the firecat.nihonsoft.org website,

Re: CrawlDb and inputDir's

2006-01-31 Thread Stefan Groschupf
Thanks for the clarification, i missed all this cross links! You definitely 'are in the know'. :-) Stefan Am 31.01.2006 um 20:31 schrieb Doug Cutting: Stefan Groschupf wrote: The call CrawlDb.createJob(...) creates the crawl db update job. In this method the main input folder is defined:

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ] Doug Cutting commented on NUTCH-193: Otis: yes, thanks, I meant org.apache.hadoop.dfs. Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today. I'll

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364694 ] Andrzej Bialecki commented on NUTCH-192: - What I meant was that both keys and values should be Strings (or rather UTF8), for the sake of simplicity. Let's take your

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] Stefan Groschupf commented on NUTCH-192: * plus whatever it takes to put the class name-id mapping in the MapWritable header (the mapping table): let's assume 40

[jira] Updated: (NUTCH-194) Nutch-169 introduced two tiny bugs

2006-01-31 Thread Marko Bauhardt (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-194?page=all ] Marko Bauhardt updated NUTCH-194: - Attachment: NutchConf.371869.patch This patch fix the above described problems. Nutch-169 introduced two tiny bugs --

Re: Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Doug Cutting
Andrzej Bialecki wrote: I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters? I'm not sure there are that many places where it could make a big difference. * UTF8 (2-byte string length)

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Owen O'Malley (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364739 ] Owen O'Malley commented on NUTCH-191: - Wouldn't it be appropriate to make input splitting into a task, so that getSplits could be run by the TaskTrackerChild? That way the

[jira] Created: (NUTCH-195) RPC call times out while indexing map task is computing splits

2006-01-31 Thread Chris Schneider (JIRA)
RPC call times out while indexing map task is computing splits -- Key: NUTCH-195 URL: http://issues.apache.org/jira/browse/NUTCH-195 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata310106.patch Now 1 byte for the class type and the size of the type itself, this means we can have only 2 byte keys and 2 byte values

Nutch Adminstration Interface

2006-01-31 Thread Stefan Groschupf
Hi developers, some people are already in the process of writing a web based administration interface for nutch. The goal is to get newbies faster and easier started with nutch. I wrote our plans together so you can get an idea what we are working on.