[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Owen O'Malley (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364773 ] Owen O'Malley commented on NUTCH-191: - I would schedule the getSplits task and when it completed, I would schedule the map jobs. It would be pretty parallel to the way the

Nutch Adminstration Interface

2006-01-31 Thread Stefan Groschupf
Hi developers, some people are already in the process of writing a web based administration interface for nutch. The goal is to get newbies faster and easier started with nutch. I wrote our plans together so you can get an idea what we are working on. http://wiki.apache.org/nutch/NutchAdmi

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Stefan Groschupf updated NUTCH-192: --- Attachment: metadata310106.patch Now 1 byte for the class type and the size of the type itself, this means we can have only 2 byte keys and 2 byte values

[jira] Created: (NUTCH-195) RPC call times out while indexing map task is computing splits

2006-01-31 Thread Chris Schneider (JIRA)
RPC call times out while indexing map task is computing splits -- Key: NUTCH-195 URL: http://issues.apache.org/jira/browse/NUTCH-195 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Bryan Pendleton (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364743 ] Bryan Pendleton commented on NUTCH-191: --- I think the reason to keep getSplits() in the jobtracker, is because the result of getSplits() determines the actual number of ma

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Owen O'Malley (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364739 ] Owen O'Malley commented on NUTCH-191: - Wouldn't it be appropriate to make input splitting into a task, so that getSplits could be run by the TaskTrackerChild? That way the

Re: Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Doug Cutting
Andrzej Bialecki wrote: I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters? I'm not sure there are that many places where it could make a big difference. * UTF8 (2-byte string length) C

[jira] Updated: (NUTCH-194) Nutch-169 introduced two tiny bugs

2006-01-31 Thread Marko Bauhardt (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-194?page=all ] Marko Bauhardt updated NUTCH-194: - Attachment: NutchConf.371869.patch This patch fix the above described problems. > Nutch-169 introduced two tiny bugs > -- > >

[jira] Created: (NUTCH-194) Nutch-169 introduced two tiny bugs

2006-01-31 Thread Marko Bauhardt (JIRA)
Nutch-169 introduced two tiny bugs -- Key: NUTCH-194 URL: http://issues.apache.org/jira/browse/NUTCH-194 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Marko Bauhardt Priority: Blocker 1

Re: Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Stefan Groschupf
+1 :-) Am 31.01.2006 um 22:06 schrieb Andrzej Bialecki: Hi, I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters? We could "borrow" the code from Lucene and create a VIntWritable for this

Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Andrzej Bialecki
Hi, I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters? We could "borrow" the code from Lucene and create a VIntWritable for this purpose. I'm thinking specifically about the following place

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] Stefan Groschupf commented on NUTCH-192: * plus whatever it takes to put the class name->id mapping in the MapWritable header (the mapping table): let's assume 40 bytes

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364694 ] Andrzej Bialecki commented on NUTCH-192: - What I meant was that both keys and values should be Strings (or rather UTF8), for the sake of simplicity. Let's take your ex

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ] Doug Cutting commented on NUTCH-193: Otis: yes, thanks, I meant org.apache.hadoop.dfs. Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today. I'll t

Re: CrawlDb and inputDir's

2006-01-31 Thread Stefan Groschupf
Thanks for the clarification, i missed all this cross links! You definitely 'are in the know'. :-) Stefan Am 31.01.2006 um 20:31 schrieb Doug Cutting: Stefan Groschupf wrote: The call CrawlDb.createJob(...) creates the crawl db update job. In this method the main input folder is defined:

Re: CrawlDb and inputDir's

2006-01-31 Thread Doug Cutting
Stefan Groschupf wrote: The call CrawlDb.createJob(...) creates the crawl db update job. In this method the main input folder is defined: job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME)); However in the update method (line 48, 49) two more input dirs are added. This confuses me sin

[Fwd: NutchCVS/0.8-dev]

2006-01-31 Thread Doug Cutting
FYI Original Message Subject: NutchCVS/0.8-dev Date: Mon, 30 Jan 2006 13:40:45 +0900 (JST) From: [EMAIL PROTECTED] Reply-To: nutch-agent@lucene.apache.org To: nutch-agent@lucene.apache.org Hi, I see that NutchCVS/0.8-dev is trying to crawl the firecat.nihonsoft.org website, but

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ] Stefan Groschupf commented on NUTCH-192: Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have string keys and values, or just string keys? It

[jira] Commented: (NUTCH-44) too many search results

2006-01-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12364679 ] Sami Siren commented on NUTCH-44: - Byron, have you made any progress with this? > too many search results > --- > > Key: NUTCH-44 > URL: ht

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ] Doug Cutting commented on NUTCH-191: We've thus far avoided loading job-specific code in the JobTracker and TaskTracker, in order to keep these more reliable. File splitti

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364674 ] Doug Cutting commented on NUTCH-192: I agree that Writable is probably overkill, that strings should be sufficient. A mapping dictionary would save a lot of space, even wit

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364672 ] Andrzej Bialecki commented on NUTCH-193: - Ok, the sooner the better from my POV. I didn;t have anything in mind that would be included in Hadoop, rather Nutch patches

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364669 ] Otis Gospodnetic commented on NUTCH-193: I assume Doug meant org.apache.hadoop.dfs, not org.apache.nutch.dfs. > move NDFS and MapReduce to a separate project >

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Sami Siren
Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it and commit it within

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364665 ] Doug Cutting commented on NUTCH-193: Andrzej: I'd like to do this soon, this week or next. No matter how long I wait, there will probably always be a few patches queued th

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it an

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Stefan Groschupf
Well, it was at least the best way we had seen, since NutchConfigured require to implement a constructor that in most cases was unused as well, since most classes are instantiated class.newInstance(). So both solutions was optimal, and we decide for the interface solution. I'm pretty sure this

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364663 ] Sami Siren commented on NUTCH-193: -- +1 I quess the fuse-j - ndfs work from John/me could be part of hadoop /contrib after this change? > move NDFS and MapReduce to a separa

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Sami Siren
Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it and commit it within

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364657 ] Doug Cutting commented on NUTCH-193: NDFS, the Nutch Distributed Filesystem will be renamed HDFS, the Hadoop Distributed Filesystem. Its code will live in the package org

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364662 ] Andrzej Bialecki commented on NUTCH-193: - What timeframe did you have in mind? There are a few patches in the queue, which will be affected by this split. Other than

[jira] Created: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)
move NDFS and MapReduce to a separate project - Key: NUTCH-193 URL: http://issues.apache.org/jira/browse/NUTCH-193 Project: Nutch Type: Task Components: ndfs Versions: 0.8-dev Reporter: Doug Cutting Assigne

Re: mapred: config parameters

2006-01-31 Thread Gal Nitzan
Hi Michael, this question should be asked in the nutch-users list. Take a look at a thread: So many Unfetched Pages using MapReduce G. On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote: > Hi, > > the last days I gave the mapred-branch a try and I was impressed! > > But I still have a pro

[jira] Closed: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Andrzej Bialecki closed NUTCH-169: --- Resolution: Fixed Patches applied, with some changes (mostly whitespace related). Thank you! > remove static NutchConf > --- > >

Re: indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Andrzej Bialecki
Byron Miller wrote: Has indexsorter code discussed a while back been pushed to jira or put in SVN? I'd like to give it a whirl on some of my indexes and the archive i can find cut the post with the code attached.. It's committed to trunk/ . It works very well, if you have good differentiatio

mapred: config parameters

2006-01-31 Thread Michael Nebel
Hi, the last days I gave the mapred-branch a try and I was impressed! But I still have a problem with the incremental crawling. My setup: I have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running one round of "crawling" consists out of the steps: - generate (I set a limit of

indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Byron Miller
Has indexsorter code discussed a while back been pushed to jira or put in SVN? I'd like to give it a whirl on some of my indexes and the archive i can find cut the post with the code attached..

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll test it and commit it within the next 12 hours. > remove sta

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364542 ] Andrzej Bialecki commented on NUTCH-192: - I have two comments: * it's not obvious to me what are the strong arguments in favor of storing Writables. I'd think that fo