Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Todd Lipcon
Hey guys, Sorry for the non-responsiveness here. I recently left my old employment and have been packing for a cross-country move. I agree that for 1.0 the best bet is what Sami has done. The code that I was working on is available here: http://github.com/toddlipcon/nutch/tree/nutch-669 But it

[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

2009-01-21 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665855#action_12665855 ] Todd Lipcon commented on NUTCH-676: --- Have you run some full crawls yet? I wrote pretty

[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

2009-01-21 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665889#action_12665889 ] Todd Lipcon commented on NUTCH-676: --- Hmm, I can't seem to find the bug I thought I

Re: nutch segment format

2009-01-05 Thread Todd Lipcon
Hi Matt, The nutch segments are stored as Hadoop SequenceFiles and MapFiles. MapFile is made up of multiple SequenceFiles. I'm not certain if the format is documented anywhere, but the source is in org.apache.hadoop.io. I doubt you'll find a PHP library for reading them, so you'll probably have

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660382#action_12660382 ] Todd Lipcon commented on NUTCH-669: --- Here's a further report on my progress: - It turns

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659857#action_12659857 ] Todd Lipcon commented on NUTCH-669: --- Hey guys, I tried it on production, but ran

[jira] Created: (NUTCH-676) MapWritable is written inefficiently and confusingly

2008-12-30 Thread Todd Lipcon (JIRA)
Reporter: Todd Lipcon Priority: Minor The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down

[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly

2008-12-30 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated NUTCH-676: -- Attachment: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch NUTCH-676: Replace

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-30 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659958#action_12659958 ] Todd Lipcon commented on NUTCH-669: --- Found the exception in a screen log: {noformat

[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

2008-12-30 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659961#action_12659961 ] Todd Lipcon commented on NUTCH-676: --- Oops - please disregard above patch - it breaks

[jira] Created: (NUTCH-672) allow unit tests to be run from bin/nutch

2008-12-10 Thread Todd Lipcon (JIRA)
: Todd Lipcon Priority: Trivial In development it's handy to be able to run a single test case easily. You can do it with ant -Dtestcase=foo test, but that's slow since it still checks all the plugins for changes, rebuilds jars, etc. This patch adds a command to bin/nutch to run

[jira] Updated: (NUTCH-672) allow unit tests to be run from bin/nutch

2008-12-10 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated NUTCH-672: -- Attachment: 0001-NUTCH-672-allow-junit-tests-to-be-run-from-bin-nutc.patch allow unit tests to be run

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655491#action_12655491 ] Todd Lipcon commented on NUTCH-669: --- For those watching this issue: I pushed a couple more

[jira] Commented: (NUTCH-670) feed plugin does not parse RSS2 enclosures

2008-12-10 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655513#action_12655513 ] Todd Lipcon commented on NUTCH-670: --- Turns out this is actually a bit trickier if I'm

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653844#action_12653844 ] Todd Lipcon commented on NUTCH-669: --- Agreed on all fronts. I spent several hours

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-05 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653940#action_12653940 ] Todd Lipcon commented on NUTCH-669: --- I've pushed the initial commit of this rewrite

[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2008-12-04 Thread Todd Lipcon (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653412#action_12653412 ] Todd Lipcon commented on NUTCH-207: --- Are both fetcher and fetcher2 supposed

[jira] Created: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-04 Thread Todd Lipcon (JIRA)
Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings