Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-23 Thread d_k
My main concerns with the Nutch2Tutorial was that it didn't stand by itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with suspicion because I didn't know what is relevant for Nutch 2 and what isn't. And the Nutch2Tutorial tutorial alone is not enough to get you going. I think

Re: What is the correct way to serialize a MapWritable to WebPage's metadata?

2014-01-23 Thread d_k
Hi Lewis, On Tue, Jan 21, 2014 at 9:03 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi d_k, On Tue, Jan 21, 2014 at 11:20 AM, dev-digest-h...@nutch.apache.orgwrote: I'm working on porting NUTCH-1622 to Nutch 2 Excellent and the path I took was to add a MapWritable field

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879847#comment-13879847 ] Sebastian Nagel commented on NUTCH-1253: +1 tested with a collection of

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879853#comment-13879853 ] Lewis John McGibbney commented on NUTCH-1253: - I'll post the patches today

[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-23 Thread Sertac TURKEL (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1164: - Attachment: NUTCH-1164.patch Hi [~tejas.patil], I updated the patchfile, I think, it's ok. Could

[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-23 Thread Sertac TURKEL (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1164: - Attachment: (was: NUTCH-1158.patch) Write JUnit tests for protocol-http

[jira] [Commented] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2014-01-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879881#comment-13879881 ] Alparslan Avcı commented on NUTCH-1709: --- +1 on this issue. The Avro generated

[jira] [Commented] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879883#comment-13879883 ] Lewis John McGibbney commented on NUTCH-1709: - I will probably submit a patch

[jira] [Created] (NUTCH-1711) Normalizer does not encode exclamation mark

2014-01-23 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1711: Summary: Normalizer does not encode exclamation mark Key: NUTCH-1711 URL: https://issues.apache.org/jira/browse/NUTCH-1711 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-1711) Normalizer does not encode exclamation mark

2014-01-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879914#comment-13879914 ] Markus Jelsma commented on NUTCH-1711: -- Well, perhaps it is best to stick with the

[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879955#comment-13879955 ] Lewis John McGibbney edited comment on NUTCH-1465 at 1/23/14 2:38 PM:

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879955#comment-13879955 ] Lewis John McGibbney commented on NUTCH-1465: - Hey [~tejasp]. Again, great

[jira] [Created] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1712: -- Summary: Use MultipleInputs in Injector to make it a single mapreduce job Key: NUTCH-1712 URL: https://issues.apache.org/jira/browse/NUTCH-1712 Project: Nutch

[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1712: --- Description: Currently Injector creates two mapreduce jobs: 1. sort job: get the urls from seeds

[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1712: --- Attachment: NUTCH-1712-trunk.v1.patch Use MultipleInputs in Injector to make it a single mapreduce

[jira] [Updated] (NUTCH-1713) IpAddressResolver and DNSCache

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1713: Attachment: NUTCH-1713-trunk.patch Patch contributed by [~wal]. I forgot to open a

[jira] [Updated] (NUTCH-1713) IpAddressResolver and DNSCache

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1713: Fix Version/s: 1.8 2.3 IpAddressResolver and DNSCache

[jira] [Commented] (NUTCH-1660) Index filter for Page's latitude and longitude

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879981#comment-13879981 ] Lewis John McGibbney commented on NUTCH-1660: - [~icebergx5] and [~talat] we

[jira] [Created] (NUTCH-1713) IpAddressResolver and DNSCache

2014-01-23 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1713: --- Summary: IpAddressResolver and DNSCache Key: NUTCH-1713 URL: https://issues.apache.org/jira/browse/NUTCH-1713 Project: Nutch Issue Type: New

[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-01-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alparslan Avcı updated NUTCH-1714: -- Attachment: NUTCH-1714.patch I've uploaded a patch that makes Nutch 2.x suitable to use

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-23 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880026#comment-13880026 ] Markus Jelsma commented on NUTCH-1113: -- I have tried running long sequences with

[jira] [Comment Edited] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-01-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880008#comment-13880008 ] Alparslan Avcı edited comment on NUTCH-1714 at 1/23/14 4:18 PM:

[jira] [Comment Edited] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-01-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880008#comment-13880008 ] Alparslan Avcı edited comment on NUTCH-1714 at 1/23/14 4:17 PM:

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880007#comment-13880007 ] Sebastian Nagel commented on NUTCH-1113: Great! I'll try to verify it within the

Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-23 Thread Tejas Patil
On Thu, Jan 23, 2014 at 1:36 PM, d_k mail...@gmail.com wrote: My main concerns with the Nutch2Tutorial was that it didn't stand by itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with suspicion because I didn't know what is relevant for Nutch 2 and what isn't. And the

[jira] [Resolved] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1164. Resolution: Fixed The patch is better now and all tests pass. It needed little modification: you

[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880288#comment-13880288 ] Tejas Patil commented on NUTCH-1712: The performance gains due to this patch won't be

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Fix Version/s: (was: 1.9) 1.8 Support sitemaps in Nutch

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880295#comment-13880295 ] Tejas Patil commented on NUTCH-1465: Hi [~lewismc], +1 for the first two suggestions.

Re: Right way to run crawl script in deploy mode

2014-01-23 Thread Tejas Patil
Correction: the subject of this message should have read: Right way to run crawl script in deploy mode ~tejas On Wed, Jan 22, 2014 at 7:56 PM, Tejas Patil tejas.patil...@gmail.comwrote: Hi nutch-dev, I was assuming that the commands to run the bin/crawl script in both local and deploy mode

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880305#comment-13880305 ] Lewis John McGibbney commented on NUTCH-1465: - hey [~tejasp] no probs. RE: #3,

Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-23 Thread d_k
What I was missing when first started with Nutch, and one can claim that a little research would of solved it, was how to configure nutch-site.xml, when looking at the NutchTutorial you can't be sure what applies to Nutch 2.x and what doesn't without prior knowledge that the nutch-site.xml is the

[jira] [Commented] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880356#comment-13880356 ] Lewis John McGibbney commented on NUTCH-1645: - hey [~msertacturkel] thank you

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Attachment: TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt Actually, I

[jira] [Comment Edited] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880362#comment-13880362 ] Lewis John McGibbney edited comment on NUTCH-1253 at 1/23/14 9:10 PM:

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1253: --- Attachment: nutch1253test.html nutch1253parsed.html It's likely a regression

[jira] [Updated] (NUTCH-1677) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1677: Patch Info: Patch Available ORIGINAL_CHAR_ENCODING and

Re: What is the correct way to serialize a MapWritable to WebPage's metadata?

2014-01-23 Thread Lewis John Mcgibbney
Hi d_k, On Thu, Jan 23, 2014 at 11:06 AM, dev-digest-h...@nutch.apache.org wrote: I attached the patch. If you think its ready I can add it to JIRA. Yes please open an issue and we can take the conversation there. dev@ is quite busy these days and some mail gets lost in the digest emails I

[jira] [Commented] (NUTCH-1677) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML

2014-01-23 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13880529#comment-13880529 ] Lewis John McGibbney commented on NUTCH-1677: - hi [~ilhamikalkan], thank you

[jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

2014-01-23 Thread Daniel Kugel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Kugel updated NUTCH-1622: Attachment: NUTCH-1622-2.x.patch A patch for Nutch 2.x was added. Create Outlinks with metadata