[jira] [Created] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-06 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3025: Summary: urlfilter-fast to filter based on the length of the URL Key: NUTCH-3025 URL: https://issues.apache.org/jira/browse/NUTCH-3025 Project: Nutch Issue

[jira] [Created] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3017: Summary: Allow fast-urlfilter to load from HDFS/S3 and support gzipped input Key: NUTCH-3017 URL: https://issues.apache.org/jira/browse/NUTCH-3017 Project: Nutch

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643035#comment-16643035 ] Julien Nioche commented on NUTCH-2648: -- [~wastl-nagel] ?? (code borrowed 

[jira] [Resolved] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2046. -- Resolution: Fixed Assignee: Julien Nioche (was: Lewis John McGibbney) > The crawl script

[jira] [Closed] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2017-04-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1371. Resolution: Duplicate > Replace Ivy with Maven Ant tasks > > >

[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890043#comment-15890043 ] Julien Nioche commented on NUTCH-2363: -- Got it! Thanks for the explanation [~markus17]! Had missed

[jira] [Resolved] (NUTCH-1531) URL filtering takes long time for very long URLs

2016-10-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1531. -- Resolution: Duplicate No follow up on this one + same functionality discussed elsewhere > URL

[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2016-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549206#comment-15549206 ] Julien Nioche commented on NUTCH-2320: -- Hi @markus17, you haven't left much time for people to

[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2016-07-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359504#comment-15359504 ] Julien Nioche commented on NUTCH-1371: -- None whatsoever [~lewismc]. Maybe mark it as duplicate and

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142863#comment-15142863 ] Julien Nioche commented on NUTCH-2046: -- I agree with the objective but I'd rather have a consistent

[jira] [Reopened] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-2213: -- Assignee: Julien Nioche The WARC Export actually has the same issue as its CommonCrawl

[jira] [Comment Edited] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140608#comment-15140608 ] Julien Nioche edited comment on NUTCH-2213 at 2/10/16 10:36 AM: Hi Joris

[jira] [Commented] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113021#comment-15113021 ] Julien Nioche commented on NUTCH-2204: -- +1 > remove junit lib from runtime >

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491 ] Julien Nioche commented on NUTCH-2177: -- Do you mean 'mapreduce.framework.name' ? > Generator

[jira] [Updated] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2177: - Attachment: NUTCH-2177.patch > Generator produces only one partition even in distributed mode >

[jira] [Comment Edited] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491 ] Julien Nioche edited comment on NUTCH-2177 at 12/1/15 11:43 AM: Do you

[jira] [Resolved] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2177. -- Resolution: Fixed Committed revision 1717412. Thanks [~wastl-nagel] and [~markus17] >

[jira] [Created] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2177: Summary: Generator produces only one partition even in distributed mode Key: NUTCH-2177 URL: https://issues.apache.org/jira/browse/NUTCH-2177 Project: Nutch

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029037#comment-15029037 ] Julien Nioche commented on NUTCH-2177: -- I am on Hadoop version: 2.4.0-amzn-7 not clear which

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018232#comment-15018232 ] Julien Nioche commented on NUTCH-2069: -- no probs. Would be good to find a way to format based on the

[jira] [Resolved] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2069. -- Resolution: Fixed Trunk committed revision 1715386. Thanks everyone for comments and reviews

[jira] [Closed] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-2069. > Ignore external links based on domain > - > >

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Attachment: NUTCH-2069.v2.patch new patch introducing 'db.ignore.external.links.mode' this is

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998467#comment-14998467 ] Julien Nioche commented on NUTCH-2064: -- FYI have ported the code to Crawler-Commons

[jira] [Resolved] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2064. -- Resolution: Fixed Fix Version/s: (was: 1.12) 1.11 Trunk :

[jira] [Assigned] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-2158: Assignee: Julien Nioche (was: Chris A. Mattmann) > Upgrade to Tika 1.11 >

[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2158: - Attachment: NUTCH-2158.patch Patch which upgrades to Tika 1.11 tests fail for protocol-http

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943757#comment-14943757 ] Julien Nioche commented on NUTCH-2132: -- Looking at it from a slightly different angle, couldn't you

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943856#comment-14943856 ] Julien Nioche commented on NUTCH-2132: -- bq. but that locks us into using Kibana, etc. Ideally one

[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503 ] Julien Nioche commented on NUTCH-2129: -- I'd rather keep it simple and not modify the CrawlDatum so

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902651#comment-14902651 ] Julien Nioche commented on NUTCH-2095: -- Thanks [~jorgelbg]. Please add a line to CHANGES.txt to

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902715#comment-14902715 ] Julien Nioche commented on NUTCH-2095: -- See [https://issues.apache.org/jira/browse/HADOOP-10961].

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902578#comment-14902578 ] Julien Nioche commented on NUTCH-2095: -- [~jorgelbg] could you please fix the test. See below {code}

[jira] [Resolved] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2102. -- Resolution: Fixed Committed revision 1704634. Thanks for the reviews > WARC Exporter >

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Fix Version/s: 1.11 > WARC Exporter > - > > Key: NUTCH-2102 >

[jira] [Closed] (NUTCH-2114) kkk

2015-09-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-2114. Resolution: Invalid > kkk > --- > > Key: NUTCH-2114 > URL:

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747300#comment-14747300 ] Julien Nioche commented on NUTCH-2102: -- The only modification to existing code is in the class

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Description: This patch adds a WARC exporter

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747301#comment-14747301 ] Julien Nioche commented on NUTCH-2102: -- Please review > WARC Exporter > - > >

[jira] [Comment Edited] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327 ] Julien Nioche edited comment on NUTCH-2102 at 9/16/15 11:21 AM: Hi Markus

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Description: This patch adds a WARC exporter

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: (was: NUTCH-2102.patch) > WARC Exporter > - > > Key:

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327 ] Julien Nioche commented on NUTCH-2102: -- Hi Markus > I believe this warc format is the updated arc

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: NUTCH-2102.patch > WARC Exporter > - > > Key: NUTCH-2102

[jira] [Created] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2102: Summary: WARC Exporter Key: NUTCH-2102 URL: https://issues.apache.org/jira/browse/NUTCH-2102 Project: Nutch Issue Type: Improvement Components:

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: NUTCH-2102.patch > WARC Exporter > - > > Key: NUTCH-2102

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-14 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744078#comment-14744078 ] Julien Nioche commented on NUTCH-2064: -- yep, can discuss that post 1.11 > URLNormalizer basic to

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731114#comment-14731114 ] Julien Nioche commented on NUTCH-2064: -- What about moving the basic URL normalizer to

[jira] [Resolved] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1517. -- Resolution: Fixed trunk committed revision 1697911. Thanks for comments and review

[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712988#comment-14712988 ] Julien Nioche commented on NUTCH-1517: -- Thanks [~jorgelbg]. Will commit soon unless

[jira] [Resolved] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2049. -- Resolution: Fixed Committed revision 1697466. Thanks to everyone involved. Upgrade Trunk to

[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706402#comment-14706402 ] Julien Nioche commented on NUTCH-2049: -- Fantastic work [~lewismc]! I think this is

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Attachment: (was: NUTCH-1517.patch) CloudSearch indexer ---

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Flags: Patch CloudSearch indexer --- Key: NUTCH-1517

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Attachment: NUTCH-1517.patch New implementation of the CloudSearchIndexWriter, uses the latest

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467 ] Julien Nioche commented on NUTCH-2069: -- Hi [~wastl-nagel] and [~markus17]. BTW did

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646543#comment-14646543 ] Julien Nioche commented on NUTCH-2069: -- What code restyle? I applied the formatting

[jira] [Created] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2069: Summary: Ignore external links based on domain Key: NUTCH-2069 URL: https://issues.apache.org/jira/browse/NUTCH-2069 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Attachment: NUTCH-2069.patch Ignore external links based on domain

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Patch Info: Patch Available Ignore external links based on domain

[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640138#comment-14640138 ] Julien Nioche commented on NUTCH-2048: -- howto_upgrade_tika.txt has been around for 2

[jira] [Assigned] (NUTCH-1517) CloudSearch indexer

2015-07-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1517: Assignee: Julien Nioche CloudSearch indexer --- Key:

[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600946#comment-14600946 ] Julien Nioche commented on NUTCH-2016: -- +1 Remove OldFetcher from trunk

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2036: - Affects Version/s: (was: 1.11) Adding some continuous crawl goodies to the crawl script

[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600949#comment-14600949 ] Julien Nioche commented on NUTCH-2036: -- Any thoughts on this? This is useful and

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2036: - Fix Version/s: 1.11 Adding some continuous crawl goodies to the crawl script

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840 ] Julien Nioche commented on NUTCH-2046: -- re-script : what about a positive parameter

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-17 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589951#comment-14589951 ] Julien Nioche commented on NUTCH-2000: -- Hi Seb, +1 to commit. Not sure I'll be able

[jira] [Resolved] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2006. -- Resolution: Fixed Fix Version/s: 1.11 Committed revision 1679567. Thanks Seb

[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545534#comment-14545534 ] Julien Nioche commented on NUTCH-2012: -- +1 to merging them into a more generic tool.

[jira] [Commented] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541843#comment-14541843 ] Julien Nioche commented on NUTCH-2008: -- Makes total sense. +1 Could also make it

[jira] [Created] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2006: Summary: IndexingFiltersChecker to take custom metadata as input Key: NUTCH-2006 URL: https://issues.apache.org/jira/browse/NUTCH-2006 Project: Nutch Issue

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2006: - Attachment: NUTCH-2006.patch Patch which allows to take custom metadata into account + improved

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2006: - Patch Info: Patch Available IndexingFiltersChecker to take custom metadata as input

[jira] [Updated] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1999: - Assignee: (was: Julien Nioche) Add http://nutch.apache.org/robots.txt

[jira] [Updated] (NUTCH-2002) ParserChecker to check robots.txt

2015-04-27 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2002: - Attachment: NUTCH-2002.patch ParserChecker to check robots.txt

[jira] [Created] (NUTCH-2002) ParserChecker to check robots.txt

2015-04-27 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2002: Summary: ParserChecker to check robots.txt Key: NUTCH-2002 URL: https://issues.apache.org/jira/browse/NUTCH-2002 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2000: - Fix Version/s: (was: 1.10) 1.11 Link inversion fails with .locked already

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510629#comment-14510629 ] Julien Nioche commented on NUTCH-2000: -- Lewis - could be, need to investigate but

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509898#comment-14509898 ] Julien Nioche commented on NUTCH-2000: -- [~lewismc] reverted to 1.10 as this is a

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2000: - Priority: Blocker (was: Major) Link inversion fails with .locked already exists.

[jira] [Updated] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2000: - Fix Version/s: (was: 1.11) 1.10 Link inversion fails with .locked already

[jira] [Created] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-04-23 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1999: Summary: Add http://nutch.apache.org/robots.txt Key: NUTCH-1999 URL: https://issues.apache.org/jira/browse/NUTCH-1999 Project: Nutch Issue Type: Improvement

[jira] [Assigned] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-04-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1999: Assignee: Julien Nioche Add http://nutch.apache.org/robots.txt

[jira] [Created] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-04-23 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2000: Summary: Link inversion fails with .locked already exists. Key: NUTCH-2000 URL: https://issues.apache.org/jira/browse/NUTCH-2000 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506745#comment-14506745 ] Julien Nioche commented on NUTCH-1990: -- bq. lot of garbage yep, that's what the

[jira] [Resolved] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1990. -- Resolution: Fixed Fix Version/s: 1.10 Committed revision 1675305. Use URI.normalise()

[jira] [Commented] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503072#comment-14503072 ] Julien Nioche commented on NUTCH-1990: -- Thanks [~wastl-nagel]! I have extracted

[jira] [Created] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1990: Summary: Use URI.normalise() in BasicURLNormalizer Key: NUTCH-1990 URL: https://issues.apache.org/jira/browse/NUTCH-1990 Project: Nutch Issue Type:

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375857#comment-14375857 ] Julien Nioche commented on NUTCH-1958: -- I agree but I think there could be benefits

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375696#comment-14375696 ] Julien Nioche commented on NUTCH-1958: -- What would you suggest as a replacement?

[jira] [Closed] (NUTCH-1965) My

2015-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1965. Resolution: Fixed WTF is this? My -- Key: NUTCH-1965 URL:

[jira] [Commented] (NUTCH-1942) Remove TopLevelDomain

2015-02-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319992#comment-14319992 ] Julien Nioche commented on NUTCH-1942: -- See

[jira] [Created] (NUTCH-1942) Remove TopLevelDomain

2015-02-12 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1942: Summary: Remove TopLevelDomain Key: NUTCH-1942 URL: https://issues.apache.org/jira/browse/NUTCH-1942 Project: Nutch Issue Type: Task Reporter:

[jira] [Closed] (NUTCH-1937) Error: Could not find or load main class bin.crawl

2015-02-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1937. Resolution: Invalid Please use the mailing list to ask questions like these instead of filing bugs

[jira] [Resolved] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1889. -- Resolution: Fixed Committed revision 1655960. Store all values from Tika metadata in Nutch

[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296621#comment-14296621 ] Julien Nioche commented on NUTCH-1918: -- Quite an important issue for those who

[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296619#comment-14296619 ] Julien Nioche commented on NUTCH-1889: -- This one is quite trivial, I'd like to see it

[jira] [Updated] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1889: - Fix Version/s: (was: 1.11) 1.10 Store all values from Tika metadata in

[jira] [Updated] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1918: - Fix Version/s: (was: 1.11) 1.10 TikaParser specifies a default namespace

  1   2   3   4   5   6   7   8   9   10   >