[jira] [Created] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-06 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3025: Summary: urlfilter-fast to filter based on the length of the URL Key: NUTCH-3025 URL: https://issues.apache.org/jira/browse/NUTCH-3025 Project: Nutch Issue

[jira] [Created] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3017: Summary: Allow fast-urlfilter to load from HDFS/S3 and support gzipped input Key: NUTCH-3017 URL: https://issues.apache.org/jira/browse/NUTCH-3017 Project: Nutch

Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel wrote: > Dear all, > > It is my pleasure to announce that Tim Allison has joined us > as a committer and member of the Nutch PMC. > > You may already know Tim as a maintainer of and

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

2018-10-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643035#comment-16643035 ] Julien Nioche commented on NUTCH-2648: -- [~wastl-nagel] ?? (code borrowed  [storm-crawler#615|https

Crawler-Commons 0.10 released

2018-06-07 Thread Julien Nioche
Hi We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the

Re: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Julien Nioche
+1 to release, thanks Seb On 18 December 2017 at 22:12, Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.14 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.14/ > > The release candidate is a zip and tar.gz archive

Re: [DISCUSS] Release 1.14?

2017-12-14 Thread Julien Nioche
happens this week, I'll make sure that it's included. > > Thanks, > Sebastian > > > On 12/11/2017 10:22 AM, Julien Nioche wrote: > > Tika 1.17 will be released shortly, maybe it would be worth waiting a > bit and integrate it first? > > > > On 8 December 2

Re: [DISCUSS] Release 1.14?

2017-12-11 Thread Julien Nioche
Tika 1.17 will be released shortly, maybe it would be worth waiting a bit and integrate it first? On 8 December 2017 at 22:53, Sebastian Nagel wrote: > Hi all, > > 50+ issues fixed > https://issues.apache.org/jira/projects/NUTCH/versions/12340218 > > Of course, as

Crawler-Commons 0.9 released

2017-10-31 Thread Julien Nioche
Happy Halloween! We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based

Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
<https://github.com/crawler-commons/crawler-commons/pull/127>. On 16 June 2017 at 08:55, Julien Nioche <lists.digitalpeb...@gmail.com> wrote: > Russian compatriots > > > Are we all Russian then? > > On 16 June 2017 at 04:29, lewis john mcgibbney <lewi...@apache.org

Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
> > Russian compatriots Are we all Russian then? On 16 June 2017 at 04:29, lewis john mcgibbney wrote: > Hi Folks, > I don't know if anyone else noticed... some of our Russian compatriots > have set up a static auto bot to notify us of source code issues... > An example

Crawler-Commons 0.8 released

2017-06-09 Thread Julien Nioche
Apologies for cross-posting The Common-Crawl project is pleased to announce its 0.8 release. *https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.8 * If you are wondering what

[jira] [Resolved] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2046. -- Resolution: Fixed Assignee: Julien Nioche (was: Lewis John McGibbney) > The crawl scr

[jira] [Closed] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2017-04-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-1371. Resolution: Duplicate > Replace Ivy with Maven Ant ta

Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Julien Nioche
Hi Lewis +1 compiled from source and ran a small crawl in local mode. All good! Thanks Julien On 29 March 2017 at 05:20, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.13 release is available at: > >

[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890043#comment-15890043 ] Julien Nioche commented on NUTCH-2363: -- Got it! Thanks for the explanation [~markus17]! Had missed

Crawler-Commons 0.7 released

2016-11-24 Thread Julien Nioche
Apologies for cross-posting The Common-Crawl project is pleased to announce its 0.7 release. https://github.com/crawler-commons/crawler-commons#24th-november-2016crawler-commons-07-released The list of changes can be found here

[jira] [Resolved] (NUTCH-1531) URL filtering takes long time for very long URLs

2016-10-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1531. -- Resolution: Duplicate No follow up on this one + same functionality discussed elsewhere >

[jira] [Commented] (NUTCH-2320) URLFilterChecker to run as TCP Telnet service

2016-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549206#comment-15549206 ] Julien Nioche commented on NUTCH-2320: -- Hi @markus17, you haven't left much time for people

[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2016-07-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359504#comment-15359504 ] Julien Nioche commented on NUTCH-1371: -- None whatsoever [~lewismc]. Maybe mark it as duplicate

ApacheCon EU Sevilla

2016-06-29 Thread Julien Nioche
Hi, Sorry for cross posting. As you are probably aware, the ApacheCon Europe, and Apache Big Data conferences will take place in Seville, Spain, November 14-18, 2016. http://events.linuxfoundation.org/events/apache-big-data-europe/ I just submitted a talk on StormCrawler

Re: [VOTE] Release Apache Nutch 1.12

2016-06-15 Thread Julien Nioche
+1 Thanks Lewis and team! On 15 June 2016 at 06:14, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.12 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.12/ > > The release candidate is a zip and tar archive of the

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142863#comment-15142863 ] Julien Nioche commented on NUTCH-2046: -- I agree with the objective but I'd rather have a consistent

[jira] [Reopened] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reopened NUTCH-2213: -- Assignee: Julien Nioche The WARC Export actually has the same issue as its CommonCrawl

[jira] [Comment Edited] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140608#comment-15140608 ] Julien Nioche edited comment on NUTCH-2213 at 2/10/16 10:36 AM: Hi Joris

[jira] [Commented] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113021#comment-15113021 ] Julien Nioche commented on NUTCH-2204: -- +1 > remove junit lib from runt

Re: [VOTE] Moving to Git

2016-01-08 Thread Julien Nioche
+1 to move to Git Note : I don't think Dennis is on the PMC anymore Ju On 8 January 2016 at 08:46, Chris Mattmann wrote: > Hi Everyone, > > I proposed this earlier, and we said we’d wait until after the > 1.11 release. So it’s time to VOTE to move Nutch to Git. So > far,

Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Julien Nioche
Thanks Lewis for taking care of the release and everyone involved. Julien On 8 December 2015 at 01:34, lewis john mcgibbney wrote: > Hello Folks, > > 07 December 2015 - Nutch 1.11 Release > > The Apache Nutch PMC are pleased to announce the immediate release of > Apache

Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-05 Thread Julien Nioche
+1 Thanks Lewis On 4 December 2015 at 18:03, Lewis John Mcgibbney wrote: > Hi Folks, > > A second candidate for the Nutch 1.11 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/ > > The release candidate consists of zip and tar

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491 ] Julien Nioche commented on NUTCH-2177: -- Do you mean 'mapreduce.framework.name' ? > Genera

[jira] [Updated] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2177: - Attachment: NUTCH-2177.patch > Generator produces only one partition even in distributed m

[jira] [Comment Edited] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033491#comment-15033491 ] Julien Nioche edited comment on NUTCH-2177 at 12/1/15 11:43 AM: Do you

[jira] [Resolved] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2177. -- Resolution: Fixed Committed revision 1717412. Thanks [~wastl-nagel] and [~markus17

[jira] [Created] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2177: Summary: Generator produces only one partition even in distributed mode Key: NUTCH-2177 URL: https://issues.apache.org/jira/browse/NUTCH-2177 Project: Nutch

[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029037#comment-15029037 ] Julien Nioche commented on NUTCH-2177: -- I am on Hadoop version: 2.4.0-amzn-7 not clear which

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018232#comment-15018232 ] Julien Nioche commented on NUTCH-2069: -- no probs. Would be good to find a way to format based

[jira] [Resolved] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2069. -- Resolution: Fixed Trunk committed revision 1715386. Thanks everyone for comments and reviews

[jira] [Closed] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-2069. > Ignore external links based on domain > - > >

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Attachment: NUTCH-2069.v2.patch new patch introducing 'db.ignore.external.links.mode

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998467#comment-14998467 ] Julien Nioche commented on NUTCH-2064: -- FYI have ported the code to Crawler-Commons [https

[jira] [Resolved] (NUTCH-2064) URLNormalizer basic to encode reserved chars and decode non-reserved chars

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2064. -- Resolution: Fixed Fix Version/s: (was: 1.12) 1.11 Trunk

[jira] [Assigned] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-2158: Assignee: Julien Nioche (was: Chris A. Mattmann) > Upgrade to Tika 1

[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2158: - Attachment: NUTCH-2158.patch Patch which upgrades to Tika 1.11 tests fail for protocol-http

Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Julien Nioche
Chris -1 We usually release tar.gz as well as zip. More importantly we need to release the sources as well as the binary. We can't even test that it compiles OK Since you released Tika, why don't we include it before cutting 1.11? Thanks Julien On 26 October 2015 at 05:53, Mattmann, Chris

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943757#comment-14943757 ] Julien Nioche commented on NUTCH-2132: -- Looking at it from a slightly different angle, couldn't you

[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943856#comment-14943856 ] Julien Nioche commented on NUTCH-2132: -- bq. but that locks us into using Kibana, etc. Ideally one

Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche
Hi What happens is that parse-tika is used by default but doesn't know what to do with that mime type. You can edit parse-plugins.xml and add to map the mime type to the html parser. Obviously you'll need parse-html to be

[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503 ] Julien Nioche commented on NUTCH-2129: -- I'd rather keep it simple and not modify the CrawlDatum so

Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche
Hi again, I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map Reduce https://www.youtube.com/watch?v=v9zjcTjjjyU Please excuse the sound quality, hesitations and stuttering. I hope you find it useful nonetheless. Julien -- *Open Source Solutions for Text Engineering*

Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche
Hi everyone, Just to let you know that we've just published a new tutorial on how to use Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch. This is related to the recent addition of NUTCH-1517 in the trunk codebase. The

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902651#comment-14902651 ] Julien Nioche commented on NUTCH-2095: -- Thanks [~jorgelbg]. Please add a line to CHANGES.txt

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902715#comment-14902715 ] Julien Nioche commented on NUTCH-2095: -- See [https://issues.apache.org/jira/browse/HADOOP-10961

[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902578#comment-14902578 ] Julien Nioche commented on NUTCH-2095: -- [~jorgelbg] could you please fix the test. See below {code

[jira] [Resolved] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2102. -- Resolution: Fixed Committed revision 1704634. Thanks for the reviews > WARC Expor

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Fix Version/s: 1.11 > WARC Exporter > - > > Key

[jira] [Closed] (NUTCH-2114) kkk

2015-09-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-2114. Resolution: Invalid > kkk > --- > > Key: NUTCH-2114 >

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche
Nutch people, Just in case you missed the announcement below. As you probably know CC use Nutch for their crawls, this is a fantastic opportunity to put your Nutch skills to great use! Julien -- Forwarded message -- From: Sara Crouse Date: 17 September

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747300#comment-14747300 ] Julien Nioche commented on NUTCH-2102: -- The only modification to existing code is in the class 'src

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Description: This patch adds a WARC exporter [http://bibnum.bnf.fr/warc

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747301#comment-14747301 ] Julien Nioche commented on NUTCH-2102: -- Please review > WARC Expor

[jira] [Comment Edited] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327 ] Julien Nioche edited comment on NUTCH-2102 at 9/16/15 11:21 AM: Hi Markus

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Description: This patch adds a WARC exporter [http://bibnum.bnf.fr/warc

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: (was: NUTCH-2102.patch) > WARC Exporter > - > >

[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747327#comment-14747327 ] Julien Nioche commented on NUTCH-2102: -- Hi Markus > I believe this warc format is the updated

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: NUTCH-2102.patch > WARC Exporter > - > > Key

[jira] [Created] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2102: Summary: WARC Exporter Key: NUTCH-2102 URL: https://issues.apache.org/jira/browse/NUTCH-2102 Project: Nutch Issue Type: Improvement Components

[jira] [Updated] (NUTCH-2102) WARC Exporter

2015-09-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2102: - Attachment: NUTCH-2102.patch > WARC Exporter > - > > Key

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-14 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744078#comment-14744078 ] Julien Nioche commented on NUTCH-2064: -- yep, can discuss that post 1.11 > URLNormalizer ba

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche
Congratulations Asitang and welcome! Julien On 9 September 2015 at 23:01, Sebastian Nagel wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Asitang Mishra has joined the Nutch team as committer > and PMC member. Asitang, please

[jira] [Commented] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-09-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731114#comment-14731114 ] Julien Nioche commented on NUTCH-2064: -- What about moving the basic URL normalizer to Crawler-Commons

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Hi Lewis I'd love to see https://issues.apache.org/jira/browse/NUTCH-1517 being part of 1.11. It is a separate indexing plugin which should not impact any existing code. It's been reviewed by Jorge and I'll to commit it soon unless someone objects. Thanks J. On 26 August 2015 at 03:23, Lewis

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Done. Thanks Markus On 26 August 2015 at 13:08, Markus Jelsma markus.jel...@openindex.io wrote: Yes Julien, please commit. I do think https://issues.apache.org/jira/browse/NUTCH-2064 should also be included. But i have my hands full atm. -Original message- From: Julien

[jira] [Resolved] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1517. -- Resolution: Fixed trunk committed revision 1697911. Thanks for comments and review

[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712988#comment-14712988 ] Julien Nioche commented on NUTCH-1517: -- Thanks [~jorgelbg]. Will commit soon unless

[jira] [Resolved] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2049. -- Resolution: Fixed Committed revision 1697466. Thanks to everyone involved. Upgrade Trunk

[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706402#comment-14706402 ] Julien Nioche commented on NUTCH-2049: -- Fantastic work [~lewismc]! I think

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Attachment: (was: NUTCH-1517.patch) CloudSearch indexer

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Flags: Patch CloudSearch indexer --- Key: NUTCH-1517

[jira] [Updated] (NUTCH-1517) CloudSearch indexer

2015-08-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1517: - Attachment: NUTCH-1517.patch New implementation of the CloudSearchIndexWriter, uses the latest

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647467#comment-14647467 ] Julien Nioche commented on NUTCH-2069: -- Hi [~wastl-nagel] and [~markus17]. BTW did

[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646543#comment-14646543 ] Julien Nioche commented on NUTCH-2069: -- What code restyle? I applied the formatting

[jira] [Created] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2069: Summary: Ignore external links based on domain Key: NUTCH-2069 URL: https://issues.apache.org/jira/browse/NUTCH-2069 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Attachment: NUTCH-2069.patch Ignore external links based on domain

[jira] [Updated] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2069: - Patch Info: Patch Available Ignore external links based on domain

[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14640138#comment-14640138 ] Julien Nioche commented on NUTCH-2048: -- howto_upgrade_tika.txt has been around for 2

[jira] [Assigned] (NUTCH-1517) CloudSearch indexer

2015-07-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1517: Assignee: Julien Nioche CloudSearch indexer --- Key

[jira] [Commented] (NUTCH-2016) Remove OldFetcher from trunk

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600946#comment-14600946 ] Julien Nioche commented on NUTCH-2016: -- +1 Remove OldFetcher from trunk

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2036: - Affects Version/s: (was: 1.11) Adding some continuous crawl goodies to the crawl script

[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600949#comment-14600949 ] Julien Nioche commented on NUTCH-2036: -- Any thoughts on this? This is useful

[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2036: - Fix Version/s: 1.11 Adding some continuous crawl goodies to the crawl script

[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599840#comment-14599840 ] Julien Nioche commented on NUTCH-2046: -- re-script : what about a positive parameter

[jira] [Commented] (NUTCH-2000) Link inversion fails with .locked already exists.

2015-06-17 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589951#comment-14589951 ] Julien Nioche commented on NUTCH-2000: -- Hi Seb, +1 to commit. Not sure I'll be able

crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche
[Apologies for cross posting]crawler-commons 0.6 is released We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.6 file included with the release for a full list of details. We suggest

[jira] [Resolved] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-2006. -- Resolution: Fixed Fix Version/s: 1.11 Committed revision 1679567. Thanks Seb

[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545534#comment-14545534 ] Julien Nioche commented on NUTCH-2012: -- +1 to merging them into a more generic tool

[jira] [Commented] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541843#comment-14541843 ] Julien Nioche commented on NUTCH-2008: -- Makes total sense. +1 Could also make

[jira] [Created] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2006: Summary: IndexingFiltersChecker to take custom metadata as input Key: NUTCH-2006 URL: https://issues.apache.org/jira/browse/NUTCH-2006 Project: Nutch Issue

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2006: - Attachment: NUTCH-2006.patch Patch which allows to take custom metadata into account + improved

[jira] [Updated] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-2006: - Patch Info: Patch Available IndexingFiltersChecker to take custom metadata as input

[jira] [Updated] (NUTCH-1999) Add http://nutch.apache.org/robots.txt

2015-05-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1999: - Assignee: (was: Julien Nioche) Add http://nutch.apache.org/robots.txt

  1   2   3   4   5   6   7   8   9   10   >