Re: [DISCUSS] Release Trunk

2013-11-28 Thread Julien Nioche
Hi Lewis We've done quite a few things in 1.x since the previous release (e.g. generic deduplication, removing indexer.solr package, etc...) and the next 2.x release will be after the changes to GORA have been made, tested and used on the Nutch side so that could be quite a while. I am neutral

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Julien Nioche
I don't think Nutch has been fully ported to the new mapreduce API which is a prerequisite for running it on Hadoop 2. I can't think of a reason why that the performance would be any different with Yarn. Julien On 9 December 2013 06:42, Tejas Patil tejas.patil...@gmail.com wrote: Has anyone

Re: Nightly builds

2014-01-08 Thread Julien Nioche
Great stuff, thanks Lewis On 8 January 2014 12:00, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Folks, On Wed, Jan 8, 2014 at 4:06 AM, dev-digest-h...@nutch.apache.org wrote: I'm working on getting the Jenkins job configuration stable again. Something seems to have been reset

Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-21 Thread Julien Nioche
Hi The whole thing has been replaced with http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialhttp://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorialwhich does exactly what you described. +1 to remove the old nutchhadooptutorial page J. On 21 January 2014 17:44, Tejas Patil

Re: Renovating Nutch Hadoop Tutorial wiki page

2014-01-22 Thread Julien Nioche
wrote: Actually what I would like to see is a Nutch 2.x tutorial at the same level of detail as the http://wiki.apache.org/nutch/NutchHadoopTutorial What is the process of contributing to that wiki page? On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi

Nutch meetup / hackathon at BerlinBuzzwords next May?

2014-01-24 Thread Julien Nioche
Hi guys, I'll certainly be at BerlinBuzzwords and have submitted at talk on Nutch. What about having a Nutch meetup / hackathon / workshop? http://berlinbuzzwords.de/news/hackathons-workshops-berlin-buzzwords Julien -- Open Source Solutions for Text Engineering

Re: [DISCUSS] Release Trunk

2014-02-12 Thread Julien Nioche
) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien Nioche wrote: Hi Lewis We've done quite a few things

Re: [DISCUSS] Release 1.8?

2014-03-11 Thread Julien Nioche
+1 Thanks for your work on these issues guys! Julien On 11 March 2014 18:24, Markus Jelsma markus.jel...@openindex.io wrote: Yes! Agreed! Sebastian Nagel wastl.na...@googlemail.com schreef: Hi everyone, NUTCH-1113 and NUTCH-1706 are fixed, broken HostDb (NUTCH-1325) has been removed

Re: Who is moderating Nutch lists?

2014-03-13 Thread Julien Nioche
I don't think these lists are moderated. Don't think they should be either J On Thursday, 13 March 2014, Markus Jelsma markus.jel...@openindex.io wrote: Well, thats not me, perhaps Chris? -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com javascript:; Sent:

Re: [VOTE] Release Apache Nutch 1.8RC#2

2014-03-16 Thread Julien Nioche
+1 from me. Thanks everyone On Sunday, 16 March 2014, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: +1 from me! SIGS pass, CHECKSUMS pass: [chipotle:~/tmp/apache-nutch-1.8] mattmann% $HOME/bin/stage_apache_rc apache-nutch 1.8-bin

Re: Pushing content to Solr from Nutch

2014-04-10 Thread Julien Nioche
Hi Xavier Your config file looks a bit outdated. Here are the values set by default (see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml) property nameplugin.includes/name

Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alparslan Folks, OK so you can see the road map's

Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
2014 08:40, Talat Uyarer ta...@uyarer.com wrote: I aggree with you Julien. Today Lewis change some issues's fix version 2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can I change fix version to 2.3 ? I need them. Thanks Talat 2014-05-01 9:47 GMT+03:00 Julien Nioche

Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
2.x a lot faster. We haven't released 2.x for some time and loads of interesting stuff has been done to it. It will be an exciting release! Thanks for your contributions and pushing things forward! Julien 2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: Hi Talat

Re: Post process Nutch data

2014-05-05 Thread Julien Nioche
Hi As mentioned earlier in a different discussion on this list behemoth would be the right tool for this Julien On Monday, 5 May 2014, Srikanth Shankara Rao srikant...@aditi.com wrote: Hi All, I have crawled Nutch data using 1.8. Data is in HDFS. I would like to post-process this data

Re: Creating Windows bash files for nutch

2014-05-18 Thread Julien Nioche
Hi Currently nutch isn't very friendly to windows users as it requires cygwin to run and there are a lot of issues with Hadoop 1.x branch, which nutch bundles with it, due to the set tmp permission issue. What do you think about doing two things: 1. Move to Hadoop 2.4 to support

Re: Creating Windows bash files for nutch

2014-05-18 Thread Julien Nioche
meant writing batch/cmd scripts for windows that don't require Cygwin. I was thinking of writing those scripts but wanted to check if people think it's a good idea. On Sunday, May 18, 2014, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Currently nutch isn't very friendly to windows

Nutch survey

2014-05-21 Thread Julien Nioche
Hi everyone! I had written a survey about Nutch and its uses and would be very grateful if you could take a couple of minutes to contribute : https://docs.google.com/forms/d/15Jg7dGoU2I1aHur3g5ia9qshCMES8hB1OLpf5q6sGXg/viewform This should help getting a clearer picture of the wider Nutch

Re: Nutch survey

2014-05-22 Thread Julien Nioche
. For those of you who haven't done the survey yet, please do take part. It will definitely help getting a better picture of who we are / what we do as a community. The survey will be online for a few weeks. Thanks Julien On 21 May 2014 16:07, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi

Re: Nutch survey

2014-05-27 Thread Julien Nioche
useful for getting a clearer picture of who we are as a community, what we like or not with Nutch etc... Survey = https://t.co/Xod5Z3Mm5E Please RT : https://twitter.com/digitalpebble/status/469130285284466688 Thanks Julien On 22 May 2014 08:10, Julien Nioche lists.digitalpeb...@gmail.com wrote

ApacheCon CFP closes June 25

2014-06-10 Thread Julien Nioche
Dear Nutch enthusiast, As you may be aware, ApacheCon will be held this year in Budapest, on November 17-23. (See http://apachecon.eu for more info.) The Call For Papers for that conference is still open, but will be closing soon. We need you talk proposals, to represent Nutch at ApacheCon. We

Travel assistance for ApacheCon EU, Budapest November 17-21 2014

2014-06-11 Thread Julien Nioche
The Travel Assistance Committee (TAC) is happy to anounce that we now accept applications for ApacheCon Europe 2014, 17-21 November in Budapest, Hungary Applications are welcome from individuals within the Apache community at-large, users, developers, educators, students, Committers, and Members,

Re: nutch elpais.com

2014-06-16 Thread Julien Nioche
Salut Yann, Not really answering your question but where did you get this config from? Some of its elements have been long deprecated (query-*, response-*, summary-*) Julien On 15 June 2014 10:20, Yann Levreau yann.levr...@gmail.com wrote: hi everyone ! I'm sorry to disturb you but i need

Version of Java in Jenkins

2014-06-17 Thread Julien Nioche
Lewis, https://issues.apache.org/jira/browse/NUTCH-1590 requires Java 1.7 for building the Javadoc. Does something need changing in Jenkins? BTW is there a WIKI page somewhere on how to configure Jenkins? Thanks Julien -- Open Source Solutions for Text Engineering

Re: Nutch Extension for realtime processing

2014-06-18 Thread Julien Nioche
Hi Jake Great to hear about your ideas. Sounds like what you are proposing would be only near realtime as much would depend on the generation which is a batch step. How / when would the update step be called? Would this be a fetcher only i.e. does not recursively discover links. If so why not

Re: Nutch Extension for realtime processing

2014-06-19 Thread Julien Nioche
to be relatively trivial. The storm-crawler project looks neat! We’ve contemplated building something similar that would reuse elements from Nutch where possible. Cheers Jake On Jun 18, 2014, at 1:34 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Jake Great to hear about your

Nearing a 1.9 release?

2014-06-29 Thread Julien Nioche
Hi guys, We've done loads of good work on the trunk since the last release, in particular : - NUTCH-1736 https://issues.apache.org/jira/browse/NUTCH-1736 - NUTCH-1647 https://issues.apache.org/jira/browse/NUTCH-1647 - NUTCH-1793 https://issues.apache.org/jira/browse/NUTCH-1793 which

Re: Nearing a 1.9 release?

2014-07-07 Thread Julien Nioche
%20Unresolved%20ORDER%20BY%20updated%20DESC and change their fix version back to 1.9 if you think they should be included in the next release. Thanks Julien On 29 June 2014 10:20, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, We've done loads of good work on the trunk since

[VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
Hi, One of the frequent issues on the mailing list / JIRA is that users can be led to think that Nutch is built with Maven as they can see what looks like a perfectly valid pom.xml at the root of the project. It becomes clearer when reading the WIKI or FAQ that ANT should be used instead but it

Re: [VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
good with maven. Talat 2014-07-15 13:36 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: Hi, One of the frequent issues on the mailing list / JIRA is that users can be led to think that Nutch is built with Maven as they can see what looks like a perfectly valid pom.xml at the root

Re: [DISCUSS] [VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
., the developer list, etc. So, we need the pom.xml as the template that has that stuff, until someone cooks up a XSL combining solution with that original template and then what ant deploy spits out, no? Cheers, Chris -Original Message- From: Julien Nioche lists.digitalpeb

Re: Problems running some ant targets on recent trunk

2014-07-17 Thread Julien Nioche
Hi This is probably due to some of the recent changes I made e.g. https://issues.apache.org/jira/browse/NUTCH-1804 I'll have a look at this. Thanks Julien On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, I have some problems running ant targets on recent

Re: Problems running some ant targets on recent trunk

2014-07-17 Thread Julien Nioche
In this case it is the target compile-test of lib-regex-filter which fails. Should it be really called for target runtime? target name=deps-jar ant target=jar inheritall=false dir=../lib-regex-filter/ ant target=compile-test inheritall=false dir=../lib-regex-filter/ /target

Re: Problems running some ant targets on recent trunk

2014-07-21 Thread Julien Nioche
for reporting it. On 17 July 2014 10:18, Julien Nioche lists.digitalpeb...@gmail.com wrote: In this case it is the target compile-test of lib-regex-filter which fails. Should it be really called for target runtime? target name=deps-jar ant target=jar inheritall=false dir=../lib-regex-filter

Re: Push Nutch 1.9

2014-08-07 Thread Julien Nioche
Lewis, Any chance you'd have time to spin a RC? Thanks Julien On 30 July 2014 21:14, Sebastian Nagel wastl.na...@googlemail.com wrote: +1 sebastian 2014-07-30 10:56 GMT+02:00 Julien Nioche lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com: Hi Lewis https

Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-13 Thread Julien Nioche
Hi, +1 to release. Compilation and tests run fine. Signatures look good. Thanks Lewis! Julien On 13 August 2014 06:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: VOTE'ing will be open for 'at-least' 72 hours to allow people enough time to cast their VOTE's. Thanks Lewis On

Re: Incorrect download links for Nutch-1.9

2014-08-28 Thread Julien Nioche
Thanks for reporting this Jake, I'll fix this tomorrow (unless a fellow committer beats me to it) Julien On 27 August 2014 17:37, Jake Dodd j...@ontopic.io wrote: Hi all, I noticed that following the download links for Nutch 1.9 (from http://nutch.apache.org/downloads.html) takes users to

Re: Title of the page Version Control

2014-08-28 Thread Julien Nioche
Thanks for reporting this Alfonso, I'll fix this tomorrow (unless a fellow committer beats me to it) Julien On 28 August 2014 10:13, Alfonso Nishikawa alfonso.nishik...@gmail.com wrote: Greetings, I found that the page https://nutch.apache.org/version_control.html states in it's title:

Re: Incorrect download links for Nutch-1.9

2014-08-29 Thread Julien Nioche
Thanks Lewis On 28 August 2014 22:41, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Jake, Thank you so much for reporting. Fixed. Thank you, have a great day. Lewis On Wed, Aug 27, 2014 at 9:37 AM, dev-digest-h...@nutch.apache.org wrote: Hi all, I noticed that following

Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
Hi chaps, -1 from me. IMHO moving the trunk code to 3.x does not really solve the issue. I'd rather make it more explicit that the standard Nutch (1.x) and Nutch-GORA (2.x) are two separate beasts for instance by referring to 2.x as Nutch-GORA in the artifacts we release. This way users won't

Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
.x with 2.x Changing to 3.x would imply a major change of architecture or functionality, which certainly won't be the case for the next release of the trunk. I agree with Julien. IMHO Opinion We do not need any changes. Talat 2014-09-01 12:23 GMT+03:00 Julien Nioche lists.digitalpeb

Re: Jump to 3.X WAS [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
Let's wait a couple of weeks before voting on this. I know Sebastian is on holiday until the 12th and there might be more people in this case. On 1 September 2014 17:34, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Julien, -Original Message- From: Julien Nioche

Re: Nutch won't fetch the whole page if the Transfer Dncoding is chunked

2014-09-17 Thread Julien Nioche
Hi Isn't that an effect of property namehttp.content.limit/name value65536/value descriptionThe length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse

Re: Generic xsl parser plugin

2014-09-25 Thread Julien Nioche
Hi Albin, You don't have to have a separate plugin for each html structure you want to parse. You can have a single plugin with multiple HTMLParseFilters. Having a generic extractor with the extraction logic configured in an external file is definitely a good idea and would make a great

Re: Generic xsl parser plugin

2014-09-26 Thread Julien Nioche
going to use this? On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Albin, You don't have to have a separate plugin for each html structure you want to parse. You can have a single plugin with multiple HTMLParseFilters. Having a generic extractor

Re: GSoC 2015

2015-02-04 Thread Julien Nioche
Moving to Hadoop 2.x ? On 4 February 2015 at 14:42, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Does anyone have any good ideas for GSoC? Seb mentioned moving Nutch towards Spark so potentially a pluggable runtime execution engine abstraction? I am currently working on

Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Julien Nioche
Congratulations and welcome Jorge! Great to have you with us Julien On 19 February 2015 at 17:20, Sebastian Nagel wastl.na...@googlemail.com wrote: Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Jorge Luis Betancourt Gonzalez has been voted in as committer and

Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

2015-03-02 Thread Julien Nioche
as an extension of IndexWriter? See [https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272] - Julien Nioche On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote

Re: Unsubscribe

2015-02-26 Thread Julien Nioche
Massimo, http://nutch.apache.org/mailing_lists.html = dev-unsubscr...@nutch.apache.org Thanks On 26 February 2015 at 19:11, Massimo Miccoli mmicc...@iltrovatore.it wrote: Massimo Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto: Author: lewismc Date: Thu Feb 26

Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-23 Thread Julien Nioche
Welcome Mo! On 22 March 2015 at 19:31, Markus Jelsma markus.jel...@openindex.io wrote: Welcome Mohammad! -Original message- From: Mohammed Omerbeancinemat...@gmail.com Sent: Sunday 22nd March 2015 18:55 To: u...@nutch.apache.org Cc: dev@nutch.apache.org Subject: Re: [ANNOUNCE]

Re: [ANNOUNCE] New Nutch committer and PMC - Guiseppe Totaro

2015-04-26 Thread Julien Nioche
Congrats and welcome Giuseppe! On 25 April 2015 at 22:43, Giuseppe Totaro totarope...@gmail.com wrote: Thanks a lot Sebastian. I am very proud to be part of this project as committer and member of the Nutch PMC. I am working on Information Retrieval at scale under the supervision of

Re: [VOTE] Release Apache Nutch 1.10

2015-04-30 Thread Julien Nioche
Thanks Lewis +1 : compiled on Linux + ran a small crawl and indexed with ES j On 29 April 2015 at 22:54, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.10. The release candidate comprises the following components.* A

crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche
[Apologies for cross posting]crawler-commons 0.6 is released We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.6 file included with the release for a full list of details. We suggest

Re: [VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-26 Thread Julien Nioche
Chris -1 We usually release tar.gz as well as zip. More importantly we need to release the sources as well as the binary. We can't even test that it compiles OK Since you released Tika, why don't we include it before cutting 1.11? Thanks Julien On 26 October 2015 at 05:53, Mattmann, Chris

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Hi Lewis I'd love to see https://issues.apache.org/jira/browse/NUTCH-1517 being part of 1.11. It is a separate indexing plugin which should not impact any existing code. It's been reviewed by Jorge and I'll to commit it soon unless someone objects. Thanks J. On 26 August 2015 at 03:23, Lewis

Re: [DISCUSS] Release Nutch trunk 1.11

2015-08-26 Thread Julien Nioche
Done. Thanks Markus On 26 August 2015 at 13:08, Markus Jelsma markus.jel...@openindex.io wrote: Yes Julien, please commit. I do think https://issues.apache.org/jira/browse/NUTCH-2064 should also be included. But i have my hands full atm. -Original message- From: Julien

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche
Congratulations Asitang and welcome! Julien On 9 September 2015 at 23:01, Sebastian Nagel wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Asitang Mishra has joined the Nutch team as committer > and PMC member. Asitang, please

Re: Nutch not recognizing html pages/images retrieved via php

2015-10-05 Thread Julien Nioche
Hi What happens is that parse-tika is used by default but doesn't know what to do with that mime type. You can edit parse-plugins.xml and add to map the mime type to the html parser. Obviously you'll need parse-html to be

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche
Nutch people, Just in case you missed the announcement below. As you probably know CC use Nutch for their crawls, this is a fantastic opportunity to put your Nutch skills to great use! Julien -- Forwarded message -- From: Sara Crouse Date: 17 September

Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche
Hi again, I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map Reduce https://www.youtube.com/watch?v=v9zjcTjjjyU Please excuse the sound quality, hesitations and stuttering. I hope you find it useful nonetheless. Julien -- *Open Source Solutions for Text Engineering*

Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche
Hi everyone, Just to let you know that we've just published a new tutorial on how to use Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch. This is related to the recent addition of NUTCH-1517 in the trunk codebase. The

Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Julien Nioche
Thanks Lewis for taking care of the release and everyone involved. Julien On 8 December 2015 at 01:34, lewis john mcgibbney wrote: > Hello Folks, > > 07 December 2015 - Nutch 1.11 Release > > The Apache Nutch PMC are pleased to announce the immediate release of > Apache

Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-05 Thread Julien Nioche
+1 Thanks Lewis On 4 December 2015 at 18:03, Lewis John Mcgibbney wrote: > Hi Folks, > > A second candidate for the Nutch 1.11 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/ > > The release candidate consists of zip and tar

Re: [VOTE] Moving to Git

2016-01-08 Thread Julien Nioche
+1 to move to Git Note : I don't think Dennis is on the PMC anymore Ju On 8 January 2016 at 08:46, Chris Mattmann wrote: > Hi Everyone, > > I proposed this earlier, and we said we’d wait until after the > 1.11 release. So it’s time to VOTE to move Nutch to Git. So > far,

Re: [VOTE] Release Apache Nutch 1.12

2016-06-15 Thread Julien Nioche
+1 Thanks Lewis and team! On 15 June 2016 at 06:14, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.12 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.12/ > > The release candidate is a zip and tar archive of the

ApacheCon EU Sevilla

2016-06-29 Thread Julien Nioche
Hi, Sorry for cross posting. As you are probably aware, the ApacheCon Europe, and Apache Big Data conferences will take place in Seville, Spain, November 14-18, 2016. http://events.linuxfoundation.org/events/apache-big-data-europe/ I just submitted a talk on StormCrawler

Crawler-Commons 0.7 released

2016-11-24 Thread Julien Nioche
Apologies for cross-posting The Common-Crawl project is pleased to announce its 0.7 release. https://github.com/crawler-commons/crawler-commons#24th-november-2016crawler-commons-07-released The list of changes can be found here

Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Julien Nioche
Hi Lewis +1 compiled from source and ran a small crawl in local mode. All good! Thanks Julien On 29 March 2017 at 05:20, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.13 release is available at: > >

Crawler-Commons 0.8 released

2017-06-09 Thread Julien Nioche
Apologies for cross-posting The Common-Crawl project is pleased to announce its 0.8 release. *https://github.com/crawler-commons/crawler-commons/releases/tag/crawler-commons-0.8 * If you are wondering what

Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
> > Russian compatriots Are we all Russian then? On 16 June 2017 at 04:29, lewis john mcgibbney wrote: > Hi Folks, > I don't know if anyone else noticed... some of our Russian compatriots > have set up a static auto bot to notify us of source code issues... > An example

Re: Establishment of Static Source Code Analysis

2017-06-16 Thread Julien Nioche
<https://github.com/crawler-commons/crawler-commons/pull/127>. On 16 June 2017 at 08:55, Julien Nioche <lists.digitalpeb...@gmail.com> wrote: > Russian compatriots > > > Are we all Russian then? > > On 16 June 2017 at 04:29, lewis john mcgibbney <lewi...@apache.org

Re: [DISCUSS] Release 1.14?

2017-12-14 Thread Julien Nioche
happens this week, I'll make sure that it's included. > > Thanks, > Sebastian > > > On 12/11/2017 10:22 AM, Julien Nioche wrote: > > Tika 1.17 will be released shortly, maybe it would be worth waiting a > bit and integrate it first? > > > > On 8 December 2

Re: [VOTE] Release Apache Nutch 1.14 RC#1

2017-12-19 Thread Julien Nioche
+1 to release, thanks Seb On 18 December 2017 at 22:12, Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.14 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.14/ > > The release candidate is a zip and tar.gz archive

Re: [DISCUSS] Release 1.14?

2017-12-11 Thread Julien Nioche
Tika 1.17 will be released shortly, maybe it would be worth waiting a bit and integrate it first? On 8 December 2017 at 22:53, Sebastian Nagel wrote: > Hi all, > > 50+ issues fixed > https://issues.apache.org/jira/projects/NUTCH/versions/12340218 > > Of course, as

Crawler-Commons 0.9 released

2017-10-31 Thread Julien Nioche
Happy Halloween! We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based

Crawler-Commons 0.10 released

2018-06-07 Thread Julien Nioche
Hi We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the

Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel wrote: > Dear all, > > It is my pleasure to announce that Tim Allison has joined us > as a committer and member of the Nutch PMC. > > You may already know Tim as a maintainer of and

[jira] Commented: (NUTCH-826) Mailing list is broken.

2010-05-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870528#action_12870528 ] Julien Nioche commented on NUTCH-826: - Nutch has recently become a TLP and some

[jira] Resolved: (NUTCH-826) Mailing list is broken.

2010-05-24 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-826. - Fix Version/s: 1.1 Resolution: Fixed Committed revision 947569. The changes should

[jira] Commented: (NUTCH-828) Fetch Filter

2010-06-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876576#action_12876576 ] Julien Nioche commented on NUTCH-828: - Shall we postpone this after the release of 1.1

[jira] Updated: (NUTCH-830) ScoringFilter to restrict the crawl to the hosts/domains listed in the seeds

2010-06-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-830: Attachment: NUTCH-830.patch ScoringFilter to restrict the crawl to the hosts/domains listed

[jira] Closed: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-834. --- Resolution: Fixed Committed revision 959228. Thanks Chris for your comments and help

[jira] Commented: (NUTCH-650) Hbase Integration

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883880#action_12883880 ] Julien Nioche commented on NUTCH-650: - The patch has been committed with revision

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-836: Attachment: NUTCH-836.patch Remove deprecated parse plugins

[jira] Created: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-836.patch Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely on parse-tika almost

[jira] Commented: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883891#action_12883891 ] Julien Nioche commented on NUTCH-836: - Actually creative-commons + languageidentifier

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-836: Description: Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These plugins

[jira] Created: (NUTCH-837) Remove search servers and Lucene dependencies

2010-06-30 Thread Julien Nioche (JIRA)
Affects Versions: 1.1 Reporter: Julien Nioche Fix For: 2.0 One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis

[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-836: Attachment: (was: NUTCH-836.patch) Remove deprecated parse plugins

[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624 ] Julien Nioche commented on NUTCH-835: - This patch has been marked for 1.2 but has been

[jira] Created: (NUTCH-840) Port tests from parse-html to parse-tika

2010-07-02 Thread Julien Nioche (JIRA)
Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment

[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840.patch Patch which adds the HTML tests to the Tika Parser The tests currently

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884671#action_12884671 ] Julien Nioche commented on NUTCH-837: - I think we can also get rid of : * docs/ * WAR

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884734#action_12884734 ] Julien Nioche commented on NUTCH-837: - :-) Remove search servers and Lucene

[jira] Updated: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-821: Attachment: NUTCH-821.patch Adds IVY support for dependencies The lib/. dir is maintained

[jira] Resolved: (NUTCH-791) External links for published javadocs are partially broken

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-791. - Fix Version/s: 1.1 Resolution: Duplicate Duplicates 790? External links for published

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885207#action_12885207 ] Julien Nioche commented on NUTCH-821: - {QUOTE} I think this patch refers to some parts

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885244#action_12885244 ] Julien Nioche commented on NUTCH-821: - I found [http://ant.apache.org/ivy/ivyde/] which

[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885260#action_12885260 ] Julien Nioche commented on NUTCH-696: - +1 : this is definitely useful. Hopefully

[jira] Issue Comment Edited: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885260#action_12885260 ] Julien Nioche edited comment on NUTCH-696 at 7/5/10 11:13 AM

<    1   2   3   4   5   6   7   8   9   10   >