Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid. Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
Julien On 3 October 2014 09:27, Lewis John Mcgibbney <[email protected]> wrote: > Hi Folks, > > On Thu, Oct 2, 2014 at 4:01 PM, <[email protected]> wrote: > > > > > Hi the new Fusion product from Lucidworks provides “advanced filesystem > > and web crawlers” anyone have had any time to check this out and how to > > compare to the current and future plans with Nutch? > > > I am always dissapointed (but never surprised) when people go and make > thier own crawlers, then run them on 'Hadoop'. > Nutch is THE native Hadoop application... why people go and write thier own > is utterly beyond me. Maybe they like MatLab too much or something ;) ... > or maybe modern fortran. > > I do not speak on behalf of the Nutch PMC, however what I will say is this. > I know that there are many CIO's, CTO's as well as many engineers on this > list and I know they are watching this thread. Nutch if a different product > now than it was <1.5 years ago. The work that has been done is unparalleled > in the Python community, and I make this statement boldly. From what I have > seen, Nutch is the most comprehensive (if a bit challenging w.r.t > configuration) product out there for crawling. There are a number of issue > to be addressed in Jira. We know this. But this still does not change my > opinion on the software. > > I have been corrected previously before for making such statements, however > my justification is as follows > > * There is a HUGE difference between crawling and scraping. > * There is a huge difference between leveraging Apache Tika within the > Nutch framework for metadata augmentation of URLs over scraping. > * There is a HUGE benefit to be obtained by utlising the Nutch community... > which is sh*t hot in comparison to ~2-3 years ago. The same community has > also ensured that Nutch has been making regular releases for a number of > years now. > > > > > Just interested I personally haven’t been able to download the product > and > > test it but I’m a bit curious and I would appreciate your comments on > this > > topic. > > > > > Hopefull the above is my outtake on things. If LucidWorks have some magic > sauce then great. Hopefully they consider bringing some of it back into > Nutch rather than writing some Perl or Python scripts. I would never expect > this to happen, however I am utterly depressed at how often I see this > happening. > Many software projects are failures. > Nutch is not. It is a decade old. > Nutch is a success. > > hth > Lewis > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

