Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch
PMC chair prior to me and a huge contributor to Nutch over the years. He
also works for Lucid.
Andrzej : would you mind telling us a bit about LW's crawler and why you
went for Aperture? Am I right in thinking that this has to do with the fact
that you needed to be able to pilot the crawl via a REST-like service?

Julien

On 3 October 2014 09:27, Lewis John Mcgibbney <[email protected]>
wrote:

> Hi Folks,
>
> On Thu, Oct 2, 2014 at 4:01 PM, <[email protected]> wrote:
>
> >
> > Hi the new Fusion product from Lucidworks provides “advanced filesystem
> > and web crawlers” anyone have had any time to check this out and how to
> > compare to the current and future plans with Nutch?
>
>
> I am always dissapointed (but never surprised) when people go and make
> thier own crawlers, then run them on 'Hadoop'.
> Nutch is THE native Hadoop application... why people go and write thier own
> is utterly beyond me. Maybe they like MatLab too much or something ;) ...
> or maybe modern fortran.
>
> I do not speak on behalf of the Nutch PMC, however what I will say is this.
> I know that there are many CIO's, CTO's as well as many engineers on this
> list and I know they are watching this thread. Nutch if a different product
> now than it was <1.5 years ago. The work that has been done is unparalleled
> in the Python community, and I make this statement boldly. From what I have
> seen, Nutch is the most comprehensive (if a bit challenging w.r.t
> configuration) product out there for crawling. There are a number of issue
> to be addressed in Jira. We know this. But this still does not change my
> opinion on the software.
>
> I have been corrected previously before for making such statements, however
> my justification is as follows
>
> * There is a HUGE difference between crawling and scraping.
> * There is a huge difference between leveraging Apache Tika within the
> Nutch framework for metadata augmentation of URLs over scraping.
> * There is a HUGE benefit to be obtained by utlising the Nutch community...
> which is sh*t hot in comparison to ~2-3 years ago. The same community has
> also ensured that Nutch has been making regular releases for a number of
> years now.
>
>
>
> > Just interested I personally haven’t been able to download the product
> and
> > test it but I’m a bit curious and I would appreciate your comments on
> this
> > topic.
> >
> >
> Hopefull the above is my outtake on things. If LucidWorks have some magic
> sauce then great. Hopefully they consider bringing some of it back into
> Nutch rather than writing some Perl or Python scripts. I would never expect
> this to happen, however I am utterly depressed at how often I see this
> happening.
> Many software projects are failures.
> Nutch is not. It is a decade old.
> Nutch is a success.
>
>  hth
> Lewis
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to