Thanks for the explanations Andrzej and Grant!
Great to hear that you are using stuff from crawler-commons.

Julien

On 6 October 2014 14:47, Andrzej Białecki <[email protected]> wrote:

>
> On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]>
> wrote:
>
> > Attaching Andrzej to this thread. As most of you know Andrzej was the
> Nutch PMC chair prior to me and a huge contributor to Nutch over the years.
> He also works for Lucid.
> > Andrzej : would you mind telling us a bit about LW's crawler and why you
> went for Aperture? Am I right in thinking that this has to do with the fact
> that you needed to be able to pilot the crawl via a REST-like service?
> >
>
>
> Hi Julien, and the Nutch community,
>
> It’s been a while. :)
>
> First, let me clarify a few issues:
>
> * indeed I now work for Lucidworks and I’m involved in the design and
> implementation of the connectors framework in the Lucidworks Fusion product.
>
> * the connectors framework in Fusion allows us to integrate wildly
> different third-party modules, e.g. we have connectors based on GCM, Hadoop
> map-reduce, databases, local files, remote filesystems, repositories, etc.
> In fact, it’s relatively straightforward to integrate Nutch with this
> framework, and we actually provide docs on how to do this, so nothing stops
> you from using Nutch if it fits the bill.
>
> * this framework provides a uniform REST API to control the processing
> pipeline for documents collected by connectors, and in most cases to manage
> the crawlers configurations and processes. Only the first part is in place
> for the integration with Nutch, i.e. configuration and jobs have to be
> managed externally, and only the processing and content enrichment is
> controlled by Lucidworks Fusion. If we get a business case that requires a
> tighter integration I’m sure we will be happy to do it.
>
> * the previous generation of Lucidworks products (called “LucidWorks
> Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy
> integration and while it worked fine for what it was originally intended,
> it definitely had some painful limitations, not to mention the fact that
> the Aperture project is no longer active.
>
> * the current version of the product DOES NOT use Aperture for web
> crawling. It uses a web- and file-crawler implementation created in-house -
> it re-uses some code from crawler-commons, with some insignificant
> modifications.
>
> * our content processing framework uses many Open Source tools (among them
> Tika, OpenNLP, Drools, of course Solr, and many others), on top of which
> we’ve built a powerful system for content enrichment, event processing and
> data analytics.
>
> So, that’s the facts. Now, let’s move on to opinions ;)
>
> There are many different use cases for web/file crawling and many
> different scalability and content processing requirements. So far the
> target audience for Lucidworks Fusion required small- to medium-scale web
> crawls, but with sophisticated content processing, extensive controls over
> the crawling frontier (handling sessions for depth-first crawls, cookies,
> form logins, etc) and easy management / control of the process over REST /
> UI. In many cases also the effort to set up and operate a Hadoop cluster
> was deemed too high or irrelevant to the core business. And in reality, as
> you know, there are workload sizes for which Hadoop is a total overkill and
> the roundtrip for processing is in the order of several minutes instead of
> seconds.
>
> For these reasons we wanted to provide a web crawler that is
> self-contained, lean, doesn’t require Hadoop, is scalable well-enough from
> small to mid-size workloads without Hadoop’s overhead, and at the same time
> to provide an easy way to integrate high-scale crawler like Nutch for
> customers that need it - and for such customers we DO recommend Nutch as
> the best high-scale crawler. :)
>
> So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
> reasonable tradeoff between ease of use, scalability, rich content
> processing and ease of integration. Don’t take my word for it - download a
> copy and try it yourself!
>
> To Lewis:
>
> > Hopefull the above is my outtake on things. If LucidWorks have some magic
> > sauce then great. Hopefully they consider bringing some of it back into
> > Nutch rather than writing some Perl or Python scripts. I would never
> expect
> > this to happen, however I am utterly depressed at how often I see this
> > happening.
>
> Lucidworks is a Java/Clojure shop, the connectors framework and the web
> crawler are written in Java - no Perl or Python in sight ;) Our magic sauce
> is in enterprise integration and rich content processing pipelines, not so
> much in base web crawling.
>
> So, that’s my contribution to this discussion … I hope this answered some
> questions. Feel fee to ask if you need more information.
>
> --
> Best regards,
> Andrzej Bialecki <[email protected]>
>
> --=# http://www.lucidworks.com #=--
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to