Thanks for the explanations Andrzej and Grant! Great to hear that you are using stuff from crawler-commons.
Julien On 6 October 2014 14:47, Andrzej Białecki <[email protected]> wrote: > > On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]> > wrote: > > > Attaching Andrzej to this thread. As most of you know Andrzej was the > Nutch PMC chair prior to me and a huge contributor to Nutch over the years. > He also works for Lucid. > > Andrzej : would you mind telling us a bit about LW's crawler and why you > went for Aperture? Am I right in thinking that this has to do with the fact > that you needed to be able to pilot the crawl via a REST-like service? > > > > > Hi Julien, and the Nutch community, > > It’s been a while. :) > > First, let me clarify a few issues: > > * indeed I now work for Lucidworks and I’m involved in the design and > implementation of the connectors framework in the Lucidworks Fusion product. > > * the connectors framework in Fusion allows us to integrate wildly > different third-party modules, e.g. we have connectors based on GCM, Hadoop > map-reduce, databases, local files, remote filesystems, repositories, etc. > In fact, it’s relatively straightforward to integrate Nutch with this > framework, and we actually provide docs on how to do this, so nothing stops > you from using Nutch if it fits the bill. > > * this framework provides a uniform REST API to control the processing > pipeline for documents collected by connectors, and in most cases to manage > the crawlers configurations and processes. Only the first part is in place > for the integration with Nutch, i.e. configuration and jobs have to be > managed externally, and only the processing and content enrichment is > controlled by Lucidworks Fusion. If we get a business case that requires a > tighter integration I’m sure we will be happy to do it. > > * the previous generation of Lucidworks products (called “LucidWorks > Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy > integration and while it worked fine for what it was originally intended, > it definitely had some painful limitations, not to mention the fact that > the Aperture project is no longer active. > > * the current version of the product DOES NOT use Aperture for web > crawling. It uses a web- and file-crawler implementation created in-house - > it re-uses some code from crawler-commons, with some insignificant > modifications. > > * our content processing framework uses many Open Source tools (among them > Tika, OpenNLP, Drools, of course Solr, and many others), on top of which > we’ve built a powerful system for content enrichment, event processing and > data analytics. > > So, that’s the facts. Now, let’s move on to opinions ;) > > There are many different use cases for web/file crawling and many > different scalability and content processing requirements. So far the > target audience for Lucidworks Fusion required small- to medium-scale web > crawls, but with sophisticated content processing, extensive controls over > the crawling frontier (handling sessions for depth-first crawls, cookies, > form logins, etc) and easy management / control of the process over REST / > UI. In many cases also the effort to set up and operate a Hadoop cluster > was deemed too high or irrelevant to the core business. And in reality, as > you know, there are workload sizes for which Hadoop is a total overkill and > the roundtrip for processing is in the order of several minutes instead of > seconds. > > For these reasons we wanted to provide a web crawler that is > self-contained, lean, doesn’t require Hadoop, is scalable well-enough from > small to mid-size workloads without Hadoop’s overhead, and at the same time > to provide an easy way to integrate high-scale crawler like Nutch for > customers that need it - and for such customers we DO recommend Nutch as > the best high-scale crawler. :) > > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a > reasonable tradeoff between ease of use, scalability, rich content > processing and ease of integration. Don’t take my word for it - download a > copy and try it yourself! > > To Lewis: > > > Hopefull the above is my outtake on things. If LucidWorks have some magic > > sauce then great. Hopefully they consider bringing some of it back into > > Nutch rather than writing some Perl or Python scripts. I would never > expect > > this to happen, however I am utterly depressed at how often I see this > > happening. > > Lucidworks is a Java/Clojure shop, the connectors framework and the web > crawler are written in Java - no Perl or Python in sight ;) Our magic sauce > is in enterprise integration and rich content processing pipelines, not so > much in base web crawling. > > So, that’s my contribution to this discussion … I hope this answered some > questions. Feel fee to ask if you need more information. > > -- > Best regards, > Andrzej Bialecki <[email protected]> > > --=# http://www.lucidworks.com #=-- > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

