Hi - anything on this? These are interesting topics so i am curious :) Cheers, Markus
-----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Thursday 9th October 2014 0:46 > To: [email protected]; [email protected] > Subject: RE: Nutch vs Lucidworks Fusion > > Hi Andrzej - how are you dealing with text extraction and other relevant > items such as article date and accompanying images? And what about other > metadata such as the author of the article or the rating some pasta recipe > got? Also, must clients (or your consultants) implement site-specific URL > filters to avoid those dreadful spider traps, or do you automatically resolve > traps? If so, how? > > Looking forward :) > > Cheers, > Markus > > > -----Original message----- > > From:Andrzej Białecki <[email protected]> > > Sent: Monday 6th October 2014 15:47 > > To: [email protected] > > Subject: Re: Nutch vs Lucidworks Fusion > > > > On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]> > > wrote: > > > > > Attaching Andrzej to this thread. As most of you know Andrzej was the > > > Nutch PMC chair prior to me and a huge contributor to Nutch over the > > > years. He also works for Lucid. > > > Andrzej : would you mind telling us a bit about LW's crawler and why you > > > went for Aperture? Am I right in thinking that this has to do with the > > > fact that you needed to be able to pilot the crawl via a REST-like > > > service? > > > > > > > Hi Julien, and the Nutch community, > > > > It’s been a while. :) > > > > First, let me clarify a few issues: > > > > * indeed I now work for Lucidworks and I’m involved in the design and > > implementation of the connectors framework in the Lucidworks Fusion product. > > > > * the connectors framework in Fusion allows us to integrate wildly > > different third-party modules, e.g. we have connectors based on GCM, Hadoop > > map-reduce, databases, local files, remote filesystems, repositories, etc. > > In fact, it’s relatively straightforward to integrate Nutch with this > > framework, and we actually provide docs on how to do this, so nothing stops > > you from using Nutch if it fits the bill. > > > > * this framework provides a uniform REST API to control the processing > > pipeline for documents collected by connectors, and in most cases to manage > > the crawlers configurations and processes. Only the first part is in place > > for the integration with Nutch, i.e. configuration and jobs have to be > > managed externally, and only the processing and content enrichment is > > controlled by Lucidworks Fusion. If we get a business case that requires a > > tighter integration I’m sure we will be happy to do it. > > > > * the previous generation of Lucidworks products (called “LucidWorks > > Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy > > integration and while it worked fine for what it was originally intended, > > it definitely had some painful limitations, not to mention the fact that > > the Aperture project is no longer active. > > > > * the current version of the product DOES NOT use Aperture for web > > crawling. It uses a web- and file-crawler implementation created in-house - > > it re-uses some code from crawler-commons, with some insignificant > > modifications. > > > > * our content processing framework uses many Open Source tools (among them > > Tika, OpenNLP, Drools, of course Solr, and many others), on top of which > > we’ve built a powerful system for content enrichment, event processing and > > data analytics. > > > > So, that’s the facts. Now, let’s move on to opinions ;) > > > > There are many different use cases for web/file crawling and many different > > scalability and content processing requirements. So far the target audience > > for Lucidworks Fusion required small- to medium-scale web crawls, but with > > sophisticated content processing, extensive controls over the crawling > > frontier (handling sessions for depth-first crawls, cookies, form logins, > > etc) and easy management / control of the process over REST / UI. In many > > cases also the effort to set up and operate a Hadoop cluster was deemed too > > high or irrelevant to the core business. And in reality, as you know, there > > are workload sizes for which Hadoop is a total overkill and the roundtrip > > for processing is in the order of several minutes instead of seconds. > > > > For these reasons we wanted to provide a web crawler that is > > self-contained, lean, doesn’t require Hadoop, is scalable well-enough from > > small to mid-size workloads without Hadoop’s overhead, and at the same time > > to provide an easy way to integrate high-scale crawler like Nutch for > > customers that need it - and for such customers we DO recommend Nutch as > > the best high-scale crawler. :) > > > > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a > > reasonable tradeoff between ease of use, scalability, rich content > > processing and ease of integration. Don’t take my word for it - download a > > copy and try it yourself! > > > > To Lewis: > > > > > Hopefull the above is my outtake on things. If LucidWorks have some magic > > > sauce then great. Hopefully they consider bringing some of it back into > > > Nutch rather than writing some Perl or Python scripts. I would never > > > expect > > > this to happen, however I am utterly depressed at how often I see this > > > happening. > > > > Lucidworks is a Java/Clojure shop, the connectors framework and the web > > crawler are written in Java - no Perl or Python in sight ;) Our magic sauce > > is in enterprise integration and rich content processing pipelines, not so > > much in base web crawling. > > > > So, that’s my contribution to this discussion … I hope this answered some > > questions. Feel fee to ask if you need more information. > > > > -- > > Best regards, > > Andrzej Bialecki <[email protected]> > > > > --=# http://www.lucidworks.com #=-- > > > > >

