Hi - anything on this? These are interesting topics so i am curious :)

Cheers,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Thursday 9th October 2014 0:46
> To: [email protected]; [email protected]
> Subject: RE: Nutch vs Lucidworks Fusion
> 
> Hi Andrzej - how are you dealing with text extraction and other relevant 
> items such as article date and accompanying images? And what about other 
> metadata such as the author of the article or the rating some pasta recipe 
> got? Also, must clients (or your consultants) implement site-specific URL 
> filters to avoid those dreadful spider traps, or do you automatically resolve 
> traps? If so, how?
> 
> Looking forward :)
> 
> Cheers,
> Markus
>  
>  
> -----Original message-----
> > From:Andrzej Białecki <[email protected]>
> > Sent: Monday 6th October 2014 15:47
> > To: [email protected]
> > Subject: Re: Nutch vs Lucidworks Fusion
> > 
> > On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]> 
> > wrote:
> > 
> > > Attaching Andrzej to this thread. As most of you know Andrzej was the 
> > > Nutch PMC chair prior to me and a huge contributor to Nutch over the 
> > > years. He also works for Lucid.
> > > Andrzej : would you mind telling us a bit about LW's crawler and why you 
> > > went for Aperture? Am I right in thinking that this has to do with the 
> > > fact that you needed to be able to pilot the crawl via a REST-like 
> > > service?
> > > 
> > 
> > Hi Julien, and the Nutch community,
> > 
> > It’s been a while. :)
> > 
> > First, let me clarify a few issues:
> > 
> > * indeed I now work for Lucidworks and I’m involved in the design and 
> > implementation of the connectors framework in the Lucidworks Fusion product.
> > 
> > * the connectors framework in Fusion allows us to integrate wildly 
> > different third-party modules, e.g. we have connectors based on GCM, Hadoop 
> > map-reduce, databases, local files, remote filesystems, repositories, etc. 
> > In fact, it’s relatively straightforward to integrate Nutch with this 
> > framework, and we actually provide docs on how to do this, so nothing stops 
> > you from using Nutch if it fits the bill.
> > 
> > * this framework provides a uniform REST API to control the processing 
> > pipeline for documents collected by connectors, and in most cases to manage 
> > the crawlers configurations and processes. Only the first part is in place 
> > for the integration with Nutch, i.e. configuration and jobs have to be 
> > managed externally, and only the processing and content enrichment is 
> > controlled by Lucidworks Fusion. If we get a business case that requires a 
> > tighter integration I’m sure we will be happy to do it.
> > 
> > * the previous generation of Lucidworks products (called “LucidWorks 
> > Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy 
> > integration and while it worked fine for what it was originally intended, 
> > it definitely had some painful limitations, not to mention the fact that 
> > the Aperture project is no longer active.
> > 
> > * the current version of the product DOES NOT use Aperture for web 
> > crawling. It uses a web- and file-crawler implementation created in-house - 
> > it re-uses some code from crawler-commons, with some insignificant 
> > modifications.
> > 
> > * our content processing framework uses many Open Source tools (among them 
> > Tika, OpenNLP, Drools, of course Solr, and many others), on top of which 
> > we’ve built a powerful system for content enrichment, event processing and 
> > data analytics.
> > 
> > So, that’s the facts. Now, let’s move on to opinions ;)
> > 
> > There are many different use cases for web/file crawling and many different 
> > scalability and content processing requirements. So far the target audience 
> > for Lucidworks Fusion required small- to medium-scale web crawls, but with 
> > sophisticated content processing, extensive controls over the crawling 
> > frontier (handling sessions for depth-first crawls, cookies, form logins, 
> > etc) and easy management / control of the process over REST / UI. In many 
> > cases also the effort to set up and operate a Hadoop cluster was deemed too 
> > high or irrelevant to the core business. And in reality, as you know, there 
> > are workload sizes for which Hadoop is a total overkill and the roundtrip 
> > for processing is in the order of several minutes instead of seconds.
> > 
> > For these reasons we wanted to provide a web crawler that is 
> > self-contained, lean, doesn’t require Hadoop, is scalable well-enough from 
> > small to mid-size workloads without Hadoop’s overhead, and at the same time 
> > to provide an easy way to integrate high-scale crawler like Nutch for 
> > customers that need it - and for such customers we DO recommend Nutch as 
> > the best high-scale crawler. :)
> > 
> > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a 
> > reasonable tradeoff between ease of use, scalability, rich content 
> > processing and ease of integration. Don’t take my word for it - download a 
> > copy and try it yourself!
> > 
> > To Lewis:
> > 
> > > Hopefull the above is my outtake on things. If LucidWorks have some magic
> > > sauce then great. Hopefully they consider bringing some of it back into
> > > Nutch rather than writing some Perl or Python scripts. I would never 
> > > expect
> > > this to happen, however I am utterly depressed at how often I see this
> > > happening.
> > 
> > Lucidworks is a Java/Clojure shop, the connectors framework and the web 
> > crawler are written in Java - no Perl or Python in sight ;) Our magic sauce 
> > is in enterprise integration and rich content processing pipelines, not so 
> > much in base web crawling.
> > 
> > So, that’s my contribution to this discussion … I hope this answered some 
> > questions. Feel fee to ask if you need more information.
> > 
> > --
> > Best regards,
> > Andrzej Bialecki <[email protected]>
> > 
> > --=# http://www.lucidworks.com #=--
> > 
> > 
> 

Reply via email to