Hi Markus, We used Opennlp Named Extraction Tool. It is basic but very useful if you have good model.
2014-10-14 0:03 GMT+03:00 Markus Jelsma <[email protected]>: > Hi - anything on this? These are interesting topics so i am curious :) > > Cheers, > Markus > > > > -----Original message----- >> From:Markus Jelsma <[email protected]> >> Sent: Thursday 9th October 2014 0:46 >> To: [email protected]; [email protected] >> Subject: RE: Nutch vs Lucidworks Fusion >> >> Hi Andrzej - how are you dealing with text extraction and other relevant >> items such as article date and accompanying images? And what about other >> metadata such as the author of the article or the rating some pasta recipe >> got? Also, must clients (or your consultants) implement site-specific URL >> filters to avoid those dreadful spider traps, or do you automatically >> resolve traps? If so, how? >> >> Looking forward :) >> >> Cheers, >> Markus >> >> >> -----Original message----- >> > From:Andrzej Białecki <[email protected]> >> > Sent: Monday 6th October 2014 15:47 >> > To: [email protected] >> > Subject: Re: Nutch vs Lucidworks Fusion >> > >> > On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]> >> > wrote: >> > >> > > Attaching Andrzej to this thread. As most of you know Andrzej was the >> > > Nutch PMC chair prior to me and a huge contributor to Nutch over the >> > > years. He also works for Lucid. >> > > Andrzej : would you mind telling us a bit about LW's crawler and why you >> > > went for Aperture? Am I right in thinking that this has to do with the >> > > fact that you needed to be able to pilot the crawl via a REST-like >> > > service? >> > > >> > >> > Hi Julien, and the Nutch community, >> > >> > It’s been a while. :) >> > >> > First, let me clarify a few issues: >> > >> > * indeed I now work for Lucidworks and I’m involved in the design and >> > implementation of the connectors framework in the Lucidworks Fusion >> > product. >> > >> > * the connectors framework in Fusion allows us to integrate wildly >> > different third-party modules, e.g. we have connectors based on GCM, >> > Hadoop map-reduce, databases, local files, remote filesystems, >> > repositories, etc. In fact, it’s relatively straightforward to integrate >> > Nutch with this framework, and we actually provide docs on how to do this, >> > so nothing stops you from using Nutch if it fits the bill. >> > >> > * this framework provides a uniform REST API to control the processing >> > pipeline for documents collected by connectors, and in most cases to >> > manage the crawlers configurations and processes. Only the first part is >> > in place for the integration with Nutch, i.e. configuration and jobs have >> > to be managed externally, and only the processing and content enrichment >> > is controlled by Lucidworks Fusion. If we get a business case that >> > requires a tighter integration I’m sure we will be happy to do it. >> > >> > * the previous generation of Lucidworks products (called “LucidWorks >> > Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy >> > integration and while it worked fine for what it was originally intended, >> > it definitely had some painful limitations, not to mention the fact that >> > the Aperture project is no longer active. >> > >> > * the current version of the product DOES NOT use Aperture for web >> > crawling. It uses a web- and file-crawler implementation created in-house >> > - it re-uses some code from crawler-commons, with some insignificant >> > modifications. >> > >> > * our content processing framework uses many Open Source tools (among them >> > Tika, OpenNLP, Drools, of course Solr, and many others), on top of which >> > we’ve built a powerful system for content enrichment, event processing and >> > data analytics. >> > >> > So, that’s the facts. Now, let’s move on to opinions ;) >> > >> > There are many different use cases for web/file crawling and many >> > different scalability and content processing requirements. So far the >> > target audience for Lucidworks Fusion required small- to medium-scale web >> > crawls, but with sophisticated content processing, extensive controls over >> > the crawling frontier (handling sessions for depth-first crawls, cookies, >> > form logins, etc) and easy management / control of the process over REST / >> > UI. In many cases also the effort to set up and operate a Hadoop cluster >> > was deemed too high or irrelevant to the core business. And in reality, as >> > you know, there are workload sizes for which Hadoop is a total overkill >> > and the roundtrip for processing is in the order of several minutes >> > instead of seconds. >> > >> > For these reasons we wanted to provide a web crawler that is >> > self-contained, lean, doesn’t require Hadoop, is scalable well-enough from >> > small to mid-size workloads without Hadoop’s overhead, and at the same >> > time to provide an easy way to integrate high-scale crawler like Nutch for >> > customers that need it - and for such customers we DO recommend Nutch as >> > the best high-scale crawler. :) >> > >> > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a >> > reasonable tradeoff between ease of use, scalability, rich content >> > processing and ease of integration. Don’t take my word for it - download a >> > copy and try it yourself! >> > >> > To Lewis: >> > >> > > Hopefull the above is my outtake on things. If LucidWorks have some magic >> > > sauce then great. Hopefully they consider bringing some of it back into >> > > Nutch rather than writing some Perl or Python scripts. I would never >> > > expect >> > > this to happen, however I am utterly depressed at how often I see this >> > > happening. >> > >> > Lucidworks is a Java/Clojure shop, the connectors framework and the web >> > crawler are written in Java - no Perl or Python in sight ;) Our magic >> > sauce is in enterprise integration and rich content processing pipelines, >> > not so much in base web crawling. >> > >> > So, that’s my contribution to this discussion … I hope this answered some >> > questions. Feel fee to ask if you need more information. >> > >> > -- >> > Best regards, >> > Andrzej Bialecki <[email protected]> >> > >> > --=# http://www.lucidworks.com #=-- >> > >> > >> -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

