Hi Markus,

We used Opennlp Named Extraction Tool. It is basic but very useful if
you have good model.



2014-10-14 0:03 GMT+03:00 Markus Jelsma <[email protected]>:
> Hi - anything on this? These are interesting topics so i am curious :)
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Thursday 9th October 2014 0:46
>> To: [email protected]; [email protected]
>> Subject: RE: Nutch vs Lucidworks Fusion
>>
>> Hi Andrzej - how are you dealing with text extraction and other relevant 
>> items such as article date and accompanying images? And what about other 
>> metadata such as the author of the article or the rating some pasta recipe 
>> got? Also, must clients (or your consultants) implement site-specific URL 
>> filters to avoid those dreadful spider traps, or do you automatically 
>> resolve traps? If so, how?
>>
>> Looking forward :)
>>
>> Cheers,
>> Markus
>>
>>
>> -----Original message-----
>> > From:Andrzej Białecki <[email protected]>
>> > Sent: Monday 6th October 2014 15:47
>> > To: [email protected]
>> > Subject: Re: Nutch vs Lucidworks Fusion
>> >
>> > On 03 Oct 2014, at 12:44, Julien Nioche <[email protected]> 
>> > wrote:
>> >
>> > > Attaching Andrzej to this thread. As most of you know Andrzej was the 
>> > > Nutch PMC chair prior to me and a huge contributor to Nutch over the 
>> > > years. He also works for Lucid.
>> > > Andrzej : would you mind telling us a bit about LW's crawler and why you 
>> > > went for Aperture? Am I right in thinking that this has to do with the 
>> > > fact that you needed to be able to pilot the crawl via a REST-like 
>> > > service?
>> > >
>> >
>> > Hi Julien, and the Nutch community,
>> >
>> > It’s been a while. :)
>> >
>> > First, let me clarify a few issues:
>> >
>> > * indeed I now work for Lucidworks and I’m involved in the design and 
>> > implementation of the connectors framework in the Lucidworks Fusion 
>> > product.
>> >
>> > * the connectors framework in Fusion allows us to integrate wildly 
>> > different third-party modules, e.g. we have connectors based on GCM, 
>> > Hadoop map-reduce, databases, local files, remote filesystems, 
>> > repositories, etc. In fact, it’s relatively straightforward to integrate 
>> > Nutch with this framework, and we actually provide docs on how to do this, 
>> > so nothing stops you from using Nutch if it fits the bill.
>> >
>> > * this framework provides a uniform REST API to control the processing 
>> > pipeline for documents collected by connectors, and in most cases to 
>> > manage the crawlers configurations and processes. Only the first part is 
>> > in place for the integration with Nutch, i.e. configuration and jobs have 
>> > to be managed externally, and only the processing and content enrichment 
>> > is controlled by Lucidworks Fusion. If we get a business case that 
>> > requires a tighter integration I’m sure we will be happy to do it.
>> >
>> > * the previous generation of Lucidworks products (called “LucidWorks 
>> > Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy 
>> > integration and while it worked fine for what it was originally intended, 
>> > it definitely had some painful limitations, not to mention the fact that 
>> > the Aperture project is no longer active.
>> >
>> > * the current version of the product DOES NOT use Aperture for web 
>> > crawling. It uses a web- and file-crawler implementation created in-house 
>> > - it re-uses some code from crawler-commons, with some insignificant 
>> > modifications.
>> >
>> > * our content processing framework uses many Open Source tools (among them 
>> > Tika, OpenNLP, Drools, of course Solr, and many others), on top of which 
>> > we’ve built a powerful system for content enrichment, event processing and 
>> > data analytics.
>> >
>> > So, that’s the facts. Now, let’s move on to opinions ;)
>> >
>> > There are many different use cases for web/file crawling and many 
>> > different scalability and content processing requirements. So far the 
>> > target audience for Lucidworks Fusion required small- to medium-scale web 
>> > crawls, but with sophisticated content processing, extensive controls over 
>> > the crawling frontier (handling sessions for depth-first crawls, cookies, 
>> > form logins, etc) and easy management / control of the process over REST / 
>> > UI. In many cases also the effort to set up and operate a Hadoop cluster 
>> > was deemed too high or irrelevant to the core business. And in reality, as 
>> > you know, there are workload sizes for which Hadoop is a total overkill 
>> > and the roundtrip for processing is in the order of several minutes 
>> > instead of seconds.
>> >
>> > For these reasons we wanted to provide a web crawler that is 
>> > self-contained, lean, doesn’t require Hadoop, is scalable well-enough from 
>> > small to mid-size workloads without Hadoop’s overhead, and at the same 
>> > time to provide an easy way to integrate high-scale crawler like Nutch for 
>> > customers that need it - and for such customers we DO recommend Nutch as 
>> > the best high-scale crawler. :)
>> >
>> > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a 
>> > reasonable tradeoff between ease of use, scalability, rich content 
>> > processing and ease of integration. Don’t take my word for it - download a 
>> > copy and try it yourself!
>> >
>> > To Lewis:
>> >
>> > > Hopefull the above is my outtake on things. If LucidWorks have some magic
>> > > sauce then great. Hopefully they consider bringing some of it back into
>> > > Nutch rather than writing some Perl or Python scripts. I would never 
>> > > expect
>> > > this to happen, however I am utterly depressed at how often I see this
>> > > happening.
>> >
>> > Lucidworks is a Java/Clojure shop, the connectors framework and the web 
>> > crawler are written in Java - no Perl or Python in sight ;) Our magic 
>> > sauce is in enterprise integration and rich content processing pipelines, 
>> > not so much in base web crawling.
>> >
>> > So, that’s my contribution to this discussion … I hope this answered some 
>> > questions. Feel fee to ask if you need more information.
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki <[email protected]>
>> >
>> > --=# http://www.lucidworks.com #=--
>> >
>> >
>>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Reply via email to