Re: Nutch vs Lucidworks Fusion

2014-10-06 Thread Andrzej Białecki

On 03 Oct 2014, at 12:44, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch 
 PMC chair prior to me and a huge contributor to Nutch over the years. He also 
 works for Lucid.
 Andrzej : would you mind telling us a bit about LW's crawler and why you went 
 for Aperture? Am I right in thinking that this has to do with the fact that 
 you needed to be able to pilot the crawl via a REST-like service?
 


Hi Julien, and the Nutch community,

It’s been a while. :)

First, let me clarify a few issues:

* indeed I now work for Lucidworks and I’m involved in the design and 
implementation of the connectors framework in the Lucidworks Fusion product.

* the connectors framework in Fusion allows us to integrate wildly different 
third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, 
databases, local files, remote filesystems, repositories, etc. In fact, it’s 
relatively straightforward to integrate Nutch with this framework, and we 
actually provide docs on how to do this, so nothing stops you from using Nutch 
if it fits the bill.

* this framework provides a uniform REST API to control the processing pipeline 
for documents collected by connectors, and in most cases to manage the crawlers 
configurations and processes. Only the first part is in place for the 
integration with Nutch, i.e. configuration and jobs have to be managed 
externally, and only the processing and content enrichment is controlled by 
Lucidworks Fusion. If we get a business case that requires a tighter 
integration I’m sure we will be happy to do it.

* the previous generation of Lucidworks products (called “LucidWorks Search”, 
shortly LWS) used Aperture as a Web crawler. This was a legacy integration and 
while it worked fine for what it was originally intended, it definitely had 
some painful limitations, not to mention the fact that the Aperture project is 
no longer active.

* the current version of the product DOES NOT use Aperture for web crawling. It 
uses a web- and file-crawler implementation created in-house - it re-uses some 
code from crawler-commons, with some insignificant modifications.

* our content processing framework uses many Open Source tools (among them 
Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve 
built a powerful system for content enrichment, event processing and data 
analytics.

So, that’s the facts. Now, let’s move on to opinions ;)

There are many different use cases for web/file crawling and many different 
scalability and content processing requirements. So far the target audience for 
Lucidworks Fusion required small- to medium-scale web crawls, but with 
sophisticated content processing, extensive controls over the crawling frontier 
(handling sessions for depth-first crawls, cookies, form logins, etc) and easy 
management / control of the process over REST / UI. In many cases also the 
effort to set up and operate a Hadoop cluster was deemed too high or irrelevant 
to the core business. And in reality, as you know, there are workload sizes for 
which Hadoop is a total overkill and the roundtrip for processing is in the 
order of several minutes instead of seconds.

For these reasons we wanted to provide a web crawler that is self-contained, 
lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size 
workloads without Hadoop’s overhead, and at the same time to provide an easy 
way to integrate high-scale crawler like Nutch for customers that need it - and 
for such customers we DO recommend Nutch as the best high-scale crawler. :)

So, in my opinion Lucidworks Fusion satisfies these goals, and provides a 
reasonable tradeoff between ease of use, scalability, rich content processing 
and ease of integration. Don’t take my word for it - download a copy and try it 
yourself!

To Lewis:

 Hopefull the above is my outtake on things. If LucidWorks have some magic
 sauce then great. Hopefully they consider bringing some of it back into
 Nutch rather than writing some Perl or Python scripts. I would never expect
 this to happen, however I am utterly depressed at how often I see this
 happening.

Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler 
are written in Java - no Perl or Python in sight ;) Our magic sauce is in 
enterprise integration and rich content processing pipelines, not so much in 
base web crawling.

So, that’s my contribution to this discussion … I hope this answered some 
questions. Feel fee to ask if you need more information.

--
Best regards,
Andrzej Bialecki a...@lucidworks.com

--=# http://www.lucidworks.com #=--



Link original url with the final redirected url

2014-10-06 Thread Vijay Chakilam
Hi,I am trying to crawl a bunch of webpages. Many of those redirect to some other pages. I’ve set the max redirect setting to 5 and was able to fetch the redirected pages and parse the content and extract text and data. When I use segment reader and dump data, I am not able to link the original url with the redirect page that is actually fetched.For example, here’s one of the webpages I am trying to fetch:http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.auThe final redirected page that is fetched in this case is:http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_amode=premiumdest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88nk=9c8dd2e0c0c2f2ee9809449e54bd040bmemtype=anonymousI am attaching the segment reader dump for generate, fetch, parse, parsedata and parsetext. I am not sure how to link the original url: http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.auwith the final redirect page that is actually fetched and parsed: http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_amode=premiumdest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88nk=9c8dd2e0c0c2f2ee9809449e54bd040bmemtype=anonymous

dump.test.fetch
Description: Binary data


dump.test.generate
Description: Binary data


dump.test.parse
Description: Binary data


dump.test.parsedata
Description: Binary data


dump.test.parsetext
Description: Binary data
Thanks,Vijay

Re: Nutch vs Lucidworks Fusion

2014-10-06 Thread Julien Nioche
Thanks for the explanations Andrzej and Grant!
Great to hear that you are using stuff from crawler-commons.

Julien

On 6 October 2014 14:47, Andrzej Białecki a...@getopt.org wrote:


 On 03 Oct 2014, at 12:44, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

  Attaching Andrzej to this thread. As most of you know Andrzej was the
 Nutch PMC chair prior to me and a huge contributor to Nutch over the years.
 He also works for Lucid.
  Andrzej : would you mind telling us a bit about LW's crawler and why you
 went for Aperture? Am I right in thinking that this has to do with the fact
 that you needed to be able to pilot the crawl via a REST-like service?
 


 Hi Julien, and the Nutch community,

 It’s been a while. :)

 First, let me clarify a few issues:

 * indeed I now work for Lucidworks and I’m involved in the design and
 implementation of the connectors framework in the Lucidworks Fusion product.

 * the connectors framework in Fusion allows us to integrate wildly
 different third-party modules, e.g. we have connectors based on GCM, Hadoop
 map-reduce, databases, local files, remote filesystems, repositories, etc.
 In fact, it’s relatively straightforward to integrate Nutch with this
 framework, and we actually provide docs on how to do this, so nothing stops
 you from using Nutch if it fits the bill.

 * this framework provides a uniform REST API to control the processing
 pipeline for documents collected by connectors, and in most cases to manage
 the crawlers configurations and processes. Only the first part is in place
 for the integration with Nutch, i.e. configuration and jobs have to be
 managed externally, and only the processing and content enrichment is
 controlled by Lucidworks Fusion. If we get a business case that requires a
 tighter integration I’m sure we will be happy to do it.

 * the previous generation of Lucidworks products (called “LucidWorks
 Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy
 integration and while it worked fine for what it was originally intended,
 it definitely had some painful limitations, not to mention the fact that
 the Aperture project is no longer active.

 * the current version of the product DOES NOT use Aperture for web
 crawling. It uses a web- and file-crawler implementation created in-house -
 it re-uses some code from crawler-commons, with some insignificant
 modifications.

 * our content processing framework uses many Open Source tools (among them
 Tika, OpenNLP, Drools, of course Solr, and many others), on top of which
 we’ve built a powerful system for content enrichment, event processing and
 data analytics.

 So, that’s the facts. Now, let’s move on to opinions ;)

 There are many different use cases for web/file crawling and many
 different scalability and content processing requirements. So far the
 target audience for Lucidworks Fusion required small- to medium-scale web
 crawls, but with sophisticated content processing, extensive controls over
 the crawling frontier (handling sessions for depth-first crawls, cookies,
 form logins, etc) and easy management / control of the process over REST /
 UI. In many cases also the effort to set up and operate a Hadoop cluster
 was deemed too high or irrelevant to the core business. And in reality, as
 you know, there are workload sizes for which Hadoop is a total overkill and
 the roundtrip for processing is in the order of several minutes instead of
 seconds.

 For these reasons we wanted to provide a web crawler that is
 self-contained, lean, doesn’t require Hadoop, is scalable well-enough from
 small to mid-size workloads without Hadoop’s overhead, and at the same time
 to provide an easy way to integrate high-scale crawler like Nutch for
 customers that need it - and for such customers we DO recommend Nutch as
 the best high-scale crawler. :)

 So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
 reasonable tradeoff between ease of use, scalability, rich content
 processing and ease of integration. Don’t take my word for it - download a
 copy and try it yourself!

 To Lewis:

  Hopefull the above is my outtake on things. If LucidWorks have some magic
  sauce then great. Hopefully they consider bringing some of it back into
  Nutch rather than writing some Perl or Python scripts. I would never
 expect
  this to happen, however I am utterly depressed at how often I see this
  happening.

 Lucidworks is a Java/Clojure shop, the connectors framework and the web
 crawler are written in Java - no Perl or Python in sight ;) Our magic sauce
 is in enterprise integration and rich content processing pipelines, not so
 much in base web crawling.

 So, that’s my contribution to this discussion … I hope this answered some
 questions. Feel fee to ask if you need more information.

 --
 Best regards,
 Andrzej Bialecki a...@lucidworks.com

 --=# http://www.lucidworks.com #=--




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/

propagating injected metadata only to child URLs?

2014-10-06 Thread Jonathan Cooper-Ellis
Hello,

I am interested in injecting metadata and propagating that to its children
only.

For example, if I want to inject www.fakenews.com/boston along with some
metadata that is specific to Boston, so I don't want it to be propagated to
www.fakenews.com or www.fakenews.com/atlanta. It should only go to
www.fakenews.com/boston/.+

Having looked at the wiki, NUTCH-655, and NUTCH-855, it seems like using
the urlmeta plugin out of the box would not achieve this, because the
metadata would be propagated to all outlinks (which presumably would
include its parent, et al.).

Is this correct? If so, is there any built-in way to do this or do I need
to figure something out?

Thanks,
jce


Generated Segment Too Large

2014-10-06 Thread Meraj A. Khan
Hi Folks,

I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
controlling the  segment size and since a single segment is being created
which is very large for the capacity of my Hadoop cluster, I have a
available storage of ~3TB , but since Hadoop generates the spill*.out files
for this large segment which gets fetched for days ,I am running out of
disk space.

I figured , if the segment size were to be controlled then for each segment
the spills files would be deleted after the job for that segment was
completed, giving me a efficient use of the disk space.

I would like to know how I can generate multiple segments of a certain size
(or just fixed number )at each depth iteration .

Right now , looks like the Generator.java does needs to be modified as it
does not consider the number of segments , is that the right approach ? if
so can you please give me a few pointers of what logic I should be changing
, if this is not the right approach I would be happy to know if there is
any way to control , the number as well as the size of the generated
segments using the configuration/job submission parameters.

Thanks for your help!