Re: Generated Segment Too Large

2014-10-07 Thread Meraj A. Khan
Markus,

I have been using Nucth for a while , but I wasnt clear about this issue,
thank you for reminding me that this is Nucth 101 :)

I will go ahead and use topN as the segment size control mechanism,
although I have one question regarding topN , i.e if I have topN value of
1000 and if there are more than topN , lets say 2000 URLs that are
unfetched at that point of time  , the remaining 1000 would be addressed in
the subsequent Fetch phase, meaning nothing is discarded or felt unfetched ?





On Tue, Oct 7, 2014 at 3:46 AM, Markus Jelsma 
wrote:

> Hi - you have been using Nutch for some time already so aren't you already
> familiar with generate.max.count configuration directive possibly combined
> with the -topN parameter for the Generator job? With generate.max.count the
> segment size depends on the number of distinct hosts or domains so it is
> not really trustworthy, the topN parameter is really strict.
>
> Markus
>
>
>
> -Original message-
> > From:Meraj A. Khan 
> > Sent: Tuesday 7th October 2014 5:54
> > To: user@nutch.apache.org
> > Subject: Generated Segment Too Large
> >
> > Hi Folks,
> >
> > I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way
> of
> > controlling the  segment size and since a single segment is being created
> > which is very large for the capacity of my Hadoop cluster, I have a
> > available storage of ~3TB , but since Hadoop generates the spill*.out
> files
> > for this large segment which gets fetched for days ,I am running out of
> > disk space.
> >
> > I figured , if the segment size were to be controlled then for each
> segment
> > the spills files would be deleted after the job for that segment was
> > completed, giving me a efficient use of the disk space.
> >
> > I would like to know how I can generate multiple segments of a certain
> size
> > (or just fixed number )at each depth iteration .
> >
> > Right now , looks like the Generator.java does needs to be modified as it
> > does not consider the number of segments , is that the right approach ?
> if
> > so can you please give me a few pointers of what logic I should be
> changing
> > , if this is not the right approach I would be happy to know if there is
> > any way to control , the number as well as the size of the generated
> > segments using the configuration/job submission parameters.
> >
> > Thanks for your help!
> >
>


Re: Exception in NUTCH 2.2.1

2014-10-07 Thread rk_sharma
No it's just a convention to write APACHE_NUTCH_HOME. it shows your root
directory where you unzip your nutch setup files. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-in-NUTCH-2-2-1-tp4107326p4163161.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch vs Lucidworks Fusion

2014-10-07 Thread Mattmann, Chris A (3980)
Thanks AB.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Andrzej BiaƂecki 
Reply-To: "user@nutch.apache.org" 
Date: Monday, October 6, 2014 at 3:47 PM
To: "user@nutch.apache.org" 
Subject: Re: Nutch vs Lucidworks Fusion

>
>On 03 Oct 2014, at 12:44, Julien Nioche 
>wrote:
>
>> Attaching Andrzej to this thread. As most of you know Andrzej was the
>>Nutch PMC chair prior to me and a huge contributor to Nutch over the
>>years. He also works for Lucid.
>> Andrzej : would you mind telling us a bit about LW's crawler and why
>>you went for Aperture? Am I right in thinking that this has to do with
>>the fact that you needed to be able to pilot the crawl via a REST-like
>>service?
>> 
>
>
>Hi Julien, and the Nutch community,
>
>It's been a while. :)
>
>First, let me clarify a few issues:
>
>* indeed I now work for Lucidworks and I'm involved in the design and
>implementation of the connectors framework in the Lucidworks Fusion
>product.
>
>* the connectors framework in Fusion allows us to integrate wildly
>different third-party modules, e.g. we have connectors based on GCM,
>Hadoop map-reduce, databases, local files, remote filesystems,
>repositories, etc. In fact, it's relatively straightforward to integrate
>Nutch with this framework, and we actually provide docs on how to do
>this, so nothing stops you from using Nutch if it fits the bill.
>
>* this framework provides a uniform REST API to control the processing
>pipeline for documents collected by connectors, and in most cases to
>manage the crawlers configurations and processes. Only the first part is
>in place for the integration with Nutch, i.e. configuration and jobs have
>to be managed externally, and only the processing and content enrichment
>is controlled by Lucidworks Fusion. If we get a business case that
>requires a tighter integration I'm sure we will be happy to do it.
>
>* the previous generation of Lucidworks products (called "LucidWorks
>Search", shortly LWS) used Aperture as a Web crawler. This was a legacy
>integration and while it worked fine for what it was originally intended,
>it definitely had some painful limitations, not to mention the fact that
>the Aperture project is no longer active.
>
>* the current version of the product DOES NOT use Aperture for web
>crawling. It uses a web- and file-crawler implementation created in-house
>- it re-uses some code from crawler-commons, with some insignificant
>modifications.
>
>* our content processing framework uses many Open Source tools (among
>them Tika, OpenNLP, Drools, of course Solr, and many others), on top of
>which we've built a powerful system for content enrichment, event
>processing and data analytics.
>
>So, that's the facts. Now, let's move on to opinions ;)
>
>There are many different use cases for web/file crawling and many
>different scalability and content processing requirements. So far the
>target audience for Lucidworks Fusion required small- to medium-scale web
>crawls, but with sophisticated content processing, extensive controls
>over the crawling frontier (handling sessions for depth-first crawls,
>cookies, form logins, etc) and easy management / control of the process
>over REST / UI. In many cases also the effort to set up and operate a
>Hadoop cluster was deemed too high or irrelevant to the core business.
>And in reality, as you know, there are workload sizes for which Hadoop is
>a total overkill and the roundtrip for processing is in the order of
>several minutes instead of seconds.
>
>For these reasons we wanted to provide a web crawler that is
>self-contained, lean, doesn't require Hadoop, is scalable well-enough
>from small to mid-size workloads without Hadoop's overhead, and at the
>same time to provide an easy way to integrate high-scale crawler like
>Nutch for customers that need it - and for such customers we DO recommend
>Nutch as the best high-scale crawler. :)
>
>So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
>reasonable tradeoff between ease of use, scalability, rich content
>processing and ease of integration. Don't take my word for it - download
>a copy and try it yourself!
>
>To Lewis:
>
>> Hopefull the above is my outtake on things. If LucidWorks have some
>>magic
>> sauce then great. Hopefully they consider bringing some of it back into
>> Nutch rather than writing some Perl or Python scripts. I would never
>>expect
>> this to happen, however I am utterly depresse

Re: Exception in NUTCH 2.2.1

2014-10-07 Thread sagarhandore007
Had you set APACHE_NUTCH_HOME variable to /home/.../apche-nutch-2.2.1/ 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-in-NUTCH-2-2-1-tp4107326p4163132.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: propagating injected metadata only to child URLs?

2014-10-07 Thread Sebastian Nagel
Hi,

> Having looked at the wiki, NUTCH-655, and NUTCH-855, it seems like using
> the urlmeta plugin out of the box would not achieve this, because the
> metadata would be propagated to all outlinks (which presumably would
> include its parent, et al.).
>
> Is this correct? If so, is there any built-in way to do this or do I need
> to figure something out?

Yes, that's right.

But it would be easy to add the check in distributeScoreToOutlinks()
of URLMetaScoringFilter. Maybe it's also a good idea to make this
functionality generally available via a property and predefined
match classes (eg, same prefix, same host, same domain). Feel free to
open an issue for that feature.

Thanks,
Sebastian

On 10/06/2014 11:00 PM, Jonathan Cooper-Ellis wrote:
> Hello,
> 
> I am interested in injecting metadata and propagating that to its children
> only.
> 
> For example, if I want to inject www.fakenews.com/boston along with some
> metadata that is specific to Boston, so I don't want it to be propagated to
> www.fakenews.com or www.fakenews.com/atlanta. It should only go to
> www.fakenews.com/boston/.+
> 
> Having looked at the wiki, NUTCH-655, and NUTCH-855, it seems like using
> the urlmeta plugin out of the box would not achieve this, because the
> metadata would be propagated to all outlinks (which presumably would
> include its parent, et al.).
> 
> Is this correct? If so, is there any built-in way to do this or do I need
> to figure something out?
> 
> Thanks,
> jce
> 



RE: Generated Segment Too Large

2014-10-07 Thread Markus Jelsma
Hi - you have been using Nutch for some time already so aren't you already 
familiar with generate.max.count configuration directive possibly combined with 
the -topN parameter for the Generator job? With generate.max.count the segment 
size depends on the number of distinct hosts or domains so it is not really 
trustworthy, the topN parameter is really strict.

Markus

 
 
-Original message-
> From:Meraj A. Khan 
> Sent: Tuesday 7th October 2014 5:54
> To: user@nutch.apache.org
> Subject: Generated Segment Too Large
> 
> Hi Folks,
> 
> I am using Nutch 1.7 on Haddop YARN , right now there seems to be no way of
> controlling the  segment size and since a single segment is being created
> which is very large for the capacity of my Hadoop cluster, I have a
> available storage of ~3TB , but since Hadoop generates the spill*.out files
> for this large segment which gets fetched for days ,I am running out of
> disk space.
> 
> I figured , if the segment size were to be controlled then for each segment
> the spills files would be deleted after the job for that segment was
> completed, giving me a efficient use of the disk space.
> 
> I would like to know how I can generate multiple segments of a certain size
> (or just fixed number )at each depth iteration .
> 
> Right now , looks like the Generator.java does needs to be modified as it
> does not consider the number of segments , is that the right approach ? if
> so can you please give me a few pointers of what logic I should be changing
> , if this is not the right approach I would be happy to know if there is
> any way to control , the number as well as the size of the generated
> segments using the configuration/job submission parameters.
> 
> Thanks for your help!
>