RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks Markus, is that "enough" driven by the HDFS block size?

Edoardo, sorry for hijacking your thread. :(
On Sep 22, 2014 9:35 AM, "Markus Jelsma"  wrote:

> Hi - It will only generate more segments when there are enough URL's to
> generate combined with either topN or generate.count.mode and
> generate.max.count.
>
> -Original message-
> > From:Meraj A. Khan 
> > Sent: Monday 22nd September 2014 15:33
> > To: user@nutch.apache.org
> > Subject: RE: get generated segments from step / fetch all empty segments
> >
> > Markus, I have used the maxnum segments but no luck, is it driven by the
> > size of the segment instead ?
> > On Sep 22, 2014 9:28 AM, "Markus Jelsma" 
> wrote:
> >
> > > You can use maxNumSegments to generate more than one segment. And
> instead
> > > of passing a list of segment names around, why not just loop over the
> > > entire directory, and move finished segments to another.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Edoardo Causarano 
> > > > Sent: Monday 22nd September 2014 15:25
> > > > To: user@nutch.apache.org
> > > > Subject: Re: get generated segments from step / fetch all empty
> segments
> > > >
> > > > Hi Meraj,
> > > >
> > > > at the moment I’m not, but in the Generator job class the method
> > > “generate” does return a list of Paths therefore the possibility is
> there
> > > (somehow.) For now I’m concentrating on passing at least 1 segment name
> > > from one step to the other, then I’ll see if and how I can get more.
> > > >
> > > >
> > > > Best,
> > > > Edoardo
> > > >
> > > >
> > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> > > wrote:
> > > >
> > > > Hi Edoardo,
> > > >
> > > > How do you generate the multiple segments at the time of generate
> phase?
> > > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> > > edoardo.causar...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I’m building an Oozie workflow to schedule the generate, fetch,
> etc…
> > > > > workflow. Right now I'm trying to feed the list of generated
> segments
> > > into
> > > > > the following fetch stage.
> > > > >
> > > > > The “crawl” script assumes that the most recently added segment is
> > > > > un-fetched and does some hdfs shell scripting to determine its
> name and
> > > > > stuff this into a shell variable, but I’d like to avoid this and
> > > somehow
> > > > > feed the list of generated segments directly into the following
> step.
> > > > >
> > > > > I have the feeling that I could use the ooze “capture data from
> action”
> > > > > option but I think that will require fiddling with the Generator
> class
> > > > > source; that’s ok but I’m a bit weary of adding custom code that
> may
> > > not be
> > > > > part of the core distribution. Has anyone already done something
> > > similar,
> > > > > preferably without touching the source? (e.g.
> > > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> > > but it
> > > > > now 404s on GitHub)
> > > > >
> > > > >
> > > > > Best,
> > > > > Edoardo
> > > > >
> > > > > --
> > > > > Edoardo Causarano
> > > > > Sent with Airmail
> > > > --
> > > > Edoardo Causarano
> > > > Sent with Airmail
> > >
> >
>


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
Hi - It will only generate more segments when there are enough URL's to 
generate combined with either topN or generate.count.mode and 
generate.max.count. 
 
-Original message-
> From:Meraj A. Khan 
> Sent: Monday 22nd September 2014 15:33
> To: user@nutch.apache.org
> Subject: RE: get generated segments from step / fetch all empty segments
> 
> Markus, I have used the maxnum segments but no luck, is it driven by the
> size of the segment instead ?
> On Sep 22, 2014 9:28 AM, "Markus Jelsma"  wrote:
> 
> > You can use maxNumSegments to generate more than one segment. And instead
> > of passing a list of segment names around, why not just loop over the
> > entire directory, and move finished segments to another.
> >
> >
> >
> > -Original message-
> > > From:Edoardo Causarano 
> > > Sent: Monday 22nd September 2014 15:25
> > > To: user@nutch.apache.org
> > > Subject: Re: get generated segments from step / fetch all empty segments
> > >
> > > Hi Meraj,
> > >
> > > at the moment I’m not, but in the Generator job class the method
> > “generate” does return a list of Paths therefore the possibility is there
> > (somehow.) For now I’m concentrating on passing at least 1 segment name
> > from one step to the other, then I’ll see if and how I can get more.
> > >
> > >
> > > Best,
> > > Edoardo
> > >
> > >
> > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> > wrote:
> > >
> > > Hi Edoardo,
> > >
> > > How do you generate the multiple segments at the time of generate phase?
> > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> > edoardo.causar...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > > > workflow. Right now I'm trying to feed the list of generated segments
> > into
> > > > the following fetch stage.
> > > >
> > > > The “crawl” script assumes that the most recently added segment is
> > > > un-fetched and does some hdfs shell scripting to determine its name and
> > > > stuff this into a shell variable, but I’d like to avoid this and
> > somehow
> > > > feed the list of generated segments directly into the following step.
> > > >
> > > > I have the feeling that I could use the ooze “capture data from action”
> > > > option but I think that will require fiddling with the Generator class
> > > > source; that’s ok but I’m a bit weary of adding custom code that may
> > not be
> > > > part of the core distribution. Has anyone already done something
> > similar,
> > > > preferably without touching the source? (e.g.
> > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> > but it
> > > > now 404s on GitHub)
> > > >
> > > >
> > > > Best,
> > > > Edoardo
> > > >
> > > > --
> > > > Edoardo Causarano
> > > > Sent with Airmail
> > > --
> > > Edoardo Causarano
> > > Sent with Airmail
> >
> 


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Markus, I have used the maxnum segments but no luck, is it driven by the
size of the segment instead ?
On Sep 22, 2014 9:28 AM, "Markus Jelsma"  wrote:

> You can use maxNumSegments to generate more than one segment. And instead
> of passing a list of segment names around, why not just loop over the
> entire directory, and move finished segments to another.
>
>
>
> -Original message-
> > From:Edoardo Causarano 
> > Sent: Monday 22nd September 2014 15:25
> > To: user@nutch.apache.org
> > Subject: Re: get generated segments from step / fetch all empty segments
> >
> > Hi Meraj,
> >
> > at the moment I’m not, but in the Generator job class the method
> “generate” does return a list of Paths therefore the possibility is there
> (somehow.) For now I’m concentrating on passing at least 1 segment name
> from one step to the other, then I’ll see if and how I can get more.
> >
> >
> > Best,
> > Edoardo
> >
> >
> > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> wrote:
> >
> > Hi Edoardo,
> >
> > How do you generate the multiple segments at the time of generate phase?
> > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> edoardo.causar...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > > workflow. Right now I'm trying to feed the list of generated segments
> into
> > > the following fetch stage.
> > >
> > > The “crawl” script assumes that the most recently added segment is
> > > un-fetched and does some hdfs shell scripting to determine its name and
> > > stuff this into a shell variable, but I’d like to avoid this and
> somehow
> > > feed the list of generated segments directly into the following step.
> > >
> > > I have the feeling that I could use the ooze “capture data from action”
> > > option but I think that will require fiddling with the Generator class
> > > source; that’s ok but I’m a bit weary of adding custom code that may
> not be
> > > part of the core distribution. Has anyone already done something
> similar,
> > > preferably without touching the source? (e.g.
> > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> but it
> > > now 404s on GitHub)
> > >
> > >
> > > Best,
> > > Edoardo
> > >
> > > --
> > > Edoardo Causarano
> > > Sent with Airmail
> > --
> > Edoardo Causarano
> > Sent with Airmail
>


Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks, I was exactly thinking of the same thing the other day,because the
crawl script sometimes loses contact with the long running fetch job and
just keeps on waiting for %complete updates especially when the script is
run in the background,even after the fetch job is complete,effectively
resulting in the termination of the script execution.

I hope this could be avoided in the Oozie workflow.
On Sep 22, 2014 9:25 AM, "Edoardo Causarano" 
wrote:

> Hi Meraj,
>
> at the moment I’m not, but in the Generator job class the method
> “generate” does return a list of Paths therefore the possibility is there
> (somehow.) For now I’m concentrating on passing at least 1 segment name
> from one step to the other, then I’ll see if and how I can get more.
>
>
> Best,
> Edoardo
>
>
> On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:
>
> Hi Edoardo,
>
> How do you generate the multiple segments at the time of generate phase?
> On Sep 22, 2014 6:01 AM, "Edoardo Causarano" 
> wrote:
>
> > Hi all,
> >
> > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > workflow. Right now I'm trying to feed the list of generated segments
> into
> > the following fetch stage.
> >
> > The “crawl” script assumes that the most recently added segment is
> > un-fetched and does some hdfs shell scripting to determine its name and
> > stuff this into a shell variable, but I’d like to avoid this and somehow
> > feed the list of generated segments directly into the following step.
> >
> > I have the feeling that I could use the ooze “capture data from action”
> > option but I think that will require fiddling with the Generator class
> > source; that’s ok but I’m a bit weary of adding custom code that may not
> be
> > part of the core distribution. Has anyone already done something similar,
> > preferably without touching the source? (e.g.
> > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but
> it
> > now 404s on GitHub)
> >
> >
> > Best,
> > Edoardo
> >
> > --
> > Edoardo Causarano
> > Sent with Airmail
> --
> Edoardo Causarano
> Sent with Airmail


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
You can use maxNumSegments to generate more than one segment. And instead of 
passing a list of segment names around, why not just loop over the entire 
directory, and move finished segments to another.

 
 
-Original message-
> From:Edoardo Causarano 
> Sent: Monday 22nd September 2014 15:25
> To: user@nutch.apache.org
> Subject: Re: get generated segments from step / fetch all empty segments
> 
> Hi Meraj,
> 
> at the moment I’m not, but in the Generator job class the method “generate” 
> does return a list of Paths therefore the possibility is there (somehow.) For 
> now I’m concentrating on passing at least 1 segment name from one step to the 
> other, then I’ll see if and how I can get more.
> 
> 
> Best,
> Edoardo
>     
> 
> On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:
> 
> Hi Edoardo,  
> 
> How do you generate the multiple segments at the time of generate phase?  
> On Sep 22, 2014 6:01 AM, "Edoardo Causarano"   
> wrote:  
> 
> > Hi all,  
> >  
> > I’m building an Oozie workflow to schedule the generate, fetch, etc…  
> > workflow. Right now I'm trying to feed the list of generated segments into  
> > the following fetch stage.  
> >  
> > The “crawl” script assumes that the most recently added segment is  
> > un-fetched and does some hdfs shell scripting to determine its name and  
> > stuff this into a shell variable, but I’d like to avoid this and somehow  
> > feed the list of generated segments directly into the following step.  
> >  
> > I have the feeling that I could use the ooze “capture data from action”  
> > option but I think that will require fiddling with the Generator class  
> > source; that’s ok but I’m a bit weary of adding custom code that may not be 
> >  
> > part of the core distribution. Has anyone already done something similar,  
> > preferably without touching the source? (e.g.  
> > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it  
> > now 404s on GitHub)  
> >  
> >  
> > Best,  
> > Edoardo  
> >  
> > --  
> > Edoardo Causarano  
> > Sent with Airmail  
> -- 
> Edoardo Causarano
> Sent with Airmail


Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Edoardo Causarano
Hi Meraj,

at the moment I’m not, but in the Generator job class the method “generate” 
does return a list of Paths therefore the possibility is there (somehow.) For 
now I’m concentrating on passing at least 1 segment name from one step to the 
other, then I’ll see if and how I can get more.


Best,
Edoardo
    

On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:

Hi Edoardo,  

How do you generate the multiple segments at the time of generate phase?  
On Sep 22, 2014 6:01 AM, "Edoardo Causarano"   
wrote:  

> Hi all,  
>  
> I’m building an Oozie workflow to schedule the generate, fetch, etc…  
> workflow. Right now I'm trying to feed the list of generated segments into  
> the following fetch stage.  
>  
> The “crawl” script assumes that the most recently added segment is  
> un-fetched and does some hdfs shell scripting to determine its name and  
> stuff this into a shell variable, but I’d like to avoid this and somehow  
> feed the list of generated segments directly into the following step.  
>  
> I have the feeling that I could use the ooze “capture data from action”  
> option but I think that will require fiddling with the Generator class  
> source; that’s ok but I’m a bit weary of adding custom code that may not be  
> part of the core distribution. Has anyone already done something similar,  
> preferably without touching the source? (e.g.  
> http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it  
> now 404s on GitHub)  
>  
>  
> Best,  
> Edoardo  
>  
> --  
> Edoardo Causarano  
> Sent with Airmail  
-- 
Edoardo Causarano
Sent with Airmail

Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Hi Edoardo,

How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, "Edoardo Causarano" 
wrote:

> Hi all,
>
> I’m building an Oozie workflow to schedule the generate, fetch, etc…
> workflow. Right now I'm trying to feed the list of generated segments into
> the following fetch stage.
>
> The “crawl” script assumes that the most recently added segment is
> un-fetched and does some hdfs shell scripting to determine its name and
> stuff this into a shell variable, but I’d like to avoid this and somehow
> feed the list of generated segments directly into the following step.
>
> I have the feeling that I could use the ooze “capture data from action”
> option but I think that will require fiddling with the Generator class
> source; that’s ok but I’m a bit weary of adding custom code that may not be
> part of the core distribution. Has anyone already done something similar,
> preferably without touching the source? (e.g.
> http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it
> now 404s on GitHub)
>
>
> Best,
> Edoardo
>
> --
> Edoardo Causarano
> Sent with Airmail