get generated segments from step / fetch all empty segments

2014-09-22 Thread Edoardo Causarano
Hi all,

I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. 
Right now I'm trying to feed the list of generated segments into the following 
fetch stage.

The “crawl” script assumes that the most recently added segment is un-fetched 
and does some hdfs shell scripting to determine its name and stuff this into a 
shell variable, but I’d like to avoid this and somehow feed the list of 
generated segments directly into the following step.

I have the feeling that I could use the ooze “capture data from action” option 
but I think that will require fiddling with the Generator class source; that’s 
ok but I’m a bit weary of adding custom code that may not be part of the core 
distribution. Has anyone already done something similar, preferably without 
touching the source? (e.g. 
http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 
404s on GitHub)


Best,
Edoardo 

-- 
Edoardo Causarano
Sent with Airmail

Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Hi Edoardo,

How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com
wrote:

 Hi all,

 I’m building an Oozie workflow to schedule the generate, fetch, etc…
 workflow. Right now I'm trying to feed the list of generated segments into
 the following fetch stage.

 The “crawl” script assumes that the most recently added segment is
 un-fetched and does some hdfs shell scripting to determine its name and
 stuff this into a shell variable, but I’d like to avoid this and somehow
 feed the list of generated segments directly into the following step.

 I have the feeling that I could use the ooze “capture data from action”
 option but I think that will require fiddling with the Generator class
 source; that’s ok but I’m a bit weary of adding custom code that may not be
 part of the core distribution. Has anyone already done something similar,
 preferably without touching the source? (e.g.
 http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it
 now 404s on GitHub)


 Best,
 Edoardo

 --
 Edoardo Causarano
 Sent with Airmail


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
You can use maxNumSegments to generate more than one segment. And instead of 
passing a list of segment names around, why not just loop over the entire 
directory, and move finished segments to another.

 
 
-Original message-
 From:Edoardo Causarano edoardo.causar...@gmail.com
 Sent: Monday 22nd September 2014 15:25
 To: user@nutch.apache.org
 Subject: Re: get generated segments from step / fetch all empty segments
 
 Hi Meraj,
 
 at the moment I’m not, but in the Generator job class the method “generate” 
 does return a list of Paths therefore the possibility is there (somehow.) For 
 now I’m concentrating on passing at least 1 segment name from one step to the 
 other, then I’ll see if and how I can get more.
 
 
 Best,
 Edoardo
     
 
 On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:
 
 Hi Edoardo,  
 
 How do you generate the multiple segments at the time of generate phase?  
 On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com  
 wrote:  
 
  Hi all,  
   
  I’m building an Oozie workflow to schedule the generate, fetch, etc…  
  workflow. Right now I'm trying to feed the list of generated segments into  
  the following fetch stage.  
   
  The “crawl” script assumes that the most recently added segment is  
  un-fetched and does some hdfs shell scripting to determine its name and  
  stuff this into a shell variable, but I’d like to avoid this and somehow  
  feed the list of generated segments directly into the following step.  
   
  I have the feeling that I could use the ooze “capture data from action”  
  option but I think that will require fiddling with the Generator class  
  source; that’s ok but I’m a bit weary of adding custom code that may not be 
   
  part of the core distribution. Has anyone already done something similar,  
  preferably without touching the source? (e.g.  
  http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it  
  now 404s on GitHub)  
   
   
  Best,  
  Edoardo  
   
  --  
  Edoardo Causarano  
  Sent with Airmail  
 -- 
 Edoardo Causarano
 Sent with Airmail


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Markus, I have used the maxnum segments but no luck, is it driven by the
size of the segment instead ?
On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 You can use maxNumSegments to generate more than one segment. And instead
 of passing a list of segment names around, why not just loop over the
 entire directory, and move finished segments to another.



 -Original message-
  From:Edoardo Causarano edoardo.causar...@gmail.com
  Sent: Monday 22nd September 2014 15:25
  To: user@nutch.apache.org
  Subject: Re: get generated segments from step / fetch all empty segments
 
  Hi Meraj,
 
  at the moment I’m not, but in the Generator job class the method
 “generate” does return a list of Paths therefore the possibility is there
 (somehow.) For now I’m concentrating on passing at least 1 segment name
 from one step to the other, then I’ll see if and how I can get more.
 
 
  Best,
  Edoardo
 
 
  On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
 wrote:
 
  Hi Edoardo,
 
  How do you generate the multiple segments at the time of generate phase?
  On Sep 22, 2014 6:01 AM, Edoardo Causarano 
 edoardo.causar...@gmail.com
  wrote:
 
   Hi all,
  
   I’m building an Oozie workflow to schedule the generate, fetch, etc…
   workflow. Right now I'm trying to feed the list of generated segments
 into
   the following fetch stage.
  
   The “crawl” script assumes that the most recently added segment is
   un-fetched and does some hdfs shell scripting to determine its name and
   stuff this into a shell variable, but I’d like to avoid this and
 somehow
   feed the list of generated segments directly into the following step.
  
   I have the feeling that I could use the ooze “capture data from action”
   option but I think that will require fiddling with the Generator class
   source; that’s ok but I’m a bit weary of adding custom code that may
 not be
   part of the core distribution. Has anyone already done something
 similar,
   preferably without touching the source? (e.g.
   http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
 but it
   now 404s on GitHub)
  
  
   Best,
   Edoardo
  
   --
   Edoardo Causarano
   Sent with Airmail
  --
  Edoardo Causarano
  Sent with Airmail



RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
Hi - It will only generate more segments when there are enough URL's to 
generate combined with either topN or generate.count.mode and 
generate.max.count. 
 
-Original message-
 From:Meraj A. Khan mera...@gmail.com
 Sent: Monday 22nd September 2014 15:33
 To: user@nutch.apache.org
 Subject: RE: get generated segments from step / fetch all empty segments
 
 Markus, I have used the maxnum segments but no luck, is it driven by the
 size of the segment instead ?
 On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote:
 
  You can use maxNumSegments to generate more than one segment. And instead
  of passing a list of segment names around, why not just loop over the
  entire directory, and move finished segments to another.
 
 
 
  -Original message-
   From:Edoardo Causarano edoardo.causar...@gmail.com
   Sent: Monday 22nd September 2014 15:25
   To: user@nutch.apache.org
   Subject: Re: get generated segments from step / fetch all empty segments
  
   Hi Meraj,
  
   at the moment I’m not, but in the Generator job class the method
  “generate” does return a list of Paths therefore the possibility is there
  (somehow.) For now I’m concentrating on passing at least 1 segment name
  from one step to the other, then I’ll see if and how I can get more.
  
  
   Best,
   Edoardo
  
  
   On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
  wrote:
  
   Hi Edoardo,
  
   How do you generate the multiple segments at the time of generate phase?
   On Sep 22, 2014 6:01 AM, Edoardo Causarano 
  edoardo.causar...@gmail.com
   wrote:
  
Hi all,
   
I’m building an Oozie workflow to schedule the generate, fetch, etc…
workflow. Right now I'm trying to feed the list of generated segments
  into
the following fetch stage.
   
The “crawl” script assumes that the most recently added segment is
un-fetched and does some hdfs shell scripting to determine its name and
stuff this into a shell variable, but I’d like to avoid this and
  somehow
feed the list of generated segments directly into the following step.
   
I have the feeling that I could use the ooze “capture data from action”
option but I think that will require fiddling with the Generator class
source; that’s ok but I’m a bit weary of adding custom code that may
  not be
part of the core distribution. Has anyone already done something
  similar,
preferably without touching the source? (e.g.
http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
  but it
now 404s on GitHub)
   
   
Best,
Edoardo
   
--
Edoardo Causarano
Sent with Airmail
   --
   Edoardo Causarano
   Sent with Airmail
 
 


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks Markus, is that enough driven by the HDFS block size?

Edoardo, sorry for hijacking your thread. :(
On Sep 22, 2014 9:35 AM, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi - It will only generate more segments when there are enough URL's to
 generate combined with either topN or generate.count.mode and
 generate.max.count.

 -Original message-
  From:Meraj A. Khan mera...@gmail.com
  Sent: Monday 22nd September 2014 15:33
  To: user@nutch.apache.org
  Subject: RE: get generated segments from step / fetch all empty segments
 
  Markus, I have used the maxnum segments but no luck, is it driven by the
  size of the segment instead ?
  On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
   You can use maxNumSegments to generate more than one segment. And
 instead
   of passing a list of segment names around, why not just loop over the
   entire directory, and move finished segments to another.
  
  
  
   -Original message-
From:Edoardo Causarano edoardo.causar...@gmail.com
Sent: Monday 22nd September 2014 15:25
To: user@nutch.apache.org
Subject: Re: get generated segments from step / fetch all empty
 segments
   
Hi Meraj,
   
at the moment I’m not, but in the Generator job class the method
   “generate” does return a list of Paths therefore the possibility is
 there
   (somehow.) For now I’m concentrating on passing at least 1 segment name
   from one step to the other, then I’ll see if and how I can get more.
   
   
Best,
Edoardo
   
   
On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
   wrote:
   
Hi Edoardo,
   
How do you generate the multiple segments at the time of generate
 phase?
On Sep 22, 2014 6:01 AM, Edoardo Causarano 
   edoardo.causar...@gmail.com
wrote:
   
 Hi all,

 I’m building an Oozie workflow to schedule the generate, fetch,
 etc…
 workflow. Right now I'm trying to feed the list of generated
 segments
   into
 the following fetch stage.

 The “crawl” script assumes that the most recently added segment is
 un-fetched and does some hdfs shell scripting to determine its
 name and
 stuff this into a shell variable, but I’d like to avoid this and
   somehow
 feed the list of generated segments directly into the following
 step.

 I have the feeling that I could use the ooze “capture data from
 action”
 option but I think that will require fiddling with the Generator
 class
 source; that’s ok but I’m a bit weary of adding custom code that
 may
   not be
 part of the core distribution. Has anyone already done something
   similar,
 preferably without touching the source? (e.g.
 http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
   but it
 now 404s on GitHub)


 Best,
 Edoardo

 --
 Edoardo Causarano
 Sent with Airmail
--
Edoardo Causarano
Sent with Airmail