get generated segments from step / fetch all empty segments
Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail
Re: get generated segments from step / fetch all empty segments
Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote: You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
Hi - It will only generate more segments when there are enough URL's to generate combined with either topN or generate.count.mode and generate.max.count. -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Monday 22nd September 2014 15:33 To: user@nutch.apache.org Subject: RE: get generated segments from step / fetch all empty segments Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote: You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail
RE: get generated segments from step / fetch all empty segments
Thanks Markus, is that enough driven by the HDFS block size? Edoardo, sorry for hijacking your thread. :( On Sep 22, 2014 9:35 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - It will only generate more segments when there are enough URL's to generate combined with either topN or generate.count.mode and generate.max.count. -Original message- From:Meraj A. Khan mera...@gmail.com Sent: Monday 22nd September 2014 15:33 To: user@nutch.apache.org Subject: RE: get generated segments from step / fetch all empty segments Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, Markus Jelsma markus.jel...@openindex.io wrote: You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- From:Edoardo Causarano edoardo.causar...@gmail.com Sent: Monday 22nd September 2014 15:25 To: user@nutch.apache.org Subject: Re: get generated segments from step / fetch all empty segments Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. Right now I'm trying to feed the list of generated segments into the following fetch stage. The “crawl” script assumes that the most recently added segment is un-fetched and does some hdfs shell scripting to determine its name and stuff this into a shell variable, but I’d like to avoid this and somehow feed the list of generated segments directly into the following step. I have the feeling that I could use the ooze “capture data from action” option but I think that will require fiddling with the Generator class source; that’s ok but I’m a bit weary of adding custom code that may not be part of the core distribution. Has anyone already done something similar, preferably without touching the source? (e.g. http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 404s on GitHub) Best, Edoardo -- Edoardo Causarano Sent with Airmail -- Edoardo Causarano Sent with Airmail