Re: jsessionid not being remvoed from the url

2014-09-22 Thread S.L
Sebastian , I am using Nutch 1.7 and a specific example in this case is
this.

http://www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p;jsessionid=7936CA95263E9C78B735E5EBE827BDDA.bbolsp-app04-163?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8



On Mon, Sep 22, 2014 at 4:12 PM, Sebastian Nagel  wrote:

> > Looks like this should have been removed , is the regex in
> > regex-normalize.xml correct ?
> >
>
> Yes. It removes various session ids, see
> src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
>
> Can you give a concrete example of a session id not removed?
> Which Nutch version is used?
>
> Thanks,
> Sebastian
>
> On 09/22/2014 06:43 AM, S.L wrote:
> > The jsessionid on the cralwed URL is not being removed ,even though a
> regex
> > URL normalizer is beign specifiied, can someonle please let me know the
> > issue here ?
> >
> > I have already set the following
> >
> > *nutch-site.xml*
> >
> > 
> > plugin.includes
> >
> >
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)
> > 
> > 
> >
> > 
> > urlnormalizer.order
> >
>  org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> > 
> > Order in which normalizers will run. If any of these
> > isn't
> > activated it will be silently skipped. If other normalizers
> not
> > on the
> > list are activated, they will run in random order after the
> > ones
> > specified here are run.
> > 
> > 
> >
> >
> > 
> > urlnormalizer.regex.file
> > regex-normalize.xml
> > Name of the config file used by the
> RegexUrlNormalizer
> > class.
> > 
> > 
> >
> >
> > *And the regex-normalize.xml file has this entry*
> > 
> >
> >
> (?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)
> >   $4
> > 
> >
> >
> > Looks like this should have been removed , is the regex in
> > regex-normalize.xml correct ?
> >
>
>


Re: jsessionid not being remvoed from the url

2014-09-22 Thread Sebastian Nagel
> Looks like this should have been removed , is the regex in
> regex-normalize.xml correct ?
>

Yes. It removes various session ids, see
src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test

Can you give a concrete example of a session id not removed?
Which Nutch version is used?

Thanks,
Sebastian

On 09/22/2014 06:43 AM, S.L wrote:
> The jsessionid on the cralwed URL is not being removed ,even though a regex
> URL normalizer is beign specifiied, can someonle please let me know the
> issue here ?
> 
> I have already set the following
> 
> *nutch-site.xml*
> 
> 
> plugin.includes
> 
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)
> 
> 
> 
> 
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> 
> Order in which normalizers will run. If any of these
> isn't
> activated it will be silently skipped. If other normalizers not
> on the
> list are activated, they will run in random order after the
> ones
> specified here are run.
> 
> 
> 
> 
> 
> urlnormalizer.regex.file
> regex-normalize.xml
> Name of the config file used by the RegexUrlNormalizer
> class.
> 
> 
> 
> 
> *And the regex-normalize.xml file has this entry*
> 
> 
> (?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)
>   $4
> 
> 
> 
> Looks like this should have been removed , is the regex in
> regex-normalize.xml correct ?
> 



RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks Markus, is that "enough" driven by the HDFS block size?

Edoardo, sorry for hijacking your thread. :(
On Sep 22, 2014 9:35 AM, "Markus Jelsma"  wrote:

> Hi - It will only generate more segments when there are enough URL's to
> generate combined with either topN or generate.count.mode and
> generate.max.count.
>
> -Original message-
> > From:Meraj A. Khan 
> > Sent: Monday 22nd September 2014 15:33
> > To: user@nutch.apache.org
> > Subject: RE: get generated segments from step / fetch all empty segments
> >
> > Markus, I have used the maxnum segments but no luck, is it driven by the
> > size of the segment instead ?
> > On Sep 22, 2014 9:28 AM, "Markus Jelsma" 
> wrote:
> >
> > > You can use maxNumSegments to generate more than one segment. And
> instead
> > > of passing a list of segment names around, why not just loop over the
> > > entire directory, and move finished segments to another.
> > >
> > >
> > >
> > > -Original message-
> > > > From:Edoardo Causarano 
> > > > Sent: Monday 22nd September 2014 15:25
> > > > To: user@nutch.apache.org
> > > > Subject: Re: get generated segments from step / fetch all empty
> segments
> > > >
> > > > Hi Meraj,
> > > >
> > > > at the moment I’m not, but in the Generator job class the method
> > > “generate” does return a list of Paths therefore the possibility is
> there
> > > (somehow.) For now I’m concentrating on passing at least 1 segment name
> > > from one step to the other, then I’ll see if and how I can get more.
> > > >
> > > >
> > > > Best,
> > > > Edoardo
> > > >
> > > >
> > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> > > wrote:
> > > >
> > > > Hi Edoardo,
> > > >
> > > > How do you generate the multiple segments at the time of generate
> phase?
> > > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> > > edoardo.causar...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I’m building an Oozie workflow to schedule the generate, fetch,
> etc…
> > > > > workflow. Right now I'm trying to feed the list of generated
> segments
> > > into
> > > > > the following fetch stage.
> > > > >
> > > > > The “crawl” script assumes that the most recently added segment is
> > > > > un-fetched and does some hdfs shell scripting to determine its
> name and
> > > > > stuff this into a shell variable, but I’d like to avoid this and
> > > somehow
> > > > > feed the list of generated segments directly into the following
> step.
> > > > >
> > > > > I have the feeling that I could use the ooze “capture data from
> action”
> > > > > option but I think that will require fiddling with the Generator
> class
> > > > > source; that’s ok but I’m a bit weary of adding custom code that
> may
> > > not be
> > > > > part of the core distribution. Has anyone already done something
> > > similar,
> > > > > preferably without touching the source? (e.g.
> > > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> > > but it
> > > > > now 404s on GitHub)
> > > > >
> > > > >
> > > > > Best,
> > > > > Edoardo
> > > > >
> > > > > --
> > > > > Edoardo Causarano
> > > > > Sent with Airmail
> > > > --
> > > > Edoardo Causarano
> > > > Sent with Airmail
> > >
> >
>


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
Hi - It will only generate more segments when there are enough URL's to 
generate combined with either topN or generate.count.mode and 
generate.max.count. 
 
-Original message-
> From:Meraj A. Khan 
> Sent: Monday 22nd September 2014 15:33
> To: user@nutch.apache.org
> Subject: RE: get generated segments from step / fetch all empty segments
> 
> Markus, I have used the maxnum segments but no luck, is it driven by the
> size of the segment instead ?
> On Sep 22, 2014 9:28 AM, "Markus Jelsma"  wrote:
> 
> > You can use maxNumSegments to generate more than one segment. And instead
> > of passing a list of segment names around, why not just loop over the
> > entire directory, and move finished segments to another.
> >
> >
> >
> > -Original message-
> > > From:Edoardo Causarano 
> > > Sent: Monday 22nd September 2014 15:25
> > > To: user@nutch.apache.org
> > > Subject: Re: get generated segments from step / fetch all empty segments
> > >
> > > Hi Meraj,
> > >
> > > at the moment I’m not, but in the Generator job class the method
> > “generate” does return a list of Paths therefore the possibility is there
> > (somehow.) For now I’m concentrating on passing at least 1 segment name
> > from one step to the other, then I’ll see if and how I can get more.
> > >
> > >
> > > Best,
> > > Edoardo
> > >
> > >
> > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> > wrote:
> > >
> > > Hi Edoardo,
> > >
> > > How do you generate the multiple segments at the time of generate phase?
> > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> > edoardo.causar...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > > > workflow. Right now I'm trying to feed the list of generated segments
> > into
> > > > the following fetch stage.
> > > >
> > > > The “crawl” script assumes that the most recently added segment is
> > > > un-fetched and does some hdfs shell scripting to determine its name and
> > > > stuff this into a shell variable, but I’d like to avoid this and
> > somehow
> > > > feed the list of generated segments directly into the following step.
> > > >
> > > > I have the feeling that I could use the ooze “capture data from action”
> > > > option but I think that will require fiddling with the Generator class
> > > > source; that’s ok but I’m a bit weary of adding custom code that may
> > not be
> > > > part of the core distribution. Has anyone already done something
> > similar,
> > > > preferably without touching the source? (e.g.
> > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> > but it
> > > > now 404s on GitHub)
> > > >
> > > >
> > > > Best,
> > > > Edoardo
> > > >
> > > > --
> > > > Edoardo Causarano
> > > > Sent with Airmail
> > > --
> > > Edoardo Causarano
> > > Sent with Airmail
> >
> 


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Markus, I have used the maxnum segments but no luck, is it driven by the
size of the segment instead ?
On Sep 22, 2014 9:28 AM, "Markus Jelsma"  wrote:

> You can use maxNumSegments to generate more than one segment. And instead
> of passing a list of segment names around, why not just loop over the
> entire directory, and move finished segments to another.
>
>
>
> -Original message-
> > From:Edoardo Causarano 
> > Sent: Monday 22nd September 2014 15:25
> > To: user@nutch.apache.org
> > Subject: Re: get generated segments from step / fetch all empty segments
> >
> > Hi Meraj,
> >
> > at the moment I’m not, but in the Generator job class the method
> “generate” does return a list of Paths therefore the possibility is there
> (somehow.) For now I’m concentrating on passing at least 1 segment name
> from one step to the other, then I’ll see if and how I can get more.
> >
> >
> > Best,
> > Edoardo
> >
> >
> > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com)
> wrote:
> >
> > Hi Edoardo,
> >
> > How do you generate the multiple segments at the time of generate phase?
> > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> edoardo.causar...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > > workflow. Right now I'm trying to feed the list of generated segments
> into
> > > the following fetch stage.
> > >
> > > The “crawl” script assumes that the most recently added segment is
> > > un-fetched and does some hdfs shell scripting to determine its name and
> > > stuff this into a shell variable, but I’d like to avoid this and
> somehow
> > > feed the list of generated segments directly into the following step.
> > >
> > > I have the feeling that I could use the ooze “capture data from action”
> > > option but I think that will require fiddling with the Generator class
> > > source; that’s ok but I’m a bit weary of adding custom code that may
> not be
> > > part of the core distribution. Has anyone already done something
> similar,
> > > preferably without touching the source? (e.g.
> > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> but it
> > > now 404s on GitHub)
> > >
> > >
> > > Best,
> > > Edoardo
> > >
> > > --
> > > Edoardo Causarano
> > > Sent with Airmail
> > --
> > Edoardo Causarano
> > Sent with Airmail
>


Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Thanks, I was exactly thinking of the same thing the other day,because the
crawl script sometimes loses contact with the long running fetch job and
just keeps on waiting for %complete updates especially when the script is
run in the background,even after the fetch job is complete,effectively
resulting in the termination of the script execution.

I hope this could be avoided in the Oozie workflow.
On Sep 22, 2014 9:25 AM, "Edoardo Causarano" 
wrote:

> Hi Meraj,
>
> at the moment I’m not, but in the Generator job class the method
> “generate” does return a list of Paths therefore the possibility is there
> (somehow.) For now I’m concentrating on passing at least 1 segment name
> from one step to the other, then I’ll see if and how I can get more.
>
>
> Best,
> Edoardo
>
>
> On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:
>
> Hi Edoardo,
>
> How do you generate the multiple segments at the time of generate phase?
> On Sep 22, 2014 6:01 AM, "Edoardo Causarano" 
> wrote:
>
> > Hi all,
> >
> > I’m building an Oozie workflow to schedule the generate, fetch, etc…
> > workflow. Right now I'm trying to feed the list of generated segments
> into
> > the following fetch stage.
> >
> > The “crawl” script assumes that the most recently added segment is
> > un-fetched and does some hdfs shell scripting to determine its name and
> > stuff this into a shell variable, but I’d like to avoid this and somehow
> > feed the list of generated segments directly into the following step.
> >
> > I have the feeling that I could use the ooze “capture data from action”
> > option but I think that will require fiddling with the Generator class
> > source; that’s ok but I’m a bit weary of adding custom code that may not
> be
> > part of the core distribution. Has anyone already done something similar,
> > preferably without touching the source? (e.g.
> > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but
> it
> > now 404s on GitHub)
> >
> >
> > Best,
> > Edoardo
> >
> > --
> > Edoardo Causarano
> > Sent with Airmail
> --
> Edoardo Causarano
> Sent with Airmail


RE: get generated segments from step / fetch all empty segments

2014-09-22 Thread Markus Jelsma
You can use maxNumSegments to generate more than one segment. And instead of 
passing a list of segment names around, why not just loop over the entire 
directory, and move finished segments to another.

 
 
-Original message-
> From:Edoardo Causarano 
> Sent: Monday 22nd September 2014 15:25
> To: user@nutch.apache.org
> Subject: Re: get generated segments from step / fetch all empty segments
> 
> Hi Meraj,
> 
> at the moment I’m not, but in the Generator job class the method “generate” 
> does return a list of Paths therefore the possibility is there (somehow.) For 
> now I’m concentrating on passing at least 1 segment name from one step to the 
> other, then I’ll see if and how I can get more.
> 
> 
> Best,
> Edoardo
>     
> 
> On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:
> 
> Hi Edoardo,  
> 
> How do you generate the multiple segments at the time of generate phase?  
> On Sep 22, 2014 6:01 AM, "Edoardo Causarano"   
> wrote:  
> 
> > Hi all,  
> >  
> > I’m building an Oozie workflow to schedule the generate, fetch, etc…  
> > workflow. Right now I'm trying to feed the list of generated segments into  
> > the following fetch stage.  
> >  
> > The “crawl” script assumes that the most recently added segment is  
> > un-fetched and does some hdfs shell scripting to determine its name and  
> > stuff this into a shell variable, but I’d like to avoid this and somehow  
> > feed the list of generated segments directly into the following step.  
> >  
> > I have the feeling that I could use the ooze “capture data from action”  
> > option but I think that will require fiddling with the Generator class  
> > source; that’s ok but I’m a bit weary of adding custom code that may not be 
> >  
> > part of the core distribution. Has anyone already done something similar,  
> > preferably without touching the source? (e.g.  
> > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it  
> > now 404s on GitHub)  
> >  
> >  
> > Best,  
> > Edoardo  
> >  
> > --  
> > Edoardo Causarano  
> > Sent with Airmail  
> -- 
> Edoardo Causarano
> Sent with Airmail


Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Edoardo Causarano
Hi Meraj,

at the moment I’m not, but in the Generator job class the method “generate” 
does return a list of Paths therefore the possibility is there (somehow.) For 
now I’m concentrating on passing at least 1 segment name from one step to the 
other, then I’ll see if and how I can get more.


Best,
Edoardo
    

On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote:

Hi Edoardo,  

How do you generate the multiple segments at the time of generate phase?  
On Sep 22, 2014 6:01 AM, "Edoardo Causarano"   
wrote:  

> Hi all,  
>  
> I’m building an Oozie workflow to schedule the generate, fetch, etc…  
> workflow. Right now I'm trying to feed the list of generated segments into  
> the following fetch stage.  
>  
> The “crawl” script assumes that the most recently added segment is  
> un-fetched and does some hdfs shell scripting to determine its name and  
> stuff this into a shell variable, but I’d like to avoid this and somehow  
> feed the list of generated segments directly into the following step.  
>  
> I have the feeling that I could use the ooze “capture data from action”  
> option but I think that will require fiddling with the Generator class  
> source; that’s ok but I’m a bit weary of adding custom code that may not be  
> part of the core distribution. Has anyone already done something similar,  
> preferably without touching the source? (e.g.  
> http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it  
> now 404s on GitHub)  
>  
>  
> Best,  
> Edoardo  
>  
> --  
> Edoardo Causarano  
> Sent with Airmail  
-- 
Edoardo Causarano
Sent with Airmail

Re: get generated segments from step / fetch all empty segments

2014-09-22 Thread Meraj A. Khan
Hi Edoardo,

How do you generate the multiple segments at the time of generate phase?
On Sep 22, 2014 6:01 AM, "Edoardo Causarano" 
wrote:

> Hi all,
>
> I’m building an Oozie workflow to schedule the generate, fetch, etc…
> workflow. Right now I'm trying to feed the list of generated segments into
> the following fetch stage.
>
> The “crawl” script assumes that the most recently added segment is
> un-fetched and does some hdfs shell scripting to determine its name and
> stuff this into a shell variable, but I’d like to avoid this and somehow
> feed the list of generated segments directly into the following step.
>
> I have the feeling that I could use the ooze “capture data from action”
> option but I think that will require fiddling with the Generator class
> source; that’s ok but I’m a bit weary of adding custom code that may not be
> part of the core distribution. Has anyone already done something similar,
> preferably without touching the source? (e.g.
> http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it
> now 404s on GitHub)
>
>
> Best,
> Edoardo
>
> --
> Edoardo Causarano
> Sent with Airmail


get generated segments from step / fetch all empty segments

2014-09-22 Thread Edoardo Causarano
Hi all,

I’m building an Oozie workflow to schedule the generate, fetch, etc… workflow. 
Right now I'm trying to feed the list of generated segments into the following 
fetch stage.

The “crawl” script assumes that the most recently added segment is un-fetched 
and does some hdfs shell scripting to determine its name and stuff this into a 
shell variable, but I’d like to avoid this and somehow feed the list of 
generated segments directly into the following step.

I have the feeling that I could use the ooze “capture data from action” option 
but I think that will require fiddling with the Generator class source; that’s 
ok but I’m a bit weary of adding custom code that may not be part of the core 
distribution. Has anyone already done something similar, preferably without 
touching the source? (e.g. 
http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it now 
404s on GitHub)


Best,
Edoardo 

-- 
Edoardo Causarano
Sent with Airmail