Re: JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
Hi Sebastian,

Thank you sir!

Two things you provided solved the problem for me!  One was the correct
syntax for the regex but the other was when you provided the info on the
indexchecker command.  Part of what i was dealing with was not having much
to go on when debugging and that command helped a lot!

In addition, the following line gave me an important clue:

-Dplugin.includes='exchange-jexl|protocol-okhttp|parse-html|indexer-solr|index-(basic|more)'

I realized that I did not have  exchange-jexl  listed as a plug-in to
include via my nutch-site.xml config file.  I'd have never have figured
that out without the clue you provided.

The exchanges are working, content is going into the right collections,
life is good!

Thank you again!

Best,

Dave Beckstrom
*Fig Leaf Software*  | "We've Got You Covered"
*Service-Disabled Veteran-Owned Small Business (SDVOSB)*
763-323-3499
dbeckst...@figleaf.com


On Tue, Mar 5, 2019 at 12:44 PM Sebastian Nagel
 wrote:

> Hi Dave,
>
> I'm by now means an expert of the JEXL syntax (cf.
> (http://commons.apache.org/proper/commons-jexl/reference/syntax.html)
> but after a few trials the expression must be
>
>  doc.getFieldValue('url')=~'.*/englishnews/.*'
>
> It's easy to test using the indexchecker, e.g.
>  % bin/nutch indexchecker
>
> -Dplugin.includes='exchange-jexl|protocol-okhttp|parse-html|indexer-solr|index-(basic|more)'
> -DdoIndex=true   http://...
>
> If you want to improve the Wiki page
>https://wiki.apache.org/nutch/Exchanges
> we're happy to grant you write access to the wiki, see
>https://wiki.apache.org/nutch/
>
> Best,
> Sebastian
>
>
> On 3/5/19 4:06 PM, Dave Beckstrom wrote:
> > Ryan and Roannel,
> >
> > Thank you guys so much for your replies.  I didn't realize it but I was
> not
> > seeing all of the emails from you.
> >
> > Roannel you sent some really helpful replies that never came in as an
> > email.  I found your replies when I browsed the web-based archives on the
> > apache site.   I wanted to make sure I thanked you for your help!!!
> >
> > I can't find one example of an exchanges.xml other than what ships with
> > Nutch.   I'm really in the blind trying to get the exchanges to work.  I
> > believe this may be the last item I need help with and then I'll have
> Nutch
> > working the way I need it to.  Any help you can offer would be GREATLY
> > appreciated.
> >
> > Let's say I have a document that was crawled and the URL for the document
> > was as follows:
> >
> >
> http://www.somedomain.com/news/englishnews/2018/this-is-my-news-article.cfm
> >
> > Here is the expression I have coded in exchanges.xml:
> >
> > 
> >
> > That expression is not triggering.  As near as I can tell the "=~" is the
> > "contains" expression.  The idea being if the url contains "englishnews"
> > then this expression should trigger.  I believe the slashes around
> > "englishnews" makes it function as a regular expression, which should
> > evaluate to true, rather then a string compare.
> >
> > If anyone can help get me past this final road block I would greatly
> > appreciate the help!  I spent an entire day on this yesterday and got
> > nowhere.
> >
> > Thank you!
> >
> > Dave
> >
>
>

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Re: JEXL and Exchanges

2019-03-05 Thread Sebastian Nagel
Hi Dave,

I'm by now means an expert of the JEXL syntax (cf.
(http://commons.apache.org/proper/commons-jexl/reference/syntax.html)
but after a few trials the expression must be

 doc.getFieldValue('url')=~'.*/englishnews/.*'

It's easy to test using the indexchecker, e.g.
 % bin/nutch indexchecker
-Dplugin.includes='exchange-jexl|protocol-okhttp|parse-html|indexer-solr|index-(basic|more)'
-DdoIndex=true   http://...

If you want to improve the Wiki page
   https://wiki.apache.org/nutch/Exchanges
we're happy to grant you write access to the wiki, see
   https://wiki.apache.org/nutch/

Best,
Sebastian


On 3/5/19 4:06 PM, Dave Beckstrom wrote:
> Ryan and Roannel,
> 
> Thank you guys so much for your replies.  I didn't realize it but I was not
> seeing all of the emails from you.
> 
> Roannel you sent some really helpful replies that never came in as an
> email.  I found your replies when I browsed the web-based archives on the
> apache site.   I wanted to make sure I thanked you for your help!!!
> 
> I can't find one example of an exchanges.xml other than what ships with
> Nutch.   I'm really in the blind trying to get the exchanges to work.  I
> believe this may be the last item I need help with and then I'll have Nutch
> working the way I need it to.  Any help you can offer would be GREATLY
> appreciated.
> 
> Let's say I have a document that was crawled and the URL for the document
> was as follows:
> 
> http://www.somedomain.com/news/englishnews/2018/this-is-my-news-article.cfm
> 
> Here is the expression I have coded in exchanges.xml:
> 
> 
> 
> That expression is not triggering.  As near as I can tell the "=~" is the
> "contains" expression.  The idea being if the url contains "englishnews"
> then this expression should trigger.  I believe the slashes around
> "englishnews" makes it function as a regular expression, which should
> evaluate to true, rather then a string compare.
> 
> If anyone can help get me past this final road block I would greatly
> appreciate the help!  I spent an entire day on this yesterday and got
> nowhere.
> 
> Thank you!
> 
> Dave
> 



JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
Ryan and Roannel,

Thank you guys so much for your replies.  I didn't realize it but I was not
seeing all of the emails from you.

Roannel you sent some really helpful replies that never came in as an
email.  I found your replies when I browsed the web-based archives on the
apache site.   I wanted to make sure I thanked you for your help!!!

I can't find one example of an exchanges.xml other than what ships with
Nutch.   I'm really in the blind trying to get the exchanges to work.  I
believe this may be the last item I need help with and then I'll have Nutch
working the way I need it to.  Any help you can offer would be GREATLY
appreciated.

Let's say I have a document that was crawled and the URL for the document
was as follows:

http://www.somedomain.com/news/englishnews/2018/this-is-my-news-article.cfm

Here is the expression I have coded in exchanges.xml:



That expression is not triggering.  As near as I can tell the "=~" is the
"contains" expression.  The idea being if the url contains "englishnews"
then this expression should trigger.  I believe the slashes around
"englishnews" makes it function as a regular expression, which should
evaluate to true, rather then a string compare.

If anyone can help get me past this final road block I would greatly
appreciate the help!  I spent an entire day on this yesterday and got
nowhere.

Thank you!

Dave

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator