Re: [DIH] URLDataSource and fetching a link

2009-10-20 Thread Grant Ingersoll

Finally getting back to this...

On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:


2009/9/17 Noble Paul നോബിള്‍  नोब्ळ्  
noble.p...@corp.aol.com:

it is possible to have a sub entity which has XPathEntityProcessor
which can use the link ar the url


This may not be a good solution.

But you can use the $hasMore and $nextUrl options of
XPathEntityProcessor to recursively loop if there are more links


Is there an example of this somewhere?  The DIH Wiki refers to it, but  
I don't see an example of it.


I have:
 entity name=nytSportsFeed
pk=link
url=http://feeds1.nytimes.com/nyt/rss/Sports 


processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/ 
item

dataSource=rss
transformer=RegexTransformer,DateFormatTransformer
field column=source xpath=/rss/channel/ 
title commonField=true /
field column=source-link xpath=/rss/ 
channel/link commonField=true /
field column=title xpath=/rss/channel/ 
item/title /
field column=id xpath=/rss/channel/item/ 
guid /
field column=link xpath=/rss/channel/item/ 
link /

  !-- Use the RegexTransformer to strip out ads --
field column=description xpath=/rss/ 
channel/item/description regex=lt;a.*?lt;/agt; replaceWith=/
field column=category xpath=/rss/channel/ 
item/category /

  !-- 'Sun, 18 May 2008 11:23:11 +' --
  field column=pubDate xpath=/rss/channel/item/pubDate  
dateTimeFormat=EEE, dd MMM  HH:mm:ss Z /

/entity

And I want to take the value from the link column and go get the  
contents of that link and index them into a body field.


I'm not sure how to link in the sub-entity.

Thanks,
Grant




On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll  
gsing...@apache.org wrote:
Many RSS feeds contain a link to some full article.  How can I  
have the
DIH get the RSS feed and then have it go and fetch the content at  
the link?


Thanks,
Grant






Re: [DIH] URLDataSource and fetching a link

2009-10-20 Thread Noble Paul നോബിള്‍ नोब्ळ्
 entity name=nytSportsFeed
   pk=link
   url=http://feeds1.nytimes.com/nyt/rss/Sports

   processor=XPathEntityProcessor
   forEach=/rss/channel | /rss/channel/item
   dataSource=rss

transformer=RegexTransformer,DateFormatTransformer
   field column=source xpath=/rss/channel/title
commonField=true /
   field column=source-link xpath=/rss/channel/link
commonField=true /
   field column=title xpath=/rss/channel/item/title
/
   field column=id xpath=/rss/channel/item/guid /
   field column=link xpath=/rss/channel/item/link
/
 !-- Use the RegexTransformer to strip out ads --
   field column=description
xpath=/rss/channel/item/description regex=lt;a.*?lt;/agt;
replaceWith=/
   field column=category
xpath=/rss/channel/item/category /
 !-- 'Sun, 18 May 2008 11:23:11 +' --
 field column=pubDate xpath=/rss/channel/item/pubDate
dateTimeFormat=EEE, dd MMM  HH:mm:ss Z /
 entity name=x   url=${nytSportsFeed.link}
processor=PlainTextEntityProcessor

dataSource=rss
transformer=HTMLStripTransformer
field column=plainText name=body
stripHTML=true/

 /entity


   /entity



On Tue, Oct 20, 2009 at 6:13 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Finally getting back to this...

 On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

  2009/9/17 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:

 it is possible to have a sub entity which has XPathEntityProcessor
 which can use the link ar the url


 This may not be a good solution.

 But you can use the $hasMore and $nextUrl options of
 XPathEntityProcessor to recursively loop if there are more links


 Is there an example of this somewhere?  The DIH Wiki refers to it, but I
 don't see an example of it.

 I have:
  entity name=nytSportsFeed
pk=link
url=
 http://feeds1.nytimes.com/nyt/rss/Sports;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
dataSource=rss
transformer=RegexTransformer,DateFormatTransformer
field column=source xpath=/rss/channel/title
 commonField=true /
field column=source-link
 xpath=/rss/channel/link commonField=true /
field column=title
 xpath=/rss/channel/item/title /
field column=id xpath=/rss/channel/item/guid /
field column=link xpath=/rss/channel/item/link
 /
  !-- Use the RegexTransformer to strip out ads --
field column=description
 xpath=/rss/channel/item/description regex=lt;a.*?lt;/agt;
 replaceWith=/
field column=category
 xpath=/rss/channel/item/category /
  !-- 'Sun, 18 May 2008 11:23:11 +' --
  field column=pubDate xpath=/rss/channel/item/pubDate
 dateTimeFormat=EEE, dd MMM  HH:mm:ss Z /
/entity

 And I want to take the value from the link column and go get the contents
 of that link and index them into a body field.

 I'm not sure how to link in the sub-entity.

 Thanks,
 Grant




 On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org
 wrote:

 Many RSS feeds contain a link to some full article.  How can I have
 the
 DIH get the RSS feed and then have it go and fetch the content at the
 link?

 Thanks,
 Grant






-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: [DIH] URLDataSource and fetching a link

2009-09-17 Thread Grant Ingersoll


On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote:

I would use the RSS feed (hopefully in Atom format) as a source of  
links, then use a regular web spider to fetch the content.


I seriously doubt that DIH is up to the task of general fetching  
from the Wild Wild Web. That is a dirty and difficult job and DIH is  
designed for cooperating data stores.




This is just for a quick demo thing, not production.


Re: [DIH] URLDataSource and fetching a link

2009-09-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
2009/9/17 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 it is possible to have a sub entity which has XPathEntityProcessor
 which can use the link ar the url

This may not be a good solution.

But you can use the $hasMore and $nextUrl options of
XPathEntityProcessor to recursively loop if there are more links

 On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org wrote:
 Many RSS feeds contain a link to some full article.  How can I have the
 DIH get the RSS feed and then have it go and fetch the content at the link?

 Thanks,
 Grant




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com