Re: [DIH] URLDataSource and fetching a link
Finally getting back to this... On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള് नोब्ळ् wrote: 2009/9/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: it is possible to have a sub entity which has XPathEntityProcessor which can use the link ar the url This may not be a good solution. But you can use the $hasMore and $nextUrl options of XPathEntityProcessor to recursively loop if there are more links Is there an example of this somewhere? The DIH Wiki refers to it, but I don't see an example of it. I have: entity name=nytSportsFeed pk=link url=http://feeds1.nytimes.com/nyt/rss/Sports processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/ item dataSource=rss transformer=RegexTransformer,DateFormatTransformer field column=source xpath=/rss/channel/ title commonField=true / field column=source-link xpath=/rss/ channel/link commonField=true / field column=title xpath=/rss/channel/ item/title / field column=id xpath=/rss/channel/item/ guid / field column=link xpath=/rss/channel/item/ link / !-- Use the RegexTransformer to strip out ads -- field column=description xpath=/rss/ channel/item/description regex=lt;a.*?lt;/agt; replaceWith=/ field column=category xpath=/rss/channel/ item/category / !-- 'Sun, 18 May 2008 11:23:11 +' -- field column=pubDate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss Z / /entity And I want to take the value from the link column and go get the contents of that link and index them into a body field. I'm not sure how to link in the sub-entity. Thanks, Grant On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org wrote: Many RSS feeds contain a link to some full article. How can I have the DIH get the RSS feed and then have it go and fetch the content at the link? Thanks, Grant
Re: [DIH] URLDataSource and fetching a link
entity name=nytSportsFeed pk=link url=http://feeds1.nytimes.com/nyt/rss/Sports processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item dataSource=rss transformer=RegexTransformer,DateFormatTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=title xpath=/rss/channel/item/title / field column=id xpath=/rss/channel/item/guid / field column=link xpath=/rss/channel/item/link / !-- Use the RegexTransformer to strip out ads -- field column=description xpath=/rss/channel/item/description regex=lt;a.*?lt;/agt; replaceWith=/ field column=category xpath=/rss/channel/item/category / !-- 'Sun, 18 May 2008 11:23:11 +' -- field column=pubDate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss Z / entity name=x url=${nytSportsFeed.link} processor=PlainTextEntityProcessor dataSource=rss transformer=HTMLStripTransformer field column=plainText name=body stripHTML=true/ /entity /entity On Tue, Oct 20, 2009 at 6:13 PM, Grant Ingersoll gsing...@apache.orgwrote: Finally getting back to this... On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള് नोब्ळ् wrote: 2009/9/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: it is possible to have a sub entity which has XPathEntityProcessor which can use the link ar the url This may not be a good solution. But you can use the $hasMore and $nextUrl options of XPathEntityProcessor to recursively loop if there are more links Is there an example of this somewhere? The DIH Wiki refers to it, but I don't see an example of it. I have: entity name=nytSportsFeed pk=link url= http://feeds1.nytimes.com/nyt/rss/Sports; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item dataSource=rss transformer=RegexTransformer,DateFormatTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=title xpath=/rss/channel/item/title / field column=id xpath=/rss/channel/item/guid / field column=link xpath=/rss/channel/item/link / !-- Use the RegexTransformer to strip out ads -- field column=description xpath=/rss/channel/item/description regex=lt;a.*?lt;/agt; replaceWith=/ field column=category xpath=/rss/channel/item/category / !-- 'Sun, 18 May 2008 11:23:11 +' -- field column=pubDate xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss Z / /entity And I want to take the value from the link column and go get the contents of that link and index them into a body field. I'm not sure how to link in the sub-entity. Thanks, Grant On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org wrote: Many RSS feeds contain a link to some full article. How can I have the DIH get the RSS feed and then have it go and fetch the content at the link? Thanks, Grant -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] URLDataSource and fetching a link
On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote: I would use the RSS feed (hopefully in Atom format) as a source of links, then use a regular web spider to fetch the content. I seriously doubt that DIH is up to the task of general fetching from the Wild Wild Web. That is a dirty and difficult job and DIH is designed for cooperating data stores. This is just for a quick demo thing, not production.
Re: [DIH] URLDataSource and fetching a link
2009/9/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: it is possible to have a sub entity which has XPathEntityProcessor which can use the link ar the url This may not be a good solution. But you can use the $hasMore and $nextUrl options of XPathEntityProcessor to recursively loop if there are more links On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org wrote: Many RSS feeds contain a link to some full article. How can I have the DIH get the RSS feed and then have it go and fetch the content at the link? Thanks, Grant -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com