Thanks Lewis. On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi Shameena, > > I think this depends directly on what tags/elements are within the > feed(s). From the feeds I looked at yesterday the relevant tags > appeared to be missing. I was surprised that Tika didn't pick up more > so I think I'll head over and see exactly what the Tika 1.1 source > looks like for the rss+xml parser. > > In the meantime the feed plugin packaged with Nutch WILL parse and > index these additional fields if they are present, but will not if > they are absent. > > Lewis > > On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> wrote: > > Hi Lewis, the things are clear, I am upset that I cannot find a means to > > find the age of a web page by nutch. I thought publishedDate from the > feed > > plugin would help. If I change the field name from publishedDate to > *pubDate > > * . Will this help? > > > > Thanks > > Shameema > > > > > > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > >> Hi, > >> > >> No This should not be necessary. The feed parser and accompanying > >> indexingfilter should extract and send (to be indexed) the following > >> metadata items > >> Author, Tags, Pub;lished date, Updated date and feed, > >> > >> There is a problem though... > >> > >> With many feeds, including the bbci one you provided in another > >> thread, many of these fields are absent, the parser and indexing > >> filter cannot operate on our behalf and subsequently leaves these > >> fields out. > >> > >> It is also important to note that in parse-plugins.xml we first try to > >> parse the application/rss+xml mimetype with parse-tika before feed... > >> I can only assume this is because parse-tika produces slightly better > >> results for this mimetype. Let me explain > >> > >> With language identifier included and parse-plugins overridden to > >> parse rss+xml solely with feed plugin I get > >> > >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >> contentType: application/rss+xml > >> content : > >> host : feeds.feedburner.com > >> tstamp : Fri Jun 08 14:04:04 BST 2012 > >> lang : unknown > >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >> > >> however with parse-tika initiated and the same fetch I get > >> > >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >> contentType: application/rss+xml > >> content : Glasgow City Council - News Feed Glasgow City Council - > >> News > >> Feed Keep up to date with all the news > >> title : Glasgow City Council - News Feed > >> host : feeds.feedburner.com > >> tstamp : Fri Jun 08 14:04:25 BST 2012 > >> lang : en > >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >> > >> Please note that this feed does not include info like publishedDate, > >> updatedDate etc instead offering other means of expressing (some) of > >> this information. In the above case, as the parse data is not present > >> for the required feed fields, or for arguments sake parse-tika, these > >> fields are not included in our subsequent index fields. > >> > >> I hope this clears things up a bit. > >> > >> On a sidenote, also some things to pick up from the above excepts from > >> some tests; > >> 1) Feed plugin fails to recognize content, title and lang fields where > >> parse-tika does this sucessfully. > >> 2) Even though parse-tika DOES utilise the language-identifier to > >> recognize the lang field and provide a value, it fails to include the > >> full value which should be lang="en-GB" as oppose to lang="en" > >> > >> Can anyone chime in on what the current state of affairs is with > >> delegation of language detection to parse-tika, or whether this as > >> already the case but needs patched to accommodate the scenario I > >> provide above? > >> > >> Thanks > >> > >> Lewis > >> > >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> > wrote: > >> > Hi Lewis, > >> > > >> > My solrindex-mapping contains > >> > <mapping> > >> > <!-- Simple mapping of fields created by Nutch IndexingFilters > >> > to fields defined (and expected) in Solr schema.xml. > >> > > >> > Any fields in NutchDocument that match a name defined > >> > in field/@source will be renamed to the corresponding > >> > field/@dest. > >> > Additionally, if a field name (before mapping) matches > >> > a copyField/@source then its values will be copied to > >> > the corresponding copyField/@dest. > >> > > >> > uniqueKey has the same meaning as in Solr schema.xml > >> > and defaults to "id" if not defined. > >> > --> > >> > <fields> > >> > <field dest="content" source="content"/> > >> > <field dest="site" source="site"/> > >> > <field dest="title" source="title"/> > >> > <field dest="host" source="host"/> > >> > <field dest="segment" source="segment"/> > >> > <field dest="boost" source="boost"/> > >> > <field dest="digest" source="digest"/> > >> > <field dest="tstamp" source="tstamp"/> > >> > <field dest="publishedDate" source="publishedDate"/> > >> > <field dest="id" source="url"/> > >> > <copyField source="url" dest="url"/> > >> > </fields> > >> > <uniqueKey>id</uniqueKey> > >> > </mapping> > >> > > >> > > >> > Do I need to edit any source code of feed plugin to make available > >> > this publishedDate. > >> > > >> > Thanks > >> > Shameema > >> > > >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney > >> > <[email protected]> wrote: > >> >> Best way to test this is by doing ad-hoc parsechecker fetches. Also > >> >> try including this value in your solr-mapping file. > >> >> > >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]> > >> wrote: > >> >>> In my schema there are certain fields used for feed plugin. > >> >>> > >> >>> <!-- fields for feed plugin (tag is also used by > >> >>> microformats-reltag)--> > >> >>> <field name="author" type="string" stored="true" > >> indexed="true"/> > >> >>> <field name="tag" type="string" stored="true" indexed="true" > >> >>> multiValued="true"/> > >> >>> <field name="feed" type="string" stored="true" > indexed="true"/> > >> >>> <field name="publishedDate" type="date" stored="true" > >> >>> indexed="true"/> > >> >>> <field name="updatedDate" type="date" stored="true" > >> >>> indexed="true"/> > >> >>> > >> >>> I have included the feed plugin in nutch site xml. The feed file is > >> fetched > >> >>> and parsed , also the links in it are working properly. But I cannot > >> get > >> >>> the publishedDate working. > >> >>> I cannot retrieve the publishedDate or sort by it. > >> >>> > >> >>> Please help. > >> >> > >> >> > >> >> > >> >> -- > >> >> Lewis > >> > >> > >> > >> -- > >> Lewis > >> > > > > -- > Lewis >

