S please. please explore why the tag pubDate is not parsed and indexed? Thanks Shameema
On Thu, Jun 14, 2012 at 6:11 PM, Lewis John Mcgibbney < [email protected]> wrote: > Depending on what the tag looks like it will be interpreted > accordingly by the feed parser. > My instincts are that there is a different between pubDate and > publishedDate being parsed and identified by the parser, however then > the question arises as to how/why the field is not identified as a > tag. > > I will try to do more digging.. it might be worth looking at the feed > source as well. > > Best > Lewis > > On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[email protected]> wrote: > > Hi Lewis, > > > > The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has > > the pubDate tag. > > Then why is it not parsed. Please explain. > > > > What i need is the value of the pubDate > > pulled to any of our date fields. > > > > Thanks > > Shameema > > > > > > > > On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[email protected]> > wrote: > > > >> I tried parsechecker to ensure that no value is retrieved to > publishedDate. > >> > >> > >> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> > wrote: > >> > >>> Hi, > >>> > >>> I am trying for days to get a solution to retrive the <pubDate> value > of > >>> a feed. Even the value is there on a feed, nutch is not parsing it and > >>> sending along with the outlinks. > >>> > >>> the feed plugin is included, but it is not populating value in the > field > >>> publishedDate. Somebody please give me hints where I went wrong. > >>> > >>> Or please let me know if it is not possible. > >>> > >>> Thanks > >>> Shameema > >>> > >>> > >>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> > wrote: > >>> > >>>> Thanks Lewis. > >>>> > >>>> > >>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < > >>>> [email protected]> wrote: > >>>> > >>>>> Hi Shameena, > >>>>> > >>>>> I think this depends directly on what tags/elements are within the > >>>>> feed(s). From the feeds I looked at yesterday the relevant tags > >>>>> appeared to be missing. I was surprised that Tika didn't pick up more > >>>>> so I think I'll head over and see exactly what the Tika 1.1 source > >>>>> looks like for the rss+xml parser. > >>>>> > >>>>> In the meantime the feed plugin packaged with Nutch WILL parse and > >>>>> index these additional fields if they are present, but will not if > >>>>> they are absent. > >>>>> > >>>>> Lewis > >>>>> > >>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> > >>>>> wrote: > >>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a > means > >>>>> to > >>>>> > find the age of a web page by nutch. I thought publishedDate from > the > >>>>> feed > >>>>> > plugin would help. If I change the field name from publishedDate to > >>>>> *pubDate > >>>>> > * . Will this help? > >>>>> > > >>>>> > Thanks > >>>>> > Shameema > >>>>> > > >>>>> > > >>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < > >>>>> > [email protected]> wrote: > >>>>> > > >>>>> >> Hi, > >>>>> >> > >>>>> >> No This should not be necessary. The feed parser and accompanying > >>>>> >> indexingfilter should extract and send (to be indexed) the > following > >>>>> >> metadata items > >>>>> >> Author, Tags, Pub;lished date, Updated date and feed, > >>>>> >> > >>>>> >> There is a problem though... > >>>>> >> > >>>>> >> With many feeds, including the bbci one you provided in another > >>>>> >> thread, many of these fields are absent, the parser and indexing > >>>>> >> filter cannot operate on our behalf and subsequently leaves these > >>>>> >> fields out. > >>>>> >> > >>>>> >> It is also important to note that in parse-plugins.xml we first > try > >>>>> to > >>>>> >> parse the application/rss+xml mimetype with parse-tika before > feed... > >>>>> >> I can only assume this is because parse-tika produces slightly > better > >>>>> >> results for this mimetype. Let me explain > >>>>> >> > >>>>> >> With language identifier included and parse-plugins overridden to > >>>>> >> parse rss+xml solely with feed plugin I get > >>>>> >> > >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ > >>>>> bin/nutch > >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> contentType: application/rss+xml > >>>>> >> content : > >>>>> >> host : feeds.feedburner.com > >>>>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 > >>>>> >> lang : unknown > >>>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> > >>>>> >> however with parse-tika initiated and the same fetch I get > >>>>> >> > >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ > >>>>> bin/nutch > >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> contentType: application/rss+xml > >>>>> >> content : Glasgow City Council - News Feed Glasgow City > >>>>> Council - > >>>>> >> News > >>>>> >> Feed Keep up to date with all the news > >>>>> >> title : Glasgow City Council - News Feed > >>>>> >> host : feeds.feedburner.com > >>>>> >> tstamp : Fri Jun 08 14:04:25 BST 2012 > >>>>> >> lang : en > >>>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml > >>>>> >> > >>>>> >> Please note that this feed does not include info like > publishedDate, > >>>>> >> updatedDate etc instead offering other means of expressing (some) > of > >>>>> >> this information. In the above case, as the parse data is not > present > >>>>> >> for the required feed fields, or for arguments sake parse-tika, > these > >>>>> >> fields are not included in our subsequent index fields. > >>>>> >> > >>>>> >> I hope this clears things up a bit. > >>>>> >> > >>>>> >> On a sidenote, also some things to pick up from the above excepts > >>>>> from > >>>>> >> some tests; > >>>>> >> 1) Feed plugin fails to recognize content, title and lang fields > >>>>> where > >>>>> >> parse-tika does this sucessfully. > >>>>> >> 2) Even though parse-tika DOES utilise the language-identifier to > >>>>> >> recognize the lang field and provide a value, it fails to include > the > >>>>> >> full value which should be lang="en-GB" as oppose to lang="en" > >>>>> >> > >>>>> >> Can anyone chime in on what the current state of affairs is with > >>>>> >> delegation of language detection to parse-tika, or whether this as > >>>>> >> already the case but needs patched to accommodate the scenario I > >>>>> >> provide above? > >>>>> >> > >>>>> >> Thanks > >>>>> >> > >>>>> >> Lewis > >>>>> >> > >>>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> > >>>>> wrote: > >>>>> >> > Hi Lewis, > >>>>> >> > > >>>>> >> > My solrindex-mapping contains > >>>>> >> > <mapping> > >>>>> >> > <!-- Simple mapping of fields created by Nutch > >>>>> IndexingFilters > >>>>> >> > to fields defined (and expected) in Solr schema.xml. > >>>>> >> > > >>>>> >> > Any fields in NutchDocument that match a name > defined > >>>>> >> > in field/@source will be renamed to the > corresponding > >>>>> >> > field/@dest. > >>>>> >> > Additionally, if a field name (before mapping) > matches > >>>>> >> > a copyField/@source then its values will be copied > to > >>>>> >> > the corresponding copyField/@dest. > >>>>> >> > > >>>>> >> > uniqueKey has the same meaning as in Solr schema.xml > >>>>> >> > and defaults to "id" if not defined. > >>>>> >> > --> > >>>>> >> > <fields> > >>>>> >> > <field dest="content" source="content"/> > >>>>> >> > <field dest="site" source="site"/> > >>>>> >> > <field dest="title" source="title"/> > >>>>> >> > <field dest="host" source="host"/> > >>>>> >> > <field dest="segment" source="segment"/> > >>>>> >> > <field dest="boost" source="boost"/> > >>>>> >> > <field dest="digest" source="digest"/> > >>>>> >> > <field dest="tstamp" source="tstamp"/> > >>>>> >> > <field dest="publishedDate" > source="publishedDate"/> > >>>>> >> > <field dest="id" source="url"/> > >>>>> >> > <copyField source="url" dest="url"/> > >>>>> >> > </fields> > >>>>> >> > <uniqueKey>id</uniqueKey> > >>>>> >> > </mapping> > >>>>> >> > > >>>>> >> > > >>>>> >> > Do I need to edit any source code of feed plugin to make > available > >>>>> >> > this publishedDate. > >>>>> >> > > >>>>> >> > Thanks > >>>>> >> > Shameema > >>>>> >> > > >>>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney > >>>>> >> > <[email protected]> wrote: > >>>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. > >>>>> Also > >>>>> >> >> try including this value in your solr-mapping file. > >>>>> >> >> > >>>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer < > [email protected] > >>>>> > > >>>>> >> wrote: > >>>>> >> >>> In my schema there are certain fields used for feed plugin. > >>>>> >> >>> > >>>>> >> >>> <!-- fields for feed plugin (tag is also used by > >>>>> >> >>> microformats-reltag)--> > >>>>> >> >>> <field name="author" type="string" stored="true" > >>>>> >> indexed="true"/> > >>>>> >> >>> <field name="tag" type="string" stored="true" > >>>>> indexed="true" > >>>>> >> >>> multiValued="true"/> > >>>>> >> >>> <field name="feed" type="string" stored="true" > >>>>> indexed="true"/> > >>>>> >> >>> <field name="publishedDate" type="date" stored="true" > >>>>> >> >>> indexed="true"/> > >>>>> >> >>> <field name="updatedDate" type="date" stored="true" > >>>>> >> >>> indexed="true"/> > >>>>> >> >>> > >>>>> >> >>> I have included the feed plugin in nutch site xml. The feed > file > >>>>> is > >>>>> >> fetched > >>>>> >> >>> and parsed , also the links in it are working properly. But I > >>>>> cannot > >>>>> >> get > >>>>> >> >>> the publishedDate working. > >>>>> >> >>> I cannot retrieve the publishedDate or sort by it. > >>>>> >> >>> > >>>>> >> >>> Please help. > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> -- > >>>>> >> >> Lewis > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> -- > >>>>> >> Lewis > >>>>> >> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Lewis > >>>>> > >>>> > >>>> > >>> > >> > > > > -- > Lewis >

