Hi all.
It is a really interesting topic as Markus said, however i think that a simple
improvement to parse metatags plugin could help.
if you check the code of this url
http://www.radiosanctispiritus.cu/es/2013/06/petrocaribe-aprueba-plan-para-crear-zona-economica-especial/
Nutch dont recognize metatags for opengraph protocol like og:title"
content="PetroCaribe aprueba plan para crear zona económica especial" />
and neither this other meta <meta property="article:published_time"
content="2013-06-30T15:11:10+00:00" />
Parse metatags could help to solve this problem with a simple modification.
Looking in source code of Parse metatags it only request metatags.name property
of configuration file
String[] values = conf.getStrings("metatags.names", "*");
for that reason only get keywords and description. Maybe adding one property
metatags.property and put all wanted metatags like
og:title;article:published_time
is possible get last modified and others metas if is present in a page.
This jira NUTCH-1561 is resolved but maybe could be open again for include
this change or create a new jira with this modification and use these metas if
is present in a web page.
¿?
----- Mensaje original -----
De: "Markus Jelsma" <[email protected]>
Para: [email protected], [email protected]
Enviados: Miércoles, 11 de Marzo 2015 17:12:14
Asunto: [MASSMAIL]RE: Handling servers with wrong Last Modified HTTP header
Hello Jorge,
This is an interesting but very complicated issue. First of all, do not rely on
HTTP headers, they are incorrect on any scale larger than very small. This is
true for Last-Modified due to dynamic CMS' but for many other headers. You can
even expect website descriptions in headers such as Content-Type, madness!
The only reliable source of a document's date and optionally time is within the
document itself. This introduces two news problems, 1) what format and
language, and 2) where exactly can you find it. Let's discuss these two issues.
The first is the most straightforward to deal with, it is a two-stage process.
First you need to extract anything that resembles a date format that is used on
Earth, this includes non-numeric dates such as month names. Then you have to
pass all those date candidates through a series of carefully aligned date
formats (SimpleDateFormat) and set the appropriate Locale. This stage requires
that you have identified the language of the document, or the part of the
document you are processing in case of multi-language documents.
Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years
ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor
with a language and a piece of text, in this case the document's extracted
text. It is very basic and has many flaws but it should work nicely if you
present it with concise fragments of text.
The second part of the solution is more cumbersome to deal with. NUTCH-1414
uses the document's extracted text as source for date extraction, and it has
really no clue as to where the date is located in the document's structure. If
you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad
results for most documents. It can be partially solved by relying on
Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from
extracting dates that actually got extracted using no text extraction algorithm
at all!
Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your
case, it will do what you want it to do. I decided a few years ago to get place
the improved date extraction tool to a separate project and get rid of
Boilerpipe altogether and build a new tool from scratch that can interface with
a date extraction tool, and has support for looking up the exact spot of the
document's date. It works on 95% of the many hundreds of real web page tests so
if you need something that works at scale, you can contact me off list, the
stuff has not been open sourced.
Have fun!
Markus
[1]: https://issues.apache.org/jira/browse/NUTCH-1414
-----Original message-----
> From:Jorge Luis Betancourt González <[email protected]>
> Sent: Tuesday 10th March 2015 4:23
> To: [email protected]
> Subject: Handling servers with wrong Last Modified HTTP header
>
> Recently in the search app we are working on we've encountered a lot of
> websites that have a wrong and invalid date in the Last Modified HTTP header,
> meaning for instance that an article posted on a news site back in 2010 has a
> Las Modified header of just a few days back, this could be for any number of
> reasons:
>
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that
> affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
>
> For what I've seen this is handled by several CMS even allowing to "tweak"
> the published date, My question is basically if any one on the list has a
> suggestion on how to tackle this or has some suggestion on how to address
> this situation. For the particular case that we've been working most of the
> URLs have the published date in the URL in the form of yyyy/mm/dd (or some
> similar fashion), so this could be one way of "guessing" the publication date
> of the article. I realize that this is no silver bullet but I'd love to get
> some feedback on this type of situations. From my experience when people
> usually filter by date in our frontend app, they usually are trying to get
> news/articles by the publication date instead of the Last Modified date and
> they are confused when the returned results have very old publication dates,
> they usually don't check if is a new comment for instance.
>
> I'm living the "how to implement this" a side for now, just interested in
> discussing how to deal with this type of situations, as stated in our
> particular case we can rely on the URL patterns for a very good portion, but
> was hopping to agree on some general approach that could be integrated in
> Nutch.
>
> Regards,
>
> PS: Should I post this also to the user list?
>