[PHP] preg_match and dates

2009-03-02 Thread Michael A. Peters

I have absolutely no control over the source file.

The source file is an xml file (er, sort of, it doesn't follow any 
particular DTD) and has a tag called VERBATIM_DATE in each record - 
looks to be required in their output as every record so far has it, but 
w/o a DTD hard to know - time of day, on the other hand, is not required 
and sometimes (usually) the tag missing.


Here's the beauty - VERBATIM_DATE in the same xml file uses multiple 
different formats. IE -


12 March 1945
14 Mar 1967
Apr 1999
12-03-2005
Before 1904
Winter or Spring 1977

etc.

It does seem that if there is a day, the day is always first - but 
sometimes it has a space as a delimiter, - as delimiter, and sometimes 
it has both - IE


10-15 Dec 1934
12 March-03 April 1956

What I'm trying to do is write a preg matches for each case I come 
across - if it matches the preg, it then parses according to the pattern 
to get me an acceptable -MM-DD (not sure how I'll deal with the 
season case yet ... but I'm serious, that kind of thing in there several 
times)


To at least get started though, is there a wildcard defined that says 
match a month?


IE

/^([0-9]{2})[\s-](MONTH_MATCH)[\s-]([0-9]{4,4}$/

where MONTH is some special magic that matches Mar March Apr April etc. ?

If you must know - it's data from a biology vertebrate museum. Thousands 
of records may match a given query. Most of them look fairly easily 
parsable and it does look like when a day is specified, it is always 
first and year is always last.


The data is needed by me, so I'm planning on having the script die if it 
comes across a date I don't have a regex to parse before it does 
anything so I can add appropriate regex as necessary, but damn - you'd 
think a vertebrate museum would have cleaned up their DB somewhat.


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] preg_match and dates

2009-03-02 Thread Per Jessen
Michael A. Peters wrote:

 What I'm trying to do is write a preg matches for each case I come
 across - if it matches the preg, it then parses according to the
 pattern to get me an acceptable -MM-DD (not sure how I'll deal
 with the season case yet ... but I'm serious, that kind of thing in
 there several times)
 
 To at least get started though, is there a wildcard defined that says
 match a month?
 
 IE
 
 /^([0-9]{2})[\s-](MONTH_MATCH)[\s-]([0-9]{4,4}$/
 
 where MONTH is some special magic that matches Mar March Apr April
 etc. ?

Just write one yourself. 



-- 
Per Jessen, Zürich (6.1°C)


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] preg_match and dates

2009-03-02 Thread Per Jessen

Michael A. Peters wrote:

This is what I have so far -

$pattern[] = /^([0-9]{1,2})[\s-]([A-Z][a-z]*)[\s-]([0-9]{4,4})$/i;
$clean[]   = \\3-\\2-\\1;

$pattern[] = /^([A-Z][a-z]*)[\s-]([0-9]{4,4})$/;
$clean[]   = \\2-\\1-01;

$foo = preg_replace($pattern, $clean, $verb_date);


If I were you, I'd write several regexes, one for each date format you 
wish to recognize.  It makes the regexes much easier to read, and you 
can still write sub-expressions for catching e.g. months and then reuse 
those in your main regexes.




/Per Jessen

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] preg_match and dates

2009-03-02 Thread Michael A. Peters

Per Jessen wrote:

Michael A. Peters wrote:


What I'm trying to do is write a preg matches for each case I come
across - if it matches the preg, it then parses according to the
pattern to get me an acceptable -MM-DD (not sure how I'll deal
with the season case yet ... but I'm serious, that kind of thing in
there several times)

To at least get started though, is there a wildcard defined that says
match a month?

IE

/^([0-9]{2})[\s-](MONTH_MATCH)[\s-]([0-9]{4,4}$/

where MONTH is some special magic that matches Mar March Apr April
etc. ?


Just write one yourself. 






This is what I have so far -

$pattern[] = /^([0-9]{1,2})[\s-]([A-Z][a-z]*)[\s-]([0-9]{4,4})$/i;
$clean[]   = \\3-\\2-\\1;

$pattern[] = /^([A-Z][a-z]*)[\s-]([0-9]{4,4})$/;
$clean[]   = \\2-\\1-01;

$foo = preg_replace($pattern, $clean, $verb_date);

That was enough for me to discover some collectors have two digit years 
and I can't differentiate 1902 from 2002 so I'll have to flag those and 
bug the curator to fix 'em.


I'd rather have ([A-Z][a-z]*) be replaced with something that makes sure 
it is a valid short or long month, writing one myself is not impossible 
but if there is a date wildcard (or a tried and proven pattern) that can 
match month built into php then it is better to use it, no?


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php