Sorry, for the benefit of others who may not know the GBIF code SVN sites, this 
particular code is all in the GBIF common resources svn:
  
https://code.google.com/p/gbif-common-resources/source/browse/#svn%2Fgbif-parsers%2Ftrunk%2Fsrc%2Fmain%2Fjava%2Forg%2Fgbif%2Fcommon%2Fparsers

And is a mavenized release on the GBIF maven repository:
  http://repository.gbif.org/index.html#nexus-search;quick~gbif-parsers
  
And the list mapping all the variations we see is:
  
https://code.google.com/p/gbif-common-resources/source/browse/gbif-parsers/trunk/src/main/resources/dictionaries/parse/countryName.txt

I hope this helps,
Tim


> Hi David,
> 
> You've built your other libraries using GBIF parsers.  Have you looked at how 
> the GBIF country names interpretation works?  It would be helpful to know why 
> it is not suitable for your use.
> 
> The GBIF library concatenates known lists (such as ISO) along with about 2500 
> variations we've collected through period review of what we observe while 
> indexing, and then using google refine we've mapped them to the ISO codes and 
> we follow the ISO code changes as best we can.  Your narwhal-processor 
> already has a software dependency on the GBIF code.
> 
> Please remember that patches and additions are always welcome to the GBIF 
> code, if you felt it could be improved.  I'm biased of course, but I'd rather 
> see something that is broken fixed than watching a recreation of something 
> that already exists.
> 
> Cheers,
> Tim
> 
> 
> On May 17, 2013, at 4:39 PM, Matt Jones wrote:
> 
>> A good official list of countries is available from the Library of Congress:
>>   http://www.loc.gov/standards/codelists/countries.xml
>>   For background, see: http://www.loc.gov/marc/countries/
>> 
>> And of course there's ISO 3166, the list of country codes:
>>   
>> http://www.iso.org/iso/home/standards/country_codes/country_names_and_code_elements_xml.htm
>>   http://www.iso.org/iso/country_codes
>> 
>> Not sure about the alternate representations and misspellings, though.
>> 
>> Matt
>> 
>> 
>> On Fri, May 17, 2013 at 5:57 AM, Shorthouse, David 
>> <[email protected]> wrote:
>> Folks,
>> 
>> The Canadensys development team, http://www.canadensys.net is looking
>> for efficient, low-maintenance ways to validate and reconcile data in
>> its National cache of occurrence data. We are working on a Java
>> library to initially tackle single-field Darwin Core validations,
>> https://github.com/Canadensys/narwhal-processor. We hope this library
>> is sufficiently generalized for uses outside our project.
>> 
>> Our current challenge is to reconcile country names, which requires
>> access to an up-to-date, well-maintained knowledge base of country
>> names, their alternative representations (possibly multilingual), and
>> mappings to known misspellings. For performance reasons, we'd like
>> this thesaurus to be embedded in the library, but with the capacity to
>> be periodically refreshed with data pulled from external resources
>> such as dbpedia.org. This clearly has ties to semantic web thinking
>> and, because we're new to the tools and services in this space, we'd
>> like to solicit pointers and feedback such that we build this part of
>> our library with maximal benefit to other projects. We started
>> collecting thoughts here:
>> https://github.com/Canadensys/narwhal-processor/issues/14.
>> 
>> Cheers,
>> 
>> David P. Shorthouse
>> Christian Gendreau
>> _______________________________________________
>> tdwg mailing list
>> [email protected]
>> http://lists.tdwg.org/mailman/listinfo/tdwg
>> 
>> _______________________________________________
>> tdwg mailing list
>> [email protected]
>> http://lists.tdwg.org/mailman/listinfo/tdwg
> 
> _______________________________________________
> tdwg mailing list
> [email protected]
> http://lists.tdwg.org/mailman/listinfo/tdwg

_______________________________________________
tdwg mailing list
[email protected]
http://lists.tdwg.org/mailman/listinfo/tdwg

Reply via email to