Thursday, July 12, 2012


As a part of a range of data quality checks the Atlas has identified potential duplicate records within the Biocache.   This allows users to discard duplicates records from searches, analysis and mapping where this is appropriate.  A discussion on duplicate records is available here

The ALA uses the scientific name, decimal latitude, decimal longitude, collector and collection date to indicate potential duplicate records: 
  • Records are considered for a species (with synonyms mapped to the accepted name).
  • Collection dates are duplicates when the individual components are identical (year, month and day). Empty values are considered the same.
  • Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate
  • Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.
When a group of records are identified as duplicates one needs to be identified as the "representative" record.  The representative record can be used to represent the entire group of duplicates and should not be considered a duplicate. 

Here is an example "representative" record in the ALA:

There is a JSON web service that returns the complete information that was used in the duplication detection:

It is possible to perform a search for all the records that are duplicates of a specific representative record:

Duplicate records can be excluded from the queries submitted to the biocache by applying the negative filter &fq=-duplicate_status:D.  Example:

No comments:

Post a Comment