The ALA uses the scientific name, decimal latitude, decimal longitude, collector and collection date to indicate potential duplicate records:
- Records are considered for a species (with synonyms mapped to the accepted name).
- Collection dates are duplicates when the individual components are identical (year, month and day). Empty values are considered the same.
- Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate
- Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.
Here is an example "representative" record in the ALA: http://biocache.ala.org.au/occurrence/3cde1570-7a38-4a58-b121-e95c35585a29#inferredOccurrenceDetails:
There is a JSON web service that returns the complete information that was used in the duplication detection:
http://biocache.ala.org.au/ws/duplicates/3cde1570-7a38-4a58-b121-e95c35585a29
It is possible to perform a search for all the records that are duplicates of a specific representative record: http://biocache.ala.org.au/occurrence/search?q=duplicate_record:3cde1570-7a38-4a58-b121-e95c35585a29
Duplicate records can be excluded from the queries submitted to the biocache by applying the negative filter &fq=-duplicate_status:D. Example: http://biocache.ala.org.au/occurrences/search?taxa=Mountain%20Thornbill&fq=-duplicate_status:D
No comments:
Post a Comment