Tuesday, July 31, 2012

PROJECT OUTPUTS #3 - VETTING SERVICES


The ALA has a web service that accepts a POST request with a JSON body that contains the information to support a vetting.   

The URL for the service is: 
http://biocache.ala.org.au/ws/assertions/query/add  

Example JSON for the POST body:

Validating the supplied information

When the JSON body is invalid a HTTP Bad Request (400) will be returned.

When an invalid or no apiKey is provided a HTTP Forbidden (403) will be returned.

Otherwise the supplied information will be validated in 2 ways; first to ensure that the species name exists in the current system and finally ensuring that the area is a valid WKT format.  If either of these checks fail a HTTP Bad Request (400) is returned as the status with a message indicating the issue.

Insert/Update   

When inserting a new validation a first load date is populated.  This date is never updated.  The purpose of this date is to provide a context for “Historic” vettings.  In the future the ALA may provide additional QAs around records that appear in a “historic” region after the vetting first loaded date.

Each vetting that is posted to the web service will be stored in the database in raw JSON.  Other fields populated include a last modified date and query.  The query will be constructed using the species name and WKT for the area.   This query will be used to identify the records that are considered part of the vetting.

Deleting  

When a “delete” is issued against an existing vetting it is marked as deleted in the database.  It is not physically deleted until the action is filtered through to the ALA data.  A "deleted" assertion can NOT be resurrected.     

Applying Vettings to ALA Data

It is not yet known the exact process that will be used to apply the vettings to the ALA data.  It will be a batch process run nightly that updates a bunch of records based on the queries that were generated for each vetting. 

New/updated vettings that have not been pushed through to the ALA data will be applied to all records that satisfy the query.  Old vettings will be applied to records that have been inserted/modified since the previous batch process.

Thursday, July 12, 2012

PROJECT OUTPUTS #2 - DUPLICATE DETECTION

As a part of a range of data quality checks the Atlas has identified potential duplicate records within the Biocache.   This allows users to discard duplicates records from searches, analysis and mapping where this is appropriate.  A discussion on duplicate records is available here

The ALA uses the scientific name, decimal latitude, decimal longitude, collector and collection date to indicate potential duplicate records: 
  • Records are considered for a species (with synonyms mapped to the accepted name).
  • Collection dates are duplicates when the individual components are identical (year, month and day). Empty values are considered the same.
  • Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate
  • Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.
When a group of records are identified as duplicates one needs to be identified as the "representative" record.  The representative record can be used to represent the entire group of duplicates and should not be considered a duplicate. 

Here is an example "representative" record in the ALA: http://biocache.ala.org.au/occurrence/3cde1570-7a38-4a58-b121-e95c35585a29#inferredOccurrenceDetails:


There is a JSON web service that returns the complete information that was used in the duplication detection:
http://biocache.ala.org.au/ws/duplicates/3cde1570-7a38-4a58-b121-e95c35585a29

It is possible to perform a search for all the records that are duplicates of a specific representative record: http://biocache.ala.org.au/occurrence/search?q=duplicate_record:3cde1570-7a38-4a58-b121-e95c35585a29

Duplicate records can be excluded from the queries submitted to the biocache by applying the negative filter &fq=-duplicate_status:D.  Example: http://biocache.ala.org.au/occurrences/search?taxa=Mountain%20Thornbill&fq=-duplicate_status:D