AP30 - Bird Species Distribution Project: JSON

Wednesday, August 22, 2012

KEY TECHNOLOGIES and FEATURES

The stack we are putting together to support AP30 requirements includes:

Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar.
Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.
A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/
Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.

Tuesday, July 31, 2012

PROJECT OUTPUTS #3 - VETTING SERVICES

The ALA has a web service that accepts a POST request with a JSON body that contains the information to support a vetting.

The URL for the service is:
http://biocache.ala.org.au/ws/assertions/query/add

Example JSON for the POST body:

Validating the supplied information

When the JSON body is invalid a HTTP Bad Request (400) will be returned.

When an invalid or no apiKey is provided a HTTP Forbidden (403) will be returned.

Otherwise the supplied information will be validated in 2 ways; first to ensure that the species name exists in the current system and finally ensuring that the area is a valid WKT format. If either of these checks fail a HTTP Bad Request (400) is returned as the status with a message indicating the issue.

Insert/Update

When inserting a new validation a first load date is populated. This date is never updated. The purpose of this date is to provide a context for “Historic” vettings. In the future the ALA may provide additional QAs around records that appear in a “historic” region after the vetting first loaded date.

Each vetting that is posted to the web service will be stored in the database in raw JSON. Other fields populated include a last modified date and query. The query will be constructed using the species name and WKT for the area. This query will be used to identify the records that are considered part of the vetting.

Deleting

When a “delete” is issued against an existing vetting it is marked as deleted in the database. It is not physically deleted until the action is filtered through to the ALA data. A "deleted" assertion can NOT be resurrected.

Applying Vettings to ALA Data

It is not yet known the exact process that will be used to apply the vettings to the ALA data. It will be a batch process run nightly that updates a bunch of records based on the queries that were generated for each vetting.

New/updated vettings that have not been pushed through to the ALA data will be applied to all records that satisfy the query. Old vettings will be applied to records that have been inserted/modified since the previous batch process.