Wednesday, October 24, 2012

Final Product Post

1. Introductory Product Information

The web services developed as part of this project have been extensions to product known as the Biocache. This tool is used as the ALA's aggregation software for specimen and species observation data.

This product provides data access including mapping facilities for a number of external portals (OZCAM, AVH, AMRiN) including the main ALA website. These portals all require search capabilities.

  • The primary focus of the Biocache is to:
  • aggregate occurrence data from multiple sources
  • provide data quality checks and cleaning of the data
  • support assertions by the data made by software or people
  • provide webservice access to this data to facilitate re-use in other portals.

2. Instructional Product Information

This product is intended to provide bulk data access to occurrence data to enable the JCU Edgar team to develop a portal for vetting occurrence data. In addition, this project has developed a number of data quality processes and services for accessing the results of these offline processes. By their nature, these processes generally require scanning across the entire index to analyse records.

These services are listed in their entirety here:

These surfaces are largely REST based with JSON outputs.

Bulk occurrence (localities) downloads

This services provides the ability to download will include all records that satisfy the q, fq and wkt parameters. The number of records for a data resource may be restricted based on a collectory configured download limit. Params:

  • q
  •  - the initial query. "q=*:*" will query anything, q="macropus" will do a free text search for "macropus", q=kingdom:Fungi will search for records with a kingdom of Fungi.
  • fq
  •  - filters to be applied to the original query. These are additional params of the form fq=INDEXEDFIELD:VALUE e.g. fq=kingdom:Fungi
  • wkt
  •  - filter polygon area to be applied to the original query. For information on Well known text, see this
  • email
  •  - the email address of the user requesting the download
  • reason
  •  - the reason for the download
  • file
  •  - the name to use for the fileto download
  • fields
  •  - a CSV list of fields to include in the download (contains a list of default)
  • extra
  •  - a CSV list of fields in include in addition to the "fields"

Example: /occurrences/download?q=genus:Macropus will download all records for the genus Macropus
For a full list of indexes fields, see:

Occurrence deletions

This service allows the retrieval of record identifiers for records that have been removed from the system since a certain date. This allows systems to keep in sync with the BioCache.

 /occurrence/deleted?date=yyyy-MM-dd. This service will return a list of occurrence UUIDs that have been deleted since the supplied date (inclusive).

In addition to this, services where added to the BioCache to allow retrieval of records added after a certain date.

Environmental outliers

Once outlier detection has been ran, it is possible to retrieve the results using these services:

The following will retrieve records that have been marked as environmental outliers for the environmental surface Precipitation - driest quarter (Bio17):*:*&fq=outlier_layer:el882&pageSize=10&facet=off

The following URL will retrieve records that have been marked as environmental outliers for more than 2 surfaces:*:*&fq=outlier_layer_count:%5B2%20TO%205%5D&pageSize=10&facet=off

Duplication detection

Duplicate detection is ran across all records in an offline process that takes several hours to complete. Once complete this webservice can be used to retrieve the details of a duplicate:
A full blog post on the duplicate detection is here

Submitting Vetting information

A set of webservices where developed to support the submission of vetting information from the Edgar project.

  • Add Query Assertion
    To add a query POST a JSON body to the following URL: /assertions/query/add.
    Example JSON body with a WKT string:
    { "id": 4, "apiKey":"sharedkey", "status": "modified", "comment": "mah comment", "classification": "breeding", "lastModified": "2012-07-23T16:34:34", "species": "Motacilla flava", "user": { "email": "", "authority": 1234 }, "area": "MULTIPOLYGON(((20 10,19.8078528040323 8.04909677983872,19.2387953251129 6.17316567634911,18.3146961230255 4.44429766980398,17.0710678118655 2.92893218813453,15.555702330196 1.68530387697455,13.8268343236509 0.761204674887138,11.9509032201613 0.192147195967697,10 0,8.04909677983873 0.192147195967692,6.17316567634912 0.761204674887125,4.444297669804 1.68530387697453,2.92893218813454 2.92893218813451,1.68530387697456 4.44429766980396,0.761204674887143 6.17316567634908,0.192147195967701 8.04909677983869,0 9.99999999999997,0.192147195967689 11.9509032201612,0.761204674887118 13.8268343236509,1.68530387697453 15.555702330196,2.9289321881345 17.0710678118654,4.44429766980394 18.3146961230254,6.17316567634906 19.2387953251129,8.04909677983868 19.8078528040323,9.99999999999996 20,11.9509032201612 19.8078528040323,13.8268343236509 19.2387953251129,15.555702330196 18.3146961230255,17.0710678118655 17.0710678118655,18.3146961230254 15.555702330196,19.2387953251129 13.8268343236509,19.8078528040323 11.9509032201613,20 10)))" }
  • View Query Assertion details
    This service will return the assertion information. It will NOT return the details of the query.
  • View Query Assertions details
    This service will return the information for all the listed assertions. It will NOT return the details of the queries.
    /assertions/queries/{csv list of uuid}
  • Apply Query Assertion
    This service will apply the supplied query assertion against the biocache records.

3. Product Re-usability Information

The web services developed as part of this project are open access and are available to be incorporated into other websites and toolsets. In additional, all of this software is built on an open source stack (Apache Cassandra, Apache SOLR) so is "free" to set up in other environments. The code developed by the Atlas for this project is accessible in this google code repository:

4. Contextual Product Information

This code developed in the Scala & Java programming languages and is open source under a Mozilla Public Licence 1.1.

No comments:

Post a Comment