Wednesday, October 24, 2012

Final Product Post

1. Introductory Product Information


The web services developed as part of this project have been extensions to product known as the Biocache. This tool is used as the ALA's aggregation software for specimen and species observation data.

This product provides data access including mapping facilities for a number of external portals (OZCAM, AVH, AMRiN) including the main ALA website. These portals all require search capabilities.

  • The primary focus of the Biocache is to:
  • aggregate occurrence data from multiple sources
  • provide data quality checks and cleaning of the data
  • support assertions by the data made by software or people
  • provide webservice access to this data to facilitate re-use in other portals.

2. Instructional Product Information

This product is intended to provide bulk data access to occurrence data to enable the JCU Edgar team to develop a portal for vetting occurrence data. In addition, this project has developed a number of data quality processes and services for accessing the results of these offline processes. By their nature, these processes generally require scanning across the entire index to analyse records.

These services are listed in their entirety here: http://biocache.ala.org.au/ws

These surfaces are largely REST based with JSON outputs.

Bulk occurrence (localities) downloads

This services provides the ability to download will include all records that satisfy the q, fq and wkt parameters. The number of records for a data resource may be restricted based on a collectory configured download limit. Params:

  • q
  •  - the initial query. "q=*:*" will query anything, q="macropus" will do a free text search for "macropus", q=kingdom:Fungi will search for records with a kingdom of Fungi.
  • fq
  •  - filters to be applied to the original query. These are additional params of the form fq=INDEXEDFIELD:VALUE e.g. fq=kingdom:Fungi
  • wkt
  •  - filter polygon area to be applied to the original query. For information on Well known text, see this
  • email
  •  - the email address of the user requesting the download
  • reason
  •  - the reason for the download
  • file
  •  - the name to use for the fileto download
  • fields
  •  - a CSV list of fields to include in the download (contains a list of default)
  • extra
  •  - a CSV list of fields in include in addition to the "fields"

Example: /occurrences/download?q=genus:Macropus will download all records for the genus Macropus
For a full list of indexes fields, see: http://biocache.ala.org.au/ws/index/fields

Occurrence deletions

This service allows the retrieval of record identifiers for records that have been removed from the system since a certain date. This allows systems to keep in sync with the BioCache.


 /occurrence/deleted?date=yyyy-MM-dd. This service will return a list of occurrence UUIDs that have been deleted since the supplied date (inclusive).


In addition to this, services where added to the BioCache to allow retrieval of records added after a certain date.



Environmental outliers

Once outlier detection has been ran, it is possible to retrieve the results using these services:

The following will retrieve records that have been marked as environmental outliers for the environmental surface Precipitation - driest quarter (Bio17):

http://biocache.ala.org.au/ws/occurrences/search?q=*:*&fq=outlier_layer:el882&pageSize=10&facet=off

The following URL will retrieve records that have been marked as environmental outliers for more than 2 surfaces:

http://biocache.ala.org.au/ws/occurrences/search?q=*:*&fq=outlier_layer_count:%5B2%20TO%205%5D&pageSize=10&facet=off

Duplication detection

Duplicate detection is ran across all records in an offline process that takes several hours to complete. Once complete this webservice can be used to retrieve the details of a duplicate:
A full blog post on the duplicate detection is here

Submitting Vetting information

A set of webservices where developed to support the submission of vetting information from the Edgar project.

  • Add Query Assertion
    To add a query POST a JSON body to the following URL: /assertions/query/add.
    Example JSON body with a WKT string:
    { "id": 4, "apiKey":"sharedkey", "status": "modified", "comment": "mah comment", "classification": "breeding", "lastModified": "2012-07-23T16:34:34", "species": "Motacilla flava", "user": { "email": "test@test.com", "authority": 1234 }, "area": "MULTIPOLYGON(((20 10,19.8078528040323 8.04909677983872,19.2387953251129 6.17316567634911,18.3146961230255 4.44429766980398,17.0710678118655 2.92893218813453,15.555702330196 1.68530387697455,13.8268343236509 0.761204674887138,11.9509032201613 0.192147195967697,10 0,8.04909677983873 0.192147195967692,6.17316567634912 0.761204674887125,4.444297669804 1.68530387697453,2.92893218813454 2.92893218813451,1.68530387697456 4.44429766980396,0.761204674887143 6.17316567634908,0.192147195967701 8.04909677983869,0 9.99999999999997,0.192147195967689 11.9509032201612,0.761204674887118 13.8268343236509,1.68530387697453 15.555702330196,2.9289321881345 17.0710678118654,4.44429766980394 18.3146961230254,6.17316567634906 19.2387953251129,8.04909677983868 19.8078528040323,9.99999999999996 20,11.9509032201612 19.8078528040323,13.8268343236509 19.2387953251129,15.555702330196 18.3146961230255,17.0710678118655 17.0710678118655,18.3146961230254 15.555702330196,19.2387953251129 13.8268343236509,19.8078528040323 11.9509032201613,20 10)))" }
  • View Query Assertion details
    This service will return the assertion information. It will NOT return the details of the query.
    /assertions/query/{uuid}
  • View Query Assertions details
    This service will return the information for all the listed assertions. It will NOT return the details of the queries.
    /assertions/queries/{csv list of uuid}
  • Apply Query Assertion
    This service will apply the supplied query assertion against the biocache records.
    /assertions/query/{uuid}/apply

3. Product Re-usability Information

The web services developed as part of this project are open access and are available to be incorporated into other websites and toolsets. In additional, all of this software is built on an open source stack (Apache Cassandra, Apache SOLR) so is "free" to set up in other environments. The code developed by the Atlas for this project is accessible in this google code repository:


4. Contextual Product Information

This code developed in the Scala & Java programming languages and is open source under a Mozilla Public Licence 1.1.



Wednesday, August 22, 2012

KEY FACTORS CUSTOMERS WILL USE TO JUDGE THE VALUE OF OUR PRODUCT

One of the most important aspects of the AP30 project includes fast access to large numbers of occurrence records at a taxon by taxon level. As an example, at the time of writing there 420,039 distinct records in the Atlas for the Australian Magpie. All of these individual records will need to be retrieved and cached by the Edgar project to facilitate the vetting tool Edgar is developing. Edgar is focussing on Bird Distributions and modelling. Currently, there 1,986 taxonomic concepts for Australian birds supplied by the Australian Faunal Directory.

In addition, the persistence of vettings against the data in the Atlas will mean other tools and portals will benefit from the improved data quality. Typically researchers who work with these data will have to undertake a complete gathering of the data and a cleaning process to remove duplicate and erroneous records. This data cleaning is typically not shared or persisted with the source data, leading to duplication of effort within the research community.

By submitting the vettings to the Atlas, Edgar will be sharing the improved data quality with any researcher accessing these data through the Atlas.

HOW THE PRODUCT WILL MEET OUR USERS NEEDS


The primary customers for this project are the Edgar team who we are working with to produce a platform to support:
  • integration of occurrence data for presentation to users on a map interface
  • the ability to filter environmental outliers
  • the ability to filter duplicate records
  • persistence of record vettings, and application of the vettings to records within a polygon 
However, once these services and data processing is in place, it will benefit the ALA and additional portals wishing to access occurrence data. This will include portals such as Online Zoological Collections of Australian Museums which is built upon Atlas web services.



KEY TECHNOLOGIES and FEATURES

The stack we are putting together to support AP30 requirements includes:

  • Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar
  • Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.
  • A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/
  • Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.

Wednesday, August 15, 2012

PROJECT DESCRIPTION

Simply stated, the goals of the AP30 are to:

  • Provide bulk data access services to assist external software projects requiring biodiversity occurrence data. 
  • Provide a web services for the persistence of annotations against occurrence records.
  • Produce data quality processes to improve the quality of the occurrence data

Primarily, these services will be targeted at meeting the needs of the AP03 project Edgar, but once developed, they will aid the development of other biodiversity portals and analytical tools.

The annotations provided by Edgar, will also persist with the occurrence data. This will aid other researchers in the removal of records not suitable for distribution modelling.





Tuesday, July 31, 2012

PROJECT OUTPUTS #3 - VETTING SERVICES


The ALA has a web service that accepts a POST request with a JSON body that contains the information to support a vetting.   

The URL for the service is: 
http://biocache.ala.org.au/ws/assertions/query/add  

Example JSON for the POST body:

Validating the supplied information

When the JSON body is invalid a HTTP Bad Request (400) will be returned.

When an invalid or no apiKey is provided a HTTP Forbidden (403) will be returned.

Otherwise the supplied information will be validated in 2 ways; first to ensure that the species name exists in the current system and finally ensuring that the area is a valid WKT format.  If either of these checks fail a HTTP Bad Request (400) is returned as the status with a message indicating the issue.

Insert/Update   

When inserting a new validation a first load date is populated.  This date is never updated.  The purpose of this date is to provide a context for “Historic” vettings.  In the future the ALA may provide additional QAs around records that appear in a “historic” region after the vetting first loaded date.

Each vetting that is posted to the web service will be stored in the database in raw JSON.  Other fields populated include a last modified date and query.  The query will be constructed using the species name and WKT for the area.   This query will be used to identify the records that are considered part of the vetting.

Deleting  

When a “delete” is issued against an existing vetting it is marked as deleted in the database.  It is not physically deleted until the action is filtered through to the ALA data.  A "deleted" assertion can NOT be resurrected.     

Applying Vettings to ALA Data

It is not yet known the exact process that will be used to apply the vettings to the ALA data.  It will be a batch process run nightly that updates a bunch of records based on the queries that were generated for each vetting. 

New/updated vettings that have not been pushed through to the ALA data will be applied to all records that satisfy the query.  Old vettings will be applied to records that have been inserted/modified since the previous batch process.

Thursday, July 12, 2012

PROJECT OUTPUTS #2 - DUPLICATE DETECTION

As a part of a range of data quality checks the Atlas has identified potential duplicate records within the Biocache.   This allows users to discard duplicates records from searches, analysis and mapping where this is appropriate.  A discussion on duplicate records is available here

The ALA uses the scientific name, decimal latitude, decimal longitude, collector and collection date to indicate potential duplicate records: 
  • Records are considered for a species (with synonyms mapped to the accepted name).
  • Collection dates are duplicates when the individual components are identical (year, month and day). Empty values are considered the same.
  • Collector names are compared; if one is null it is considered a duplicate, otherwise a Levenstein distance is calculated with an acceptable threshold indicating a duplicate
  • Latitudes and Longitudes are duplicates when they are identical at the same precision. Null values are excluded from consideration.
When a group of records are identified as duplicates one needs to be identified as the "representative" record.  The representative record can be used to represent the entire group of duplicates and should not be considered a duplicate. 

Here is an example "representative" record in the ALA: http://biocache.ala.org.au/occurrence/3cde1570-7a38-4a58-b121-e95c35585a29#inferredOccurrenceDetails:


There is a JSON web service that returns the complete information that was used in the duplication detection:
http://biocache.ala.org.au/ws/duplicates/3cde1570-7a38-4a58-b121-e95c35585a29

It is possible to perform a search for all the records that are duplicates of a specific representative record: http://biocache.ala.org.au/occurrence/search?q=duplicate_record:3cde1570-7a38-4a58-b121-e95c35585a29

Duplicate records can be excluded from the queries submitted to the biocache by applying the negative filter &fq=-duplicate_status:D.  Example: http://biocache.ala.org.au/occurrences/search?taxa=Mountain%20Thornbill&fq=-duplicate_status:D

Tuesday, June 26, 2012

PROJECT OUTPUTS #1 - ENVIRONMENTAL OUTLIER DETECTION

One of the goals of this project is to provide a web service layer  on top of occurrence data that supports the Edgar project. This includes:
  • Bulk access to occurrence records, including access to sensitive records not visible to the public but required for species distribution modelling purposes.
  • New data quality control methods, and error reporting
  • Ingestion of feedback and additional locality information
In this post, we'll talk about the first two points and what we have done thus far.

Bulk access

We have developed services that allow downloading of occurrence records:
Both of these services give bulk access. The downloadfromDB service gives a wider range of metadata associated with the record, whereas downloadFromIndex gives a subset of the data that is typically of use for researchers and anyone modelling with species (include scientific name, latitude, longitude).

The latter was developed to support Edgar's need for faster bulk downloads.  To support the Edgar project's need to maintain a separate local cache of the data, we have also developed a service so that deleted records can be tracked:
When records are ingested by the Atlas a UUID is issued against the record, and the Atlas keeps track of the properties within that record that make it unique. This is typically a ID of some sort, but it may be a combination of lat/long, date and species name for example.

Lists of species can be retrieved from services listed here:
The service http://bie.ala.org.au/search.json is currently in use by Edgar to retrieve a list of species which then drives the data harvesting. Edgar is also making use of services to retrieve LSIDs, the identifier the ALA is using for a taxon.

Data quality

As part of the AP30 project, the ALA has developed some data quality methods, and exposed the outputs to aid the modelling work in Edgar.

Mountain Thornbill
IMAGE BY: TOM TARRANT
RIGHTS: ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 


For each species, we are processing data to detect outlier records. This involves loading all points for a single species, intersecting these points with 5 chosen environmental surfaces, and running an algorithm known as Reverse JackKnife.


Heres an example record in the ALA:

http://biocache.ala.org.au/occurrence/b07bbac2-22d7-4c8a-8d61-4be1ab9e0d09

The details for this record are here (in JSON format):

http://biocache.ala.org.au/ws/outlier/record/b07bbac2-22d7-4c8a-8d61-4be1ab9e0d09

And the overall details of the tests for this species are accessible via the LSID, like so (JSON again):

http://biocache.ala.org.au/ws/outlierInfo/urn:lsid:biodiversity.org.au:afd.taxon:0c139726-2add-4abe-a714-df67b1d4b814.json

The tests use 5 environmental layers that have been chosen for their suitability for testing with this algorithm. Theres are write up of this work available here.

Wednesday, June 13, 2012

PRODUCT TEAM

Peter Doherty - Project Manager

Peter has been the Program Officer for the Atlas since 2009.

Dave Martin - Software Architect

Dave has been working in biodiversity informatics for 6 years, and prior to this has worked on healthcare and banking systems. He started working for the Atlas in 2008. Prior to this, he worked for the Global Biodiversity Information Facility (GBIF) in Copenhagen. He has skills in GIS, databases (relational, nosql), Lucene, Java, Scala, Groovy/Grails.

Miles Nicholls - Data Manager

With a background as a business analyst in data warehousing and business intelligence Miles has been working with the Atlas since late 2009 as Data Manager.  Miles has  qualifications in science (although never used in anger and frighteningly out of date) and information systems and thinks it's great that the ALA combines the two.  Miles' work with the ALA involves discussing data sharing, open access licensing, data schemas and formats with the owners of data and transforming data using whatever tool will do the best job at the time.

Natasha Carter - Software Developer

Natasha has been working as a software engineer for 8 years.  She works for a small company that specialises in providing data integration solutions. She is experienced in the design and implementation of solutions using a variety of tools and technologies including:
  • Java/Scala
  • RDBMS - MySQL, Postgres, Oracle, Ingres and SQLServer
  • no SQL - Cassandra
Natasha has been involved in the ALA since December 2009 contributing to the data and service layers.

Nick dos Remedios - Software Developer

Nick has been with the Atlas of Living Australia since 2008 and has been working as a software developer since 1999. Prior to that he worked as immunologist/molecular biologist/bioinformatician. He has experience in the areas of airline logistics, patent informatics and biodiversity informatics. He often gets pigeon-holed as front-end developer and spends most of his time coding in Java, Groovy, Javascript and HTML/CSS. Favourite frameworks/APIs include Grails, Spring MVC and jQuery.

The project

In conjunction with ANDS Project AP03, this project will collaborate closely with James Cook University (JCU) to produce an online tool that allows users to view current observations and distribution maps for Australian bird species as well as view predicted future distributions taking into account climate change.

This project (AP30) will be providing the back-end functionality for the project, i.e., web services and/or bulk data downloads for:
  • deliver new web services making the locality information held by ALA available to AP03,
  • deliver new data quality control/cleansing over the data as defined with project AP03 
  • deliver new web services allowing for ingestion of feedback and additional locality information from AP03 into the ALA records.
  • additional data quality web services of relevance to a wider research base that those targeted for AP03