Tuesday, June 26, 2012

PROJECT OUTPUTS #1 - ENVIRONMENTAL OUTLIER DETECTION

One of the goals of this project is to provide a web service layer  on top of occurrence data that supports the Edgar project. This includes:
  • Bulk access to occurrence records, including access to sensitive records not visible to the public but required for species distribution modelling purposes.
  • New data quality control methods, and error reporting
  • Ingestion of feedback and additional locality information
In this post, we'll talk about the first two points and what we have done thus far.

Bulk access

We have developed services that allow downloading of occurrence records:
Both of these services give bulk access. The downloadfromDB service gives a wider range of metadata associated with the record, whereas downloadFromIndex gives a subset of the data that is typically of use for researchers and anyone modelling with species (include scientific name, latitude, longitude).

The latter was developed to support Edgar's need for faster bulk downloads.  To support the Edgar project's need to maintain a separate local cache of the data, we have also developed a service so that deleted records can be tracked:
When records are ingested by the Atlas a UUID is issued against the record, and the Atlas keeps track of the properties within that record that make it unique. This is typically a ID of some sort, but it may be a combination of lat/long, date and species name for example.

Lists of species can be retrieved from services listed here:
The service http://bie.ala.org.au/search.json is currently in use by Edgar to retrieve a list of species which then drives the data harvesting. Edgar is also making use of services to retrieve LSIDs, the identifier the ALA is using for a taxon.

Data quality

As part of the AP30 project, the ALA has developed some data quality methods, and exposed the outputs to aid the modelling work in Edgar.

Mountain Thornbill
IMAGE BY: TOM TARRANT
RIGHTS: ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 


For each species, we are processing data to detect outlier records. This involves loading all points for a single species, intersecting these points with 5 chosen environmental surfaces, and running an algorithm known as Reverse JackKnife.


Heres an example record in the ALA:

http://biocache.ala.org.au/occurrence/b07bbac2-22d7-4c8a-8d61-4be1ab9e0d09

The details for this record are here (in JSON format):

http://biocache.ala.org.au/ws/outlier/record/b07bbac2-22d7-4c8a-8d61-4be1ab9e0d09

And the overall details of the tests for this species are accessible via the LSID, like so (JSON again):

http://biocache.ala.org.au/ws/outlierInfo/urn:lsid:biodiversity.org.au:afd.taxon:0c139726-2add-4abe-a714-df67b1d4b814.json

The tests use 5 environmental layers that have been chosen for their suitability for testing with this algorithm. Theres are write up of this work available here.

Wednesday, June 13, 2012

PRODUCT TEAM

Peter Doherty - Project Manager

Peter has been the Program Officer for the Atlas since 2009.

Dave Martin - Software Architect

Dave has been working in biodiversity informatics for 6 years, and prior to this has worked on healthcare and banking systems. He started working for the Atlas in 2008. Prior to this, he worked for the Global Biodiversity Information Facility (GBIF) in Copenhagen. He has skills in GIS, databases (relational, nosql), Lucene, Java, Scala, Groovy/Grails.

Miles Nicholls - Data Manager

With a background as a business analyst in data warehousing and business intelligence Miles has been working with the Atlas since late 2009 as Data Manager.  Miles has  qualifications in science (although never used in anger and frighteningly out of date) and information systems and thinks it's great that the ALA combines the two.  Miles' work with the ALA involves discussing data sharing, open access licensing, data schemas and formats with the owners of data and transforming data using whatever tool will do the best job at the time.

Natasha Carter - Software Developer

Natasha has been working as a software engineer for 8 years.  She works for a small company that specialises in providing data integration solutions. She is experienced in the design and implementation of solutions using a variety of tools and technologies including:
  • Java/Scala
  • RDBMS - MySQL, Postgres, Oracle, Ingres and SQLServer
  • no SQL - Cassandra
Natasha has been involved in the ALA since December 2009 contributing to the data and service layers.

Nick dos Remedios - Software Developer

Nick has been with the Atlas of Living Australia since 2008 and has been working as a software developer since 1999. Prior to that he worked as immunologist/molecular biologist/bioinformatician. He has experience in the areas of airline logistics, patent informatics and biodiversity informatics. He often gets pigeon-holed as front-end developer and spends most of his time coding in Java, Groovy, Javascript and HTML/CSS. Favourite frameworks/APIs include Grails, Spring MVC and jQuery.

The project

In conjunction with ANDS Project AP03, this project will collaborate closely with James Cook University (JCU) to produce an online tool that allows users to view current observations and distribution maps for Australian bird species as well as view predicted future distributions taking into account climate change.

This project (AP30) will be providing the back-end functionality for the project, i.e., web services and/or bulk data downloads for:
  • deliver new web services making the locality information held by ALA available to AP03,
  • deliver new data quality control/cleansing over the data as defined with project AP03 
  • deliver new web services allowing for ingestion of feedback and additional locality information from AP03 into the ALA records.
  • additional data quality web services of relevance to a wider research base that those targeted for AP03