Tuesday, June 26, 2012


One of the goals of this project is to provide a web service layer  on top of occurrence data that supports the Edgar project. This includes:
  • Bulk access to occurrence records, including access to sensitive records not visible to the public but required for species distribution modelling purposes.
  • New data quality control methods, and error reporting
  • Ingestion of feedback and additional locality information
In this post, we'll talk about the first two points and what we have done thus far.

Bulk access

We have developed services that allow downloading of occurrence records:
Both of these services give bulk access. The downloadfromDB service gives a wider range of metadata associated with the record, whereas downloadFromIndex gives a subset of the data that is typically of use for researchers and anyone modelling with species (include scientific name, latitude, longitude).

The latter was developed to support Edgar's need for faster bulk downloads.  To support the Edgar project's need to maintain a separate local cache of the data, we have also developed a service so that deleted records can be tracked:
When records are ingested by the Atlas a UUID is issued against the record, and the Atlas keeps track of the properties within that record that make it unique. This is typically a ID of some sort, but it may be a combination of lat/long, date and species name for example.

Lists of species can be retrieved from services listed here:
The service http://bie.ala.org.au/search.json is currently in use by Edgar to retrieve a list of species which then drives the data harvesting. Edgar is also making use of services to retrieve LSIDs, the identifier the ALA is using for a taxon.

Data quality

As part of the AP30 project, the ALA has developed some data quality methods, and exposed the outputs to aid the modelling work in Edgar.

Mountain Thornbill

For each species, we are processing data to detect outlier records. This involves loading all points for a single species, intersecting these points with 5 chosen environmental surfaces, and running an algorithm known as Reverse JackKnife.

Heres an example record in the ALA:


The details for this record are here (in JSON format):


And the overall details of the tests for this species are accessible via the LSID, like so (JSON again):


The tests use 5 environmental layers that have been chosen for their suitability for testing with this algorithm. Theres are write up of this work available here.


  1. Is there an official definition of 'occurence data'? - not readily seeing it in the wikipedia :)

  2. Huh, 'reverse jackknife' that's interesting :) thanks for sharing :)

  3. Spatial outliers are really interesting, thanks for the link to the notes. Would be nice to see a shorter summary post explaining the subjectivity of outliers and how the dev and scientists are making this subjective algorithm work.

  4. Cheers David.

    "Occurrence" is a widely used term in biodiversity informatics that is covers specimen collections and observations of species. A record will typically consist of (at least) 4 components:

    1) What - a species or subspecies or higher level taxon if identification to species level isnt possibility
    2) Where - typically geographical coordinates, but could be a locality description
    3) When - a single date or date range
    4) Whom - who recorded the observation or collected the specimen.

    I agree on the subjectivity of the outliers. Essentially the approach here is to allow researchers scientists to filter records marked by the ALA as outliers if they want, but not to prevent access to any of the data. Hence researchers can use our outlier annotations on the records if they think it meets there needs and/or they agree with the techniques used.