Wednesday, August 22, 2012

KEY FACTORS CUSTOMERS WILL USE TO JUDGE THE VALUE OF OUR PRODUCT

One of the most important aspects of the AP30 project includes fast access to large numbers of occurrence records at a taxon by taxon level. As an example, at the time of writing there 420,039 distinct records in the Atlas for the Australian Magpie. All of these individual records will need to be retrieved and cached by the Edgar project to facilitate the vetting tool Edgar is developing. Edgar is focussing on Bird Distributions and modelling. Currently, there 1,986 taxonomic concepts for Australian birds supplied by the Australian Faunal Directory.

In addition, the persistence of vettings against the data in the Atlas will mean other tools and portals will benefit from the improved data quality. Typically researchers who work with these data will have to undertake a complete gathering of the data and a cleaning process to remove duplicate and erroneous records. This data cleaning is typically not shared or persisted with the source data, leading to duplication of effort within the research community.

By submitting the vettings to the Atlas, Edgar will be sharing the improved data quality with any researcher accessing these data through the Atlas.

HOW THE PRODUCT WILL MEET OUR USERS NEEDS


The primary customers for this project are the Edgar team who we are working with to produce a platform to support:
  • integration of occurrence data for presentation to users on a map interface
  • the ability to filter environmental outliers
  • the ability to filter duplicate records
  • persistence of record vettings, and application of the vettings to records within a polygon 
However, once these services and data processing is in place, it will benefit the ALA and additional portals wishing to access occurrence data. This will include portals such as Online Zoological Collections of Australian Museums which is built upon Atlas web services.



KEY TECHNOLOGIES and FEATURES

The stack we are putting together to support AP30 requirements includes:

  • Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar
  • Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.
  • A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/
  • Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.

Wednesday, August 15, 2012

PROJECT DESCRIPTION

Simply stated, the goals of the AP30 are to:

  • Provide bulk data access services to assist external software projects requiring biodiversity occurrence data. 
  • Provide a web services for the persistence of annotations against occurrence records.
  • Produce data quality processes to improve the quality of the occurrence data

Primarily, these services will be targeted at meeting the needs of the AP03 project Edgar, but once developed, they will aid the development of other biodiversity portals and analytical tools.

The annotations provided by Edgar, will also persist with the occurrence data. This will aid other researchers in the removal of records not suitable for distribution modelling.