Showing posts with label andsTools. Show all posts
Showing posts with label andsTools. Show all posts

Wednesday, August 22, 2012

KEY FACTORS CUSTOMERS WILL USE TO JUDGE THE VALUE OF OUR PRODUCT

One of the most important aspects of the AP30 project includes fast access to large numbers of occurrence records at a taxon by taxon level. As an example, at the time of writing there 420,039 distinct records in the Atlas for the Australian Magpie. All of these individual records will need to be retrieved and cached by the Edgar project to facilitate the vetting tool Edgar is developing. Edgar is focussing on Bird Distributions and modelling. Currently, there 1,986 taxonomic concepts for Australian birds supplied by the Australian Faunal Directory.

In addition, the persistence of vettings against the data in the Atlas will mean other tools and portals will benefit from the improved data quality. Typically researchers who work with these data will have to undertake a complete gathering of the data and a cleaning process to remove duplicate and erroneous records. This data cleaning is typically not shared or persisted with the source data, leading to duplication of effort within the research community.

By submitting the vettings to the Atlas, Edgar will be sharing the improved data quality with any researcher accessing these data through the Atlas.

KEY TECHNOLOGIES and FEATURES

The stack we are putting together to support AP30 requirements includes:

  • Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar
  • Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.
  • A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/
  • Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.