Wednesday, August 22, 2012

KEY TECHNOLOGIES and FEATURES

The stack we are putting together to support AP30 requirements includes:

  • Apache Cassandra Database. The database will house the full record details and will store the results of duplicate & outlier detection. It will also provide the persistence for the record vettings provided by Edgar
  • Apache SOLR search indexes. These indexes will support the searching capabilities required by Edgar.
  • A processing chain implemented in Scala. This will include the algorithms for detecting duplicate records and environmental outliers. This custom code will then update search indexes to allow Edgar to filter for non-duplicates, and non-outliers hence improving the quality of the model and reducing the number of records to be vetted. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/
  • Java Spring MVC web services. These web services will provide the interface for the Edgar project to download snapshots of data for modelling and vetting purposes. They will also provide a write interface for submission of the vettings of bird records by expert users. The code for this component is accessible in the google code repository http://code.google.com/p/ala-portal/. An additional important functional requirement is for Edgar to be able to query for record deletions. Services will be developed to allow Edgar to keep track of these deletions that occur periodically when the ALA harvests from data providers.

1 comment: