Sunday
Jul102011

Species Concepts, Occurrences and Individuals

There has been an ongoing discussion on the TWDG list about adding a class for an "Individual" to the DarwinCore vocabulary.
I thought these examples might help people get their head around this idea of "Individual".

 

For some email systems etc. the long version of these URI's are escaped so have added bit.ly links to should also work.
They are all collected into this bit.ly bundle http://bitly.com/okM8h0
These are from occurrence records that can be browsed here http://ocs.taxonconcept.org/ocs/index.html

 

 

 

 

The "Tag" for individuals of that species.
txn:SpeciesIndividualTag
KB View http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%2Fses%2FICmLC%23Individual>
http://bitly.com/oM7X6Q
  * Note in the knowledge base view the statement.
  is type of txn_ocs:f522444a-2dd9-400e-be59-47213ef38cb9#Individual

 

 

Monday
Feb212011

Species Occurrence Records Represented in RDF

 

 

I thought it might be helpful to explain how the example TaxonConcept occurrence records are structured. The diagram above shows the different sections in the RDF document. The different sections make statements about the various related entities. Here is a link to one of the RDF records. The occurrence section gives the what, when, where and who for the occurrence itself. The occurrence section also allows for the inclusion of an image of the original occurrence label.

The occurrence happened within a given geographic area documented by a WGS84 latitude, longitude, and radius in meters. In order to allow "areas" that overlap or are otherwise identical, to be linked, it is important that the precision (significant digits) for the longitude and latitude be standardized. For these examples, I simply added, one significant digit to the numbers from a typical GPS. This should allow submeter precision for those projects that need it. Remember that the actual precision is recorded in the radius in meters. Those who are exporting data sets with only two or three significant digits can output those longitude and latitudes by setting the signficant digits to 8 in their export software. This would mean that data recorded as 44.86, -87.23, would be exported as 44.86000000, -87.23000000. Again, remember that the actual precision is given in the radius. One feature of these "areas" is that I only need to define this "area" once, so if I have 10,000 mosquito records I can simply refer to the just area "geo:44.86528100,-87.23147800;u=10" in the other 9,999 records. Another feature is that if a different LOD entity has soil or weather data associated with this area, they can add that and it will be visible to those users consuming my occurrence records.

The species concept section documents that an instance of the species concept had a particular occurrence in a particular area. It also adds that the species concept can then be "expected" in the particular continent, state/province and county.

The occurrence was of an individual, which now could be a specimen in a collection or an individual that might be encountered in other studies.

The occurrence record has an associated identification. This section contains information about who identified the specimen, how they identified it and what concept and scientific name were assigned to the specimen. The identification section also allows for the inclusion of a image of the original identification label.

The last three sections make assertions about the geographic areas of continent, state/province and county. Since the species concept was observed within these geographical areas, the species concept can be said to be "expected" in these areas.

This method creates the triples that allow LOD consumers to see what species concepts are "expected" in a given geographical area. It also an emergent phenomena that individual occurrence records create geographic "species lists."

These examples include more information than they need to, but I thought it might be best to include everything that was part of the DarwinCore. Individual data providers might choose not to include all of these attributes in their own data.

Monday
Dec202010

Biodiversity Informatics on the Semantic Web

Here is my recent talk from the Entomology Collections Network in San Diego.

 

 

Thursday
Aug052010

Why Linked Open Data Makes Sense for Biodiversity Informatics

I came across an issue in my own work that I think serves as a good example of the advantages of the Linked Open Data approach.

I have been working to create Linked Open Data compliant identifiers for species. Species are traditionally described in a published paper. These species descriptions along with the type specimens serve as documentation of the species concept. Occasionally, others revise this species concept through published "revisions."

The TaxonConcept.org species concepts would be clearer and more useful if I included links to the original species description in the species concept RDF.

Since the Biodiversity Heritage Library has already been working on collecting, scanning and databasing this information, it seems that the most sensible and efficient approach would be simply to link to their identifiers for the appropriate publications in my RDF.

It does not make sense to replicate this data and functionality in my application when the BHL is already doing a great job databasing biodiversity publications.

These links could be either in the form of a URL to the PDF version of the species description, or as links to an RDF file containing the title, author, and journal of the original species description. All that is really needed are resolvable identifiers for each published species description that exposes enough information to make it clear to what specific article I am linking to.

This first diagram represents how I might have modeled TaxonConcept.org as a traditional "walled garden" web application. In this example, I recreate tables and data for occurrence records and references. I then curate and expose my reference and occurrence data separately from other, often more complete, data sets.

A Linked Open Data approach would make more sense. One could simply link to occurrence records and references that are already being curated by others.

In addition to improving the value of my data set, other groups could use those same links to improve the quality of information they provide. In the process, those data sets that link to the same BHL identifier are now also interlinked and "findable." From the perspective of the BHL, these links could serve as a way to measure the utility of their service and obtain metrics on each publication.

If each dataset is assigned to a separate graph, it becomes easy to include or exclude the data sets and statements made by other groups.

The diagram below shows some of the potentially linkable data sets. There are a lot more on the Linked Open Data Cloud, but I wanted a reasonably sized diagram. Some of these resources already exist and are interlinked, while others like GBIF, the BHL and the Encyclopedia of Life are either not available or are still in the planning stage.

In summary, the Linked Open Data approach makes the best use of everyone's efforts, reduces data redundancy, and makes additional data sets, of which you might not have been originally aware, findable and usable.

Thursday
Jun102010

A Species has_many Classifications

 

An issue that I need to address while creating species concepts is what to do about alternative classifications. One solution is to simply choose one classification. Unfortunately, I can't find one best classification. My current alternatives are the NCBI Taxonomy, ITIS, Catalog of Life 2010, and DBpedia. Each of these alternatives have their strengths and weaknesses.

In GeoSpecies Knowledge Base,  there is only one classification. The GeoSpecies species have a set of link outs to URI's for families, orders, classes, phyla and kingdoms in addition to species. This structure has proven too problematic and difficult to maintain. It also does not accurately represent the reality that a species can have many classifications.

The diversity of alternative classifications might be best represented as links to alternative classification ontologies. Each classification would be modeled as an OWL ontology in Protege. Species concepts would then be linked to the lowest appropriate place in each ontology. I have started creating some of these in Protege, but creating these manually takes a lot of time.

One of these ontologies maps to the Catalog of Life 2010 to the class level. Each of the TaxonConcept species concepts are linked to a  Catalog of Life 2010 Ontology at the Class level, and the existing DBpedia ontology at Eukaryote, Animal, Mammal, Bird, Insect, Arachnid, depending on which is most appropriate.

Here is an example of this linking:

 In the future, I hope to have some of these to order or family. This would allow systems to infer that if something is in ITIS "Salticidae" then it is an Arthropod, or if something is in DBpedia "Mammal" it is also a "Eukaryote". 

Since each of these species concepts links to a number of alternative classifications, the fact that each species can have many classifications is better represented.

I would be interested to hear what others think of this architecture. Is there a potentially better way to represent the same kind of relationships?