Species Concepts, Occurrences and Individuals

 There has been an ongoing discussion on the <a data-cke-saved-href="http://www.tdwg.org/" href="http://www.tdwg.org/">TWDG</a> list about adding a class for an "Individual" to the DarwinCore vocabulary.I thought these examples might help people get their head around this idea of "Individual".

For some email systems etc. the long version of these URI's are escaped so have added bit.ly links to should also work.

They are all collected into this bit.ly bundle http://bitly.com/okM8h0
These are from occurrence records that can be browsed here http://ocs.taxonconcept.org/ocs/index.html

Silver-bordered Fritillary Butterfly (Boloria selene). http://lod.taxonconcept.org/ses/ICmLC.html

KB View http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%2Fses%2FICmLC%23Species

An occurrence of that species. http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9.html

KB View http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence

An individual of that species.

KB View http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Focs.taxonconcept.org%2Focs%2Ff522444a-2dd9-400e-be59-47213ef38cb9%23Individual

The "Tag" for individuals of that species.

KB View http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%2Fses%2FICmLC%23Individual>
  * Note in the knowledge base view the statement.
  is type of txn_ocs:f522444a-2dd9-400e-be59-47213ef38cb9#Individual

Species Occurrence Records Represented in RDF

Species Occurrence Records.png

I thought it might be helpful to explain how the example TaxonConcept occurrence records are structured. The diagram above shows the different sections in the RDF document. The different sections make statements about the various related entities. Here is a link to one of the RDF records. The occurrence section gives the what, when, where and who for the occurrence itself. The occurrence section also allows for the inclusion of an image of the original occurrence label.

The occurrence happened within a given geographic area documented by a WGS84 latitude, longitude, and radius in meters. In order to allow "areas" that overlap or are otherwise identical, to be linked, it is important that the precision (significant digits) for the longitude and latitude be standardized. For these examples, I simply added, one significant digit to the numbers from a typical GPS. This should allow submeter precision for those projects that need it. Remember that the actual precision is recorded in the radius in meters. Those who are exporting data sets with only two or three significant digits can output those longitude and latitudes by setting the signficant digits to 8 in their export software. This would mean that data recorded as 44.86, -87.23, would be exported as 44.86000000, -87.23000000. Again, remember that the actual precision is given in the radius. One feature of these "areas" is that I only need to define this "area" once, so if I have 10,000 mosquito records I can simply refer to the just area "geo:44.86528100,-87.23147800;u=10" in the other 9,999 records. Another feature is that if a different LOD entity has soil or weather data associated with this area, they can add that and it will be visible to those users consuming my occurrence records.

The species concept section documents that an instance of the species concept had a particular occurrence in a particular area. It also adds that the species concept can then be "expected" in the particular continent, state/province and county.

The occurrence was of an individual, which now could be a specimen in a collection or an individual that might be encountered in other studies.

The occurrence record has an associated identification. This section contains information about who identified the specimen, how they identified it and what concept and scientific name were assigned to the specimen. The identification section also allows for the inclusion of a image of the original identification label.

The last three sections make assertions about the geographic areas of continent, state/province and county. Since the species concept was observed within these geographical areas, the species concept can be said to be "expected" in these areas.

This method creates the triples that allow LOD consumers to see what species concepts are "expected" in a given geographical area. It also an emergent phenomena that individual occurrence records create geographic "species lists."

These examples include more information than they need to, but I thought it might be best to include everything that was part of the DarwinCore. Individual data providers might choose not to include all of these attributes in their own data.

Why Linked Open Data Makes Sense for Biodiversity Informatics

I came across an issue in my own work that I think serves as a good example of the advantages of the Linked Open Data approach.

I have been working to create Linked Open Data compliant identifiers for species. Species are traditionally described in a published paper. These species descriptions along with the type specimens serve as documentation of the species concept. Occasionally, others revise this species concept through published "revisions."

The TaxonConcept.org species concepts would be clearer and more useful if I included links to the original species description in the species concept RDF.

Since the Biodiversity Heritage Library has already been working on collecting, scanning and databasing this information, it seems that the most sensible and efficient approach would be simply to link to their identifiers for the appropriate publications in my RDF.

It does not make sense to replicate this data and functionality in my application when the BHL is already doing a great job databasing biodiversity publications.

These links could be either in the form of a URL to the PDF version of the species description, or as links to an RDF file containing the title, author, and journal of the original species description. All that is really needed are resolvable identifiers for each published species description that exposes enough information to make it clear to what specific article I am linking to.

This first diagram represents how I might have modeled TaxonConcept.org as a traditional "walled garden" web application. In this example, I recreate tables and data for occurrence records and references. I then curate and expose my reference and occurrence data separately from other, often more complete, data sets.

A Linked Open Data approach would make more sense. One could simply link to occurrence records and references that are already being curated by others.

In addition to improving the value of my data set, other groups could use those same links to improve the quality of information they provide. In the process, those data sets that link to the same BHL identifier are now also interlinked and "findable." From the perspective of the BHL, these links could serve as a way to measure the utility of their service and obtain metrics on each publication.

If each dataset is assigned to a separate graph, it becomes easy to include or exclude the data sets and statements made by other groups.

The diagram below shows some of the potentially linkable data sets. There are a lot more on the Linked Open Data Cloud, but I wanted a reasonably sized diagram. Some of these resources already exist and are interlinked, while others like GBIF, the BHL and the Encyclopedia of Life are either not available or are still in the planning stage.

In summary, the Linked Open Data approach makes the best use of everyone's efforts, reduces data redundancy, and makes additional data sets, of which you might not have been originally aware, findable and usable.

A Species has_many Classifications


An issue that I need to address while creating species concepts is what to do about alternative classifications. One solution is to simply choose one classification. Unfortunately, I can't find one best classification. My current alternatives are the NCBI Taxonomy, ITIS, Catalog of Life 2010, and DBpedia. Each of these alternatives have their strengths and weaknesses.

In GeoSpecies Knowledge Base,  there is only one classification. The GeoSpecies species have a set of link outs to URI's for families, orders, classes, phyla and kingdoms in addition to species. This structure has proven too problematic and difficult to maintain. It also does not accurately represent the reality that a species can have many classifications.

The diversity of alternative classifications might be best represented as links to alternative classification ontologies. Each classification would be modeled as an OWL ontology in Protege. Species concepts would then be linked to the lowest appropriate place in each ontology. I have started creating some of these in Protege, but creating these manually takes a lot of time.

One of these ontologies maps to the Catalog of Life 2010 to the class level. Each of the TaxonConcept species concepts are linked to a  Catalog of Life 2010 Ontology at the Class level, and the existing DBpedia ontology at Eukaryote, Animal, Mammal, Bird, Insect, Arachnid, depending on which is most appropriate.

Here is an example of this linking:

 In the future, I hope to have some of these to order or family. This would allow systems to infer that if something is in ITIS "Salticidae" then it is an Arthropod, or if something is in DBpedia "Mammal" it is also a "Eukaryote". 

Since each of these species concepts links to a number of alternative classifications, the fact that each species can have many classifications is better represented.

I would be interested to hear what others think of this architecture. Is there a potentially better way to represent the same kind of relationships?


Introductory Blog Entry

Welcome to the TaxonConcept Blog! It is my hope that this blog will help me share and discuss various issues of modeling biodiversity data for use in the Linked Open Data Cloud.  

I work in the area of biodiversity informatics, which is the application of informatics techniques to biodiversity information for improved management, presentation, exploration and analysis. (1)  Integrating the large number of disparate data sets that relate to species is a serious challenge. This information occurs in a wide variety of forms, including DNA sequences, images, proteins, and occurrence records. To facilitate large scale analysis and improve understanding, distributed information about species needs to be linked together. Semantic web tools and techniques may allow these different data sets to be connected together into useful knowledge bases. Some background and related issues can be found on the GeoSpecies About Page.

One of the first problems I encountered was that data sets are often tied to a particular combination of genus and specific epithet. This form of identifier serves two purposes. The first is as a form of phylogenetic hypothesis, i.e. this species belongs in this particular genus. The second purpose is to serve as a universal stable identifier for the species. The problem is that these two roles are at odds with each other. How can a species have a stable identifier when a taxonomic revision will change the identifier itself?

A common example of this is with the Eastern Tree Hole Mosquito Ochlerotatus triseriatus / Aedes triseriatus. There are taxonomists who feel that the subgenus Ochlerotatus should be elevated to a genus status. The result is that a large number of Aedes mosquitoes are now placed in the genus Ochlerotatus, thereby changing their identifier. Some have argued on Taxacom that the name is the species concept. By this they mean that Aedes triseriatus is not the same thing as Ochlerotatus triseriatus. It is here where we see the split between some taxonomists and some other biologists. If the specimens that are instances of the species concept are the same under either name and the characters used to diagnose the species are the same, then some would argue that this is really the same species. Others argue that the specific phylogenetic hypothesis is what defines the species. 

How does this affect how related data can be appropriately interpreted?  Species occurrence records play a major part in biodiversity information. If the species is the phylogenetic hypothesis, then records of Ochlerotatus triseriatus and Aedes triseriatus are occurrences of two different things. You should have one map of Ochlerotatus triseriatus occurrences, and another map of Aedes triseriatus occurrences.

Since the names associated with an occurrence record in GBIF or some other database are usually assigned by the person who identified the specimen, it seems important to understand how they decided what name goes with what specimen. It is my experience with mosquitoes and many other kinds of organisms that the name is assigned by matching the specimen to a particular collection of characters. The name assigned to the specimen is determined by where the characters place it in the key, or by where its DNA sequence places it in BLAST. It is my assertion that the vast majority of occurrence records in GBIF and other repositories have their species name assignments made in this manner.

Although phylogeny often has a role in the construction of a key, the vast majority of collection records are labeled based on matching characters to a pattern. In other words, a particular phylogenetic hypothesis is not what is being considered when a name is assigned. Assuming that the actual characters involved and the specimens entailed are the same, a change in the phylogenetic hypothesis does not change the mental process that is being used to assign names to specimens. It is as if the old name can be crossed off and the new name substituted - the underlying "species concept" remains the same.

In order to properly interpret species occurrence records and other data records, it seemed that there needed to be a system of identifiers for the species itself separate from any particular phylogenetic hypothesis. This is saying a set of specimens exist that cluster into this group based on these specific criteria. These specimens can be considered instances of the species concept. Various humans may hypothesize that a species concept has a particular phylogeny, but that assertion exists as a statement separate from the species concept itself.

I started minting species concepts for the Linked Open Data web as part of the GeoSpecies Knowledge Base. In the process I learned a number of things: some feel that species should be modeled as a class, others think it should be modeled as an instance. I also came to the conclusion that it might be best to separate the documentation of these species concepts from sites that make assertions about those concepts. This would allow end users to  segment LOD species data through the use of named graphs. It also seemed like a good idea to separate the species concept functionality into a separate domain that could be easily transferred to another entity, if that was required. To achieve these goals and requirements I have created TaxonConcept.org. 

In future blog posts I will elaborate on the features of this conceptualization and solicit feedback on particular methods of linking related data to these concepts.  

Thank you very much for visiting. I look forward to your comments.

Pete DeVries


* URI Universal Resource Identifier