Welcome to the TaxonConcept Blog! It is my hope that this blog will help me share and discuss various issues of modeling biodiversity data for use in the Linked Open Data Cloud.
I work in the area of biodiversity informatics, which is the application of informatics techniques to biodiversity information for improved management, presentation, exploration and analysis. (1) Integrating the large number of disparate data sets that relate to species is a serious challenge. This information occurs in a wide variety of forms, including DNA sequences, images, proteins, and occurrence records. To facilitate large scale analysis and improve understanding, distributed information about species needs to be linked together. Semantic web tools and techniques may allow these different data sets to be connected together into useful knowledge bases. Some background and related issues can be found on the GeoSpecies About Page.
One of the first problems I encountered was that data sets are often tied to a particular combination of genus and specific epithet. This form of identifier serves two purposes. The first is as a form of phylogenetic hypothesis, i.e. this species belongs in this particular genus. The second purpose is to serve as a universal stable identifier for the species. The problem is that these two roles are at odds with each other. How can a species have a stable identifier when a taxonomic revision will change the identifier itself?
A common example of this is with the Eastern Tree Hole Mosquito Ochlerotatus triseriatus / Aedes triseriatus. There are taxonomists who feel that the subgenus Ochlerotatus should be elevated to a genus status. The result is that a large number of Aedes mosquitoes are now placed in the genus Ochlerotatus, thereby changing their identifier. Some have argued on Taxacom that the name is the species concept. By this they mean that Aedes triseriatus is not the same thing as Ochlerotatus triseriatus. It is here where we see the split between some taxonomists and some other biologists. If the specimens that are instances of the species concept are the same under either name and the characters used to diagnose the species are the same, then some would argue that this is really the same species. Others argue that the specific phylogenetic hypothesis is what defines the species.
How does this affect how related data can be appropriately interpreted? Species occurrence records play a major part in biodiversity information. If the species is the phylogenetic hypothesis, then records of Ochlerotatus triseriatus and Aedes triseriatus are occurrences of two different things. You should have one map of Ochlerotatus triseriatus occurrences, and another map of Aedes triseriatus occurrences.
Since the names associated with an occurrence record in GBIF or some other database are usually assigned by the person who identified the specimen, it seems important to understand how they decided what name goes with what specimen. It is my experience with mosquitoes and many other kinds of organisms that the name is assigned by matching the specimen to a particular collection of characters. The name assigned to the specimen is determined by where the characters place it in the key, or by where its DNA sequence places it in BLAST. It is my assertion that the vast majority of occurrence records in GBIF and other repositories have their species name assignments made in this manner.
Although phylogeny often has a role in the construction of a key, the vast majority of collection records are labeled based on matching characters to a pattern. In other words, a particular phylogenetic hypothesis is not what is being considered when a name is assigned. Assuming that the actual characters involved and the specimens entailed are the same, a change in the phylogenetic hypothesis does not change the mental process that is being used to assign names to specimens. It is as if the old name can be crossed off and the new name substituted - the underlying "species concept" remains the same.
In order to properly interpret species occurrence records and other data records, it seemed that there needed to be a system of identifiers for the species itself separate from any particular phylogenetic hypothesis. This is saying a set of specimens exist that cluster into this group based on these specific criteria. These specimens can be considered instances of the species concept. Various humans may hypothesize that a species concept has a particular phylogeny, but that assertion exists as a statement separate from the species concept itself.
I started minting species concepts for the Linked Open Data web as part of the GeoSpecies Knowledge Base. In the process I learned a number of things: some feel that species should be modeled as a class, others think it should be modeled as an instance. I also came to the conclusion that it might be best to separate the documentation of these species concepts from sites that make assertions about those concepts. This would allow end users to segment LOD species data through the use of named graphs. It also seemed like a good idea to separate the species concept functionality into a separate domain that could be easily transferred to another entity, if that was required. To achieve these goals and requirements I have created TaxonConcept.org.
In future blog posts I will elaborate on the features of this conceptualization and solicit feedback on particular methods of linking related data to these concepts.
Thank you very much for visiting. I look forward to your comments.
* URI Universal Resource Identifier