{"id":804,"date":"2018-09-29T02:55:14","date_gmt":"2018-09-29T02:55:14","guid":{"rendered":"http:\/\/oceansofbiodiversity.blogs.auckland.ac.nz\/?p=804"},"modified":"2024-12-06T11:31:41","modified_gmt":"2024-12-06T10:31:41","slug":"tips-when-selecting-biodiversity-data-from-gbif-and-obis","status":"publish","type":"post","link":"https:\/\/site.nord.no\/oceansofbiodiversity\/2018\/09\/29\/tips-when-selecting-biodiversity-data-from-gbif-and-obis\/","title":{"rendered":"Tips when selecting biodiversity data from GBIF and OBIS"},"content":{"rendered":"<p><span style=\"color: #0000ff\"><strong>What kind of data are available?<\/strong><\/span><br \/>\nThe data comprise of records of the occurrence of a species or higher taxon (sometimes animals and plants can only be identified to Genus or Family) in a place at a particular date. Most records include latitude and longitude coordinates. An increasing number of records have additional information, such as who collected and identified the species, its depth or altitude, and other associated information.<br \/>\nOBIS has data on about half of the named 240,000 marine species, and GBIF probably has records of over half of all named species on Earth. The number of records is about 50 million in OBIS and 1 billion in GBIF. Some records go back centuries, but most data is from recent decades, with a time-lag in data entry of about five years.<br \/>\nThe records come from specimens and field observations, and of all kinds of animals, plants and microbes.<br \/>\n<span style=\"color: #0000ff\"><strong>For marine species, should I use OBIS or GBIF?<\/strong><\/span><br \/>\nIf searching for a marine taxon, use both. Although many datasets are published in both, some are only in one, and many largely terrestrial datasets may have some useful marine records (e.g. herbarium or museum collections).<br \/>\nBecause both GBIF and OBIS use the same data standards and formats, you can merge the file. Then delete duplicate datasets by searching on dataset name or ID and removing older versions. Then sort by records and see if some records seem to be duplicates. Then delete duplicates. Removing duplicates will reduce your dataset size which will enable faster analyses.<br \/>\n<span style=\"color: #0000ff\"><strong>What format is the data in? <\/strong><\/span><br \/>\nYou can download the data as a standard tabular data sheet. This is based on an international standard \u201cDarwin Core\u201d format used by both GBIF, OBIS and other biodiversity databases. This makes it easy to integrate datasets.<br \/>\n<span style=\"color: #ff0000\"><em>You should be familiar with this data format should you wish to publish your own data at some point. <\/em><\/span><br \/>\n<span style=\"color: #0000ff\"><strong>What are datasets?<\/strong><\/span><br \/>\nOBIS and GBIF are compilations of thousands of datasets. Datasets can be anything from annual fishery research trawl data, plankton net surveys, benthic samples, bird counts, satellite tracked paths of whales and turtles, citizen science records, and museum specimen collections. Think of GBIF and OBIS as a journal containing many \u2018papers\u2019. Thus you need to cite each dataset used like you would papers in a journal.<br \/>\n<span style=\"color: #0000ff\"><strong>Why must I cite the datasets?<\/strong><\/span><br \/>\nIt is a condition of almost all datasets that you cite them once used. Citing OBIS or GBIF is not sufficient (it is like citing a journal and not a paper published in it). In fact, you are breaking the conditions of use if you do not cite the datasets used; to put it more bluntly, it is illegal to use the data if you do not cite the datasets.<br \/>\nNote which datasets you are using by copying the citations and DOI (a unique code) of datasets. If you used only a few datasets then cite them in the references with other publications; if tens or hundreds then cite them in an Appendix.<br \/>\nYou will notice that many datasets do not have sensible citations provided. Some even say \u201ccite this dataset\u201d. Some cite a source print publication. However, an increasing number now provide a conventional author-year-title-source citation to which you add the date accessed. You must add the date accessed because datasets may have multiple versions if they may be amended over time.<br \/>\n<span style=\"color: #0000ff\"><strong>What should I do next to learn more? <\/strong><\/span><br \/>\nDownload some data.\u00a0Select your taxon or geographic area of interest. For example a species or higher taxonomic level, or a country or other geographic region. Look at the information on the web page \u2013 how many records, datasets and does it look sufficient for your purpose on the map? If it looks potentially useful for your purpose, then download the data file.<br \/>\n<span style=\"color: #0000ff\"><strong>Should I make available the data I used? <\/strong><\/span><br \/>\nProbably yes because you have selected particular data for your purpose. This compilation of data is unique to your research. To enable your work to be reproducible you should make your data used available on an open access archive, e.g., Figshare. Do this when or before you publish the results of your analyses.<br \/>\nDo not just publish your data as a pdf; use the same standard (comma separated values file, \u201c.csv\u201d) format you received it in so other people can more easily re-use your data for their purpose. Then they are likely to cite your publication.<br \/>\nIf you added additional data to that you used from GBIF or OBIS, such as from the literature, your field records, or other unpublished sources, then include this in the dataset you publish. Each row of the datasheet is a \u2018record\u2019 and notes its origin. If these data are not already in GBIF and OBIS then send it to one of their nodes to publish on your behalf.<br \/>\n<span style=\"color: #0000ff\"><strong>What quality assurance checks should I do? <\/strong><\/span><br \/>\nGBIF and OBIS increasingly provide indicators of completeness and other quality assurance checks on their data. However, you need to do your own because only you know what data are suitable for your purpose. The following checks are recommended:<\/p>\n<ol>\n<li>Remove duplicate datasets. It can happen that both old and new versions of a dataset may occur in GBIF.<\/li>\n<li>Remove duplicate records (it is possible that records get published through more than one dataset). If a record has the same species, latitude, longitude and collection date as another it is likely a duplicate. It can happen that the same records get published through more than one dataset.<\/li>\n<li>Check taxonomic nomenclature. For marine species you can use the \u2018taxon match\u2019 tool on WoRMS or Lifewatch to check which names are synonyms and organised in a standard taxonomic classification. Names that are not \u2018matched\u2019 may be misspelt or mistaken names or not marine taxa. For non-marine taxa, the best source for checking names in the Catalogue of Life.<\/li>\n<li>Check temporal resolution. Do you want to use all records over all time, or only recent ones?<\/li>\n<li>Check spatial resolution. There are fields in the data sheet for the geographic precision of each record. You have to decide if you wish to only accept records with particular geographic accuracy (e.g., no data on precision, 10 km accuracy).<\/li>\n<li>Map the data points. Do some look like outliers? Check their metadata and source. Does the place name match the latitude and longitude coordinates? If the point seems questionable you may decide to omit it from your analysis.<\/li>\n<li>Do points for marine species appear on land, or terrestrial species appear on the ocean? This could be because the location is associated with an island or country and the exact point is unknown. You need to decide whether such points are useful or not for your purpose.<\/li>\n<\/ol>\n<p>You can create a table of number of records and species downloaded, and show the reduction in both with each step in this data filtering (sometimes called cleaning).<br \/>\n<strong>Where can I find definitions of variables in the datasheet?<\/strong><br \/>\nThe terms are recognized as &#8216;Darwin Core Terms&#8217;, and the definition of variables can be read here: <a href=\"http:\/\/tdwg.github.io\/dwc\/terms\/index.htm\">http:\/\/tdwg.github.io\/dwc\/terms\/index.htm<\/a><br \/>\n<span style=\"color: #0000ff\"><strong>Can I reduce the size of the dataset and still do the analysis?<\/strong><\/span><br \/>\nIf doing a regional or global analysis, you can aggregate records to larger spatial cells, such as 5<sup>o<\/sup> latitude and longitude cells commonly used for global analyses. Then perhaps all you want to know is which species is present in each 5<sup>o<\/sup> cell. Thus you can reduce the dataset size from many records of the same species, to one record of the species per cell. You may also wish to know the total number of records per cell to have an indicator of sampling effort.<br \/>\n<span style=\"color: #0000ff\"><strong>What to do about sampling bias?<\/strong><\/span><br \/>\nAll sampling is biased. This bias includes:<\/p>\n<ol>\n<li>Sampling methods that target particular taxa (e.g., spotting whales, sediment cores, plankton nets). Even within a method, a method will vary in its efficacy in detecting different species and life-stages. Work is underway to extend GBIF and OBIS metadata to be able to select datasets that used similar methods, and you could presently do this by checking dataset metadata, or using common species as indicators to find datasets that used comparable methods.<\/li>\n<li>Sample sizes vary. Even when using similar methods, survey area or time may vary, nets may have different mesh size and towing speed, transects and quadrats may have different areas sampled, traps may be deployed for different time periods, etc.<\/li>\n<li>Sampling is spatially biased. Some places are sampled more and less for many reasons.<\/li>\n<li>Sampling is temporally biased.<\/li>\n<\/ol>\n<p>Sampling bias is not a problem as long as you use it with its bias in mind and interpret the results accordingly. All sampling is in practice \u201cstratified\u201d in some way, such as to a particular environment, habitat, taxonomic group (guild), or size group. So be up front about the scope of your work; what has been and has not been studied.<br \/>\n<span style=\"color: #0000ff\"><strong>Can I get and use species abundance data?<\/strong><\/span><br \/>\nSpecies abundance data is increasingly available in OBIS and GBIF. However, abundance data are highly dependent on sampling method, effort and size. Alternatively, one can use the number of locations of a species in an area (e.g., 5o cell) at a particular time (e.g., year) to indicate its spatial abundance. This is less sensitive to sampling bias. This is also called species occurrence, incidence and presence.<br \/>\n<span style=\"color: #0000ff\"><strong>Why do most people use species \u201cpresence-only\u201d data?<\/strong><\/span><br \/>\nSpecies presence is far less sensitive to sampling bias, and is the most usual metric of diversity in biogeographic studies. From it, one can calculate a variety of measures of species richness, checklists of species, and change in species composition over time and space. Change in species composition is also called species turnover and betadiversity.<br \/>\nThis is often incorrectly called presence-absence data. In fact, whether a species is truly absent is often unknown. However, it can sometimes be reasonable to consider it absent if it is known that it would have been sampled if present.<br \/>\n<span style=\"color: #0000ff\"><strong>Which species are most interesting to study?<\/strong><\/span><br \/>\nMany species have important attributes, such as if listed as threatened, extinct, endangered, introduced, and\/or invasive. Others are important as food or may be pests, vectors of diseases. Many species provide habitat for others (e.g. corals, trees), and other are top predators that may control the abundance of other species in a food web.<br \/>\nMost species are rare, and often the role in an ecosystem is unknown. However, geographic rarity and endemicity places species at greater risk of extinction.<br \/>\n<span style=\"color: #0000ff\"><strong>Where do I get more details on how to analyse data?<\/strong><\/span><br \/>\nHow to use OBIS <a href=\"https:\/\/classroom.oceanteacher.org\/course\/view.php?id=349\">https:\/\/classroom.oceanteacher.org\/course\/view.php?id=349<\/a> and <a href=\"http:\/\/iobis.org\/manual\/\">http:\/\/iobis.org\/manual\/<\/a><br \/>\nSee also\u00a0the <a href=\"https:\/\/github.com\/iobis\/obistools\">obistools R<\/a> package with the <a href=\"https:\/\/github.com\/iobis\/xylookup\">XYlookup service<\/a> (returns shore distance, bathymetry, SS-salinity, SS-temperature for a given coordinate).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What kind of data are available? The data comprise of records of the occurrence of a species or higher taxon (sometimes animals and plants can only be identified to Genus or Family) in a place at a particular date. Most records include latitude and longitude coordinates. An increasing number of records have additional information, such [&hellip;]<\/p>\n","protected":false},"author":96,"featured_media":812,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,28,8],"tags":[],"coauthors":[21],"class_list":["post-804","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-for_graduate_students","category-important-tips","category-resource"],"_links":{"self":[{"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/posts\/804","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/users\/96"}],"replies":[{"embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/comments?post=804"}],"version-history":[{"count":1,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/posts\/804\/revisions"}],"predecessor-version":[{"id":2132,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/posts\/804\/revisions\/2132"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/media\/812"}],"wp:attachment":[{"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/media?parent=804"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/categories?post=804"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/tags?post=804"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/site.nord.no\/oceansofbiodiversity\/wp-json\/wp\/v2\/coauthors?post=804"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}