|
||||||
|
Limitations and Cautions for Using the Data SourcesFive major limitations and cautions exist that need to be considered at the time of using water quality data sources. The limitations include:
Disparate data formatsThis can be a real problem when attempting to cross-compare or analyze data. For example, if the user retrieves data from a federal database, it may result in different formats. As a result, the user would need to develop a common format to compile the necessary data and reduce the data set to cover the specific area of concern. The user needs to be aware that this is not a quick process if trying to develop a local comprehensive database. However, when looking for specific data for a specific reason, retrieving the data from the databases and formatting them for examination should not be too difficult or time-consuming. Data duplicationData may be submitted to national databases (e.g., STORET) from state agencies and still be contained in the state agencies' databases. If data is downloaded from the national databases, and then from state and local databases, there is a risk of data duplication. The user should be aware that the potential exists for data duplication. Data bias and data gapsSince a study can be based on existing data sets, the nature of the data and an understanding of the sampling plans and objectives need to be explored. To do this by documenting each data set would be difficult. The types of problems, which might be included in the underlying data, include biases in sampling due to contagious sampling. For example, biases may arise if a problem was identified within a system, and then the system was sampled extensively near the problem point but was not sampled away from the problem or in systems that did not have problems. Inclusion of all this data can have serious effects on all summary statistics and could result in criteria that represent severely impacted sites. Other examples of biases based on grouping data from many studies include pooling data from intensive short-term studies with low-intensity long-term studies. An example would be a study which sampled fixed stations once each season for 25 years versus a study that sampled each station daily for one year. Pooling these data sets can result in unintended biases in distributional parameters and characteristics. Use of the data statistically should be approached with cautionOnce states compile data sources they must take into account several cautions regarding data bias during statistical analysis. The program under which the data were collected and the purpose of data collection are of critical importance. The objectives of a monitoring program or data collection strongly dictate the sampling program that is employed. Taken alone, the data can be analyzed using standard statistical methodology. However, when data collected from different programs are merged, the differences in the statistical sampling, the analysis, and the aggregation of the data can be critical. Ignoring these issues can result in erroneous conclusions. For example, imagine a very important water body in a large community that supports recreation, education, industry, and potable water uses. If the water body is very large, then it is likely that many users will be monitoring the water quality. Some of the monitoring will be based on regulatory compliance and therefore will have strict periodicity, analytical methods, and purposes. A local university that is interested in understanding ecosystem conditions might undertake regional monitoring. A beach might be monitoring for basic water parameters and microbiology. At the same time, a scout camp on the shore of the water body might have science projects conducted by the scouts. In all cases, the monitoring has a specific objective and planned sampling in both space and time. The analyses range from well-meaning citizens to professional, certified laboratories. An analyst aggregating these results in an effort to understand the water quality of the water body could take the hundreds of measurements made with uncalibrated instruments and average them with the single quarterly measurement made for compliance monitoring, and end up with an erroneous result. This may happen because the abundant data from some sources (e.g., citizen groups) far out-shadows the quarterly data measured by the professional laboratory. Similarly, the abundant compliance data will be specific to the intake or the outflow from a plant or industry. These data are highly biased. If combined with a research program's randomized sampling plan that covers the entire water body, the compliance data will overwhelm the sparse but highly valued randomized data collection. Giving these results the same weight will result in an incorrect inference about the overall quality of the water body. Quality control of the dataSimilarly, the level of quality assurance performed is critical as illustrated
in the example above. The amount of quality control will be dictated by
the questions being addressed and the budgets available. An example is
the dissolved oxygen concentration in a water body. If one researcher
is interested in anoxic conditions, the precise concentrations are not
as important as the presence or absence of some oxygen. Therefore, the
calibration of the instrument or the titration methodology may not be
as critical as it would be for a researcher looking at hypoxic conditions
for which highly sensitive calibration is necessary. Mixing these two
types of data for a given water body or sampling station could lead to
results that are not correct or defensible. RECOMMENDATIONS FOR USING WATER QUALITY DATA SOURCESThe data contained in the source water quality database inventory is a critical resource that needs to be approached with caution. Often users know data are available but they are unsure of how to access it, avoid data duplication, data bias, and how to analyze data from disparate data sets. Some basic recommendations to assist users with the data mining and data management process along with approaches for managing the data limitations are discussed below.
The Data Mining ProcessBefore mining water quality data, it is necessary to evaluate what questions
are to be answered and what the ultimate use for the data will be. If
there are only one or two questions and the data will not be revisited
in the future to answer more questions, then simply downloading the locations,
time periods, and parameters, and compiling the data in something as simple
as Microsoft Excel, may be the most efficient process. Protocol for database development that can assist with the issue of data from disparate formatDeveloping a comprehensive database requires some data access skills and protocols. Downloading data is usually a straightforward task, however, organizing and formatting the data into a framework that is useful for retrieval requires some energy and repetitive actions with attention to detail. An organization's staff may change throughout this process; so it is imperative to develop protocols on exactly how the data is formatted for the database and how additional pieces of information (or fields) are added to the data to make it relevant to your database. If the user is downloading and formatting a lot of data, developing automated codes for formatting or uploading data can save significant amounts of time and reduce operator error. Advice and/or constraints for reviewing data setsWhen obtaining data from the various databases, there are a number of ways to collect them. First, data can be downloaded, and then the user can format it and sort through it later. This is usually the most time-intensive of the choices. Second, the user can examine which locations have sufficient data to meet the data objectives (i.e., number of observations, time period, geographical locations, etc.) and then download those specific files. For example, if the user is interested in nutrient data for the Delaware River between 1990 and 1999, the data download can be narrowed from hundreds of locations to less than a dozen without a lot of work. Then it can be determined if the data are sufficient for the specific areas of interest in the watershed or if the number of observations is sufficient. In some cases, even these requests may be too large and unwieldy for the computers or Internet to handle downloads in a timely manner. Therefore, it may be necessary to consider dividing requests and downloads into smaller pieces to recombine at a later time. This is more time-consuming and produces greater chances for error and duplication; therefore careful precautions must be followed. Customize information from the databaseDeveloping some background information for each database would be useful, as it can be difficult to constantly manipulate the national databases and apply filters to the data. For example, PWD downloaded and formatted as much as possible into their database, so it possesses some unique attributes that other databases may be lacking. PWD obtained the latitude and longitudes for the site and assigned a table with the watershed, stream, and subwatershed to each of the sites in an effort to provide querying capabilities at various levels. They also developed queries by date and by parameter, and can automatically bring up data in formats for long-term trends. Also, the data source and organization are provided to enable a number of ways to filter data, as certain data sets may bias the statistical analysis. For example, if the user only wants data for a specific water body and time period, but does not want data collected by volunteer organizations, and only wants sites located between certain stream or river miles, the database is flexible to accommodate such requests. PWD found that many data sets are lacking strong geographical components, so they suggest using GIS skills to enhance each data set significantly. Conduct some basic exploratory analyses on the different data sets to assess any potential data bias/data gaps and/or data duplicationExploratory analyses can result in an understanding of these potential problems and possible sources of bias in the distributions. Based on the exploratory analyses, transformation of the data and aggregations can be performed which will identify appropriate statistical methodologies for evaluating the hypotheses. The exploratory analyses will focus on the development of distributions for all of the data collected as well as for some categorized data. The development of the categories can be based statistically or on established expectations. This approach should also help with the limitations of using the data statistically. Since data may be based on various sampling techniques, it is imperative to research this and begin with the lowest level of analyses to document any similarities or major dissimilarities between the data sets.
|
|||||
|
||||||
|
||||||