mainlogo

Goals and use

The goal of NUSAP is to discipline and structure the critical appratermsisal of the knowledge base behind quantitative policy relevant scientific information. The basic idea is to qualify quantities using the five qualifiers of the NUSAP acronym: Numeral, Unit, Spread, Assessment, and Pedigree. By adding expert judgment of reliability (Assessment) and systematic multi-criteria evaluation of the production process of numbers (Pedigree), NUSAP has extended the statistical approach to uncertainty (inexactness) with the methodological (unreliability) and epistemological (ignorance) dimensions. By providing a separate qualification for each dimension of uncertainty, it enables flexibility in their expression. By means of NUSAP, nuances of meaning about quantities can be conveyed concisely and clearly, to a degree that is quite impossible with statistical methods only.

We will discuss the five qualifiers. The first is Numeral; this will usually be an ordinary number; but when appropriate it can be a more general quantity, such as the expression "a million" (which is not the same as the number lying between 999,999 and 1,000,001). Second comes Unit, which may be of the conventional sort, but which may also contain extra information, as the date at which the unit is evaluated (most commonly with money). The middle category is Spread, which generalizes from the "random error" of experiments or the "variance" of statistics. Although Spread is usually conveyed by a number (either ±, % or "factor of") it is not an ordinary quantity, for its own inexactness is not of the same sort as that of measurements. Methods to address Spread can be statistical data analysis, sensitivity analysis or Monte Carlo analysis possibly in combination with expert elicitation.

The remaining two qualifiers constitute the more qualitative side of the NUSAP expression. Assessment expresses qualitative judgments about the information. In the case of statistical tests, this might be the significance level; in the case of numerical estimates for policy purposes, it might be the qualifier "optimistic" or "pessimistic". In some experimental fields, information is given with two ± terms, of which the first is the spread, or random error, and the second is the "systematic error" which must estimated on the basis of the history of the measurement, and which corresponds to our Assessment. It might be thought that the "systematic error" must always be less than the "experimental error", or else the stated "error bar" would be meaningless or misleading. But the "systematic error" can be well estimated only in retrospect, and then it can give surprises.

Finally there is P for Pedigree, which conveys an evaluative account of the production process of information, and indicates different aspects of the underpinning of the numbers and scientific status of the knowledge used. Pedigree is expressed by means of a set of pedigree criteria to assess these different aspects. Assessment of pedigree involves qualitative expert judgment. To minimize arbitrariness and subjectivity in measuring strength, a pedigree matrix is used to code qualitative expert judgments for each criterion into a discrete numeral scale from 0 (weak) to 4 (strong) with linguistic descriptions (modes) of each level on the scale. Each special sort of information has its own aspects that are key to its pedigree, so different pedigree matrices using different pedigree criteria can be used to qualify different sorts of information. Table 1 gives an example of a pedigree matrix for emission monitoring data. An overview of pedigree matrices found in the literature is given in the pedigree matrices section of http://www.nusap.net. Risbey et al. (2001) document a method to draft pedigree scores by means of expert elicitation. Examples of questionnaires used for eliciting pedigree scores can be found at http://www.nusap.net.

Score

Proxy representation

Empirical basis

Methodological rigour

Validation

4

An exact measure of the desired quantity

Controlled experiments and large sample direct measurements

Best available practice in well established discipline

Compared with independent measurements of the same variable over long domain

3

Good fit or measure

Historical/field data uncontrolled experiments small sample direct measurements

Reliable method common within est. discipline Best available practice in immature discipline

Compared with independent measurements of closely related variable over shorter period

2

Well correlated but not measuring the same thing

Modelled/derived data Indirect measurements

Acceptable method but limited consensus on reliability

Measurements not independent proxy variable limited domain

1

Weak correlation but commonalities in measure

Educated guesses indirect approx. rule of thumb est.

Preliminary methods unknown reliability

Weak and very indirect validation

0

Not correlated and not clearly related

Crude speculation

No discernible rigour

No validation performed

Table 1 Pedigree matrix for emission monitoring data (Risbey et al., 2001; adapted from Ellis et al., 2000a, 2000b).

We will briefly elaborate the four criteria in this example pedigree matrix.

Proxy representation

Sometimes it is not possible to measure directly the thing we are interested in or to represent it by a parameter, so some form of proxy measure is used. Proxy refers to how good or close a measure of the quantity that we measure or model is to the actual quantity we seek or represent. Think of first order approximations, over simplifications, idealizations, gaps in aggregation levels, differences in definitions, non-representativeness, and incompleteness issues.

Empirical basis

Empirical basis typically refers to the degree to which direct observations, measurements and statistics are used to estimate the parameter. Sometimes directly observed data are not available and the parameter or variable is estimated based on partial measurements or calculated from other quantities. Parameters or variables determined by such indirect methods have a weaker empirical basis and will generally score lower than those based on direct observations.

Methodological rigour

Some method will be used to collect, check, and revise the data used for making parameter or variable estimates. Methodological quality refers to the norms for methodological rigour in this process applied by peers in the relevant disciplines. Well-established and respected methods for measuring and processing the data would score high on this metric, while untested or unreliable methods would tend to score lower.

Validation

This metric refers to the degree to which one has been able to crosscheck the data and assumptions used to produce the numeral of the parameter against independent sources. In many cases, independent data for the same parameter over the same time period are not available and other data sets must be used for validation. This may require a compromise in the length or overlap of the data sets, or may require use of a related, but different, proxy variable for indirect validation, or perhaps use of data that has been aggregated on different scales. The more indirect or incomplete the validation, the lower it will score on this metric.

Visualizing pedigree scores

In general, pedigree scores will be established using expert judgements from more than one expert. Two ways of visualizing results of a pedigree analysis are discussed here: radar diagrams and kite diagrams. (Risbey et al, 2001; Van der Sluijs et al, 2001a). An example of both representations is given in figure 2.

Example of radar diagram representation of the pedigree scores for the gas depletion multiplier in the TIMER model as assessed by 6 experts.

Same pedigree scores as kite diagram.

Figure 2   Example of representations of same results by radar diagram and kite diagram (Van der Sluijs et al, 2001a)

Both representations use polygons with one axis for each criterion, having 0 in the center of the polygon and 4 on each corner point of the polygon. In the radar diagrams a colored line connecting the scores represents the scoring of each expert, whereas a black line represents the average scores.

The kite diagrams follow a traffic light analogy. The minimum scores in each group for each pedigree criterion span the green kite; the maximum scores span the amber kite. The remaining area is red. The width of the amber band represents expert disagreement on the pedigree scores. In some cases the size of the green area was strongly influenced by a single deviating low score given by one of the experts. In those cases the light green kite shows what the green kite would look like if that outlier had been omitted. Note that the algorithm for calculating the light green kite is such that outliers are evaluated per pedigree criterion, so that outliers defining the light green area need not be from the same expert.

A web-tool to produce kite diagrams is available from http://www.nusap.net.

The kite diagrams can be interpreted as follows: the green colored area reflects the (apparent minimal consensus) strength of the underpinning of each parameter. The greener the diagram the stronger the underpinning is. The orange colored zone shows the range of expert disagreement on that underpinning. The remaining area is red. The more red you see the weaker the underpinning is (all according to the assessment by the group of experts represented in the diagram).

A kite diagram captures the information from all experts in the group without the need to average expert opinion. Averaging expert opinion is a controversial issue in elicitation methodologies. A second advantage is that it provides a fast and intuitive overview of parameter strength, preserving key aspects of the underlying information.

Propagation of pedigree in calculations

Ellis et al. (2000) have developed a pedigree calculator to assess propagation of pedigree in a calculation in order to establish pedigree scores for quantities calculated from other quantities. For more information we refer to http://www.esapubs.org/archive/appl/A010/006/default.htm

Diagnostic Diagram

The method chosen to address the spread qualifier (typically sensitivity analysis or Monte Carlo analysis) provides for each input quantity a quantitative metric for uncertainty contribution (or sensitivity), for instance the relative contribution to the variance in a given model output. The Pedigree scores can be aggregated (by dividing the sum of the scores of the pedigree criteria by sum of the maximum attainable scores) to produce a metric for parameter strength. These two independent metrics can be combined in a NUSAP Diagnostic Diagram. The Diagnostic Diagram is based on the notion that neither spread alone nor strength alone is a sufficient measure for quality. Robustness of model output to parameter strength could be good even if parameter strength is low, provided that the model outcome is not critically influenced by the spread in that parameter. In this situation our ignorance of the true value of the parameter has no immediate consequences because it has a negligible effect on calculated model outputs. Alternatively, model outputs can be robust against parameter spread even if its relative contribution to the total spread in model is high provided that parameter strength is also high. In the latter case, the uncertainty in the model outcome adequately reflects the inherent irreducible uncertainty in the system represented by the model. In other words, the uncertainty then is a property of the modelled system and does not stem from imperfect knowledge on that system. Mapping model parameters in the assessment diagram thus reveals the weakest critical links in the knowledge base of the model with respect to the model outcome assessed, and helps in the setting of priorities for model improvement.