Weight Functions in Nonlinear Regression

 

Background – SigmaPlot equation items sometimes use a weight variable for the purpose of assigning a weight to each observation (or response) in a regression data set. The weight for an observation measures its uncertainty relative to the probability distribution from which it’s sampled. A larger weight indicates an observation which varies little from the true mean of its distribution while a smaller weight refers to an observation that is sampled more from the tail of its distribution.

Under the statistical assumptions for estimating the parameters of a fit model using the least squares approach, the weights are, up to a scale factor, equal to the reciprocal of the population variances of the (Gaussian) distributions from which the observations are sampled.

If the variances for all observations are equal (homoscedasticity), then a weight variable is unnecessary and the unweighted least squaresproblem of minimizing the sum of squares of the residuals is solved to find the best-fit parameters. Here, we define a residual, sometimes called a raw residual, to be the difference between an observation and the predicted value (the value of the fit model) at a given value of the independent variable(s).

If the variances of the observations are not all the same (heteroscedasticity), then a weight variable is needed and the weighted least squaresproblem of minimizing the weighted sum of squares of the residuals is solved to find the best-fit parameters. If the variances differ among the observations and the corresponding weights are not included in the regression analysis, then the best-fit parameters and best-fit predicted values may not estimate the true fit model. In particular, an incorrectly assumed un-weighted design may lead one to interpret certain observations as outliers when in fact they are not.

There is another important point concerning the usage of the term weight. Many authors and some software products associate this term with the standard deviation of the population, instead of the variance as described above. With this interpretation, weight is the square root of the value we are describing above. To make matters a little more confusing, we use the term weighted residual in the report. This is equal to the square root of the weight (as defined earlier) times the value of the raw residual.

Current requirements for weighted regressions -The SigmaPlot fitting algorithm (Levenberg-Marquardt) is designed to only use constant weights for a set of independent observations. In other words, weights are assumed to have fixed values that do not change during the regression process. Usually the weight variable is defined in terms of a worksheet column where each row is used to compute the weight that is assigned to the observation in the same row of the dependent variable column.

Weight functions and some general applications – Our new feature will allow the user to define a weight variable as a function of the parameters contained in the fit model. One application of this more general adaptive weighting is in situations where the variances of the observations cannot be determined prior to performing the fit. For example, if all observations are Poisson distributed, then the population means, which are what the predicted values are estimating, equal to the population variances.

Although the least squares approach for estimating parameters is designed for normally distributed data, other distributions are sometimes used with least squares when other methods are unavailable. In the case of Poisson data, we need to define the weight variable as the reciprocal of the predicted values. This procedure is sometimes referred to as “weighting by predicted values”. The Example section shows how to specify this type of weight variable in an equation item.

Another application of adaptive weighting is to obtain robust procedures for estimating parameters that mitigate the effects of outliers. Occasionally, there may be a few observations in a data set that are sampled from the tail of their distributions with small probability, or there are a few observations that are sampled from distributions that deviate slightly from the normality assumption used in least squares estimation and thus contaminate the data set.

These aberrant observations, called outliers in the response variable, can have a significant impact on the fit results because they have relatively large (raw or weighted) residuals and thus inflate the sum of squares that is being minimized. Least squares estimation methods are often criticized for their instability when a few observations change significantly or when many observations change by a small amount.

One way to mitigate the effects of outliers is to use a weight variable that is a function of the residuals (and hence, also a function of the parameters), where the weight assigned to an observation is inversely related to the size of the residual. The definition of what weighting function to use depends upon assumptions about the distributions of observations (assuming they are not normal) and a scheme for deciding what size of residual to tolerate. The Example section shows a particular type of weight variable that can be used in robust fitting.

M-Estimation – One robust fitting technique, called M-estimation, can be applied in many cases with the adaptive weighting that we will provide. For this type of robust fitting, independent data is assumed to be sampled from a distribution that is not necessarily normal, but usually one with a symmetric probability density function. We then solve a maximum likelihood estimation problem by minimizing a merit function defined as the negative of the log-likelihood function.

The merit function can often be expressed as a function of only the residuals, where a residual is the difference between an observation and some parameterized expression that defines one of the distribution´s parameters. The method of linear least squares is a special case where the distribution for the data is a normal distribution. In this case, the merit function simply becomes the sum of squares of the residuals and the distribution parameter that´s being estimated is the mean.

For the robust fitting of data that has outliers with respect to least squares, the idea is to choose a distribution with wider tails than the normal distribution. This moves the outliers into a region with a higher probability of occurrence. If we perform M-estimation with respect to this new distribution and minimize the corresponding merit function, then we obtain parameter estimates that are more resistant to change when outliers are present.

As it turns out, many M-estimation problems can be solved by considering a corresponding weighted least squares problem where the weights are functions of the residuals. The technique we discuss later for solving this type of least squares problem will then yield, assuming convergence, a local if not global solution to the M-estimation problem.

The program Prism includes a modified Levenberg-Marquardt algorithm for solving a particular M-estimation problem based on the Cauchy distribution. This is their primary robust fitting method used to detect outliers.

New parameter-estimation procedure – In a weighted regression problem with independent observations, the main objective is to estimate the true parameter values at which the fit model gives the true means of the observations and at which the weights become inversely proportional to the true variances (with the proportionality constant the same for all observations).

If the weights are constants, so that the true variances of the observations are assumed to be known up to some constant multiple, then the usual approach of obtaining the parameter estimates is to minimize a weighted sum of squares of the residuals between the observations and the fit model.

If the observations are normally distributed, then it is easy to show that this approach is equivalent to maximum likelihood estimation. This weighted least squares problem, under fairly general conditions, yields a statistically consistent method of estimating the true parameter values (i.e. a method that converges to the true parameter values as the sample size increases without bound).

If the weights are functions of the parameters and we obtain our parameter estimates by minimizing the corresponding weighted sum of squares, we run into difficulties. First, using our current algorithm to solve the minimization problem would result is slow convergence or no convergence due to instabilities caused by estimating the derivatives of a more complicated fit model that now includes the weight functions. A modified version of the algorithm would be required.

Second, it is known that the parameter values minimizing the sum of squares do not generally give consistent estimates of the true parameter values when the weights are functions of the parameters. There is, however, another approach that can often yield consistent estimates and is called iteratively reweighted least squares (IRLS).

It must be emphasized that the limiting values of the parameters obtained by iterative reweighting do not generally yield a minimum (local or global) of the weighted sum of squares discussed above. In other words, assuming normally distributed errors, this is not a maximum likelihood procedure.

The parameter estimation algorithm we will use for adaptive weighting, IRLS, is based on solving a sequence of constant-weighted least squares problems where each sub-problem is solved using our current implementation of the Levenberg-Marquardt algorithm. This process begins by evaluating the weights using the initial parameter values and then minimizing the sum of squares with these fixed weights. The best-fit parameters are then used to re-evaluate the weights.

With the new weight values, the above process of minimizing the sum of squares is repeated. We continue in this fashion until convergence is achieved. The criterion used for convergence is that the relative error between the square root of the sum of the weighted residuals for the current parameter values and the square root of the sum of weighted residuals for the parameter values of the previous iteration is less than the tolerance value set in the equation item. As with other estimation procedures, convergence is not guaranteed.

The IRLS estimation procedure is used by the NLIN Procedure in SAS 9.2 and by SYSTAT 12. The program Prism also uses IRLS, but only when it performs “weighting by predicted values”.

Positive weight values – In order for weights to have any influence on the results, they must be positive. If the weight variable, whether defining weights as constant or as functions of the parameters, evaluates to zero or a negative number for any observation, then that observation is ignored in the computations for minimizing the sum of squares.

Other remarks – In the current version of SigmaPlot (and previous versions), you can define a weight variable in terms of the parameters of the fit equation. You can then use this weight variable to perform a fit, without errors, and obtain results that are meaningful as long as you realize how the weight variable is used. Currently, when this situation arises, we evaluate the weight variable for the initial parameter values and then treat the weights as constant for the duration of the fit process. In other words, there is no iterative reweighing.

Examples Using Weight Functions in Nonlinear Regression

 

Example (weighting with predicted values) – This example simply shows how to set up the equation text to perform weighting with predicted values as discussed above. The fit text displayed in the Function dialog below is used to fit a straight line to the data shown in the table on the right:

x

y

1.0000 1.0000
1.0000 2.0000
2.0000 2.0000
2.0000 3.0000
2.0000 5.0000
3.0000 5.0000
3.0000 6.0000
3.0000 20.0000
4.0000 25.0000
6.0000 25.0000
5.0000 30.0000

Note the definition of the weight variable in the Variables section of the fit text. When weighting with predicted values, we mean that the weights are defined as the reciprocal of the predicted values and , in this example, the variable f is representing our predicted values. The absolute value is used to ensure that the weights are postive.

First, we perform the fit with no weighting using the first fit statement in the Equation section of the fit text. The best-fit parameter results, the residual sum of squares, and the R-squared (coefficient of determination) value are shown in the tables below. The graph of the fit curve is also shown.

Parameter Value Std. Error
y0 -6.8613 3.7141
a 6.2336 1.1340

 

Residual SS 288.2774
R^2 0.7705

 

We next use the weight variable that is defined in the Variables section by commenting out the first fit statement and using the second fit statement in the Equation section. We assume the Reduced Chi-square option for computing the parameter standard errors is turned on. The table of numeric results is given below as well as a graph of the fit curve.

Parameter Value Std. Error
y0 -4.4363 1.7852
a 5.4000 0.9888
Residual SS 25.0005
R^2 0.7682

Notice that the standard errors of the parameters are smaller in the weighted fit. Although some of the data (the observations with larger deviance) have larger error variance, most of the data have smaller error variance and these errors are propagated into the computation for the standard errors of the parameters.

Example (weighting with residuals) – The fit text displayed in the Function dialog below is used to fit a Gaussian curve to the data shown in the table on the right:

x y
-6.0000 1.6000
-5.0000 1.8000
-3.0000 2.0000
0.0000 4.0000
2.0000 2.7500
3.0000 1.6800
4.0000 0.6400
6.0000 0.2000

First, we perform the fit with no weighting using the first fit statement in the Equation section of the fit text. The best-fit parameter results, the residual sum of squares, and the R-squared (coefficient of determination) value are shown in the tables below. The graph of the fit curve is also shown.

Parameter Value Std. Error
a 3.6667 0.5243
b 3.1567 0.4416
x0 -0.8084 0.3944

 

Residual SS 1.8569
R^2 0.8092

We next use the weight variable that is defined in the Variables section by commenting out the first fit statement and using the second fit statement in the Equation section. We assume the Reduced Chi-square option for computing the parameter standard errors is turned on.

The expression for the weight variable is derived from the Cauchy probability distribution and is expressed as a function of the residuals y. This is a type of robust fitting that will allow the fit to be more stable with respect to outliers by the manner in which it assigns smaller weight values to large residuals. The table of numeric results is given below as well as a graph of the fit curve.

Parameter Value Std. Error
a 4.0090 0.3055
b 2.4681 0.2033
x0 -0.2593 0.2055

 

Residual SS 0.5231
R^2 0.9436

From the graph, it is seen that the influence of any outliers is diminished as most of the data has a better fit. Not only does robust fitting reduce the effects of outliers, it also helps to identify them. In the graph above, we see that the first two observations do not fit well with the rest of the data. If we remove these from the data set and go back to perform a fit with no weighting, we obtain the following results:

Parameter Value Std. Error
a 4.0720 0.1343
b 2.3460 0.0808
x0 -0.1742 0.0842

 

Residual SS 0.0607
R^2 0.9937

After removing the first two observations, we see that the parameter values are similar to the values obtained from the robust fitting. Notice that the standard errors of the parameters are almost twice as large in the robust fitting as in the final un-weighted curve fit with the first two observations removed.

The reason for this is that the small weights assigned to the first two observations increase the standard errors of these observations and these errors are propagated into the computation for the standard errors of the parameters. In short, although the outliers are having little impact on the parameter values in the robust fitting, they are having a much larger impact on the standard errors.

Outliers, in general, should be investigated carefully. For example, if certain assumptions are violated, then an observation may be falsely identified as an outlier. This can happen if an incorrect fit model is used or if the error variance for an observation is unknown or specified incorrectly. Even if the assumptions are correct, an outlier cannot always be observed visually.

Some procedure as above is usually necessary, with a judicious choice of weight function, to identify outliers or reduce their impact, or both. In the above example, we identified the outliers and then repeated the original fit with the outliers removed. Frequently, the objective is to just identify the outliers rather than obtaining a robust fit that is insensitive to them or obtaining a non-robust fit with the outliers removed.

Weighting Functions in SigmaPlot Fit Equations

Seven weighting functions are now available in all curve fit equations (3D equations are slightly different from 2D). These weightings as they appear in each fit file are 1/y, 1/y2, 1/x, 1/x2, 1/predicteds, 1/predicteds2, and Cauchy.

References

Seber, G. A. F. and Wild, C. J. (1989) Nonlinear Regression, New York: John Wiley & Sons.

Huber, P. J. (1981) Robust Statistics. New York: John Wiley and Sons.

SAS/ STAT User´s Guide, Version 9.2 (2008)., SAS Institute, Cary, NC., Chapter 60.

Try SigmaPlot FREE for 30 Days!