20 Discriminant Analysis
Introduction
Discriminant Analysis finds a set of prediction equations based on independent variables that are used to classify individuals into groups. There are two possible objectives in a discriminant analysis: finding a predictive equation for classifying new individuals or interpreting the predictive equation to better understand the relationships that may exist among the variables.
In many ways, discriminant analysis parallels multiple regression analysis. The main difference between these two techniques is that regression analysis deals with a continuous dependent variable, while discriminant analysis must have a discrete dependent variable. The methodology used to complete a discriminant analysis is similar to regression analysis. You plot each independent variable versus the group variable. You often go through a variable selection phase to determine which independent variables are beneficial. You conduct a residual analysis to determine the accuracy of the discriminant equations.
The mathematics of discriminant analysis are related very closely to the oneway MANOVA. In fact, the roles of the variables are simply reversed. The classification (factor) variable in the MANOVA becomes the dependent variable in discriminant analysis. The dependent variables in the MANOVA become the independent variables in the discriminant analysis.
Source: “Cluster & Discriminant Analysis, Rural & International Marketing Research” by NPTEL is licensed under CC BYNCSA 4.0
Technical Details
Suppose you have data for K groups, with Nk observations per group. Let N represent the total number of observations. Each observation consists of the measurements of p variables. The ith observation is represented by Xki. Let M represent the vector of means of these variables across all groups and Mk the vector of means of observations in the kth group.
Define three sums of squares and cross products matrices, S_{T}, S_{W}, and S_{A}, as follows
A discriminant function is a weighted average of the values of the independent variables. The weights are selected so that the resulting weighted average separates the observations into the groups. High values of the average come from one group, low values of the average come from another group. The problem reduces to one of finding the weights which, when applied to the data, best discriminate among groups according to some criterion. The solution reduces to finding the eigenvectors, V_{w}, of V_{A }. The canonical coefficients are the elements of these S –^{1}S eigenvectors.
A goodnessoffit , Wilks’ lambda, is defined as follows:
where is the jth eigenvalue corresponding to the eigenvector described above and m is the minimum of K1 and p.
The canonical correlation between the j^{th} discriminant function and the independent variables is related to these eigenvalues as follows:
Various other matrices are often considered during a discriminant analysis. The overall covariance matrix, T, is given by:
The withingroup covariance matrix, W, is given by:
The amonggroup (or betweengroup) covariance matrix, A, is given by:
The linear discriminant functions are defined as:
The standardized canonical coefficients are given by:
where v_{ij }are the elements of V and w_{ij }are the elements of W.
The correlations between the independent variables and the canonical variates are given by:
Discriminant Analysis Checklist
Tabachnick (1989) provides the following checklist for conducting a discriminant analysis. We suggest that you consider these issues and guidelines carefully.
Unequal Group Size and Missing Data
You should begin by screening your data. Pay particular attention to patterns of missing values. When using discriminant analysis, you should have more observations per group than you have independent variables. If you do not, there is a good chance that your results cannot be generalized, and future classifications based on your analysis will be inaccurate.
Unequal group size does not influence the direct solution of the discriminant analysis problem. However, unequal group size can cause subtle changes during the classification phase. Normally, the sampling of each group (the of the total that belongs to a particular group) is used during the classification stage. If the relative group sample sizes are not representative of their sizes in the overall , the classification procedure will be erroneous. (You can make appropriate adjustments to prevent these erroneous classifications by adjusting the prior probabilities.)
NCSS ignores rows with missing values. If it appears that most missing values occur in one or two variables, you might want to leave these out of the analysis in order to obtain more data and hence more accuracy.
Multivariate Normality and Outliers
Discriminant analysis does not make the strong normality assumptions that MANOVA does because the emphasis is on classification. A sample size of at least twenty observations in the smallest group is usually adequate to ensure robustness of any inferential tests that may be made.
Outliers can cause severe problems that even the robustness of discriminant analysis will not overcome. You should screen your data carefully for outliers using the various univariate and multivariate normality tests and plots to determine if the normality assumption is reasonable. You should perform these tests on one group at a time.
Homogeneity of Covariance Matrices
Discriminant analysis makes the assumption that the group covariance matrices are equal. This assumption may be tested with Box’s M test in the Equality of Covariances procedure or looking for equal slopes in the Probability Plots. If the covariance matrices appear to be grossly different, you should take some corrective action. Although the inferential part of the analysis is robust, the classification of new individuals is not. These will tend to be classified into the groups with larger covariances. Corrective action usually includes the close screening for outliers and the use of variancestabilizing transformations such as the logarithm.
Linearity
Discriminant analysis assumes linear relations among the independent variables. You should study scatter plots of each pair of independent variables, using a different color for each group. Look carefully for curvilinear patterns and for outliers. The occurrence of a curvilinear relationship will reduce the power and the discriminating ability of the discriminant equation.
Multicollinearity and Singularity
Multicollinearity occurs when one predictor variable is almost a weighted average of the others. This collinearity will only show up when the data are considered one group at a time. Forms of multicollinearity may show up when you have very small group sample sizes (when the number of observations is less than the number of variables). In this case, you must reduce the number of independent variables.
Multicollinearity is easily controlled for during the variable selection phase. You should only include variables that show an R² with other X’s of less than 0.99.
See the chapter on Multiple Regression for a more complete discussion of multicollinearity.
Data Structure
The data given in the table below are the first eight rows (out of the 150 in the database) of the famous “iris data” published by Fisher (1936). These data are measurements in millimeters of sepal length, sepal width, petal length, and petal width of fifty plants for each of three varieties of iris: (1) Iris setosa, (2) Iris versicolor, and (3) Iris virginica. Note that Iris versicolor is a polyplid hybrid of the two other species. Iris setosa is a diploid species with 38 chromosomes, Iris virginica is a tetraploid, and Iris versicolor is a hexaploid with 108 chromosomes.
Discriminant analysis finds a set of prediction equations, based on sepal and petal measurements, that classify additional irises into one of these three varieties. Here Iris is the dependent variable, while SepalLength, SepalWidth, PetalLength, and PetalWidth are the independent variables.
Fisher dataset (subset)
SepalLength  SepalWidth  PetalLength  PetalWidth  Iris 
50  33  14  2  1 
64  28  56  22  3 
65  28  46  15  2 
67  31  56  24  3 
63  28  51  15  3 
46  34  14  3  1 
69  31  51  23  3 
62  22  45  15  2 
Missing Values
If missing values are found in any of the independent variables being used, the row is omitted. If they occur only in the dependent (categorical) variable, the row is not used during the calculation of the prediction equations, but a predicted group (and scores) is calculated. This allows you to classify new observations.
Procedure Options
This section describes the options available in this procedure.
Variables Tab
This panel specifies the variables used in the analysis.
Group Variable
Y: Group Variable
This is the dependent, Y, grouping, or classification variable. It must be discrete in nature. That means it can only have a few unique values. Each unique value represents a separate group of individuals. The values may be text or numeric.
Independent Variables
X’s: Independent Variable
These are the set of independent variables. Although the probability statements used in discriminant analysis assume that these variables are continuous (and normal), the technique is robust enough that it can tolerate a few discrete variables (assuming that they are numeric).
Estimation Options
Estimation Method
This option designates the classification method used.
 Linear Discriminant Function
Signifies that you want to classify using the lineardiscriminant functions (assumes multivariate normality with equal covariance matrices). This is the most popular technique.
 Regression Coefficients
Indicates that you want to classify using multiple regression coefficients (no special assumptions). This method develops a multiple regression equation for each group, ignoring the discrete nature of the dependent variable. Each of the dependent variables is constructed by using a 1 if a row is in the group and a 0 if it is not.
Source: “Cluster & Discriminant Analysis, Rural & International Marketing Research” by NPTEL is licensed under CC BYNCSA 4.0
Estimation Options – Linear Discriminant Function Options
Prior Probabilities
Allows you to specify the prior probabilities for lineardiscriminant classification. If this option is left blank, the prior probabilities are assumed equal. This option is not used by the regression classification method. The numbers should be separated by blanks or commas. They will be adjusted so that they sum to one. For example, you could use “4 4 2” or “2 2 1” when you have three groups whose population proportions are 0.4, 0.4, and 0.2, respectively.
Variable Selection Options
Variable Selection
This option specifies whether a stepwise variableselection phase is conducted.
 None
All independent variables are used in the analysis. No variable selection is conducted.
 Stepwise
A stepwise variableselection is performed using the “in” and “out” probabilities specified next.
Variable Selection Options – Stepwise Selection Options
Maximum Iterations
(Automatic Selection only). This options sets the maximum number of steps that are used in the stepwise procedure. It is possible to set the above probabilities so that one or two variables are alternately entered and removed, over and over. We call this an infinite loop. To avoid such an occurrence, you can set the maximum number of steps permitted.
Probability Enter
(Stepwise only.) This option sets the probability level for tests used to determine if a variable may be brought into the discriminant equation. At each step, the variable (not in the equation) with the smallest probability level below this cutoff value is entered.
Probability Removed
(Stepwise only.) This option sets the probability level for tests used to determine if a variable should be removed from the discriminant equation. At each step, the variable (in the equation) with the largest probability level above this cutoff value is removed.
Reports Tab
The following options control the format of the reports.
Select Reports
Group Means – Canonical Scores
These options let you specify which reports you want displayed. This is especially useful if you have a lot of data, since some of the reports produce a separate report row for each data row. You may want to omit these reports.
The rowwise reports are Predicted Classification, Linear Discriminant Function Scores, Regression Scores, and Canonical Scores.
Report Options
Report Format
This option lets you specify whether to output a “brief” or “verbose” report during the variableselection phase. Normally, you would only select “verbose” if you have fewer than ten independent variables.
Precision
Specify the precision of numbers in the report. A singleprecision number will show sevenplace accuracy, while a doubleprecision number will show thirteenplace accuracy. Note that the reports are formatted for single precision. If you select double precision, some numbers may run into others. Also note that all calculations are performed in double precision, regardless of which option you select here. This is for reporting purposes only.
Variable Names
This option lets you select whether to display only variable names, variable labels, or both.
Value Labels
This option applies to the Group Variable. It lets you select whether to display data values, value labels, or both. Use this option if you want the output to automatically attach labels to the values (like 1=Yes, 2=No, etc.). See the section on specifying Value Labels elsewhere in this manual.
Plots Tab
These panels specify the pairwise plots of the scores generated for each set of functions. A separate function is generated for each group. A separate plot is constructed for each pair of functions.
Select Plots
LDScore Plots – CanonicalScore Plots
These options let you specify which plots you want displayed. Click the plot format button to change the plot settings.
Storage Tab
These options let you specify where to store various rowwise s.
Warning: Any data already in a column is replaced by the new data. Be careful not to specify columns that contain data.
Data Storage Columns
Predicted Group
You can automatically store the predicted group for each row into the column specified here. The predicted group is generated for each row of data in which all independent variable values are nonmissing.
Linear Discriminant Scores
You can automatically store the lineardiscriminant scores for each row into the columns specified here. These scores are generated for each row of data in which all independent variable values are nonmissing. Note that a column must be specified for each group.
Linear Discriminant Probabilities
You can automatically store the lineardiscriminant probabilities for each row into the columns specified here. These probabilities are generated for each row of data in which all independent variable values are nonmissing. Note that a column must be specified for each group.
Regression Coefficient Scores
You can automatically store the regression coefficient scores for each row into the columns specified here. These scores are generated for each row of data in which all independent variable values are nonmissing. Note that a column must be specified for each group.
Canonical Scores
You can automatically store the canonical scores for each row into the columns specified here. These scores are generated for each row of data in which all independent variable values are nonmissing. Note that the number of columns specified should be one less that the number of groups.
Example 1 – Discriminant Analysis
This section presents an example of how to run a discriminant analysis. The data used are shown in the table above and found in the Fisher dataset.
You may follow along here by making the appropriate entries or load the completed template Example1 by clicking on Open Example Template from the File menu of the Discriminant Analysis window.
 Open the Fisher

 From the File menu of the NCSS Data window, select Open Example Data.
 Click on the file Fisher.NCSS.
 Click Open.
2. Open the Discriminant Analysis

 Using the Analysis menu or the Procedure Navigator, find and select the Discriminant Analysis procedure.
 On the menus, select File, then New Template. This will fill the procedure with the default template.
3. Specify the variables.

 On the Discriminant Analysis window, select the Variables tab.
 Doubleclick in the Y: Group Variable box. This will bring up the variable selection window.
 Select Iris from the list of variables and then click Ok. “Iris” will appear in the Y: Group Variable box.
 Doubleclick in the X’s: Independent Variables text box. This will bring up the variable selection window.
 Select SepalLength through PetalWidth from the list of variables and then click Ok. “SepalLength PetalWidth” will appear in the X’s: Independent Variables box.
4. Specify the reports

 Select the Reports tab.
 Check all reports and plots. Normally you would only view a few of these reports, but we are selecting them all so that we can document them.
 Enter Labels in the Variable Names box.
 Enter Value Labels in the Value Labels box.
5. Run the procedure.

 From the Run menu, select Run Procedure. Alternatively, just click the green Run button.
Group Means Report
Group Means
Variable 
Iris Setosa 
Versicolor 
Virginica 
Overall 
Sepal Length  50.06  59.36  65.88  58.43333 
Sepal Width  34.28  27.7  29.74  30.57333 
Petal Length  14.62  42.6  55.52  37.58 
Petal Width  2.46  13.26  20.26  11.99333 
Count  50  50  50  150 
This report shows the means of each of the independent variables across each of the groups. The last row shows the count (number of observations) in the group. Note that the column headings come from the use of value labels for the group variable.
Group Standard Deviations Report
Group Standard Deviations
Variable 
Iris Setosa 
Versicolor 
Virginica 
Overall 
Sepal Length  3.524897  5.161712  6.358796  8.280662 
Sepal Width  3.790644  3.137983  3.224966  4.358663 
Petal Length  1.73664  4.69911  5.518947  17.65298 
Petal Width  1.053856  1.977527  2.7465  7.622377 
Count  50  50  50  150 
This report shows the standard deviations of each of the independent variables across each of the groups. The last row shows the count or number of observations in the group.
Discriminant analysis makes the assumption that the covariance matrices are identical for each of the groups. This report lets you glance at the standard deviations to check if they are about equal.
Total Correlation\Covariance Report
Total Correlation\Covariance
Variable 
Variable
Sepal Length 
Sepal Width 
Petal Length 
Petal Width 
Sepal Length  68.56935  4.243401  127.4315  51.62707 
Sepal Width  0.117570  18.99794  32.96564  12.16394 
Petal Length  0.871754  0.428440  311.6278  129.5609 
Petal Width  0.817941  0.366126  0.962865  58.10063 
This report shows the correlation and covariance matrices that are formed when the grouping variable is ignored. Note that the correlations are on the lower left and the covariances are on the upper right. The variances are on the diagonal.
BetweenGroup Correlation\Covariance Report
BetweenGroup Correlation\Covariance
Variable 
Variable
Sepal Length 
Sepal Width 
Petal Length 
Petal Width 
Sepal Length  3160.607  997.6334  8262.42  3563.967 
Sepal Width  0.745075  567.2466  2861.98  1146.633 
Petal Length  0.994135  0.812838  21855.14  9338.7 
Petal Width  0.999768  0.759258  0.996232  4020.667 
This report displays the correlations and covariances formed using the group means as the individual observations. The correlations are shown in the lowerleft half of the matrix. The withingroup covariances are shown on the diagonal and in the upperright half of the matrix. Note that if there are only two groups, all correlations will be equal to one since they are formed from only two rows (the two group means).
WithinGroup Correlation\Covariance Report
WithinGroup Correlation\Covariance
Variable 
Variable Sepal Length 
Sepal Width 
Petal Length 
Petal Width 
Sepal Length  26.50082  9.272109  16.75143  3.840136 
Sepal Width  0.530236  11.53878  5.524354  3.27102 
Petal Length  0.756164  0.377916  18.51878  4.266531 
Petal Width  0.364506  0.470535  0.484459  4.188163 
This report shows the correlations and covariances that would be obtained from data in which the group means had been subtracted. The correlations are shown in the lowerleft half of the matrix. The withingroup covariances are shown on the diagonal and in the upperright half of the matrix.
Variable Influence Report
This report analyzes the influence of each of the independent variables on the discriminant analysis.
Variable
The name of the independent variable.
Removed Lambda
This is the value of a Wilks’ lambda computed to test the impact of removing this variable.
Removed FValue
This is the Fratio that is used to test the significance of the above Wilks’ lambda.
Removed FProb
This is the probability (significance level) of the above Fratio. It is the probability to the right of the Fratio. The test is significant (the variable is important) if this value is less than the value of alpha that you are using, such as 0.05.
Alone Lambda
This is the value of a Wilks’ lambda that would be obtained if this were the only independent variable used.
Alone FValue
This is an Fratio that is used to test the significance of the above Wilks’ lambda.
Alone FProb
This is the probability (significance level) of the above Fratio. It is the probability to the right of the Fratio. The test is significant (the variable is important) if this value is less than the value of alpha that you are using, such as 0.05.
RSquared OtherX’s
This is the RSquared value that would be obtained if this variable were regressed on all other independent variables. When this RSquared value is larger than 0.99, severe multicollinearity problems exist. You should remove variables (one at a time) with large RSquared and rerun your analysis.
Linear Discriminant Functions Report
Linear Discriminant Functions Section
Variable 
Iris Setosa 
Versicolor 
Virginica 
Constant  85.20985  71.754  103.2697 
Sepal Length  2.354417  1.569821  1.244585 
Sepal Width  2.358787  0.707251  0.3685279 
Petal Length  1.643064  0.5211451  1.276654 
Petal Width  1.739841  0.6434229  2.107911 
This report presents the linear discriminant function coefficients. These are often called the discriminant coefficients. They are also known as the “plugin” estimators, since the true variancecovariance matrices are required but their estimates are pluggedin. This technique assumes that the independent variables in each group follow a multivariatenormal distribution with equal variancecovariance matrices across groups. Studies have shown that this technique is fairly robust to departures from either assumption.
The report represents three classification functions, one for each of the three groups. Each function is represented vertically. When a weighted average of the independent variables is formed using these coefficients as the weights (and adding the constant), the discriminant scores result. To determine which group an individual belongs to, select the group with the highest score.
Regression Coefficients Report
Regression Coefficients Section
Variable 
Iris Setosa 
Versicolor 
Virginica 
Constant  0.1182229  1.577059  0.6952819 
Sepal Length  6.602977E03  2.015369E03  4.587608E03 
Sepal Width  2.428479E02  4.456162E02  2.027684E02 
Petal Length  2.246571E02  2.206692E02  3.987911E04 
Petal Width  5.747273E03  4.943066E02  5.517793E02 
This report presents the regression coefficients. These coefficients are determined as follows:
 Create three indicator variables, one for each of the three varieties of iris. Each indicator variable is set to one when the row belongs to that group and zero otherwise.
 Fit a multiple regression of the independent variables on each of the three indicator variables.
 The regression coefficients obtained are those shown in this table.
Hence, predicted values generated by these coefficients will be between zero and one. To determine which group an individual belongs to, select the group with the highest score.
Classification Count Table Report
Classification Count Table for Iris
Actual 
Predicted Setosa 
Versicolor 
Virginica 
Total 
Setosa  50  0  0  50 
Versicolor  0  48  2  50 
Virginica  0  1  49  50 
Total  50  49  51  150 
This report presents a matrix that indicates how accurately the current discriminant functions classify the observations. If perfect classification has been achieved, there will be zeros on the offdiagonals. The rows of the table represent the actual groups, while the columns represent the predicted group.
Percent Reduction
The percent reduction is the classification accuracy achieved by the current discriminant functions over what is expected if the observations were randomly classified. The formula for the Reduction in classification error is
[Sum of diagonals – N/k] / [N – N/k].
Misclassified Rows Report
Misclassified Rows Section
Row 
Actual 
Predicted 
Pcnt1 
Pcnt2 
Pcnt3 
5  Virginica  Versicolo  0.0  72.9  27.1 
9  Versicolo  Virginica  0.0  25.3  74.7 
12  Versicolo  Virginica  0.0  14.3  85.7 
This report shows the actual group and the predicted group of each observation that was misclassified. It also shows 100 times the estimated probability, P(i), that the row is in each group. For easier viewing, we have multiplied the probabilities by 100 to make this a percent probability (between 0 and 100) rather than a regular probability (between 0 and 1). A value near 100 gives a strong indication that the observation belongs in that group.
P(i)
If the linear discriminant classification technique was used, these are the estimated probabilities that this row belongs to the ith group. See James (1985), page 69, for details of the algorithm used to estimate these probabilities. This algorithm is briefly outlined here.
Let fi (i = 1, 2, …, K) be the linear discriminant function value. Let max(fk) be the maximum score of all groups. Let P(Gi) be the overall probability of classifying an individual into group i. The values of P(i) are generated using the following equation:
If the regression classification technique was used, this is the predicted value of the regression equation. The implicit Y value in the regression equation is one or zero, depending on whether this observation is in the ith group or not. Hence, a predicted value near zero indicates that the observation is not in the ith group, while a value near one indicates a strong possibility that this observation is in the ith group. There is nothing to prevent these predicted values from being greater than one or less than zero. They are not estimated probabilities.
You can store these values for further analysis by listing variables in the appropriate Storage Tab options.
a number that is used to represent a population characteristic and that generally cannot be determined easily
the number of times a value of the data occurs
the number of successes divided by the total number in the sample
a subset of the population studied
all individuals, objects, or measurements whose properties are being studied
a numerical characteristic of the sample; a statistic estimates the corresponding population parameter.