lecturenotes.pdf

(1566 KB) Pobierz
Cluster Analysis
Johann Bacher
Chair of Sociology
University Erlangen-Nuremberg
Findelgasse 7-9
D-90402 Nuremberg
Nuremberg, 2002
Note: Do not quote without permission of the author.
1
Contents
Chapter 1: Overview and Examples
Chapter 2: Transformation of Variables
Chapter 3: Dissimilarity and Similarity Measures
Chapter 4: Hierarchical Clustering Techniques
Chapter 5: K-Means
Chapter 6: Special Issues
Chapter 7: Probabilistic Clustering
1
21
29
43
104
162
184
2
Chapter 1:
Overview and Examples
Chapter 1: ...................................................................................................................................1
Overview and Examples .............................................................................................................1
1.1 Purpose and Techniques ...................................................................................................2
1.2 Examples ..........................................................................................................................8
1.3 Criteria for a Good Classification...................................................................................17
1.4 Typologies without Cluster Analysis .............................................................................19
1.5 Further Applications of Clustering Techniques..............................................................19
References ............................................................................................................................20
1
1.1 Purpose and Techniques
The main idea of cluster analysis is very simple (Bacher 1996: 1-4):
Find K clusters (or a classification that consists of K clusters) so that the objects of
one cluster are similar to each other whereas objects of different clusters are
dissimilar.
The following quotations should additionally illustrate this task:
"This monograph will be concerned with certain techniques for the analysis of multivariate data,
which attempt to solve the following problem:
Given a number of objects or individuals, each of which is described by a set of numerical measures,
devise a classification scheme for grouping the objects into a number of classes such that objects
within classes are similar in some respect and unlike those form other classes. The number of classes
and the characteristics of each class are to be determined".
(Everitt 1981: 1).
"The subject of classification is concerned with the investigation of the relationships within a set of
'objects' in order to establish whether or not the data can validly be summarized by a small number of
classes (or clusters) of similar objects."
(Gordon 1999: 1)
Everitt's characteristic requires two notes:
Clustering techniques can also be applied to cluster variables. Everitt only mentions cases!
Clustering techniques can also be applied in a confirmatory way. Everitt's definition
suggests that cluster analysis is an explorative technique.
The description of Gordon also needs some remarks:
According to Gordon the classification must be valid.
The number of clusters should be small.
2
Gordon formulates additional criteria: A cluster should contain similar objects and should
satisfy additional criteria. Compared to Everitt, Gordon's definition portrays the development
in cluster analysis. Until the 80s the discussion concentrated mainly on techniques. At the end
of the 80s the whole process of clustering – starting with the selection of cases and variables
and ending with the validation of clusters – became dominant. The steps in a clustering
process are:
1. selection of appropriate cases, variables and methods
2. application of the methods
3. evaluation of the results.
This last step includes:
1. determination of the number of clusters, if unknown
2. substantive interpretation of clusters
3. test of stability
4. test of internal validity (model fit), relative validity and external validity
Techniques
Different techniques have been developed to cluster cases or variables. The lecture will
discuss the most important ones:
Hierarchical clustering methods (see chapter 4).
They result in a hierarchy of
classifications (partitions).
K-means clustering methods (see chapter 5).
They result in a classification with K
clusters. A sequence of clusters containing a different number of clusters is not
automatically generated.
Probabilistic methods (see chapter 7),
like latent class and latent profile methods or
mixture models. These methods differ from the two approaches mentioned before
(hierarchical and k-means techniques) in the assignment of objects to the clusters.
Hierarchical and k-means techniques result in a deterministic assignment. An object can
only belong to one cluster, e.g. object 1 belongs to cluster 2, object 2 to cluster 2, object 3
3
Zgłoś jeśli naruszono regulamin