lecturenotes.pdf

(1566 KB) Pobierz

Cluster Analysis

Johann Bacher

Chair of Sociology

University Erlangen-Nuremberg

Findelgasse 7-9

D-90402 Nuremberg

Nuremberg, 2002

Note: Do not quote without permission of the author.

Contents

Chapter 1: Overview and Examples

Chapter 2: Transformation of Variables

Chapter 3: Dissimilarity and Similarity Measures

Chapter 4: Hierarchical Clustering Techniques

Chapter 5: K-Means

Chapter 6: Special Issues

Chapter 7: Probabilistic Clustering

104

162

184

Chapter 1:

Overview and Examples

Chapter 1: ...................................................................................................................................1

Overview and Examples .............................................................................................................1

1.1 Purpose and Techniques ...................................................................................................2

1.2 Examples ..........................................................................................................................8

1.3 Criteria for a Good Classification...................................................................................17

1.4 Typologies without Cluster Analysis .............................................................................19

1.5 Further Applications of Clustering Techniques..............................................................19

References ............................................................................................................................20

1.1 Purpose and Techniques

The main idea of cluster analysis is very simple (Bacher 1996: 1-4):

•

Find K clusters (or a classification that consists of K clusters) so that the objects of

one cluster are similar to each other whereas objects of different clusters are

dissimilar.

The following quotations should additionally illustrate this task:

"This monograph will be concerned with certain techniques for the analysis of multivariate data,

which attempt to solve the following problem:

Given a number of objects or individuals, each of which is described by a set of numerical measures,

devise a classification scheme for grouping the objects into a number of classes such that objects

within classes are similar in some respect and unlike those form other classes. The number of classes

and the characteristics of each class are to be determined".

(Everitt 1981: 1).

"The subject of classification is concerned with the investigation of the relationships within a set of

'objects' in order to establish whether or not the data can validly be summarized by a small number of

classes (or clusters) of similar objects."

(Gordon 1999: 1)

Everitt's characteristic requires two notes:

•

Clustering techniques can also be applied to cluster variables. Everitt only mentions cases!

Clustering techniques can also be applied in a confirmatory way. Everitt's definition

suggests that cluster analysis is an explorative technique.

The description of Gordon also needs some remarks:

•

According to Gordon the classification must be valid.

The number of clusters should be small.

Gordon formulates additional criteria: A cluster should contain similar objects and should

satisfy additional criteria. Compared to Everitt, Gordon's definition portrays the development

in cluster analysis. Until the 80s the discussion concentrated mainly on techniques. At the end

of the 80s the whole process of clustering – starting with the selection of cases and variables

and ending with the validation of clusters – became dominant. The steps in a clustering

process are:

1. selection of appropriate cases, variables and methods

2. application of the methods

3. evaluation of the results.

This last step includes:

1. determination of the number of clusters, if unknown

2. substantive interpretation of clusters

3. test of stability

4. test of internal validity (model fit), relative validity and external validity

Techniques

Different techniques have been developed to cluster cases or variables. The lecture will

discuss the most important ones:

•

Hierarchical clustering methods (see chapter 4).

They result in a hierarchy of

classifications (partitions).

K-means clustering methods (see chapter 5).

They result in a classification with K

clusters. A sequence of clusters containing a different number of clusters is not

automatically generated.

•

Probabilistic methods (see chapter 7),

like latent class and latent profile methods or

mixture models. These methods differ from the two approaches mentioned before

(hierarchical and k-means techniques) in the assignment of objects to the clusters.

Hierarchical and k-means techniques result in a deterministic assignment. An object can

only belong to one cluster, e.g. object 1 belongs to cluster 2, object 2 to cluster 2, object 3

lecturenotes.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: