This function performs Agglomerative Hierarchical Clustering (AHC) on a given problem of regionalisation (contiguity/connectivity constraint) with optional size constraints on the clusters. Multiple iterations of AHR can be made if multiple values are given for some parameters.
Usage
AHR(
distances = NULL,
contiguity = NULL,
sizes = NULL,
d = NULL,
data = NULL,
m = 0,
M = Inf,
standardQuant = FALSE,
binarQual = FALSE,
nbTries = 5L,
propFusionsInit = 0.01,
linkages = "saut_min",
fusionConstraints = NA,
fusionConstraintModes = available_fusion_modes(),
splitConnectedComponents = FALSE,
criteria = NULL,
evalLinkages = linkages,
minNbClusters = 2L,
maxNbClusters = Inf,
parallel = TRUE,
nbCores = detectCores() - 1L,
verbose = TRUE
)
Arguments
- distances
The distance matrix of the problem. This can be omitted if a distance function
d
and data contextdata
are provided. If onlydistances
is provided, all distances must be present. (distance matrix)- contiguity
A contiguity matrix or an
igraph
contiguity graph. If not provided, the problem is considered completely contiguous (all elements are neighbors of each other). In that case solving time might be long for each iteration.- sizes
Represents the size of each element. By default, it is set to
1
for each element (the size of a cluster becomes its cardinal). All data must be positive or zero. (positive real numeric vector)- d
Distance function between elements. This can be omitted if
distances
is already indicated. If present,data
must also be specified. Some classical distances are available, it is recommended to use them rather than a personal function for optimisation reasons :"
euclidean
": Euclidean distance."
manhattan
" : Manhattan distance."
minkowski
" : Minkowski's distance. In that case a value for p >= 1 must be specified.
(function or string)
- data
A data.frame where each row represents data related to an element. This can be omitted if
d
is omitted. Present variables can be quantitative or qualitative. If qualitative variables are present, some distances may not be used. Possibility of standardising variables and transforming qualitative variables into binary variables (one-hot encoding) usingstandardQuant
andbinarQual
. (data.frame)- m
Minimum size constraint. Must be positive or zero and small enough for the problem to be feasible. Default is
0
(no constraint). (positive number)- M
Maximum size constraint. Must be positive, superior or equal to
m
and large enough for the problem to be feasible. Default isInf
(no constraint). Multiple values can be given if different values for the maximum size constraint must be tried. (positive real vector)- standardQuant
TRUE
if the variables indata
should be standardised (i.e., centered and scaled),FALSE
(default) otherwise. Standardisation is applied after the possible binarization of qualitative variables (seebinarQual
). (flag)- binarQual
TRUE
if qualitative variables should be binarized (one-hot encoding), for example, to make the data set compatible with common distances or to standardize these variables.FALSE
(default) otherwise. (flag)- nbTries
The number of initialisations to test. The first one will start from a single-element partition, and the remaining ones will be generated randomly respecting connectivity and maximum cluster size constraints. The number of random fusions to perform for each randomly generated partition is determined by
propFusionsInit
. (strictly positive integer)- propFusionsInit
Proportion of fusions (relatively to the number of elements) to be randomly performed for the initialisations. Ignored if
nbTries = 1
. Default is1%
. (value in ]0,1[)- linkages
A vector of distance measures to be tested. Can include both function names and actual distance functions. These measures determine the linkage criterion for the hierarchical clustering. If using custom distance functions, they must take the pairwise distance matrix as an argument and return the linkage distance. Default implemented linkages can be seen with
available_linkages()
.- fusionConstraints
Type of constraints to add on the fusions. When a new fusion must be done the algorithm will choose the best fusion in term of the linkage distance respecting the constraint. If no fusion check the constraint a partial relaxation is realized. This kind of constraint can be useful when the problem have a minimum size constraint. Multiple values can be given. The implemented constraints are the following:
FALSE
(orNA
): no constraint added"
singletonMin
": fusion between a cluster of one element and a cluster which doesn't verify the minimum size constraint."
singletonAny
" : fusion between a cluster of one element and any other cluster"
minMin
" : fusion between clusters which do not verify the minimum size constraint"
minAny
" : fusion between a cluster which do not verify the minimum size constraint and any other one. (vector)
- fusionConstraintModes
The way the fusion constraint, if any has been supplied in
fusionConstraints
parameters (ignored otherwise) should be applied. Actually two modes are available:"
deterministic
": the fusion constraint is applied constantly until it is unfeasible."
random
": the fusion constraint is applied randomly with a probability increasing with the number of iterations.
Default to "deterministic". (character vector)
- splitConnectedComponents
A flag indicating if AHR should be done independently on connected components. This can consequently improve calculation time if the problem is not connected and will not impact the results of the AHR. If
TRUE
the grid will be realized on the different connected components and the return will be a list with one element per component. Ignored if the problem is connected. Default toFALSE
. (flag)- criteria
A vector of criteria for cluster evaluation to be calculated at the end of the hierarchical clustering. Optional. Use
available_criteria()
to see what criteria are available. (character vector, possiblyNULL
)- evalLinkages
a vector of linkages that will be used by each criterion in
criteria
needing this parameter. By default it is equal tolinkages
.- minNbClusters
Minimum number of clusters allowed. Default is
2
. Setting a higher value forminNbClusters
can reduce computation time. (strictly positive integer)- maxNbClusters
Maximum number of clusters allowed. At the end of each iteration if solutions with a number of cluster inferior to this value exist, those with a superior number of clusters will be removed. Must be superior or equal to
minNbClusters
. Can reduce computation time. Default toInf
. (strictly positive integer)- parallel
Logical indicating whether to use parallel processing. Default is TRUE.
- nbCores
Number of CPU cores to use for parallel processing (sockets method). Default is one less than the detected number of cores.
- verbose
Logical indicating whether to display progress messages. Default is TRUE.
Value
Depending on splitConnectedComponents
and if the problem is
connected or not.
If splitConnectedComponents = FALSE
or the problem is connected, a list
with 3 elements:
results
: A tibble containing partitions given by AHR with some information like what constraint are respected. If at least one feasible solution have been found only those will be return, with calculated criteria if any given incriteria
. If at least one criterion is given results will be order by quality regarding the first criterion. Otherwise every solution will be returned, ordered by their size constraint score.grid
: the different parameters used for each AHR in a tibble.initialPartitions
: the different partitions that have been used for the first iteration of the AHRs. Stored in a tibble.
Otherwise a list with one element per connected component. Each of the elements are a list corresponding to the return of this function for the connected component.
See also
stats::hclust()
if you don't have any constraint.
Examples
set.seed(123L)
grid <- simple_grid(4L, 5L)
data <- grid$context
sizes <- grid$repartition$nbIndividuals
contiguity <- grid$contiguity
m <- 200.0
M <- 800.0
d <- "euclidean"
criterion <- "CHI"
AHR(contiguity = contiguity,
d = d, data = data,
sizes = sizes,
m = m, M = M,
criteria = criterion,
nbTries = 1L,
fusionConstraints = available_fusion_constraints(),
fusionConstraintModes = available_fusion_modes(),
parallel = FALSE)
#> ℹ Starting time: 2023-11-28 11:58:14.022351
#> ℹ 9 AHC to evaluate
#> → 118 non-trivial regionalisations obtained
#> ✔ 7 feasible partitions obtained
#> → Calculation of the CHI criterion
#> → Execution time: 1.04459047317505
#> $grid
#> # A tibble: 9 × 9
#> idExp linkage minNbClusters idpartitionInit fusionConstraint
#> <int> <chr> <int> <int> <chr>
#> 1 1 Single 2 1 NA
#> 2 2 Single 2 1 singletonMin
#> 3 3 Single 2 1 singletonMin
#> 4 4 Single 2 1 singletonAny
#> 5 5 Single 2 1 singletonAny
#> 6 6 Single 2 1 minMin
#> 7 7 Single 2 1 minMin
#> 8 8 Single 2 1 NaN
#> 9 9 Single 2 1 NaN
#> # ℹ 4 more variables: fusionConstraintMode <chr>, m <dbl>, M <dbl>,
#> # contiguity <lgl>
#>
#> $initialPartitions
#> # A tibble: 1 × 4
#> idpartitionInit partitionInit mode nbFusions
#> <int> <named list> <chr> <dbl>
#> 1 1 <int [20]> unitary 0
#>
#> $results
#> # A tibble: 7 × 11
#> method idExp partition nbClusters checkContiguityConst checkMinSizeConst
#> <chr> <int> <named list> <int> <lgl> <lgl>
#> 1 AHR 2 <int [20]> 3 TRUE TRUE
#> 2 AHR 2 <int [20]> 4 TRUE TRUE
#> 3 AHR 2 <int [20]> 5 TRUE TRUE
#> 4 AHR 3 <int [20]> 3 TRUE TRUE
#> 5 AHR 3 <int [20]> 4 TRUE TRUE
#> 6 AHR 3 <int [20]> 6 TRUE TRUE
#> 7 AHR 3 <int [20]> 5 TRUE TRUE
#> # ℹ 5 more variables: checkMaxSizeConst <lgl>, scoreMinSizeConst <dbl>,
#> # scoreMaxSizeConst <dbl>, scoreSizeConsts <dbl>, CHI <dbl>
#>