Skip to contents

This function performs Agglomerative Hierarchical Clustering (AHC) on a given problem of regionalisation (contiguity/connectivity constraint) with optional size constraints on the clusters. Multiple iterations of AHR can be made if multiple values are given for some parameters.

Usage

AHR(
  distances = NULL,
  contiguity = NULL,
  sizes = NULL,
  d = NULL,
  data = NULL,
  m = 0,
  M = Inf,
  standardQuant = FALSE,
  binarQual = FALSE,
  nbTries = 5L,
  propFusionsInit = 0.01,
  linkages = "saut_min",
  fusionConstraints = NA,
  fusionConstraintModes = available_fusion_modes(),
  splitConnectedComponents = FALSE,
  criteria = NULL,
  evalLinkages = linkages,
  minNbClusters = 2L,
  maxNbClusters = Inf,
  parallel = TRUE,
  nbCores = detectCores() - 1L,
  verbose = TRUE
)

Arguments

distances

The distance matrix of the problem. This can be omitted if a distance function d and data context data are provided. If only distances is provided, all distances must be present. (distance matrix)

contiguity

A contiguity matrix or an igraph contiguity graph. If not provided, the problem is considered completely contiguous (all elements are neighbors of each other). In that case solving time might be long for each iteration.

sizes

Represents the size of each element. By default, it is set to 1 for each element (the size of a cluster becomes its cardinal). All data must be positive or zero. (positive real numeric vector)

d

Distance function between elements. This can be omitted if distances is already indicated. If present, data must also be specified. Some classical distances are available, it is recommended to use them rather than a personal function for optimisation reasons :

  • "euclidean": Euclidean distance.

  • "manhattan" : Manhattan distance.

  • "minkowski" : Minkowski's distance. In that case a value for p >= 1 must be specified.

(function or string)

data

A data.frame where each row represents data related to an element. This can be omitted if d is omitted. Present variables can be quantitative or qualitative. If qualitative variables are present, some distances may not be used. Possibility of standardising variables and transforming qualitative variables into binary variables (one-hot encoding) using standardQuant and binarQual. (data.frame)

m

Minimum size constraint. Must be positive or zero and small enough for the problem to be feasible. Default is 0 (no constraint). (positive number)

M

Maximum size constraint. Must be positive, superior or equal to m and large enough for the problem to be feasible. Default is Inf (no constraint). Multiple values can be given if different values for the maximum size constraint must be tried. (positive real vector)

standardQuant

TRUE if the variables in data should be standardised (i.e., centered and scaled), FALSE (default) otherwise. Standardisation is applied after the possible binarization of qualitative variables (see binarQual). (flag)

binarQual

TRUE if qualitative variables should be binarized (one-hot encoding), for example, to make the data set compatible with common distances or to standardize these variables. FALSE (default) otherwise. (flag)

nbTries

The number of initialisations to test. The first one will start from a single-element partition, and the remaining ones will be generated randomly respecting connectivity and maximum cluster size constraints. The number of random fusions to perform for each randomly generated partition is determined by propFusionsInit. (strictly positive integer)

propFusionsInit

Proportion of fusions (relatively to the number of elements) to be randomly performed for the initialisations. Ignored if nbTries = 1. Default is 1%. (value in ]0,1[)

linkages

A vector of distance measures to be tested. Can include both function names and actual distance functions. These measures determine the linkage criterion for the hierarchical clustering. If using custom distance functions, they must take the pairwise distance matrix as an argument and return the linkage distance. Default implemented linkages can be seen with available_linkages().

fusionConstraints

Type of constraints to add on the fusions. When a new fusion must be done the algorithm will choose the best fusion in term of the linkage distance respecting the constraint. If no fusion check the constraint a partial relaxation is realized. This kind of constraint can be useful when the problem have a minimum size constraint. Multiple values can be given. The implemented constraints are the following:

  • FALSE (or NA): no constraint added

  • "singletonMin": fusion between a cluster of one element and a cluster which doesn't verify the minimum size constraint.

  • "singletonAny" : fusion between a cluster of one element and any other cluster

  • "minMin" : fusion between clusters which do not verify the minimum size constraint

  • "minAny" : fusion between a cluster which do not verify the minimum size constraint and any other one. (vector)

fusionConstraintModes

The way the fusion constraint, if any has been supplied in fusionConstraints parameters (ignored otherwise) should be applied. Actually two modes are available:

  • "deterministic": the fusion constraint is applied constantly until it is unfeasible.

  • "random": the fusion constraint is applied randomly with a probability increasing with the number of iterations.

Default to "deterministic". (character vector)

splitConnectedComponents

A flag indicating if AHR should be done independently on connected components. This can consequently improve calculation time if the problem is not connected and will not impact the results of the AHR. If TRUE the grid will be realized on the different connected components and the return will be a list with one element per component. Ignored if the problem is connected. Default to FALSE. (flag)

criteria

A vector of criteria for cluster evaluation to be calculated at the end of the hierarchical clustering. Optional. Use available_criteria() to see what criteria are available. (character vector, possibly NULL)

evalLinkages

a vector of linkages that will be used by each criterion in criteria needing this parameter. By default it is equal to linkages.

minNbClusters

Minimum number of clusters allowed. Default is 2. Setting a higher value for minNbClusters can reduce computation time. (strictly positive integer)

maxNbClusters

Maximum number of clusters allowed. At the end of each iteration if solutions with a number of cluster inferior to this value exist, those with a superior number of clusters will be removed. Must be superior or equal to minNbClusters. Can reduce computation time. Default to Inf. (strictly positive integer)

parallel

Logical indicating whether to use parallel processing. Default is TRUE.

nbCores

Number of CPU cores to use for parallel processing (sockets method). Default is one less than the detected number of cores.

verbose

Logical indicating whether to display progress messages. Default is TRUE.

Value

Depending on splitConnectedComponents and if the problem is connected or not. If splitConnectedComponents = FALSE or the problem is connected, a list with 3 elements:

  • results: A tibble containing partitions given by AHR with some information like what constraint are respected. If at least one feasible solution have been found only those will be return, with calculated criteria if any given in criteria. If at least one criterion is given results will be order by quality regarding the first criterion. Otherwise every solution will be returned, ordered by their size constraint score.

  • grid: the different parameters used for each AHR in a tibble.

  • initialPartitions: the different partitions that have been used for the first iteration of the AHRs. Stored in a tibble.

Otherwise a list with one element per connected component. Each of the elements are a list corresponding to the return of this function for the connected component.

Details

[Experimental]

See also

merge_cc_partitions()

stats::hclust() if you don't have any constraint.

Examples

set.seed(123L)
grid  <- simple_grid(4L, 5L)
data <- grid$context
sizes <- grid$repartition$nbIndividuals
contiguity <- grid$contiguity

m <- 200.0
M <- 800.0

d <- "euclidean"

criterion <- "CHI"

AHR(contiguity = contiguity,
    d = d, data = data,
    sizes = sizes,
    m = m, M = M,
    criteria = criterion,
    nbTries = 1L,
    fusionConstraints = available_fusion_constraints(),
    fusionConstraintModes = available_fusion_modes(),
    parallel = FALSE)
#>  Starting time: 2023-11-28 11:58:14.022351
#>  9 AHC to evaluate
#> → 118 non-trivial regionalisations obtained
#>  7 feasible partitions obtained
#> → Calculation of the CHI criterion
#> → Execution time: 1.04459047317505
#> $grid
#> # A tibble: 9 × 9
#>   idExp linkage minNbClusters idpartitionInit fusionConstraint
#>   <int> <chr>           <int>           <int> <chr>           
#> 1     1 Single              2               1 NA              
#> 2     2 Single              2               1 singletonMin    
#> 3     3 Single              2               1 singletonMin    
#> 4     4 Single              2               1 singletonAny    
#> 5     5 Single              2               1 singletonAny    
#> 6     6 Single              2               1 minMin          
#> 7     7 Single              2               1 minMin          
#> 8     8 Single              2               1 NaN             
#> 9     9 Single              2               1 NaN             
#> # ℹ 4 more variables: fusionConstraintMode <chr>, m <dbl>, M <dbl>,
#> #   contiguity <lgl>
#> 
#> $initialPartitions
#> # A tibble: 1 × 4
#>   idpartitionInit partitionInit mode    nbFusions
#>             <int> <named list>  <chr>       <dbl>
#> 1               1 <int [20]>    unitary         0
#> 
#> $results
#> # A tibble: 7 × 11
#>   method idExp partition    nbClusters checkContiguityConst checkMinSizeConst
#>   <chr>  <int> <named list>      <int> <lgl>                <lgl>            
#> 1 AHR        2 <int [20]>            3 TRUE                 TRUE             
#> 2 AHR        2 <int [20]>            4 TRUE                 TRUE             
#> 3 AHR        2 <int [20]>            5 TRUE                 TRUE             
#> 4 AHR        3 <int [20]>            3 TRUE                 TRUE             
#> 5 AHR        3 <int [20]>            4 TRUE                 TRUE             
#> 6 AHR        3 <int [20]>            6 TRUE                 TRUE             
#> 7 AHR        3 <int [20]>            5 TRUE                 TRUE             
#> # ℹ 5 more variables: checkMaxSizeConst <lgl>, scoreMinSizeConst <dbl>,
#> #   scoreMaxSizeConst <dbl>, scoreSizeConsts <dbl>, CHI <dbl>
#>