Skip to contents

Algorithm to improve (according to a certain criterion) a solution that is feasible for a certain classification problem with connectivity and size constraints. [Experimental]

Usage

enhance_feasible(
  regionalisation,
  distances = NULL,
  contiguity = NULL,
  sizes = NULL,
  d = NULL,
  data = NULL,
  m = 0,
  M = Inf,
  standardQuant = FALSE,
  binarQual = FALSE,
  enhanceCriteria = c("AHC", "Silhouette", "Dunn"),
  linkages = "saut max",
  evaluationCriteria = enhanceCriteria,
  maxIt = Inf,
  parallel = TRUE,
  nbCores = detectCores() - 1L,
  verbose = TRUE
)

Arguments

regionalisation

feasible regionalisation to optimize.

distances

The distance matrix of the problem. This can be omitted if a distance function d and data context data are provided. If only distances is provided, all distances must be present. (distance matrix)

contiguity

A contiguity matrix or an igraph contiguity graph. If not provided, the problem is considered completely contiguous (all elements are neighbors of each other).

sizes

Represents the size of each element. By default, it is set to 1 for each element (the size of a cluster becomes its cardinal). All data must be positive or zero. (positive real numeric vector)

d

Distance function between elements. This can be omitted if distances is already indicated. If present, data must also be specified. Some classical distances are available, it is recommended to use them rather than a personal function for optimisation reasons :

  • "euclidean": Euclidean distance.

  • "manhattan" : Manhattan distance.

  • "minkowski" : Minkowski's distance. In that case a value for p >= 1 must be specified.

(function or string)

data

A data.frame where each row represents data related to an element. This can be omitted if d is omitted. Present variables can be quantitative or qualitative. If qualitative variables are present, some distances may not be used. Possibility of standardising variables and transforming qualitative variables into binary variables (one-hot encoding) using standardQuant and binarQual. (data.frame)

m

Minimum size constraint. Must be positive or zero and small enough for the problem to be feasible. Default is 0 (no constraint). (positive number)

M

Maximum size constraint. Must be positive, superior or equal to m and large enough for the problem to be feasible. Default is Inf (no constraint). (positive number)

standardQuant

TRUE if the variables in data should be standardised (i.e., centered and scaled), FALSE (default) otherwise. Standardisation is applied after the possible binarization of qualitative variables (see binarQual). (flag)

binarQual

TRUE if qualitative variables should be binarized (one-hot encoding), for example, to make the data set compatible with common distances or to standardize these variables. FALSE (default) otherwise. (flag)

enhanceCriteria

A vector of criteria used for the enhancement of the actual feasible solution. Currently available choices are those in available_criteria(), plus "AHC" (depends of the linkages parameter). Compared to others AHC doesn't improve a global criterion but do this locally, hoping to reduce computing time. Regarding to this criterion a feasible solution, built by move a unique element from a cluster to another is better if the element is closer to the other cluster than it's actual (depending of some linkage).

linkages

Vector of linkage distances used when a criterion ("Dunn", "AHC") needs it.

evaluationCriteria

criteria used for comparison after enhancement. They are evaluated on each feasible solution given by each criterion used for enhancement. Must be a vector composed of the available criteria in c3t. For the Dunn index there will be one criterion per linkage given. See available_criteria().

maxIt

maximum number of allowed iterations. Default is Inf. (strictly positive integer)

parallel

Logical indicating whether to use parallel processing. Default is TRUE.

nbCores

Number of CPU cores to use for parallel processing (sockets method). Default is one less than the detected number of cores.

verbose

Logical indicating whether to display progress messages. Default is TRUE.

Value

a tibble with one row per try. For each row the following variables:

  • criterion: name of the criterion used for improvement.

  • linkage: type of linkage distance used (NA if this argument is irrelevant for the actual criterion).

  • sampleSize: size of the sample for the calculation of the criterion (NA if irrelevant).

  • statut: state of improvement. Indicates whether an improvement could be made or not.

  • iterations: number of improving iterations performed.

  • regionalisationOpti: the new regionalisation. Identical to the input argument if no improvement could be made.

  • one column per criterion indicated in critereEvaluation. If some of those criteria use a linkage distance, there will be one column per linkage distance given in linkage and per criterion.

References

Marc Christine and Michel Isnard. "Un algorithme de regroupement d'unités statistiques selon certains critères de similitudes" Insee Méthodes, 2000, p. 50`