Objective Computer Clusterization algorithm (OCC)

In this algorithm clusterization of input data sample, optimal after balance of clusterization criterion, is found by rationally organised sorting-out procedure. It finds optimal clusterizations of input data sample among all possible clusterizations.

Adequateness Law and black box concept

Almost all objects of recognition and control in economy, ecology, biology and medicine are undeterministic or fuzzy. They can be represented by deterministic (robust) part and additional black boxes acting on each output of object. The only information about these boxes is that they have limited values of output variables which are similar to the corresponding states of object.

According to W.R.Ashby work [33] diversity of control system is to be not smaller, than diversity of the object itself. The Law of Adequateness, given by S.Beer, establishes that for optimal control the objects are to be compensated by corresponding black boxes of the control system [13]. For optimal pattern recognition and clustering only partial compensation is necessary. More of what we are interested in is to minimise the degree of compensation by all means to get more accurate results.

The simplest presentation of black boxes outputs are the set of random figures with even distribution. Random figures can be compensated by another random figures only after smoothing or averaging. This operation is realised by interval discretization of the input data to D levels. According to Widrow theorem each level is to unite in itself almost equal number of the input data realisations (points).

" Pointing Finger" clusterization algorithm

The degree of black boxes mutual compensation can be easily regulated and optimized by balance of two clusterizations criterion calculation. It is obtained using two trees of clusterization construction.

There are shown:
    1 - input data sample;
    2 - interval discretizated sample;
    3,7 - calculation of the distances between points;
    4,8 - first and second hierarchical clusterization trees;
    6 - interval discretizated tree, calculated with the account of analogues;
    9 - calculation of number of clusters and balance criterion, which is equal BL=0 for several values of discretization levels D and several values of compensation coefficient l. The choice of the D-plane, where s=3 (two trivial clusterizations and optimal one).

The realizations presented in data sample correspond to the points of multidimensional hyperspace. Each point has its nearest neighbor or first analogue. To calculate analogues the city-block measure of distance is used. Then the sample of analogues is calculated, according to weighted summation formulae:

where: B - realization, given in the input data sample;
A1 - its first analogue (nearest neighbor);
l - coefficient of black boxes mutual compensation.

Formulae is valid for continue-valued and interval discretizated features. For binary variables the voting procedures are developed [8]. The hierarchical tree of clusterization is constructed for discretizated input data sample B and for the sample of analogues A1. There is proved, that the hierarchical tree construction can be considered as a procedure, which minimises the sorting-out volume: the optimal clusterization is not excluded in the result of this procedure [25]. Then the balance of clusterizations criterion is calculated for two hierarchical trees:

where: k - number of clusters;
- number of similar clusters.

The pointing-out characteristic on the figure shows the change of the criterion along the steps of trees construction. Except of the tree clusterization balance criterion is to be equal to zero at the very beginning and at the end of trees construction, i.m. for clusterizations: s = 1 - every point is the separate cluster; s = N - all points are united to one cluster. Optimal clusterization we can find by means of D and l variation :

D = N, N/2, N/3, ..., 2 ;
l = 0, 0.05, 0.1, ... , 1.

The value of noise compensation coefficient l is chosen to get single zero value of balance criterion. If the number of optimal clusterizations is cannot be reduced by increasing of l coefficient there are necessary to invite experts for final decision.

It was shown that one can apply computer sorting algorithms to choosing clusterings and pattern recognition and not just to modeling [17,20]. There is no difference in general between modeling and clusterization. The difference is only in the degree of detail of the mathematical language. The language that is used is fuzzier: instead of equations, it uses cluster relationships.