This vignette is designed as an introduction to use the R package
‘odetector’ (Cebeci et al, 2022). You can download the
recent version of the package from CRAN with the following command:
You can also install the constantly updated version of the package from Github as follows:
if(!require(devtools))
install.packages("devtools", repo="https://cloud.r-project.org")
devtools::install_github("zcebeci/odetector")If you have already installed ‘odetector’, you can load
it into R working environment by using the following command:
We demonstrate outlier detection with ‘odetector’ on a
synthetic data set consisting of the three features (p1,
p2 and p3) of four clusters. This
three-dimensional data set was created by using the R Package
‘MixSim’ (Melnykov et al, 2013). The data set consists of a
total of 130 data objects, 30 in each cluster in addition to 10 samples
as the outliers at the bottom.
In the following code chunk, the dataset is loaded into R working environment and its first and last rows are displayed for giving an idea about its content.
## p1 p2 p3 cl
## [1,] 0.9968195 0.3472756 0.4891324 1
## [2,] 0.9933293 0.3108594 0.5058799 1
## [3,] 1.0163660 0.3563446 0.5144635 1
## [4,] 0.9969506 0.4276673 0.4889330 1
## [5,] 0.9883648 0.3068805 0.5345190 1
## [6,] 0.9213208 0.2802930 0.6342422 1
## p1 p2 p3 cl
## [125,] 0.68032778 0.4338203 0.24720960 0
## [126,] 0.09832769 0.6726663 0.71949486 0
## [127,] 0.78069212 0.5256899 0.82378164 0
## [128,] 0.25003095 0.5115713 0.29874354 0
## [129,] 0.19075014 0.9381229 0.05666282 0
## [130,] 0.62097163 0.8947515 0.66280757 0
The following command plots the data set by the clusters. The black marked objects in the plot are the outliers.
The outlier detection algorithm in the package
‘odetector’ uses the typicality degrees which are produced
by a possibilistic clustering algorithm such as Possibilistic C-means
(PCM), Fuzzy Possibilistic C-means (FPCM), Possibilistic Fuzzy C-means
(PFCM) or Unsupervised Possibilistic Fuzzy C-means (UPFC). In this
example, we use the outlier detection process on the results from UPFC
algorithm (Wu et al, 2010) implemented in the package
‘ppclust’ (Cebeci, 2018). For the details see the manual
and vignettes of the R package ‘ppclust’ at https://CRAN.R-project.org/package=ppclust. If required,
in order to run UPFC, the ‘ppclust’ can be loadede into
working environment as follows:
For clustering we select the columns of features from the data set
‘x3p4c’ and store in the data frame named x as
follows:
## p1 p2 p3
## [1,] 0.9968195 0.3472756 0.4891324
## [2,] 0.9933293 0.3108594 0.5058799
## [3,] 1.0163660 0.3563446 0.5144635
## [4,] 0.9969506 0.4276673 0.4889330
## [5,] 0.9883648 0.3068805 0.5345190
## [6,] 0.9213208 0.2802930 0.6342422
## p1 p2 p3
## [125,] 0.68032778 0.4338203 0.24720960
## [126,] 0.09832769 0.6726663 0.71949486
## [127,] 0.78069212 0.5256899 0.82378164
## [128,] 0.25003095 0.5115713 0.29874354
## [129,] 0.19075014 0.9381229 0.05666282
## [130,] 0.62097163 0.8947515 0.66280757
Since the data set ’x3p4c has four clusters, we run UPFC
for 4 clusters and display the firsrt row of clustering results with
following commands:
## Cluster 1 Cluster 2 Cluster 3 Cluster 4
## 1 0.0018554019 2.244214e-04 0.9218609 0.0001413868
## 2 0.0033327338 8.354150e-05 0.9068443 0.0001624117
## 3 0.0015015496 2.301656e-04 0.9104836 0.0001264145
## 4 0.0006668342 1.438301e-03 0.8493402 0.0001618827
## 5 0.0047133546 6.533282e-05 0.9175368 0.0002465250
## 6 0.0281144849 1.646894e-05 0.7123001 0.0021139593
In clustering based outlier detection, the use of optimal number of
clusters is very critical point in order to properly partition a data
set. Because we need the optimal number of clusters before starting the
clustering algorithm, and it totally affect the result of clustering.
One can determine it by running an appropriate clustering algorithm for
a series of number of clusters in a range, namely ‘c1’ and
‘c2’, and calculate the clustering validation process. The
majority of the validation indices has been proposed for the results
from hard clustering algorithms, i.e. K-means. For validating the
partitioning results of the fuzzy clustering algorithms a plenty number
of clustering validation indexes, i.e, partition entropy (PE), partition
coefficient (PC), Xie-Beni index (XB), Kwon index (Kwon), Fuzzy
Hypervolume index (FHV) etc., have been proposed. In Cebeci (2020) the R
implementations of this sort of fuzzy and possibilistic validation
indexes are described. Below there is an example using R package
‘fcvalid’ for determining the optimal number of clusters
(k) in the data set ‘x3p4c’. It can be installed
from Github as follows:
if(!require(devtools))
install.packages("devtools", repo="https://cloud.r-project.org")
suppressMessages(devtools::install_github("zcebeci/fcvalid"))After installing the package, run the fcm function of
the ppclust package by changing the cluster number from
c1 to c2. Then get the fuzzy index values with
the relevant function in the package ‘fcvalid’
library(ppclust)
library(fcvalid)
c1 <- 2 #Starting number of clusters
c2 <- 5 #Final number of clusters
indnames <- c("PC","MPC","PE","XB","Kwon", "TSS", "CL", "FS", "PBMF","FSIL","FHV", "APD")
indvals <- matrix(ncol=length(indnames), nrow=(c2-c1+1))
colnames(indvals) <- indnames
rownames(indvals) <- paste0("c=", c1:c2)
i <- 1
for(c in c1:c2){
resfcm <- ppclust::fcm(x=x, centers=c, nstart=3)
indvals[i,1] <- pc(resfcm)
indvals[i,2] <- mpc(resfcm)
indvals[i,3] <- pe(resfcm)
indvals[i,4] <- xb(resfcm)
indvals[i,5] <- kwon(resfcm)
indvals[i,6] <- tss(resfcm)
indvals[i,7] <- cl(resfcm)
indvals[i,8] <- fs(resfcm)
indvals[i,9] <- pbm(resfcm)
indvals[i,10] <- si(resfcm)$sif
indvals[i,11] <- fhv(resfcm)
indvals[i,12] <- apd(resfcm)
i <- i+1
}In the result from the R script above, you will see that the majority of fuzzy indices suggests the optimal number of clusters as 3 while some suggests as 4. For example, as a powerful fuzzy index, the Fuzzy Hypervolume (FHV) index suggests 4 clusters in the data set. So we can extract this number as the optimal number of clusters as follows:
# Display the fuzzy indices in various runs of FCM
indvals <- round(t(indvals),3)
print(indvals)
# Optimal number of clusters with Fuzzy Hypervolume (FHV) index
optk <- colnames(indvals)[which.min(indvals["FHV",])]
optk
k <- unname(which.min(indvals["FHV",])) + 1
kBelow, there is an example to run UPFC algorithm with the optimal
number of clusters found in the previous example. In the result object
named res.upfc, t contains the typicality
degrees to be used in outlier detection.
The outlier detection algorithm uses the object of
ppclust class which is returned by possibilistic and fuzzy
clustering algorithm as shown in the previous section. Outlier detection
is started with the predefined default values of the function
detect.outliers as follows:
In order to change the threshold typicality, the arguments
alpha and alpha2 can be set to different
threshold values. In the following command alpha is set to
0.05 for the Approach 1 and alpha2 is set to 0.4 for the
Approach 2. See the package manual for the details about these
arguments.
The structure of result object:
## List of 5
## $ X : num [1:130, 1:3] 0.997 0.993 1.016 0.997 0.988 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:130] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "p1" "p2" "p3"
## $ outliers1: Named int [1:10] 121 122 123 124 125 126 127 128 129 130
## ..- attr(*, "names")= chr [1:10] "Obj.1" "Obj.2" "Obj.3" "Obj.4" ...
## $ outliers2: Named int [1:11] 66 121 122 123 124 125 126 127 128 129 ...
## ..- attr(*, "names")= chr [1:11] "Obj.1" "Obj.2" "Obj.3" "Obj.4" ...
## $ outliers3: Named int(0)
## ..- attr(*, "names")= chr(0)
## $ call : language detect.outliers(x = res.upfc, alpha = 0.05, alpha2 = 0.4)
## - attr(*, "class")= chr "outliers"
The result of the detect.outliers is an object of
outliers class. The components of this class can be
displayed individually. For example, while the first command displays
the result detected with the Approach 1, the second one displays the
outliers detected with Approach 2:
## Obj.1 Obj.2 Obj.3 Obj.4 Obj.5 Obj.6 Obj.7 Obj.8 Obj.9 Obj.10
## 121 122 123 124 125 126 127 128 129 130
## Obj.1 Obj.2 Obj.3 Obj.4 Obj.5 Obj.6 Obj.7 Obj.8 Obj.9 Obj.10 Obj.11
## 66 121 122 123 124 125 126 127 128 129 130
The full list of outliers obtained with all of the approaches can be
displayed together with the function print.outliers as
follows:
## Outliers from detect.outliers(x = res.upfc, alpha = 0.05, alpha2 = 0.4)
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
## List of outliers computed with Approach 1
## p1 p2 p3
## 121 0.48240696 0.6277246 0.12007434
## 122 0.56841818 0.7060801 0.49683778
## 123 0.21629682 0.8977075 0.50038525
## 124 0.74970854 0.7773038 0.84919884
## 125 0.68032778 0.4338203 0.24720960
## 126 0.09832769 0.6726663 0.71949486
## 127 0.78069212 0.5256899 0.82378164
## 128 0.25003095 0.5115713 0.29874354
## 129 0.19075014 0.9381229 0.05666282
## 130 0.62097163 0.8947515 0.66280757
##
## List of data set with the outliers (marked with *) by using Approach 1
## 0.997 0.347 0.489
## 0.993 0.311 0.506
## 1.016 0.356 0.514
## 0.997 0.428 0.489
## 0.988 0.307 0.535
## 0.921 0.280 0.634
## 0.975 0.389 0.484
## 0.965 0.429 0.504
## 0.947 0.409 0.574
## 0.890 0.300 0.542
## 0.961 0.337 0.563
## 0.838 0.233 0.722
## 0.927 0.440 0.589
## 0.972 0.351 0.466
## 0.910 0.391 0.569
## 0.967 0.404 0.465
## 0.907 0.365 0.520
## 0.982 0.444 0.486
## 0.923 0.373 0.597
## 0.862 0.296 0.596
## 0.957 0.312 0.561
## 0.965 0.369 0.552
## 0.920 0.344 0.527
## 1.003 0.321 0.541
## 0.932 0.285 0.488
## 0.986 0.328 0.506
## 0.610 0.022 0.721
## 0.608 0.065 0.816
## 0.478 0.078 0.756
## 0.500 0.123 0.779
## 0.540 0.041 0.727
## 0.499 0.098 0.734
## 0.575 -0.010 0.662
## 0.584 0.038 0.777
## 0.598 0.007 0.675
## 0.616 0.034 0.659
## 0.477 0.043 0.696
## 0.595 0.039 0.847
## 0.600 -0.010 0.684
## 0.611 -0.043 0.633
## 0.512 0.124 0.712
## 0.543 0.055 0.833
## 0.642 -0.013 0.747
## 0.701 -0.064 0.746
## 0.554 0.120 0.836
## 0.563 0.053 0.674
## 0.596 0.023 0.771
## 0.600 -0.029 0.677
## 0.448 0.120 0.713
## 0.545 0.042 0.802
## 0.600 -0.039 0.697
## 0.506 0.091 0.691
## 0.598 0.048 0.804
## 0.536 0.065 0.709
## 0.536 0.039 0.694
## 0.529 0.067 0.657
## 0.492 0.077 0.743
## 0.549 0.074 0.663
## 0.580 0.071 0.786
## 0.587 0.027 0.629
## 0.674 0.027 0.784
## 0.591 0.088 0.769
## 0.596 0.030 0.813
## 0.560 0.076 0.692
## 0.860 0.837 0.286
## 0.789 0.817 0.305
## 0.875 1.082 0.402
## 0.969 1.080 0.351
## 0.960 0.954 0.291
## 0.950 1.094 0.484
## 0.940 0.932 0.394
## 0.999 1.040 0.401
## 0.940 0.960 0.378
## 1.025 1.068 0.302
## 0.911 1.058 0.334
## 0.895 0.973 0.341
## 0.943 1.067 0.374
## 0.893 1.075 0.407
## 0.981 1.039 0.426
## 0.876 1.002 0.370
## 0.875 1.099 0.411
## 0.941 0.998 0.314
## 0.949 1.050 0.473
## 0.872 0.785 0.325
## 0.808 0.900 0.407
## 0.847 0.934 0.320
## 0.911 1.039 0.407
## 0.831 0.964 0.400
## 0.939 0.917 0.358
## 0.447 0.330 0.915
## 0.333 0.410 0.959
## 0.446 0.468 0.832
## 0.353 0.362 0.820
## 0.368 0.262 0.797
## 0.401 0.401 0.919
## 0.445 0.352 0.779
## 0.490 0.501 0.681
## 0.446 0.460 0.834
## 0.327 0.306 0.950
## 0.408 0.424 0.746
## 0.345 0.388 0.807
## 0.399 0.501 0.786
## 0.425 0.540 0.668
## 0.368 0.468 0.793
## 0.408 0.252 0.963
## 0.524 0.528 0.726
## 0.365 0.460 0.884
## 0.420 0.465 0.854
## 0.422 0.417 0.827
## 0.524 0.398 0.758
## 0.395 0.382 0.912
## 0.574 0.230 0.720
## 0.452 0.509 0.715
## 0.428 0.367 0.856
## 0.473 0.491 0.794
## 0.359 0.421 0.875
## 0.368 0.498 0.801
## 0.346 0.485 0.779
## 0.372 0.403 0.881
## 0.373 0.389 0.924
## * 0.482 0.628 0.120
## * 0.568 0.706 0.497
## * 0.216 0.898 0.500
## * 0.750 0.777 0.849
## * 0.680 0.434 0.247
## * 0.098 0.673 0.719
## * 0.781 0.526 0.824
## * 0.250 0.512 0.299
## * 0.191 0.938 0.057
## * 0.621 0.895 0.663
##
## List of outliers computed with Approach 2
## p1 p2 p3
## 66 0.78902573 0.8172686 0.30533197
## 121 0.48240696 0.6277246 0.12007434
## 122 0.56841818 0.7060801 0.49683778
## 123 0.21629682 0.8977075 0.50038525
## 124 0.74970854 0.7773038 0.84919884
## 125 0.68032778 0.4338203 0.24720960
## 126 0.09832769 0.6726663 0.71949486
## 127 0.78069212 0.5256899 0.82378164
## 128 0.25003095 0.5115713 0.29874354
## 129 0.19075014 0.9381229 0.05666282
## 130 0.62097163 0.8947515 0.66280757
##
## List of data set with the outliers (marked with *) by using Approach 2
## 0.997 0.347 0.489
## 0.993 0.311 0.506
## 1.016 0.356 0.514
## 0.997 0.428 0.489
## 0.988 0.307 0.535
## 0.921 0.280 0.634
## 0.975 0.389 0.484
## 0.965 0.429 0.504
## 0.947 0.409 0.574
## 0.890 0.300 0.542
## 0.961 0.337 0.563
## 0.838 0.233 0.722
## 0.927 0.440 0.589
## 0.972 0.351 0.466
## 0.910 0.391 0.569
## 0.967 0.404 0.465
## 0.907 0.365 0.520
## 0.982 0.444 0.486
## 0.923 0.373 0.597
## 0.862 0.296 0.596
## 0.957 0.312 0.561
## 0.965 0.369 0.552
## 0.920 0.344 0.527
## 1.003 0.321 0.541
## 0.932 0.285 0.488
## 0.986 0.328 0.506
## 0.610 0.022 0.721
## 0.608 0.065 0.816
## 0.478 0.078 0.756
## 0.500 0.123 0.779
## 0.540 0.041 0.727
## 0.499 0.098 0.734
## 0.575 -0.010 0.662
## 0.584 0.038 0.777
## 0.598 0.007 0.675
## 0.616 0.034 0.659
## 0.477 0.043 0.696
## 0.595 0.039 0.847
## 0.600 -0.010 0.684
## 0.611 -0.043 0.633
## 0.512 0.124 0.712
## 0.543 0.055 0.833
## 0.642 -0.013 0.747
## 0.701 -0.064 0.746
## 0.554 0.120 0.836
## 0.563 0.053 0.674
## 0.596 0.023 0.771
## 0.600 -0.029 0.677
## 0.448 0.120 0.713
## 0.545 0.042 0.802
## 0.600 -0.039 0.697
## 0.506 0.091 0.691
## 0.598 0.048 0.804
## 0.536 0.065 0.709
## 0.536 0.039 0.694
## 0.529 0.067 0.657
## 0.492 0.077 0.743
## 0.549 0.074 0.663
## 0.580 0.071 0.786
## 0.587 0.027 0.629
## 0.674 0.027 0.784
## 0.591 0.088 0.769
## 0.596 0.030 0.813
## 0.560 0.076 0.692
## 0.860 0.837 0.286
## * 0.789 0.817 0.305
## 0.875 1.082 0.402
## 0.969 1.080 0.351
## 0.960 0.954 0.291
## 0.950 1.094 0.484
## 0.940 0.932 0.394
## 0.999 1.040 0.401
## 0.940 0.960 0.378
## 1.025 1.068 0.302
## 0.911 1.058 0.334
## 0.895 0.973 0.341
## 0.943 1.067 0.374
## 0.893 1.075 0.407
## 0.981 1.039 0.426
## 0.876 1.002 0.370
## 0.875 1.099 0.411
## 0.941 0.998 0.314
## 0.949 1.050 0.473
## 0.872 0.785 0.325
## 0.808 0.900 0.407
## 0.847 0.934 0.320
## 0.911 1.039 0.407
## 0.831 0.964 0.400
## 0.939 0.917 0.358
## 0.447 0.330 0.915
## 0.333 0.410 0.959
## 0.446 0.468 0.832
## 0.353 0.362 0.820
## 0.368 0.262 0.797
## 0.401 0.401 0.919
## 0.445 0.352 0.779
## 0.490 0.501 0.681
## 0.446 0.460 0.834
## 0.327 0.306 0.950
## 0.408 0.424 0.746
## 0.345 0.388 0.807
## 0.399 0.501 0.786
## 0.425 0.540 0.668
## 0.368 0.468 0.793
## 0.408 0.252 0.963
## 0.524 0.528 0.726
## 0.365 0.460 0.884
## 0.420 0.465 0.854
## 0.422 0.417 0.827
## 0.524 0.398 0.758
## 0.395 0.382 0.912
## 0.574 0.230 0.720
## 0.452 0.509 0.715
## 0.428 0.367 0.856
## 0.473 0.491 0.794
## 0.359 0.421 0.875
## 0.368 0.498 0.801
## 0.346 0.485 0.779
## 0.372 0.403 0.881
## 0.373 0.389 0.924
## * 0.482 0.628 0.120
## * 0.568 0.706 0.497
## * 0.216 0.898 0.500
## * 0.750 0.777 0.849
## * 0.680 0.434 0.247
## * 0.098 0.673 0.719
## * 0.781 0.526 0.824
## * 0.250 0.512 0.299
## * 0.191 0.938 0.057
## * 0.621 0.895 0.663
##
## List of outliers computed with Approach 3
## No outliers detected.
##
## List of outliers computed with Approach 4
## No outliers detected.
The function summary.outliers calculates the descriptive
statistics of the detected outliers.
## Summary of Outliers from detect.outliers(x = res.upfc, alpha = 0.05, alpha2 = 0.4)
## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
##
## Summary of outliers computed with Approach 1
## p1 p2 p3
## Min. :0.09833 Min. :0.4338 Min. :0.05666
## 1st Qu.:0.22473 1st Qu.:0.5512 1st Qu.:0.26009
## Median :0.52541 Median :0.6894 Median :0.49861
## Mean :0.46379 Mean :0.6985 Mean :0.47752
## 3rd Qu.:0.66549 3rd Qu.:0.8654 3rd Qu.:0.70532
## Max. :0.78069 Max. :0.9381 Max. :0.84920
##
## Summary of outliers computed with Approach 2
## p1 p2 p3
## Min. :0.09833 Min. :0.4338 Min. :0.05666
## 1st Qu.:0.23316 1st Qu.:0.5767 1st Qu.:0.27298
## Median :0.56842 Median :0.7061 Median :0.49684
## Mean :0.49336 Mean :0.7093 Mean :0.46187
## 3rd Qu.:0.71502 3rd Qu.:0.8560 3rd Qu.:0.69115
## Max. :0.78903 Max. :0.9381 Max. :0.84920
##
## Summary of outliers computed with Approach 3
## No outliers detected
## Summary of outliers in small clusters
## No outliers detected
##
## Available components:
## [1] "X" "outliers1" "outliers2" "outliers3" "call"
There are many ways of visual representation of the results of
outlier detection analysis. A traditional way is to plot the results by
using the functions plot.outliers and
pairs.outliers.
The function plot.outliers plots the scattering of
outliers. The argument ot is used to assing the number of approach to
calculate the outliers. It ranges bbetween 1 and 4.
For furher analysis of data, the outliers can be removed from the
original data set by using the function remove.outliers as
follows:
In the above command, the option sc is set to
TRUE if the data objects in small clusters are desired to
be treated as collective outliers. Compare the following figure to the
figure which has been plotted for the original data in the second
section.
As default, the outlier detection algorithm uses alpha,
threshold typicality level of 0.05. While the much more outliers is
expected with the higher level of this argument, the lower values can be
resulted with the less number of outliers. In the following command
alpha has been set to 0.1 instead of the default value of
0.05.
As seen below, the data object 66 is also evaluated as
the outlier with the setting of alpha=0.1. For the details,
see the package manual of ‘odetector’.
## Obj.1 Obj.2 Obj.3 Obj.4 Obj.5 Obj.6 Obj.7 Obj.8 Obj.9 Obj.10 Obj.11
## 66 121 122 123 124 125 126 127 128 129 130
Cebeci, Z., Cebeci, C., Tahtali, Y. and Bayyurt, L. 2022. Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. Peerj Computer Science, 8:e1060. https://doi.org/10.7717/peerj-cs.1060.
Wu, X., Wu, B., Sun, J. & Fu, H. (2010). Unsupervised possibilistic fuzzy clustering. J. of Information & Computational Sci., 7(5): 1075-1080.
Melnykov, V., Chen,W-C. & Maitra, R. (2013). MixSim: An R package for simulating data to study performance of clustering algorithms. J. of Statistical Software, 51(12):1-25. DOI: https://doi.org/10.18637/jss.v051.i12.
Cebeci, Z. (2018), Comparison of internal validity indices for fuzzy clustering. Journal of Agricultural Informatics, 10(2):1-14. DOI: https://doi.org/10.17700/jai.2019.10.2.537.
Cebeci, Z. (2020). fcvalid: An R Package for Internal Validation of Probabilistic and Possibilistic Clustering. Sakarya University Journal of Computer and Information Sciences, 3(1), 11-27. DOI: https://doi.org/10.35377/saucis.03.01.664560.
Cebeci, Z., Cebeci, C., Tahtali, Y. & Bayyurt, L. (2022). Two novel outlier detection approaches based on unsupervised possibilistic and fuzzy clustering. Peerj Computer Science, 8:e1060. https://doi.org/10.7717/peerj-cs.1060.