Title: | Outlier Detection Using Partitioning Clustering Algorithms |
---|---|
Description: | An object is called "outlier" if it remarkably deviates from the other objects in a data set. Outlier detection is the process to find outliers by using the methods that are based on distance measures, clustering and spatial methods (Ben-Gal, 2005 <ISBN 0-387-24435-2>). It is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for outlier removing in data processing. This package provides the implementations of some novel approaches to detect the outliers based on typicality degrees that are obtained with the soft partitioning clustering algorithms such as Fuzzy C-means and its variants. |
Authors: | Zeynel Cebeci [aut, cre]
|
Maintainer: | Zeynel Cebeci <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.0 |
Built: | 2025-03-09 03:10:35 UTC |
Source: | https://github.com/zcebeci/odetector |
An object is an "outlier" if it remarkably deviates from the other objects in a data set. Outlier detection is a process to identify outliers with the methods based on distance measures, clustering and spatial methods (Ben-Gal, 2005). This package introduces the functions for some novel approaches to detect the outliers based on the typicality degrees, obtained using the fuzzy and possibilistic clustering algorithms, i.e, the Unsupervised Possibilistic Fuzzy C-Means clustering algorithm (Wu et al, 2010).
Although it is mainly called as outlier detection or anomaly detection, there are many synonym terms of outlier detection in the different application domains, i.e., fraud detection, discordants detection, exception mining, aberration detection, surprise detection, peculiarity detection or contaminant detection etc.
Outlier detection methods/algorithms can be classified with different taxonomies. In a common taxonomy, they are categorized as clustering-based methods, distance based methods and density based methods. Clustering-based methods divides data objects into clusters and seeks the objects which are not typical members of any clusters. The novel approaches applied in this package use typicality degrees from a possibilistic and fuzzy clustering algorithms. These approaches are basically decide the atypicality of data points. For example, an object is decided to be atypical if its average possibilistic membership degree to all clusters is less than a pre-defined threshold typicality degree. The objects are labeled as the outliers if they satify the above rule.
Zeynel Cebeci, Cagatay Cebeci, Yalcin Tahtali
Ben-Gal, I. (2005). Outlier detection, in Maimon, O. & Rockach, L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic Publishers, <ISBN 0-387-24435-2>.
Wu, X., Wu, B., Sun, J. & Fu, H. (2010). Unsupervised possibilistic fuzzy clustering. J. of Information & Computational Sci., 7 (5): 1075-1080.
upfc
,
detect.outliers
,
plot.outliers
,
pairs.outliers
,
print.outliers
,
remove.outliers
,
summary.outliers
The detect.outliers
function finds the outliers by using four different approaches based on the typicality degrees of the data objects in a data set.
detect.outliers (x, k, alpha=0.05, alpha2=0.2, tsc="m1")
detect.outliers (x, k, alpha=0.05, alpha2=0.2, tsc="m1")
x |
an object of class ‘ppclust’ containing the clustering results from a possibilistic and fuzzy clustering algorithm in the package ppclust. Alternatively, a numeric data frame or matrix containing data set can be input to generate the object of class ‘ppclust’ internally. |
k |
an integer specifying the number of cluster. If the argument |
alpha |
a number to specify the threshold typicality value to be used to detect the outliers. If the typicality value of an object is less than this value the object is determined as an outlier. The default value of |
alpha2 |
a number specifying the threshold typicality value to be used with the Approach 2 in order to detect the outliers. The objects which the rows sums of their typicality degrees are less than this value are evaluated as the outliers. The default value of |
tsc |
a string specifying the method to determine the size of small clusters for finding collective outliers. The default value is m1 and the alternative is m2. See the Details for the details. |
The function detect.outliers
computes the outliers by using four different approaches. The first approach (Approach 1) assumes that a data object is an outlier if its average typicality is less than the alpha
, a user-defined threshold typicality degree. If the sum of typicality degrees of an object to all clusters is less than the alpha2
, a user-defined threshold value for typicalities row sums. In the third approach (Approach 3) an object is labeled as an outlier, if its typicality to all clusters is less than the alpha
. The last approach (Approach 4) is that all members of a small cluster are the collective outliers and can be labeled as the outliers.
With Approach 4, the members of a small clusters are considered as the collective outliers. In the function detect.outliers
, two different methods are available to compute the threshold small cluster size (tsc
). In the following equations, the first one has been proposed by Santos-Pereira & Pires(2002) and works good for the small data sets. The second is a novel method is proposed by the authors of this document and works better than the previous one for the larger data sets.
where:
p
is the number of features,
k
is the number of clusters,
n
is the number of objects.
an object of class ‘outliers’ containing the following items:
X |
a numeric data matrix containing the processed data set. |
outliers1 |
a numeric vector containing the labels (row indexes) of outliers found by the Approach 1. |
outliers2 |
a numeric vector containing the labels (row indexes) of outliers found by the Approach 2. |
outliers3 |
a numeric vector containing the labels (row indexes) of outliers found by the Approach 3. |
outliers4 |
a numeric vector containing the labels (row indexes) objects in the small clusters to be treated as outliers. |
Zeynel Cebeci
Santos-Pereira, C.M. & Pires, A.M. (2002), Detection of outliers in multivariate data: A method based on clustering and robust estimators. In Haerdle W., Roenz B. (eds) Compstat. Physica, Heidelberg. pp. 291-296.
Wu, X., Wu, B., Sun, J. & Fu, H. (2010). Unsupervised possibilistic fuzzy clustering. J. of Information & Computational Sci., 7 (5): 1075-1080.
plot.outliers
,
pairs.outliers
,
print.outliers
,
remove.outliers
,
summary.outliers
,
upfc
# Load the dataset x3p4c and extract the first three columns data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic # Fuzzy C-Means (UPFC) algorithm of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers with a ppclust object out <- detect.outliers(res.upfc) # Summarize and plot the outliers summary(out) plot(out) # Detect the outliers with a higher possibility out <- detect.outliers(res.upfc, alpha=0.1) # Summarize and plot the outliers summary(out) plot(out) # Detect the outliers with an original data frame or matrix x <- x3p4c[,1:3] head(x) out <- detect.outliers(x=x, k=4, alpha=0.1) # Summarize and plot the outliers summary(out) plot(out) # Summarize and plot the outliers summary(out) plot(out)
# Load the dataset x3p4c and extract the first three columns data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic # Fuzzy C-Means (UPFC) algorithm of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers with a ppclust object out <- detect.outliers(res.upfc) # Summarize and plot the outliers summary(out) plot(out) # Detect the outliers with a higher possibility out <- detect.outliers(res.upfc, alpha=0.1) # Summarize and plot the outliers summary(out) plot(out) # Detect the outliers with an original data frame or matrix x <- x3p4c[,1:3] head(x) out <- detect.outliers(x=x, k=4, alpha=0.1) # Summarize and plot the outliers summary(out) plot(out) # Summarize and plot the outliers summary(out) plot(out)
Plots the scatter plots showing the outliers found in a data set.
## S3 method for class 'outliers' pairs(x, ...)
## S3 method for class 'outliers' pairs(x, ...)
x |
an object of |
... |
additional arguments for S3 method |
scatter plots showing the outliers by the variable pairs.
Zeynel Cebeci, Cagatay Cebeci, Yalcin Tahtali
detect.outliers
,
plot.outliers
,
print.outliers
,
remove.outliers
,
summary.outliers
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Plot the outliers by the variable pairs pairs(out)
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Plot the outliers by the variable pairs pairs(out)
Plots the outliers found in a data set.
## S3 method for class 'outliers' plot(x, ot=1, ...)
## S3 method for class 'outliers' plot(x, ot=1, ...)
x |
an object of |
ot |
an integer ranges [1,4] representing the outlier detection approach. |
... |
additional arguments for S3 method |
plots of the object of outliers
class.
Zeynel Cebeci, Cagatay Cebeci, Yalcin Tahtali
detect.outliers
,
pairs.outliers
,
print.outliers
,
remove.outliers
,
summary.outliers
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers outs <- detect.outliers(res.upfc) # Plot the outliers plot(outs, ot=1)
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers outs <- detect.outliers(res.upfc) # Plot the outliers plot(outs, ot=1)
Prints the outliers found in a data set.
## S3 method for class 'outliers' print(x, ...)
## S3 method for class 'outliers' print(x, ...)
x |
an object of |
... |
additional arguments for S3 method |
Print out of the object of outliers
class.
Zeynel Cebeci, Cagatay Cebeci, Yalcin Tahtali
detect.outliers
,
pairs.outliers
,
plot.outliers
,
remove.outliers
,
summary.outliers
,
upfc
# Load the dataset x3p4c and use the first three columns data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Print the outliers print(out)
# Load the dataset x3p4c and use the first three columns data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Print the outliers print(out)
Removes the detected outliers from a data set.
remove.outliers(x, ot=1, sc=FALSE)
remove.outliers(x, ot=1, sc=FALSE)
x |
an object of |
ot |
an integer specifying the outlier detection approach. The default is 1 for the Approach 1. For the other methods use 2 or 3. See |
sc |
a logical value for including the objects in the small clusters into removal process. The default is FALSE. Use TRUE for removing the objects in the small clusters. |
Xr |
a numeric matrix containing the outliers-removed data set. |
Zeynel Cebeci, Cagatay Cebeci, Yalcin Tahtali
detect.outliers
,
pairs.outliers
,
plot.outliers
,
print.outliers
,
summary.outliers
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Remove the outliers Xr1 <- remove.outliers(out, ot=1) print(Xr1) # Remove the outliers including the collective outliers Xr2 <- remove.outliers(out, ot=1, sc=TRUE) print(Xr2)
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Remove the outliers Xr1 <- remove.outliers(out, ot=1) print(Xr1) # Remove the outliers including the collective outliers Xr2 <- remove.outliers(out, ot=1, sc=TRUE) print(Xr2)
Summarizes the detected outliers for a data set.
## S3 method for class 'outliers' summary(object, ...)
## S3 method for class 'outliers' summary(object, ...)
object |
an object of |
... |
additional arguments for S3 method |
Print out of the descriptive statistics for the outliers in an object of outliers
class.
Zeynel Cebeci, Yalcin Tahtali, Cagatay Cebeci
detect.outliers
,
pairs.outliers
,
plot.outliers
,
print.outliers
,
remove.outliers
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Summarize the outliers summary(out)
# Load the dataset x3p4c and extract the first three columns to x data(x3p4c) x <- x3p4c[,1:3] # For 4 clusters, run Unsupervised Possibilistic Fuzzy C-Means (UPFC) algorithm # of the package ppclust res.upfc <- ppclust::upfc(x, centers=4) # Detect the outliers out <- detect.outliers(res.upfc) # Summarize the outliers summary(out)
A synthetic data set which was created by using the R package ‘MixSim’ (Melnykov et al, 2013). It consists of three continous variables forming four clusters. The last ten rows between Line 121 and 130 of the data set contains the outliers which are labeled as the class "0".
data(x3p4c)
data(x3p4c)
A data frame with 130 rows and 3 numeric variables:
a numeric continous variable
a numeric continous variable
a numeric continous variable
an integer variable containing the class labels. While the label 0 represents the generated outliers, the labels 1-4 stand for the classes of the clusters.
The data set x3p4c
is recommended to learn the outlier detection algorithms.
Melnykov, V., Chen,W-C. & Maitra, R. (2013). MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):1-25.
data(x3p4c) # Descriptive statistics of the data set summary(x3p4c) # Plot the data set pairs(x3p4c[,-4], col=x3p4c[,4], pch=19, cex=2)
data(x3p4c) # Descriptive statistics of the data set summary(x3p4c) # Plot the data set pairs(x3p4c[,-4], col=x3p4c[,4], pch=19, cex=2)