Ivan Jordanov
Computational Intelligence for Data Analytics
Abstract: Humankind has been collecting data since the recording started, but in the last decade with
the considerable advances in computing and storage technologies, advancements of cloud computing,
development of ubiquitous connectivity and the internet of things, there has been explosion in the size
and variety of collected data. Nevertheless, one can be data-rich and knowledge-poor, and this is
where the data analytics and the development and application of machine learning models become
necessity for gaining insight of complex processes to prove scientific theories and discoveries, support
decision making and enhance strategic planning in different areas of the economy, finance, industry,
healthcare, etc. Recently, there is an influx of polymorphic, unstructured and multimodal data – social
media, images, audio, video, etc., which is complicating further the data processing and knowledge extraction process. But
even the traditional structured datasets present problems that need to be addressed and overcome in the early stages of data
pre-processing, feature extraction and feature selection. This is because they usually contain variety of data formats, e.g.,
categorical, continuous, ordinal, and frequently missing data (usually result of sensors faults, human errors, collection,
transportation, or storage problems). The most popular approaches in dealing with missing data generally fall in three groups:
Deletion methods; Single imputation methods; and Model-based methods [1].
In this tutorial I will talk about the third group methods, which are considered to be the most popular, ‘modern’ model-
based approaches [1]. Particularly, Multiple imputation (MI) method will be introduced and discussed in addition to the K-
Nearest Neighbour Imputation (KNN-I) and Bagged Tree Imputation (BTI) [2].
Subsequently, MI, KNN-I and BTI will be applied in a case study for pre-processing a real world radar signal large dataset
(more than 30 000 samples). The dataset comprises intercepted and collected pulse train characteristics, which typically
include signal frequencies, type of modulation, scan period, pulse repetition intervals, etc., and usually consist of mixture of
continuous, discrete and categorical data, and also frequently include missing values. Missing values are imminent part of
real world datasets and radar datasets make no exception of that.
Then will briefly talk about supervised and unsupervised learning and the use of three supervised approaches: Neural
Networks (NN); Random Forests (RF); and Support Vector Machines (SVM) for solving radar signal classification and source
identification problem [3, 4]. Results from applying the NN, RF and SVM (using R and Matlab) on complete data subset
(without missing data) and the full dataset with substituted (up to 60%) missing data with MI, KNN-I and BTI will be critically
analysed and discussed.
Finally, I’ll talk about the opportunities and challenges in applying computational intelligence and machine learning
techniques to Big Data [5] and the available software for Big Data [6].
[1] J. W. Osborne, Best Practices in Data Cleaning, SAGE, 2013. [2] D. T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining (2nd Ed.), Wiley, 2014.
[3] I. Jordanov and A. Georgieva, Neural Network Learning with Global Heuristic Optimization, IEEE Trans. on Neural
Networks, vol.18 (3), 937-942, 2007.
[4] N. Petrov, A.Georgieva and I. Jordanov, Self-Organising Maps for Texture Classification, Neural Computing and
Applications, vol. 22 (7), pp. 1499-1508, 2013.
[5] V. Prajapati, Big Data Analytics with R and Hadoop, PACT Publishing, 2013.
[6] Apache Spark and Scala - http://www.edureka.co/apache-spark-scala-training - retrieved September 2015.
Ivan Jordanov received his PhD degree in Computer Aided Optimization of Dynamic Systems and MSc in Applied
Mathematics and Informatics from the Technical University of Sofia (Bulgaria). He is currently a Reader at the School of
Computing, University of Portsmouth, UK. His research interests include computational intelligence (Machine Learning and
Neural Networks, Big Data and Data Analytics, and Evolutionary Computation) for solving pattern recognition, classification
and optimization problems. He has been a guest editor of Neural Computing and Applications journal and currently is an
associate editor of three international journals. His publication list includes co-authoring of three textbooks, three chapters in
books, editing five Springer-Verlag books, and more than 80 papers in peer-reviewed journals and conference proceedings.
|
|