Ivan Jordanov
Computational Intelligence for Data Analytics 

Abstract: Humankind has been collecting data since the recording started, but in the last decade with the considerable advances in computing and storage technologies, advancements of cloud computing, development of ubiquitous connectivity and the internet of things, there has been explosion in the size and variety of collected data. Nevertheless, one can be data-rich and knowledge-poor, and this is where the data analytics and the development and application of machine learning models become necessity for gaining insight of complex processes to prove scientific theories and discoveries, support decision making and enhance strategic planning in different areas of the economy, finance, industry, healthcare, etc. Recently, there is an influx of polymorphic, unstructured and multimodal data social

media, images, audio, video, etc., which is complicating further the data processing and knowledge extraction process. But even the traditional structured datasets present problems that need to be addressed and overcome in the early stages of data pre-processing, feature extraction and feature selection. This is because they usually contain variety of data formats, e.g., categorical, continuous, ordinal, and frequently missing data (usually result of sensors faults, human errors, collection, transportation, or storage problems). The most popular approaches in dealing with missing data generally fall in three groups: Deletion methods; Single imputation methods; and Model-based methods [1].

In this tutorial I will talk about the third group methods, which are considered to be the most popular, ‘modern’ model- based approaches [1]. Particularly, Multiple imputation (MI) method will be introduced and discussed in addition to the K- Nearest Neighbour Imputation (KNN-I) and Bagged Tree Imputation (BTI) [2].

Subsequently, MI, KNN-I and BTI will be applied in a case study for pre-processing a real world radar signal large dataset (more than 30 000 samples). The dataset comprises intercepted and collected pulse train characteristics, which typically include signal frequencies, type of modulation, scan period, pulse repetition intervals, etc., and usually consist of mixture of continuous, discrete and categorical data, and also frequently include missing values. Missing values are imminent part of real world datasets and radar datasets make no exception of that.

Then will briefly talk about supervised and unsupervised learning and the use of three supervised approaches: Neural Networks (NN); Random Forests (RF); and Support Vector Machines (SVM) for solving radar signal classification and source identification problem [3, 4]. Results from applying the NN, RF and SVM (using R and Matlab) on complete data subset (without missing data) and the full dataset with substituted (up to 60%) missing data with MI, KNN-I and BTI will be critically analysed and discussed.

Finally, I’ll talk about the opportunities and challenges in applying computational intelligence and machine learning techniques to Big Data [5] and the available software for Big Data [6].

References

[1] J. W. Osborne, Best Practices in Data Cleaning, SAGE, 2013.

[2] D. T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining (2nd Ed.), Wiley, 2014.

[3] I. Jordanov and A. Georgieva, Neural Network Learning with Global Heuristic Optimization, IEEE Trans. on Neural

Networks, vol.18 (3), 937-942, 2007.

[4] N. Petrov, A.Georgieva and I. Jordanov, Self-Organising Maps for Texture Classification, Neural Computing and

Applications, vol. 22 (7), pp. 1499-1508, 2013.

[5] V. Prajapati, Big Data Analytics with R and Hadoop, PACT Publishing, 2013.

[6] Apache Spark and Scala - http://www.edureka.co/apache-spark-scala-training - retrieved September 2015.

Ivan Jordanov received his PhD degree in Computer Aided Optimization of Dynamic Systems and MSc in Applied Mathematics and Informatics from the Technical University of Sofia (Bulgaria). He is currently a Reader at the School of Computing, University of Portsmouth, UK. His research interests include computational intelligence (Machine Learning and Neural Networks, Big Data and Data Analytics, and Evolutionary Computation) for solving pattern recognition, classification and optimization problems. He has been a guest editor of Neural Computing and Applications journal and currently is an associate editor of three international journals. His publication list includes co-authoring of three textbooks, three chapters in books, editing five Springer-Verlag books, and more than 80 papers in peer-reviewed journals and conference proceedings.