Machine Learning

In progress: machine learning algorithms and visualisation techniques will be developped in future sections…

Popular Machine Learning Techniques:

Regression: used to predict and/or explain the value of a continuous variable such as the price of a house based on its characteristics.
Classification: used to predict the class (value of a discrete variable) of a case, e.g., if a cell is benign or malignant.
Clustering: used to regroup similar cases in a same cluster, e.g., customer segmentation in the banking field. Clustering is mostly used to discover structure, summarize, and detect anomaly.
Association: used to find items or events that often co-occur.
Dimension reduction: used to reduce the size of data by reducing redundant features.
Recommendation systems: used to associate people’s preferences and suggest to these people new items such as books or movies based on their similar tastes.
Density estimation: mostly used to explore and find structures within data.

Supervised and Unsupervised Learning

Supervised learning consists in supervising a machine learning model: We do this by “teaching” the model. In supervised learning the outcomes of the model are known. We teach and train the model. Once trained the model is used to predict outcomes for unforeseen data. Supervised learning is where you have input variables (X) and an output variable (Y). We train the model in order to approximate the function which maps the input into the output.

Y = f(X)

Learning stops when the algorithm (the function) achieves an acceptable level of perfomance.

As an example we will use a supervised algorithm for the cancer dataset. This dataset have some historical data describing tumors for patients such as Clump thickness, Uniformity of cell size, Uniformity of cell shape, etc., and a Boolean (outcomes) parameter which indicates whether the tumor is maline or benign for each patient. The columns are called Features. Most popular supervised machine learning algorithms:

Regression algorithms: Linear, Polynomial, Logistic, Neural Netwwork, Regression Trees and Random Forests, etc.
Classification algorithms: Logistic Regression, K-Nearest Neighbours, Random Forest, Support Vector Machine, Stochastic Gradient Descent, Naïve Bayes, etc.

Unsupervised learning consists in not supervising the model: we let the model work on its own to discover information. The Unsupervised algorithm trains on the dataset, and draws conclusions on UNLABELED data (outcomes are unknown). Most popular unsupervised machine learning algorithms:

Clustering algorithms: K-means
Dimension reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc.
Density estimation: Kernel Density Estimation

In comparison to supervised learning, unsupervised learning has fewer models and fewer evaluation methods that can be used to ensure that the outcomes of the model are accurate. For this reason, unsupervised learning creates a less controlable environment.

Python Libraries for Machine Learning Applications:

from https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

The four main libraries used for datascience computing are NumPy, SciPy, Matplotlib and Pandas.

LIBRARY	FEATURES
NumPy (Numerical Python)	numerical computing with powerful numerical N-dimensional array object useful linear algebra, Fourier transform, and random number capabilities tools for integrating C/C++ and Fortran code
SciPy (Scientific Python)	built on Numpy : use of multidimensional arrays provided by the NumPy module efficient numerical routines: optimization, regression, interpolation
Matplotlib	2-D visualization, “publication-ready” plots: histograms, line plots, heat plots similar to Matlab, allow Latex commands to add math to your plot
Pandas	powerful and flexible open source data analysis for data munging and preparation for structured data operations and manipulations

OTHER LIBRARIES	FEATURES
Scikit Learn	for machine learning Built on NumPy, SciPy and matplotlib statistical modeling classification
Statsmodels	for statistical data visualization to explore data to estimate statistical models
Seaborn	for making attractive and informative statistical graphics based on matplotlib
Bokeh	for creating interactive plots dashboards and data applications on modern web-browsers (like D3.js) high-performance interactivity over very large or streaming datasets
Blaze	to extend the capability of Numpy and Pandas to distributed and streaming datasets used to access data from a multitude of sources (Bcolz, MongoDB, \| \| \| SQLAlchemy, Apache Spark, PyTables, etc.) to create effective visualizations and dashboards on huge chunks of data with Bokeh
Scrapy	for web crawling useful framework for getting specific patterns of data to dig through web-pages within a website to gather information
SymPy (symbolic computation)	Wide-ranging capabilities from basic symbolic arithmetic to calculus algebra, discrete mathematics and quantum physics the capability of formatting the result of the computations as LaTeX code
Requests	for accessing the web, similar to the standard python library urllib2 but easier to code
os	for Operating system and file operations
networkx and igraph	for graph
regular expressions	to find patterns in text data
BeautifulSoup	for scrapping web, extract information from just a single webpage in a run

Scikit-learn has most of the classification, regression and clustering algorithms, and it’s designed to work with the Python numerical and scientific libraries, NumPy and SciPy. The preprocessing package of scikit learn provides several common utility functions and transformer classes to change raw feature vectors into a suitable form of vector for modeling. When developing a model, you have to split your dataset into train and test sets to train your model, and then test the model’s accuracy separately. Scikit learn can split arrays or matrices into random train and test subsets for you.

What is the difference between Machine Learning and Deep Learning

Machine Learning covers the statistical part of artificial intelligence. From wikipedia: “Machine learning algorithms build a mathematical model based on sample data, known as”training data“, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers.” “The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning”. “In its application across business problems, machine learning is also referred to as predictive analytics.”
Deep Learning “is part of a broader family of machine learning methods based on artificial neural networks with representation learning.” “Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering”, etc.