Machine Learning

In progress: machine learning algorithms and visualisation techniques will be developped in future sections…

Supervised and Unsupervised Learning

Supervised learning consists in supervising a machine learning model: We do this by “teaching” the model. In supervised learning the outcomes of the model are known. We teach and train the model. Once trained the model is used to predict outcomes for unforeseen data. Supervised learning is where you have input variables (X) and an output variable (Y). We train the model in order to approximate the function which maps the input into the output.

Y = f(X)

Learning stops when the algorithm (the function) achieves an acceptable level of perfomance.

As an example we will use a supervised algorithm for the cancer dataset. This dataset have some historical data describing tumors for patients such as Clump thickness, Uniformity of cell size, Uniformity of cell shape, etc., and a Boolean (outcomes) parameter which indicates whether the tumor is maline or benign for each patient. The columns are called Features. Most popular supervised machine learning algorithms:

Unsupervised learning consists in not supervising the model: we let the model work on its own to discover information. The Unsupervised algorithm trains on the dataset, and draws conclusions on UNLABELED data (outcomes are unknown). Most popular unsupervised machine learning algorithms:

In comparison to supervised learning, unsupervised learning has fewer models and fewer evaluation methods that can be used to ensure that the outcomes of the model are accurate. For this reason, unsupervised learning creates a less controlable environment.

Python Libraries for Machine Learning Applications:

from https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

The four main libraries used for datascience computing are NumPy, SciPy, Matplotlib and Pandas.

LIBRARY FEATURES
NumPy (Numerical Python)
  • numerical computing with powerful numerical N-dimensional array object
  • useful linear algebra, Fourier transform, and random number capabilities
  • tools for integrating C/C++ and Fortran code
SciPy (Scientific Python)
  • built on Numpy : use of multidimensional arrays provided by the NumPy module
  • efficient numerical routines: optimization, regression, interpolation
Matplotlib
  • 2-D visualization, “publication-ready” plots: histograms, line plots, heat plots
  • similar to Matlab, allow Latex commands to add math to your plot
Pandas
  • powerful and flexible open source data analysis
  • for data munging and preparation
  • for structured data operations and manipulations
OTHER LIBRARIES FEATURES
Scikit Learn
  • for machine learning
  • Built on NumPy, SciPy and matplotlib
  • statistical modeling classification
Statsmodels
  • for statistical data visualization
  • to explore data
  • to estimate statistical models
Seaborn
  • for making attractive and informative statistical graphics
  • based on matplotlib
Bokeh
  • for creating interactive plots
  • dashboards and data applications on modern web-browsers (like D3.js)
  • high-performance interactivity over very large or streaming datasets
Blaze
  • to extend the capability of Numpy and Pandas to distributed and streaming datasets
  • used to access data from a multitude of sources (Bcolz, MongoDB, | | | SQLAlchemy, Apache Spark, PyTables, etc.)
  • to create effective visualizations and dashboards on huge chunks of data with Bokeh
Scrapy
  • for web crawling
  • useful framework for getting specific patterns of data
  • to dig through web-pages within a website to gather information
SymPy (symbolic computation)
  • Wide-ranging capabilities from basic symbolic arithmetic to calculus algebra, discrete mathematics and quantum physics
  • the capability of formatting the result of the computations as LaTeX code
Requests for accessing the web, similar to the standard python library urllib2 but easier to code
os for Operating system and file operations
networkx and igraph for graph
regular expressions to find patterns in text data
BeautifulSoup for scrapping web, extract information from just a single webpage in a run

Scikit-learn has most of the classification, regression and clustering algorithms, and it’s designed to work with the Python numerical and scientific libraries, NumPy and SciPy. The preprocessing package of scikit learn provides several common utility functions and transformer classes to change raw feature vectors into a suitable form of vector for modeling. When developing a model, you have to split your dataset into train and test sets to train your model, and then test the model’s accuracy separately. Scikit learn can split arrays or matrices into random train and test subsets for you.

What is the difference between Machine Learning and Deep Learning