In progress: machine learning algorithms and visualisation techniques will be developped in future sections…
Supervised learning consists in supervising a machine learning model: We do this by “teaching” the model. In supervised learning the outcomes of the model are known. We teach and train the model. Once trained the model is used to predict outcomes for unforeseen data. Supervised learning is where you have input variables (X) and an output variable (Y). We train the model in order to approximate the function which maps the input into the output.
Y = f(X)
Learning stops when the algorithm (the function) achieves an acceptable level of perfomance.
As an example we will use a supervised algorithm for the cancer dataset. This dataset have some historical data describing tumors for patients such as Clump thickness, Uniformity of cell size, Uniformity of cell shape, etc., and a Boolean (outcomes) parameter which indicates whether the tumor is maline or benign for each patient. The columns are called Features. Most popular supervised machine learning algorithms:
Unsupervised learning consists in not supervising the model: we let the model work on its own to discover information. The Unsupervised algorithm trains on the dataset, and draws conclusions on UNLABELED data (outcomes are unknown). Most popular unsupervised machine learning algorithms:
In comparison to supervised learning, unsupervised learning has fewer models and fewer evaluation methods that can be used to ensure that the outcomes of the model are accurate. For this reason, unsupervised learning creates a less controlable environment.
from https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/
The four main libraries used for datascience computing are NumPy, SciPy, Matplotlib and Pandas.
LIBRARY | FEATURES |
---|---|
NumPy (Numerical Python) |
|
SciPy (Scientific Python) |
|
Matplotlib |
|
Pandas |
|
OTHER LIBRARIES | FEATURES |
---|---|
Scikit Learn |
|
Statsmodels |
|
Seaborn |
|
Bokeh |
|
Blaze |
|
Scrapy |
|
SymPy (symbolic computation) |
|
Requests | for accessing the web, similar to the standard python library urllib2 but easier to code |
os | for Operating system and file operations |
networkx and igraph | for graph |
regular expressions | to find patterns in text data |
BeautifulSoup | for scrapping web, extract information from just a single webpage in a run |
Scikit-learn has most of the classification, regression and clustering algorithms, and it’s designed to work with the Python numerical and scientific libraries, NumPy and SciPy. The preprocessing package of scikit learn provides several common utility functions and transformer classes to change raw feature vectors into a suitable form of vector for modeling. When developing a model, you have to split your dataset into train and test sets to train your model, and then test the model’s accuracy separately. Scikit learn can split arrays or matrices into random train and test subsets for you.
Machine Learning covers the statistical part of artificial intelligence. From wikipedia: “Machine learning algorithms build a mathematical model based on sample data, known as”training data“, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers.” “The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning”. “In its application across business problems, machine learning is also referred to as predictive analytics.”
Deep Learning “is part of a broader family of machine learning methods based on artificial neural networks with representation learning.” “Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering”, etc.