Data science

What is Data Science according to Wikipedia ?

The term data science (originally used interchangeably with “datalogy”) has existed for over thirty years and was used initially as a substitute for computer science.
Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.

Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualisation

Whether a data is “Big data” or not, we can use Data Science to support Data Driven Decisions and take better decisions.

Collected data volume from connected objects (different sensors), social networks (Linkedin, Facebook), smart objects, search engines (Google), Science research center (Cern), banks, etc., is colossal. Connected Objects are expanding. They will interact more and more in the future with many everyday life aspects. Collected data is exponentially increasing leading to the use of more measurement unity as Tera Byte, Peta Byte, Exa Byte, etc.

###Measurement Unity

SYMBOL	VALUE	NAME
Kb	10^3	Kilo Byte
Mb	10^6	Mega Byte
Gb	10^9	Giga Byte
Tb	10^12	Tera Byte
Pb	10^15	Peta Byte
Eb	10^18	Exa Byte
Zb	10^21	Zetta Byte
Yb	10^25	Yotta Byte

How to deal with Big data storage and processing

In order to allow a larger volume of data to be stored, research and experimentation efforts in distributed systems began in earnest in the 1970s and continued through 1990s, with focused interest peaking in the late 1980s. A number of distributed operating systems were introduced during this period; however, very few of these implementations achieved even modest commercial success.

Nevertheless, to proceed computations, the necessary data had to be transfered over the network from data site to computation site. This implies a lot of redundant downloads and uploads and the data processing time will encrease depending on the number of users and the network bandwidth. Furthermore, difficulties can be encountered due to server failures.

The concept of distributing computations emerged to overcome this problem. Computations had to be moved to data and executed on data site. Only instructions to be executed and computation results have to be transfered through the network. Time-saving is substancial.

This idea was implemented by Doug Cutting and Mike Cafarella in 2005 as a framework called Hadoop. Hadoop Distributed File System (HDFS) is one of the Hadoop basic modules and is used for distributed storage.

Hadoop was developped as an open source software framework for storage and large scale processing of data-sets on clusters. It allows to store data volumes up to several Peta Bytes and offers flexibility in adding or removing servers from a cluster. Server failures are managed by the system (the file content is replicated on multiple DataNodes for reliability).

An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.

MapReduce

MapReduce is a generic framework to handle big data by:

reducing data into subsets
processing data separately on different machines
putting all the pieces back together

Hadoop is an implementation of MapReduce.

Storage

The most popular database management systems since the 1980s (DBMS) have all supported the relational model (invented in 1970 by Edgar F. Codd) as represented by the SQL language (Structured Query Language).

Well-known DBMSs include MySQL, PostgreSQL, Microsoft SQL Server, Oracle, Sybase, SAP HANA, and IBM DB2.

Those DBMSs are efficient for a maximum data volume of tens or hundreds Tera Bytes. Furthermore, the Relational Database Management System is hosted on one server… NoSQL technologies and NewSQL were developed to overcome this limitation.

NoSQL

The distributed data base system can not afford anymore the use of joins between tables as in conventional Relational DBMS (SQL). The relationnal model is no more adapted to the distributed database system. NoSQL technologies (meaning Not Only SQL) is used to describe this new information storage paradigm. It is used to describe database or data management systems that support new, more efficient ways to access data : Key-value, Column Family, Documents, Graph DB…

NewSQL

The Structured Query Language (SQL) platform can be adopted for new systems with some technical adjustments that would give it the flexibility of NoSQL systems

–>

Computer Languages and Statistical tools

The explosion of “Data” problems requires statistical learning skills to be solved as well as parallel developments in computer science and, in particular, machine learning. Statistical knowledge about the lasso and sparse regression, classification and regression trees, and boosting and support vector machines is important to modelise and understand complex datasets.

R and Python languages are today the most popular langages to process data

R is an open source statistical programming language and environment.The R project was conceived in the 1990s. R is an interpreted language. It is limited to in-memory data processing but offers a wide variety of statistical and graphical techniques. It is popular for its great visualisation capabilities. Modern environments have extended R capabilities to overcome its in-memory limitations by creating libraries (e.g. Bigmemory) or integrating R in a distributed architecture (e.g. RHadoop).

Python is a scripting programming language. It was conceived in the late 1980s. It is great at handling text and unstructured data and turns to be more and more popular in data science. Libraries for data science transformed Python from a general purpose programming language into a powerful and robust tool for data analysis and visualisation.