"I don't know what you mean by 'glory,' " Alice said.
Humpty Dumpty smiled contemptuously. "Of course you don't—till I tell you. I meant 'there's a nice knock-down argument for you!' "
"But 'glory' doesn't mean 'a nice knock-down argument'," Alice objected.
"When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean—neither more nor less."
"The question is," said Alice, "whether you can make words mean so many different things."
"The question is," said Humpty Dumpty, "which is to be master—that's all."
I don't think Lewis Carroll had 'Big Data' or 'Machine Learning' in mind when he penned these words, however I think the quote is quite apt in this context. All to often these buzzwords seem to fall foul to the Humpty Dumpty principle, they mean just what the speaker chooses them to mean - regardless of what the words actually mean to anyone else. So what do these terms actually mean?
The field of study which investigates algorithms that give computers the ability to learn without being explicitly programmed.
What do we mean by ‘learn’ in this context? The definition used by Machine Learning practitioners, originally stated by Arthur Samuel is:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
So what problems can Machine Learning algorithms be applied to? The main advances in machine learning have been in the following areas:
- Classification - classifying data items into groups based on a training set. For example, a computer given a set of emails and told which are spam and which are not spam will be able to use machine learning to classify new emails as either spam or non-spam.
- Cluster Analysis – identifying similarities in items in data sets without being explicitly told what to look for.
- Computer vision – teaching computers to understand what they are seeing from visual inputs
- Object recognition – a combination of the above three area in which a computer is able to correctly recognise objects from visual inputs.
- Natural Language Processing – being able to correctly interpret natural languages.
- Search Engines – taking human input to a search engine and suggesting appropriate results
- Speech and handwriting recognition – translating speech and handwriting into written text.
A trait shared by all these problems is that previously computers were thought to be incapable of tackling them. This is one reason why Machine Learning is such an exciting and growing field of study.
If you'd like to know more about Machine Learning then Andrew Ng at Stanford University has released a really good free online course through Coursera which can be accessed through the following link:
Big Data can be defined as data which conforms to the 3Vs. Big Data is available at a higher volume, higher velocity (rate at which data is generated) and/or greater variety than normal data sources.
So for example, looking at an insurance company, claims data would not count as Big Data, the volume will be fairly low, velocity will be slow, and variety will be fairly uniform.
The browsing patterns of an aggregator website on the other hand would count as Big Data. For example, the amount of time someone spends on Comparethemarket.com, their clicks, what they search for, how many searches they make, how often they return to the website before making a purchase, etc. would count as Big Data. There would be a massive volume of data to analyse and the data would be available in real time. (It wouldn’t meet the variety criteria, but that’s not a necessary condition)
Due to the need to extract useful information from Big Data, and the difficulties created by the 3Vs, we cannot rely on traditional methods of data analysis. Given the volume and velocity of Big Data, we require methods of analysis that does not need to be programmed explicitly, this is where Machine Learning fits in. Machine Learning in the guise of speech and handwriting recognition can also be important if the data generated is in audio form but needs to be combined with other data.
Data Mining is a catch all term for the process of analysing and summarising data into useful information. Data may be in the form of Big Data, and methods used may be based on Machine Learning (where the algorithm learns from the data) or may be more traditional.
Data Visualisation is the process of creating visual graphics that aid in understanding and exploring data. It has become increasingly important for two reasons, firstly, the rise in the volume of data sets means that new methods are required to understand data, secondly, an increase in computing power means that more advanced visualisation techniques are now possible.
Data Science is a broad term which encompasses processes which aim to extract knowledge or insight from Data. Data science therefore includes all the previous fields.
For example, in carrying at an analysis, we will first collect our data, which may or may not be in the form of Big Data, we will then mine our data, possibly using machine learning, and then present our results through Data Visualisation.