What we’re talking about when we talk artificial intelligence
There is this thing that people talk a lot about called artificial intelligence or AI. It is many things to many people. To some, it represents families of algorithms that “learn” to recognize patterns given enough data. Others will ask you to be extremely specific about the context because to them it is the elusive technology that we always advance towards but never arrive at. In the movies, AI is sometimes embodied by a calculating humanoid robot that could never understand humanity and doesn’t care to anyway. In other media, it is the helpful sidekick that knows how you best like your coffee. It can answer your question before you know what you’re asking and you can fall in love with it because it knows you better than any human has ever known you.
You’ve probably heard or seen many of these definitions, and you’ve probably heard them a lot. The best and brightest in our culture have their own opinions on the subject and have brought it even to the attention of the United Nations. And every day it seems like there is an article on how to constrain it, keep it in a box. So here’s a safe prediction: Discussion of artificial intelligence is not going away.
For these reasons, we’ll define various forms of artificial intelligence and terms associated with it. The intent here is to clarify. Machine Love US will continue to put out media referencing these terms; we are going to be using all of this different kind of language and want to be clear about it. It will help our discussion and hopefully, it will help you. In the end, we’ll have a better chance of understanding one another.
Let’s go from the broadest thing to the most specific.
The most transformative: General (Strong) Artificial Intelligence
General (sometimes referred as Strong) AI is what our culture thinks about when someone says “artificial intelligence.” For many, it is considered an inevitability. General AI might be conscious but it doesn’t have to be. We can imagine a very smart thinking machine that is not conscious. If you are confused or intrigued by that last sentence you might be interested in podcasts by Sam Harris.
A generally intelligent system will learn from its surroundings and respond to new inputs. It can adapt and improve itself. Given new data in new contexts, it can make some sense of the situation and optimize towards whatever goal it was programmed (or chose) to pursue.
Here’s the thing, this is hard to do and we’re not really sure we can pull it off. It’s difficult because intelligence is (probably) more than data manipulation. The biggest challenges are those where no data exists. In the natural world, these problems are resolved by mechanisms that have not been identified, let alone understood
Though if we are going to create a general artificial intelligence it is likely to be developed by moving through the arc of intelligence outlined in the graphic above. It could start with an intelligence that is close to a bacterium or if we’re very good, some sort of insect. Moving along the arc it is likely to pass through the intelligence of a rodent, a primate, to a human and then beyond.
Nick Bostrom’s recent book SuperIntelligence deals with this discussion directly. If you are at all interested in how a general artificial intelligence might be invented and how it could take off. His prevailing point is that if intelligence is that if intelligence is a like a railroad track, human intelligence is just one stop on along it. Now, think of general artificial intelligence as a train traveling along that track. It is likely to spend a very minimal amount of time at our station before speeding on into the distance.
This technology may not arise for some many years, although many researchers agree that we will eventually create it. It is difficult to describe how transformational artificial intelligence could be to society. It is very exciting to think about…and a bit scary too.
What most people mean: Narrow (Weak) Artificial Intelligence
Narrow artificial intelligence exists and is found in many applications. It determines cat pictures from dog pictures, prices goods, helps drive cars, identifies cancers, and matches you with potential partners. Apple has release face detection software on their phones to unlock them. Business and researchers will continue rushing into this space looking for ripe opportunities because they are finding them.
It is the role of the data scientist to build these systems. To do this a data scientist will employ the use of machine learning.
The practice and mechanics: Machine Learning
A field of study that gives computers the ability to learn without being explicitly programmed.
– Arthur Samuel
Machine learning is how we build artificially intelligence systems. The fuel to these systems is data and the traditional approach is to use labeled data to conduct supervised learning. With supervised learning, we know the answer to the question we are asking. We provide this data to various algorithms (depending on the problem we’ll choose a different tool) and if provided with enough examples and if there is really something that differentiates a cat from a dog, for example, these algorithms will start predicting the results to varying levels of precision.
I recommend this book to software developers interested in learning the mechanics of machine learning. Even if you don’t have software experience but are interested in the ‘doing’ of machine learning, it is worth your time.
The practice of machine learning involves testing and tuning various algorithms in order to determine their effectiveness at predicting the target variable (which might be dollars if you’re predicting income or a yes/no if you’re predicting whether someone has diabetes). The data scientist will have prepared a test set that has not been used to build the algorithm so they can judge how accurate it is in predicting it’s target. To increase the algorithms efficacy the data scientist might tune hyperparameters, adjust the learning rate, or otherwise experiment with feature engineering.
Blogger, podcast host, and animator CPG Grey has cleverly animated how machine learning works at a high level. It’s clever and cute and I encourage you to watch it.
Recent advances in machine learning: Deep Learning
Deep learning is a subset of machine learning. The fundamentals of deep learning have existed since the 1950s, though some trace it earlier than that. Building of models with deep learning involves leveraging artificial neural networks which are a category of algorithm very loosely inspired by how the human brain works.
Francois Chollet is the creator of the deep learning framework Keras that is growing in popularity. He also has a fantastic blog that I recommend you follow if you are at all interested in deep learning. Keras should not be taken lightly.
If you are not going to be implementing these algorithms anytime soon it might be enough to know that applied deep learning is exciting the data science field. If you want to go a bit further you should do more reading on the perceptron, backpropagation, and activation functions.
The perceptron is the building block of deep neural networks which use activation functions to evaluate how well they are doing and backpropagation to fine tune their predictive power.
Rapid fire section of terms
OK now let’s do a rapid-fire term round. This is meant to be helpful to those who hear some of these terms thrown around the office or at parties. This is not meant to be comprehensive but to give an overview of the language data scientists use to ply their trade.
Accuracy: Within supervised learning this is a measure of how well the predictions made on the test dataset compare with reality. So if you have 100 pictures of cats and dogs in your test set and you correctly label 75 of them then your score is 75%.
Agglomerative clustering: Refers to a collection of clustering algorithms all building on the sample principle whereby each point in a dataset
Anscombes Quartet: This is a can be a neat discussion item at a party. Look at the graphics on the left, they are very different datasets. However, traditional measures that are used to describe datasets (mean, variance, correlation coefficients and even a regression line) are all the same! This demonstration proves that these descriptive statistics should not be the only measures used when investigating datasets. It also shows the power of visualization.
Area under the curve (AUC): The area under the receiver operating characteristics curve (ROC) which shows the false positive rate against the true positive rate. This number is commonly used to evaluate classifiers.
Bag of Words: A representation of text data for use in machine learning. Characteristics of text like paragraphs and chapter headings are discarded in favor of looking at individual words themselves.
Bayes Theorem: is a critical statistical formula. This methodology works well even given very small amounts of data and it is often very intuitive. The difficulty with Bayes Thereom is that you need to identify a prior. A good description of Bayesian thinking.
Bias-Variance Tradeoff: The dilemma whereby the data scientists needs to evaluate and tradeoff between a models’ bias and variance. Models with low bias are usually more complex (resulting in high variance) whereas simpler models typically do not capture the complexity of the attribute (high bias).
Brier Score: A statistical test measuring the accuracy of probabilistic predictions.
Chain Rule: A rule of calculus. Learning the chain rule will help in your understanding of the fundamentals of machine learning.
Chi-Squared: Statistical test separating out the effects between a categorical variable with more than two groups. You will use this test to look at the differences in your experiment between different age groups, for example.
Classification: a common task of machine learning models. Think of this as identifying cats from dogs (which was very difficult for a long time!)
Curse of dimensionality: Refers to the trouble encountered when modeling with a significant number of sparse features. When modeling it’s important for the model to observe at least a few examples of each value of a dimension. Basic example: you are attempting to estimate income using a person’s height and typical color of shirt they wear. You may have very few individuals who are over six feet tall that like red shirts. This is problematic when trying to predict the income of the next individual you encounter with these characteristics.
Derivative: The slope of a line at any given point. Really you should understand derivatives when you begin learning machine learning.
Eigenvector: A fundamental concept within linear algebra.
Ensemble: Shorthand for ensemble models or a single model that is composed of a group of models. Interestingly ensembles have been shown to perform better than individual models.
F1 score: Is a measure of overall model fitness.
Feature importance: How valuable a given variable is in your predictive model. Let’s say you are trying to predict how much a particular person on the street makes in income. Let’s say you have how many years of education they have and also how tall they are in feet. Feature importance would tell you which of these two items of information is more predictive in explaining how much they make. In this particular example it may be less obvious which of these two features is more important, sadly.
Gradient Descent: An algorithm used to solve optimization problems. It does this essentially through trial and error, by taking steps in a given direction within a model (or function) and re-evaluating it’s location in the function after making that step. By way of an example, think about how you might go about climbing down a hill while blindfolded.
Grid search: A process whereby different settings of hyper parameters are searched to determine the combination such that the model is optimized.
Hetroskedasticity: Means that the variance of the error in the model is not constant. This is evidence of a volatile model.
Hyperparameter: A particular setting on a model. These are tuned during the model building phase to optimize particular attributes of the model.
Imputation: Calculation of data points that are not strictly observed. Typically this is to be avoided as much as possible.
Learning rate: Commonly tuned when building deep learning models. This parameter can be likened to the size of steps you decide to take as you make your way, blindfolded, down a hilltop. Too large a step and you might get into trouble and potentially miss the fastest way down, too small a step and you’ll be there all day.
Logistic regression: A type of machine learning model especially useful for classification tasks. These models are highly useful.
Mean Absolute Error (MAE): A measure of predictive model fitness. This measure is more easily understandable than it’s cousin MSE. For example, when attempting to predict a given individuals income this would be how far off, in dollars, the model is from reality. More discussion of the tradeoff between MAE and MSE can be found here.
Mean Squared Error (MSE): A measure of predictive model fitness. This measure should be more closely considered when outliers are troubling for the model.
One-hot encoding: Is a data science method whereby categorical features (for simplicity: male vs. female, dead vs. alive, rich vs. poor vs. middle-income) are transformed into numerical features. Useful for getting your data into shape for model building.
Overfitting: Is a problem faced by data scientists when their model learns the given dataset (studying to the test). It is problematic in that the model is not generalizable.
Regularization: A technique in data science to reduce the likelihood of overfitting.
R-squared: A measure of a model’s fitness. Think of it how much variation in your target variable (what you’re trying to predict) is captured in your features (what you know about your target). The larger the better.
Support Vector Machines: A machine learning algorithm. Powerful when properly used they have been cited as being hard to tune appropriately.
TF-IDF: A useful concept in natural language processing (NLP) where we can understand the most frequent words in a given batch of text.
Again this is not meant to be a comprehensive list of terms. I barely feel like we’ve scratched the surface, but perhaps this will give you some notion about what machine learning involves and how data scientists talk about their trade.
Final thoughts: It’s not going to get less complicated in the near future but AI is very promising
This post covered a lot of terms and you’re probably exhausted from the effort. Unfortunately, it’s unlikely to get less complex.
It is hard to recognize the impact that AI will have on our world because it isn’t yet widely deployed and we aren’t riding in self-driving vehicles. But remember that back in the early 1990s it would have been difficult to believe the impact the internet would have on all aspects of our lives. Most people didn’t see how the internet was relevant to them and didn’t know what it could accomplish. The same is true for machine learning and AI today. But let’s be clear: AI is coming.
As a side note if you appreciate this content and would like further technical posts about how to build these kinds of systems and learn how to use them please leave a comment so I know. I want to talk about the topics you are interested in reading.
Yes, they are listening…and learning.
Bonus fun book: After On
I had recommended a few books during the course of this blog post so let me recommend one more. After On is a fun novel for the near future. I particularly enjoyed it’s description of what it felt like to have a fictional machine learned app improve over time:
When Phluttr first launched, its little interruptions were so irrelevant, people installed the app just to chuckle at them, and #TryAgainPhluttr was a hot hashtag. Later, it felt more like a fortune cookie; still very random, but at times weirdly topical, in fun, coincidental ways. These days, Phluttr’s accuracy makes Mitchell’s skin crawl daily.