15.1 Introduction to Machine Learning

  • machine learning—one of the most exciting and promising subfields of artificial intelligence
  • You’ll see how to quickly solve challenging and intriguing problems that novices and most experienced programmers probably would not have attempted just a few years ago.
  • Big, complex topic.
  • Our goal is a friendly, hands-on introduction to a few of the simpler machine-learning techniques.

What Is Machine Learning?

  • Can we really make our machines (computers) learn?
  • “Secret sauce” is data, and lots of it
  • Rather than programming expertise into our applications, we program them to learn from data
  • Build working machine-learning models then use them to make remarkably accurate predictions

Prediction

  • Improve weather forecasting to save lives, minimize injuries and property damage
  • Improve cancer diagnoses and treatment regimens to save lives
  • Improve business forecasts to maximize profits and secure people’s jobs
  • Detect fraudulent credit-card purchases and insurance claims
  • Predict customer “churn”, what prices houses are likely to sell for, ticket sales of new movies, and anticipated revenue of new products and services
  • Predict the best strategies for coaches and players to use to win more games and championships
  • All of these kinds of predictions are happening today with machine learning.

Machine learning applications    
Anomaly detection Data mining social media (like Facebook, Twitter, LinkedIn) Predict mortgage loan defaults
Chatbots Detecting objects in scenes Natural language translation (English to Spanish, French to Japanese, etc.)
Classifying emails as spam or not spam Detecting patterns in data Recommender systems (“people who bought this product also bought…”)
Classifying news articles as sports, financial, politics, etc. Diagnostic medicine Self-Driving cars (more generally, autonomous vehicles)
Computer vision and image classification Facial recognition Sentiment analysis (like classifying movie reviews as positive, negative or neutral)
Credit-card fraud detection Handwriting recognition Spam filtering
Customer churn prediction Insurance fraud detection Time series predictions like stock-price forecasting and weather forecasting
Data compression Intrusion detection in computer networks Voice recognition
Data exploration Marketing: Divide customers into clusters

15.1.1 Scikit-Learn

  • Scikit-learn, also called sklearn, conveniently packages the most effective machine-learning algorithms as estimators.
  • Each is encapsulated, so you don’t see the intricate details and heavy mathematics of how these algorithms work.
  • With scikit-learn and a small amount of Python code, you’ll create powerful models quickly for analyzing data, extracting insights from the data and most importantly making predictions.
  • You’ll use scikit-learn to train each model on a subset of your data, then test each model on the rest to see how well your model works.
  • Once your models are trained, you’ll put them to work making predictions based on data they have not seen.
  • Your computer that you’ve used mostly on rote chores will take on characteristics of intelligence.

Which Scikit-Learn Estimator Should You Choose for Your Project (1 of 2)

  • It’s difficult to know in advance which model(s) will perform best on your data, so you typically try many models and pick the one that performs best.
  • As you’ll see, scikit-learn makes this convenient for you.
  • A popular approach is to run many models and pick the best one(s).
  • How do we evaluate which model performed best?
  • You’ll want to experiment with lots of different models on different kinds of datasets.

Which Scikit-Learn Estimator Should You Choose for Your Project (1 of 2)

  • You’ll rarely get to know the details of the complex mathematical algorithms in the sklearn estimators, but with experience, you’ll become familiar with which algorithms may be best for particular types of datasets and problems.
  • Even with that experience, it’s unlikely that you’ll be able to intuit the best model for each new dataset.
  • So scikit-learn makes it easy for you to “try ’em all.”
  • The models report their performance so you can compare the results and pick the model(s) with the best performance.

15.1.2 Types of Machine Learning (1 of 2)

Types of machine learning diagram


Supervised Machine Learning

  • Supervised machine learning falls into two categories—classification and regression.
  • You train machine-learning models on datasets that consist of rows and columns.
  • Each row represents a data sample.
  • Each column represents a feature of that sample.
  • In supervised machine learning, each sample has an associated label called a target (like “dog” or “cat”).
  • This is the value you’re trying to predict for new data that you present to your models.

Datasets

  • You’ll work with some “toy” datasets, each with a small number of samples with a limited number of features.
  • You’ll also work with several richly featured real-world datasets, one containing a few thousand samples and one containing tens of thousands of samples.
  • In the world of big data, datasets commonly have, millions and billions of samples, or even more.
  • There’s an enormous number of free and open datasets available for data science studies.
  • Libraries like scikit-learn package up popular datasets for you to experiment with and provide mechanisms for loading datasets from various repositories (such as openml.org).
  • Governments, businesses and other organizations worldwide offer datasets on a vast range of subjects.
  • We’ll work with several popular free datasets, using a variety of machine learning techniques.

Classification

  • We’ll use one of the simplest classification algorithms, k-nearest neighbors, to analyze the Digits dataset bundled with scikit-learn.
  • Classification algorithms predict the discrete classes (categories) to which samples belong.
  • Binary classification uses two classes, such as “spam” or “not spam” in an email classification application.
  • Multi-classification uses more than two classes, such as the 10 classes, 0 through 9, in the Digits dataset.
  • A classification scheme looking at movie descriptions might try to classify them as “action,” “adventure,” “fantasy,” “romance,” “history” and the like.

Regression

  • Regression models predict a continuous output.
  • We’ll perform simple linear regression using scikit-learn’s LinearRegression estimator.
  • Next, we use a LinearRegression estimator to perform multiple linear regression with the California Housing dataset that’s bundled with scikit-learn.
  • We’ll predict the median house value of a U. S. census block of homes, considering eight features per block, such as the average number of rooms, median house age, average number of bedrooms and median income.
  • The LinearRegression estimator, by default, uses all the numerical features in a dataset to make more sophisticated predictions than you can with a single-feature simple linear regression.

Unsupervised Machine Learning

  • Next, we’ll introduce unsupervised machine learning with clustering algorithms.
  • We’ll use dimensionality reduction (with scikit-learn’s TSNE estimator) to compress the Digits dataset’s 64 features down to two for visualization purposes.
  • This will enable us to see how nicely the Digits data “cluster up.”
  • This dataset contains handwritten digits like those the post office’s computers must recognize to route each letter to its designated zip code.
  • This is a challenging computer-vision problem, given that each person’s handwriting is unique.
  • Yet, we’ll build this clustering model with just a few lines of code and achieve impressive results.
  • And we’ll do this without having to understand the inner workings of the clustering algorithm.
  • This is the beauty of object-based programming.
  • We’ll see this kind of convenient object-based programming again in the next chapter, where we’ll build powerful deep learning models using the open source Keras library.

K-Means Clustering and the Iris Dataset (1 of 2)

  • We’ll present the simplest unsupervised machine-learning algorithm, k-means clustering, and use it on the Iris dataset that’s also bundled with scikit-learn.
  • We’ll use dimensionality reduction (with scikit-learn’s PCA estimator) to compress the Iris dataset’s four features to two for visualization purposes.
  • We’ll show the clustering of the three Iris species in the dataset and graph each cluster’s centroid, which is the cluster’s center point.

K-Means Clustering and the Iris Dataset (1 of 2)

  • Finally, we’ll run multiple clustering estimators to compare their ability to divide the Iris dataset’s samples effectively into three clusters.
  • You normally specify the desired number of clusters, k.
  • K-means works through the data trying to divide it into that many clusters.
  • As with many machine learning algorithms, k-means is iterative and gradually zeros in on the clusters to match the number you specify.
  • K-means clustering can find similarities in unlabeled data.
  • This can ultimately help with assigning labels to that data so that supervised learning estimators can then process it.
  • Given that it’s tedious and error-prone for humans to have to assign labels to unlabeled data, and given that the vast majority of the world’s data is unlabeled, unsupervised machine learning is an important tool.

Big Data and Big Computer Processing Power

  • The amount of data that’s available today is already enormous and continues to grow exponentially.
  • The data produced in the world in the last few years equals the amount produced up to that point since the dawn of civilization.
  • We commonly talk about big data, but “big” may not be a strong enough term to describe truly how huge data is getting.
  • People used to say “I’m drowning in data and I don’t know what to do with it.”
  • With machine learning, we now say, “Flood me with big data so I can use machine-learning technology and powerful computing capabilities to extract insights and make predictions from it.”
  • This is occurring at a time when computing power is exploding and computer memory and secondary storage are exploding in capacity while costs dramatically decline.
  • All of this enables us to think differently about the solution approaches.
  • We now can program computers to learn from data, and lots of it.
  • It’s now all about predicting from data.

15.1.3 Datasets Bundled with Scikit-Learn

  • Scikit-learn also provides capabilities for loading datasets from other sources, such as the 20,000+ datasets available at https://openml.org.
Datasets bundled with scikit-learn  
"Toy" datasets Real-world datasets
Boston house prices Olivetti faces
Iris plants 20 newsgroups text
Diabetes Labeled Faces in the Wild face recognition
Optical recognition of handwritten digits Forest cover types
Linnerrud RCV1
Wine recognition Kddcup 99
Breast cancer Wisconsin (diagnostic) California Housing

15.1.4 Steps in a Typical Data Science Study

  • We’ll perform the steps of a typical machine-learning case study, including:
    • loading the dataset
    • exploring the data with pandas and visualizations
    • transforming your data (converting non-numeric data to numeric data because scikit-learn requires numeric data; in the chapter, we use datasets that are “ready to go,” but we’ll discuss the issue again in the “Deep Learning” chapter)
    • splitting the data for training and testing
    • creating the model
    • training and testing the model
    • tuning the model and evaluating its accuracy
    • making predictions on live data that the model hasn’t seen before.
  • These are important steps in cleaning your data before using it for machine learning.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.