8.12 Introduction to Regular Expressions

  • A regular expression describes a search pattern for matching characters in other strings
  • Can help you extract data from unstructured text
  • Can help you ensure that data is in the correct format before processing it

Validating Data

  • Often use regular expressions to validate the data
    • A U.S. ZIP Code consists of five digits (such as 02215) or five digits followed by a hyphen and four more digits (such as 02215-4775)
    • A string last name contains only letters, spaces, apostrophes and hyphens
    • An e-mail address contains only the allowed characters in the allowed orde
    • A U.S. Social Security number contains three digits, a hyphen, two digits, a hyphen and four digits, and adheres to other rules about the specific numbers that can be used in each group of digits
  • Rarely need to create your own regular expressions
  • Repositories of existing regular expressions that you can copy and use
    • https://regex101.com
    • http://www.regexlib.com
    • https://www.regular-expressions.info

Other Uses of Regular Expressions

  • Extract data from text (sometimes known as scraping)
    • e.g., locating all URLs in a web page
    • You might prefer tools like BeautifulSoup, XPath and lxml for this
  • Clean data
    • Removing data that’s not required, removing duplicate data, handling incomplete data, fixing typos, ensuring consistent data formats, dealing with outliers and more
  • Transform data into other formats
    • Reformatting data that was collected as tab-separated or space-separated values into comma-separated values (CSV) for an application that requires data to be in CSV format

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.