8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging

  • Data does not always come in forms ready for analysis
  • Data could be
    • wrong format
    • incorrect
    • missing
  • Data scientists can spend as much as 75% of their time preparing data before they begin their studies
  • Called data munging or data wrangling

8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)

  • Two of the most important steps in data munging are data cleaning and transforming data into optimal formats for database systems and analytics software
  • Common data cleaning examples include:
    • deleting observations with missing values,
    • substituting reasonable values for missing values,
    • deleting observations with bad values,
    • substituting reasonable values for bad values,
    • tossing outliers (although sometimes you’ll want to keep them),
    • duplicate elimination (although sometimes duplicates are valid),
    • dealing with inconsistent data,
    • and more.

8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)

  • Data cleaning is a difficult and messy process where you could easily make bad decisions that would negatively impact your results
  • The actions data scientists take can vary per project, be based on the quality and nature of the data and be affected by evolving organization and professional standards
  • Common data transformations include:
    • removing unnecessary data and features (we’ll say more about features in the data science case studies),
    • combining related features,
    • sampling data to obtain a representative subset (we’ll see in the data science case studies that random sampling is particularly effective for this and we’ll say why),
    • standardizing data formats,
    • grouping data,
    • and more.

Cleaning Your Data

  • Bad data values and missing values can significantly impact data analysis
  • Some data scientists advise against any attempts to insert “reasonable values”
    • Instead, they advocate clearly marking missing data and leaving it up to the data analytics package to handle the issue

Cleaning Your Data (cont.)

  • Consider a hospital that records patients’ temperatures (and probably other vital signs) four times per day
  • Assume that the data consists of a name and four float values, such as
    ['Brown, Sue', 98.6, 98.4, 98.7, 0.0]
    
  • Patient’s first three recorded temperatures are 99.7, 98.4 and 98.7
  • Last temperature was missing and recorded as 0.0, perhaps because the sensor malfunctioned
  • Average of the first three values is 98.57, which is close to normal
  • If you calculate the average temperature including the missing value for which 0.0 was substituted, the average is only 73.93, clearly a questionable result
  • Crucial to “get the data right.”
  • One way to clean the data is to substitute a reasonable value for the missing temperature, such as the average of the patient’s other readings

Data Validation

  • Series of five-digit ZIP Codes from a dictionary of city-name/five-digit-ZIP-Code key–value pairs
  • Intentionally entered an invalid ZIP Code for Miami
In [1]:
import pandas as pd
In [2]:
zips = pd.Series({'Boston': '02215', 'Miami': '3310'})
In [3]:
zips
Out[3]:
Boston    02215
Miami      3310
dtype: object

Data Validation (cont.)

  • The “second column” represents the Series’ ZIP Code values (from the dictionary’s values)
  • The “first column” represents their indices (from the dictionary’s keys)
  • Can use regular expressions with Pandas to validate data
  • The str attribute of a Series provides string-processing and various regular expression methods
  • Use the str attribute’s match method to check whether each ZIP Code is valid:
In [4]:
zips.str.match(r'\d{5}')
Out[4]:
Boston     True
Miami     False
dtype: bool
  • match applies the regular expression \d{5} to each Series element
  • Returns a new Series containing True for each valid element

Data Validation (cont.)

  • Several ways to deal with invalid data
  • One is to catch it at its source and interact with the source to correct the value
    • Not always possible
  • In the case of the bad Miami ZIP Code of 3310, we might look for Miami ZIP Codes beginning with 3310
    • There are two—33101 and 33109
    • We could pick one of those

Data Validation (cont.)

  • Sometimes, rather than matching an entire value to a pattern, you’ll want to know whether a value contains a substring that matches the pattern
  • Use method contains instead of match
In [5]:
cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])
In [6]:
cities
Out[6]:
0    Boston, MA 02215
1     Miami, FL 33101
dtype: object
In [7]:
cities.str.contains(r' [A-Z]{2} ')
Out[7]:
0    True
1    True
dtype: bool
In [8]:
cities.str.match(r' [A-Z]{2} ')
Out[8]:
0    False
1    False
dtype: bool

Reformatting Your Data

  • Consider munging data into a different format
  • Assume that an application requires U.S. phone numbers in the format ###-###-####
  • The phone numbers have been provided to us as 10-digit strings without hyphens
In [9]:
contacts = [['Mike Green', 'demo1@deitel.com', '5555555555'],
            ['Sue Brown', 'demo2@deitel.com', '5555551234']]   
In [10]:
contactsdf = pd.DataFrame(contacts, 
                          columns=['Name', 'Email', 'Phone'])
In [11]:
contactsdf
Out[11]:
Name Email Phone
0 Mike Green demo1@deitel.com 5555555555
1 Sue Brown demo2@deitel.com 5555551234

Reformatting Your Data (cont.)

  • Munge the data with functional-style programming
  • Can map the phone numbers to the proper format by calling the Series method map on the DataFrame’s 'Phone' column
  • map’s argument is a function that receives a value and returns the mapped value
  • Our function get_formatted_phone maps 10 consecutive digits into the format ###-###-####
In [12]:
import re
In [13]:
def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

Reformatting Your Data (cont.)

  • Regular expression in the block’s first statement matches only 10 consecutive digits
  • Captures substrings containing the first three digits, the next three digits and the last four digits
  • return statement:
    • If result is None, returns value unmodified
    • Otherwise, calls result.groups() to get a tuple containing the captured substrings and pass that tuple to string method join to concatenate the elements, separating each from the next with '-' to form the mapped phone number

Reformatting Your Data (cont.)

  • Series method map returns a new Series containing the results of calling its function argument for each value in the column
In [14]:
formatted_phone = contactsdf['Phone'].map(get_formatted_phone)
In [15]:
formatted_phone
Out[15]:
0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object
  • Once you’ve confirmed that the data is in the correct format, you can update it in the original DataFrame
In [16]:
contactsdf['Phone'] = formatted_phone
In [17]:
contactsdf
Out[17]:
Name Email Phone
0 Mike Green demo1@deitel.com 555-555-5555
1 Sue Brown demo2@deitel.com 555-555-1234

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.