8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging¶

Data does not always come in forms ready for analysis
Data could be
- wrong format
- incorrect
- missing
Data scientists can spend as much as 75% of their time preparing data before they begin their studies
Called data munging or data wrangling

8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)¶

Two of the most important steps in data munging are data cleaning and transforming data into optimal formats for database systems and analytics software
Common data cleaning examples include:
- deleting observations with missing values,
- substituting reasonable values for missing values,
- deleting observations with bad values,
- substituting reasonable values for bad values,
- tossing outliers (although sometimes you’ll want to keep them),
- duplicate elimination (although sometimes duplicates are valid),
- dealing with inconsistent data,
- and more.

8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)¶

Data cleaning is a difficult and messy process where you could easily make bad decisions that would negatively impact your results
The actions data scientists take can vary per project, be based on the quality and nature of the data and be affected by evolving organization and professional standards
Common data transformations include:
- removing unnecessary data and features (we’ll say more about features in the data science case studies),
- combining related features,
- sampling data to obtain a representative subset (we’ll see in the data science case studies that random sampling is particularly effective for this and we’ll say why),
- standardizing data formats,
- grouping data,
- and more.

Cleaning Your Data¶

Bad data values and missing values can significantly impact data analysis
Some data scientists advise against any attempts to insert “reasonable values”
- Instead, they advocate clearly marking missing data and leaving it up to the data analytics package to handle the issue

Cleaning Your Data (cont.)¶

Consider a hospital that records patients’ temperatures (and probably other vital signs) four times per day
Assume that the data consists of a name and four float values, such as
```
['Brown, Sue', 98.6, 98.4, 98.7, 0.0]
```
Patient’s first three recorded temperatures are 99.7, 98.4 and 98.7
Last temperature was missing and recorded as 0.0, perhaps because the sensor malfunctioned
Average of the first three values is 98.57, which is close to normal
If you calculate the average temperature including the missing value for which 0.0 was substituted, the average is only 73.93, clearly a questionable result
Crucial to “get the data right.”
One way to clean the data is to substitute a reasonable value for the missing temperature, such as the average of the patient’s other readings

Data Validation¶

Series of five-digit ZIP Codes from a dictionary of city-name/five-digit-ZIP-Code key–value pairs
Intentionally entered an invalid ZIP Code for Miami

import pandas as pd

zips = pd.Series({'Boston': '02215', 'Miami': '3310'})

zips

Boston    02215
Miami      3310
dtype: object

Data Validation (cont.)¶

The “second column” represents the Series’ ZIP Code values (from the dictionary’s values)
The “first column” represents their indices (from the dictionary’s keys)
Can use regular expressions with Pandas to validate data
The str attribute of a Series provides string-processing and various regular expression methods
Use the str attribute’s match method to check whether each ZIP Code is valid:

zips.str.match(r'\d{5}')

Boston     True
Miami     False
dtype: bool

match applies the regular expression \d{5} to each Series element
Returns a new Series containing True for each valid element

Data Validation (cont.)¶

Several ways to deal with invalid data
One is to catch it at its source and interact with the source to correct the value
- Not always possible
In the case of the bad Miami ZIP Code of 3310, we might look for Miami ZIP Codes beginning with 3310
- There are two—33101 and 33109
- We could pick one of those

Data Validation (cont.)¶

Sometimes, rather than matching an entire value to a pattern, you’ll want to know whether a value contains a substring that matches the pattern
Use method contains instead of match

cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])

cities

0    Boston, MA 02215
1     Miami, FL 33101
dtype: object

cities.str.contains(r' [A-Z]{2} ')

0    True
1    True
dtype: bool

cities.str.match(r' [A-Z]{2} ')

0    False
1    False
dtype: bool

Reformatting Your Data¶

Consider munging data into a different format
Assume that an application requires U.S. phone numbers in the format ###-###-####
The phone numbers have been provided to us as 10-digit strings without hyphens

contacts = [['Mike Green', 'demo1@deitel.com', '5555555555'],
            ['Sue Brown', 'demo2@deitel.com', '5555551234']]

contactsdf = pd.DataFrame(contacts, 
                          columns=['Name', 'Email', 'Phone'])

contactsdf

Reformatting Your Data (cont.)¶

Munge the data with functional-style programming
Can map the phone numbers to the proper format by calling the Series method map on the DataFrame’s 'Phone' column
map’s argument is a function that receives a value and returns the mapped value
Our function get_formatted_phone maps 10 consecutive digits into the format ###-###-####

import re

def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

Reformatting Your Data (cont.)¶

Regular expression in the block’s first statement matches only 10 consecutive digits
Captures substrings containing the first three digits, the next three digits and the last four digits
return statement:
- If result is None, returns value unmodified
- Otherwise, calls result.groups() to get a tuple containing the captured substrings and pass that tuple to string method join to concatenate the elements, separating each from the next with '-' to form the mapped phone number

Reformatting Your Data (cont.)¶

Series method map returns a new Series containing the results of calling its function argument for each value in the column

formatted_phone = contactsdf['Phone'].map(get_formatted_phone)

formatted_phone

0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object

Once you’ve confirmed that the data is in the correct format, you can update it in the original DataFrame

contactsdf['Phone'] = formatted_phone

contactsdf

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

	Name	Email	Phone
0	Mike Green	demo1@deitel.com	5555555555
1	Sue Brown	demo2@deitel.com	5555551234

	Name	Email	Phone
0	Mike Green	demo1@deitel.com	555-555-5555
1	Sue Brown	demo2@deitel.com	555-555-1234