Import cleaning process
The following notebook is part of our import cleaning process. This notebook accomplishes the following:
- Imports a CSV file
- Removes extra columns
- Converts strings to correct data types
- Saves in the cleansed directory
# Pip install
%pip install pandas
# Here we would import libraries
import sys
import pandas as pd
Import file from raw data folder
Here we import the file from the raw folder
# Pseudo code for opening a file, importing a CSV, and loading it into pandas
# Define the file path
file_path = 'path/to/your/csvfile.csv'
# Use pandas to read the CSV file
df = pd.read_csv(file_path)
# Display the first few rows of the dataframe
print(df.head())
Removes extra columns
Removing the address and phone fields
# BEGIN: Remove extra columns
# List of columns to remove
columns_to_remove = ['address', 'phone']
# Remove the specified columns
df_cleaned = df.drop(columns=columns_to_remove)
# Display the first few rows of the cleaned dataframe
print(df_cleaned.head())
# END: Remove extra columns
Set data types
Sets the correct datatypes for date and identity fields.
# Pseudo code for setting data types
# Convert the 'date_field' to datetime
df_cleaned['date_field'] = pd.to_datetime(df_cleaned['date_field'])
# Convert the 'identity_field' to numeric (integer)
df_cleaned['identity_field'] = pd.to_numeric(df_cleaned['identity_field'], errors='coerce')
# Display the data types of the dataframe to verify changes
print(df_cleaned.dtypes)
Save new data to a cleansed directory
Write the cleansed data from Pandas to a new CSV file in the cleansed folder
# Pseudo code to write the cleansed data to a new CSV file in the cleansed folder
# Define the output file path
output_file_path = 'path/to/cleansed/folder/cleansed_data.csv'
# Use pandas to write the dataframe to a CSV file
df_cleaned.to_csv(output_file_path, index=False)
# Confirm the file has been written
print(f"Cleansed data has been written to {output_file_path}")