ChatGPT Code Interpreter: How It Saved Me Hours of Work | by Soner Yıldırım | Jul, 2023

Data cleaning and preprocessing

The next step after reading the data is clean and preprocess . I’m sure you heard it several times that this step usually takes the most time.

There are a few issues that need to be handled before creating the world map visualizations. ChatGPT detected them and provided a solution less than a minute. Quite impressive!

( by author)

The in the above snippet:

# code generated by Code Interpreter
# Check if the country names are consistent across both
population_countries = set(population_data['Country Name'])
coordinates_countries = set(country_coordinates['name'])

# Countries in the population data but not in the coordinates data
missing_in_coordinates = population_countries - coordinates_countries

# Countries in the coordinates data but not in the population data
missing_in_population = coordinates_countries - population_countries

There are some countries missing in the coordinates and also some in the population dataset.

What ChatGPT did in the code snippet above:

  • Create sets of country names in both datasets
  • Compare them separately to find the difference as a set

ChatGPT solved it in less than a minute:

The code in the above snippet:

# code generated by Code Interpreter
# Mapping of country names from the population dataset to the coordinates dataset
country_name_mapping = {
'Bahamas, The': 'Bahamas',
'Brunei Darussalam': 'Brunei',
'Cabo Verde': 'Cape Verde',
'Congo, Dem. Rep.': 'Congo [DRC]',
'Congo, Rep.': 'Congo [Republic]',
"Cote d'Ivoire": "Côte d'Ivoire",
'Czechia': 'Czech Republic',
'Egypt, Arab Rep.': 'Egypt',
'Gambia, The': 'Gambia',
'Hong Kong SAR, China': 'Hong Kong',
'Iran, Islamic Rep.': 'Iran',
"Korea, Dem. People's Rep.": 'North Korea',
'Korea, Rep.': 'South Korea',
'Kyrgyz Republic': 'Kyrgyzstan',
'Lao PDR': 'Laos',
'Macao SAR, China': 'Macau',
'Micronesia, Fed. Sts.': 'Micronesia',
'Myanmar': 'Myanmar [Burma]',
'North Macedonia': 'Macedonia [FYROM]',
' Federation': 'Russia',
'Sao Tome and Principe': 'São Tomé and Príncipe',
'Slovak Republic': 'Slovakia',
'Syrian Arab Republic': 'Syria',
'Turkiye': '',
'Venezuela, RB': 'Venezuela',
'Yemen, Rep.': 'Yemen',
'Eswatini': 'Swaziland'
}

# Clean the population dataset
population_data_clean = population_data.replace({"Country Name": country_name_mapping})

# Drop the rows that do not exist in the coordinates dataset
population_data_clean = population_data_clean[population_data_clean['Country Name'].isin(coordinates_countries)]

# Merge the population data with the country coordinates data
merged_data = pd.merge(population_data_clean, country_coordinates, left_on='Country Name', right_on='name')

# Keep only the necessary columns
merged_data = merged_data[['Country Name', '2022', 'latitude', 'longitude']]

# Rename the columns for clarity
merged_data.columns = ['Country', 'Population', 'Latitude', 'Longitude']

merged_data.head()

Cleaned and preprocessed data (image by author)

We now have the data in a nice and clean format. This step would normally take a lot of time and manual efforts. ChatGPT did it in about a minute.

Source link