Python pandas Practice Problems for Beginner Coders
July 1, 2022
Wrangling large datasets is simpler with the help of programmatic analysis and built-in methods. Pandas is an open-source Python package widely used for data cleaning, manipulation, and inspection.
With pandas DataFrame objects, programmers can easily find missing values, calculate new fields and search for insights in their data. The library is also useful for machine learning, making it possible for machine learning engineers to handle large amounts of data and prepare it for a model.
To help beginner coders practice Python pandas fundamentals and learn how to explore data, datascience@berkeley collected six exercises covering the basics of data analysis in Python.
Are You Ready to Start Your Python pandas Practice?
Consider the following questions to make sure you have the proper prior knowledge and coding environment to continue.
How much Python and pandas do I already need to know?
This problem set is intended for people who are already familiar with Python syntax, data types, and data structures. Each exercise focuses on a different set of operations or functionalities in pandas, and they progressively become more complex.
Readers who want to learn more about Python and pandas before starting can explore the following resources:
datascience@berkeley created a Google Colab notebook as a starting point for readers to execute their code. Google Colab is a free computational environment that allows anyone with an Internet connection to execute Python code via the browser.
This notebook contains the questions and corresponding solutions.
Python pandas Practice Problems
1. DataFrame Basic Properties Exercise
Our DataFrame (df) contains data on registered voters in the United States, including demographic information and political preference. Using pandas, print the first 5 rows of the DataFrame to get a sense of what the data looks like. Next, answer the following questions:
How many observations are in the DataFrame?
How many variables are measured (how many columns)?
What is the age of the youngest person in the data? The oldest?
How many days a week does the average respondent watch TV news (round to the nearest tenth)?
Check for missing values. Are there any?
2. Cleaning Data Exercise
We want to adjust the dataset for our use. Do the following:
Rename the educ column education.
Create a new column called party based on each respondent’s answer to PID. party should equal Democrat if the respondent selected either Strong Democrat or Weak Democrat. party will equal Republican if they selected Strong or Weak Republican and Independent if they selected anything else.
Create a new column called age_group that buckets respondents into the following categories based on their age: 18-24, 25-34, 35-44, 45-54, 55-64, and 65 and over.
3. Filtering Data Exercise
Use the filtering method to find all the respondents who have the impression that Bill Clinton is moderate or conservative (ClinLR equals 4 or higher). How many respondents are in this subset?
Among these respondents, how many have a household income less than $50,000 and attended at least some college?
4. Calculating From Data Exercise
For each of the below match-ups, choose the group that is more likely to vote for Bill Clinton. You can calculate this using the percentage of each group that intends to vote for Clinton (vote). Which match-up was the closest? Which had the biggest difference?
Democrats or Republicans
People younger than 44 or People 44 and older
People who watch TV news at least 6 days a week or People who watch TV news less than 3 days a week
People who live somewhere with a population greater than the average respondent or People who live in a place with a population equal to or less than the average respondent
5. Grouping Data Exercise
Use the groupby() method to bucket respondents by age_group. Which age group is the most conservative? Which watches TV news the least?
Next, calculate 5 percentile groups based on income. Group the dataset by these percentiles. Which income bracket is the most liberal? Which is the most conservative? The oldest? Highest educated?
6. Voting Across the Aisle
We are interested in learning more about respondents whose political views differ strongly from the candidate they expect to vote for. Using selfLR, vote, ClinLR, and DoleLR, work through the following questions. Your interpretation may differ from the answer key.
What is the largest recorded difference between a respondent’s political leaning and their impression of their intended candidate’s political leaning?
How many respondents exhibit a difference of that magnitude?
Make a separate DataFrame called sway that only includes these voters who exhibit a difference greater than |3|.
Among those in sway, are respondents more likely to be voting for a candidate more conservative or more liberal than their own political leaning?
In sway, which candidate is the more popular choice?
Additional Python pandas Exercises
Here are some additional problem sets to work on data analysis in Python:
Copyright (c) 2022, UC Berkeley School of Information All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.