Understanding pandas.DataFrame.loc[] through 6 examples
Last updated: February 24, 2024
Table of Contents
Introduction, creating a sample dataframe, example 1: basic selection, example 2: select multiple rows, example 3: slicing rows, example 4: selecting rows and columns, example 5: conditional selection, example 6: setting values, advanced use: combining with other methods.
The pandas library in Python is a powerhouse for data manipulation and analysis. Among its many features, DataFrame.loc[] stands out for its ability to select data based on label information. This tutorial will guide you through understanding and utilizing loc[] with six comprehensive examples.
Preparation
Ensure you have pandas installed and imported in your Python environment:
Starting with the basics, you can select a single row:
The output will show information for the first row, indexed at 0.
Selecting multiple rows by specifying a list of indices:
This will output rows 0 and 2.
You can slice rows using a colon:
This slice includes rows 1 through 3.
More selective data access by specifying row and column labels:
Outputs will show the name of the first person and names with cities of the second and fourth persons, respectively.
Using conditions to filter rows:
This command lists all persons older than 30 years.
loc[] can also be used to modify data:
The age for the first person has been updated to 29.
Combining loc[] with other pandas methods can unlock even more power. For instance, using loc[] with groupby() for aggregated data selection:
Note: The above might require adjustments based on real data context, as groupby().loc[] isn’t directly applicable. This shows the concept of combining loc[] with other methods.
The pandas.DataFrame.loc[] method is essential for precise data selection and manipulation. Through these examples, you’ve seen its versatility – from basic to more sophisticated data operations. Experiment with these techniques on your own data sets to discover the true power of pandas.
Next Article: pandas.DataFrame.insert() – Inserting a new column at a specific location
Previous Article: Pandas DataFrame: Access and modify the value of a cell with .at[] and .iat[]
Series: DateFrames in Pandas
Related Articles
How to Use Pandas for Geospatial Data Analysis (3 examples)
February 28, 2024
You May Also Like
- Pandas: Remove all non-numeric elements from a Series (3 examples)
- How to Use Pandas Profiling for Data Analysis (4 examples)
- How to Handle Large Datasets with Pandas and Dask (4 examples)
- Pandas – Using DataFrame.pivot() method (3 examples)
- Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
- Pandas: Select columns whose names start/end with a specific string (4 examples)
- 3 ways to turn off future warnings in Pandas
- How to Integrate Pandas with Apache Spark
- How to Use Pandas for Web Scraping and Saving Data (2 examples)
- How to Clean and Preprocess Text Data with Pandas (3 examples)
- Pandas – Using Series.replace() method (3 examples)
- Pandas json_normalize() function: Explained with examples
- Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
- Using pandas.Series.rank() method (4 examples)
- Pandas: Dropping columns whose names contain a specific string (4 examples)
- Pandas: How to print a DataFrame without index (3 ways)
- Fixing Pandas NameError: name ‘df’ is not defined
- Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
- Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
- Pandas: Checking equality of 2 DataFrames (element-wise)
pandas: Get/Set values with loc, iloc, at, iat
You can use loc , iloc , at , and iat to access data in pandas.DataFrame and get/set values. Use square brackets [] as in loc[] , not parentheses () as in loc() .
- pandas.DataFrame.loc — pandas 2.0.3 documentation
- pandas.DataFrame.iloc — pandas 2.0.3 documentation
- pandas.DataFrame.at — pandas 2.0.3 documentation
- pandas.DataFrame.iat — pandas 2.0.3 documentation
The differences are as follows:
- at , loc : Row/Column name (label)
- iat , iloc : Row/Column number
- at , iat : Single value
- loc , iloc : Single or multiple values
at , iat : Access and get/set a single value
Access a single value, access multiple values using lists and slices, access rows and columns, mask by boolean array and pandas.series, duplicated row/column names, specify by number and name, implicit type conversion when selecting a row as pandas.series.
You can also select rows and columns of pandas.DataFrame and elements of pandas.Series by indexing [] .
- pandas: Select rows/columns by index (numbers and names)
Note that the previously provided get_value() and ix[] have been removed in version 1.0 .
The sample code in this article is based on pandas version 2.0.3 . The following pandas.DataFrame is used as an example.
You can specify the row/column name in at . In addition to getting data, you can also set (assign) a new value.
You can specify the row/column number (0-based indexing) in iat .
loc , iloc : Access and get/set single or multiple values
loc and iloc can access both single and multiple values using lists or slices. You can use row/column names for loc and row/column numbers for iloc .
You can access a single value with loc and iloc as well as with at and iat . However, at and iat are faster than loc and iloc .
In addition to retrieving data, you can also set a new value for the element.
With loc and iloc , you can access multiple values by specifying a group of data with a list [a, b, c, ...] and slice start:stop:step .
Note that in the slice notation start:stop:step , the step is optional and can be omitted. For basic usage of slices, see the following article.
- How to slice a list, string, tuple in Python
When using the slice notation start:stop:step with loc (which uses row/column names), the stop value is inclusive. However, with iloc (which uses row/column numbers), the stop value is exclusive, following the typical behavior of standard Python slices.
When specified by a list, rows and columns follow the order of that list.
For example, you can extract odd/even rows by specifying step .
You can set multiple values simultaneously. If you assign a scalar value, all selected elements will be set to that value. For assigning values to a range, use a two-dimensional list (list of lists) or a two-dimensional NumPy array ( ndarray ).
Note that selecting a row or a column by specifying it as a scalar value returns Series , whereas the same row or column, specified as a slice or a list, returns DataFrame .
In particular, be aware of potential implicit type conversions when retrieving rows as a Series . See below for details.
You can select rows and columns with df[] . They can be specified as:
- Rows: Slice of row name/number
- Columns: Column name or list of column names
For more information, see the following article.
You can specify rows and columns in various ways with loc and iloc .
If you omit specifying columns with loc or iloc , rows are selected. You can specify them by row name/number or list of such names/numbers.
You can select columns with loc and iloc by specifying rows as : . It is possible to specify by slice.
As mentioned above, specifying a single row or column with a scalar value returns a Series , while using a slice or list returns a DataFrame .
Note that selecting a row as pandas.Series may result in implicit type conversion. See below for details.
With loc and iloc , you can use a boolean array or list to filter data. While the following example demonstrates row filtering, the same approach can be applied to columns.
If the number of elements does not match, an error is raised.
You can also use a boolean Series with loc for filtering. Note that the filtering is based on matching labels, not on the order of the data.
You cannot specify Series in iloc .
Even with loc , an error is raised if the labels do not match.
Both row names ( index ) and column names ( columns ) can have duplicates.
Consider the following DataFrame with duplicate row and column names as an example.
For at and loc , specifying duplicate names selects the corresponding multiple elements.
When using iat and iloc to specify by row/column number, duplicated names are not an issue because they operate based on position.
To avoid confusion, it's advisable to use unique values for row and column names unless there's a compelling reason otherwise.
You can check whether row and column names are unique (not duplicated) with index.is_unique and columns.is_unique .
- pandas.Index.is_unique — pandas 2.0.3 documentation
See the following article on how to rename row and column names.
- pandas: Rename column/index names of DataFrame
If you want to specify by both number and name, use at or loc in combination with the index or columns attributes.
You can retrieve row or column names based on their number using the index and columns attributes.
For index and columns , you can use slices and lists to retrieve multiple names.
Using this and at or loc , you can specify by number and name.
Using indexing operations in succession, such as df[...][...] , df.loc[...].iloc[...] , and other similar patterns, is known as "chained indexing". This approach can trigger a SettingWithCopyWarning .
- pandas: How to fix SettingWithCopyWarning: A value is trying to be set on ...
While this approach causes no issues during simple data retrieval and checking, be cautious as assigning new values might yield unexpected results.
If the columns of the original DataFrame have different data types, then when selecting a row as a Series with loc or iloc , the data type of the elements in the selected Series might differ from the data types in the original DataFrame .
- pandas: How to use astype() to cast dtype of DataFrame
Consider a DataFrame with columns of integers ( int ) and floating point numbers ( float ).
If you retrieve a row as a Series using loc or iloc , its data type becomes float . Elements in int columns are converted to float .
If you execute the following code, the element is returned as float .
You can get elements of the original type with at or iat .
When a row is selected using a list or slice with loc or iloc , a DataFrame is returned instead of a Series .
Related Categories
Related articles.
- pandas: Sort DataFrame/Series with sort_values(), sort_index()
- pandas: Copy DataFrame to the clipboard with to_clipboard()
- pandas: Concat multiple DataFrame/Series with concat()
- pandas: Find, count, drop duplicates (duplicated, drop_duplicates)
- pandas: Random sampling from DataFrame with sample()
- Convert between pandas DataFrame/Series and NumPy array
- pandas: Count values in DataFrame/Series with conditions
- pandas: Grouping data with groupby()
- pandas: Get and set options for display, data behavior, etc.
- pandas: Replace values in DataFrame and Series with replace()
- pandas: Remove NaN (missing values) with dropna()
- pandas: Convert a list of dictionaries to DataFrame with json_normalize
- pandas: Split string columns by delimiters or regular expressions
- pandas: Apply functions to values, rows, columns with map(), apply()
- Learn Python
- Python Lists
- Python Dictionaries
- Python Strings
- Python Functions
- Learn Pandas & NumPy
- Pandas Tutorials
- Numpy Tutorials
- Learn Data Visualization
- Python Seaborn
- Python Matplotlib
Set Pandas Conditional Column Based on Values of Another Column
- August 9, 2021 February 22, 2022
There are many times when you may need to set a Pandas column value based on the condition of another column. In this post, you’ll learn all the different ways in which you can create Pandas conditional columns.
Table of Contents
Video Tutorial
If you prefer to follow along with a video tutorial, check out my video below:
Loading a Sample Dataframe
Let’s begin by loading a sample Pandas dataframe that we can use throughout this tutorial.
We’ll begin by import pandas and loading a dataframe using the .from_dict() method:
This returns the following dataframe:
Using Pandas loc to Set Pandas Conditional Column
Pandas loc is incredibly powerful! If you need a refresher on loc (or iloc), check out my tutorial here . Pandas’ loc creates a boolean mask, based on a condition. Sometimes, that condition can just be selecting rows and columns, but it can also be used to filter dataframes. These filtered dataframes can then have values applied to them.
Let’s explore the syntax a little bit:
With the syntax above, we filter the dataframe using .loc and then assign a value to any row in the column (or columns) where the condition is met.
Let’s try this out by assigning the string ‘Under 30’ to anyone with an age less than 30, and ‘Over 30’ to anyone 30 or older.
Let's take a look at what we did here:
- We assigned the string 'Over 30' to every record in the dataframe. To learn more about this, check out my post here or creating new columns.
- We then use .loc to create a boolean mask on the Age column to filter down to rows where the age is less than 30. When this condition is met, the Age Category column is assigned the new value 'Under 30'
But what happens when you have multiple conditions? You could, of course, use .loc multiple times, but this is difficult to read and fairly unpleasant to write. Let's see how we can accomplish this using numpy's .select() method.
Using Numpy Select to Set Values using Multiple Conditions
Similar to the method above to use .loc to create a conditional column in Pandas, we can use the numpy .select() method.
Let's begin by importing numpy and we'll give it the conventional alias np :
Now, say we wanted to apply a number of different age groups, as below:
- <20 years old,
- 20-39 years old,
- 40-59 years old,
- 60+ years old
In order to do this, we'll create a list of conditions and corresponding values to fill:
Running this returns the following dataframe:
Let's break down what happens here:
- We first define a list of conditions in which the criteria are specified. Recall that lists are ordered meaning that they should be in the order in which you would like the corresponding values to appear.
- We then define a list of values to use , which corresponds to the values you'd like applied in your new column.
Something to consider here is that this can be a bit counterintuitive to write. You can similarly define a function to apply different values. We'll cover this off in the section of using the Pandas .apply() method below .
One of the key benefits is that using numpy as is very fast, especially when compared to using the .apply() method.
Using Pandas Map to Set Values in Another Column
The Pandas .map() method is very helpful when you're applying labels to another column. In order to use this method, you define a dictionary to apply to the column.
For our sample dataframe, let's imagine that we have offices in America, Canada, and France. We want to map the cities to their corresponding countries and apply and "Other" value for any other city.
When we print this out, we get the following dataframe returned:
What we can see here, is that there is a NaN value associated with any City that doesn't have a corresponding country. If we want to apply "Other" to any missing values, we can chain the .fillna() method:
Using Pandas Apply to Apply a function to a column
Finally, you can apply built-in or custom functions to a dataframe using the Pandas .apply() method.
Let's take a look at both applying built-in functions such as len() and even applying custom functions.
Applying Python Built-in Functions to a Column
We can easily apply a built-in function using the .apply() method. Let's see how we can use the len() function to count how long a string of a given column.
Take note of a few things here:
- We apply the .apply() method to a particular column,
- We omit the parentheses "()"
Using Third-Party Packages in Pandas Apply
Similarly, you can use functions from using packages. Let's use numpy to apply the .sqrt() method to find the scare root of a person's age.
Using Custom Functions with Pandas Apply
Something that makes the .apply() method extremely powerful is the ability to define and apply your own functions.
Let's revisit how we could use an if-else statement to create age categories as in our earlier example:
In this post, you learned a number of ways in which you can apply values to a dataframe column to create a Pandas conditional column, including using .loc , .np.select() , Pandas .map() and Pandas .apply() . Each of these methods has a different use case that we explored throughout this post.
Learn more about Pandas methods covered here by checking out their official documentation:
- Pandas Apply
- Numpy Select
Nik Piepenbreier
Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials. View Author posts
2 thoughts on “Set Pandas Conditional Column Based on Values of Another Column”
Thank you so much! Brilliantly explained!!!
Thanks Aisha!
Leave a Reply Cancel reply
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.
IMAGES
VIDEO
COMMENTS
Access a group of rows and columns by label (s) or a boolean array. .loc[] is primarily label based, but may also be used with a boolean array. Allowed inputs are: A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index). A list or array of labels, e.g. ['a', 'b', 'c'].
In addition to what indexers you can pass to loc, it also enables you to make assignments. Now we can break down the line of code you provided. iris_data['class'] == 'versicolor' returns a boolean array. class is a scalar that represents a value in the columns object.
The axis labeling information in pandas objects serves many purposes: Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display. Enables automatic and explicit data alignment. Allows intuitive getting and setting of subsets of the data set.
The pandas.DataFrame.loc[] method is essential for precise data selection and manipulation. Through these examples, you’ve seen its versatility – from basic to more sophisticated data operations. Experiment with these techniques on your own data sets to discover the true power of pandas.
You can use loc, iloc, at, and iat to access data in pandas.DataFrame and get/set values. Use square brackets [] as in loc[], not parentheses () as in loc(). pandas.DataFrame.loc — pandas 2.0.3 docum ...
In this piece, we’ll go over how to edit your DataFrames based on conditional statements using the .loc method. If you’ve been working with Pandas for a while now, you may already have come across the dreaded “SettingwithCopyWarning” message when you run your code.
By mastering the .loc method in Pandas step by step, you will be able to confidently tackle diverse data processing situations and effectively extract valuable insights from your datasets.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?). When a slice is passed to df.loc the end-points are included in the range.
In this post, you learned a number of ways in which you can apply values to a dataframe column to create a Pandas conditional column, including using .loc, .np.select(), Pandas .map() and Pandas .apply(). Each of these methods has a different use case that we explored throughout this post.
We use a lambda function to describe how we’re using the .loc method: the .loc tells us what we are doing (filtering) and the lambda tells us how. Using .loc and lambda enables us to chain data...