Numpy – Shishir Kant Singh

Pandas – Drop Rows by Label | Index

Shishir Kant Singh — Tue, 28 Jan 2025 16:11:01 +0000

By using pandas.DataFrame.drop() method you can drop/remove/delete rows from DataFrame. axis param is used to specify what axis you would like to remove. By default axis=0 meaning to remove rows. Use axis=1 or columns param to remove columns. By default, Pandas return a copy DataFrame after deleting rows, used inpalce=True to remove from existing referring DataFrame.

In this article, I will cover how to remove rows by labels, indexes, and ranges and how to drop inplace and None, Nan & Null values with examples. if you have duplicate rows, use drop_duplicates() to drop duplicate rows from pandas DataFrame

Key Points –

Use the drop() method to remove rows by specifying the row labels or indices.
Set the axis parameter to 0 (or omit it) to indicate that rows should be dropped.
Use the inplace parameter to modify the original DataFrame directly without creating a new one.
After dropping rows, consider resetting the index with reset_index() to maintain sequential indexing.
Set the errors parameter to ‘ignore’ to suppress errors when attempting to drop non-existent row labels.
Leverage the query() method to filter and drop rows based on complex conditions.

Pandas.DataFrame.drop() Syntax – Drop Rows & Columns

Let’s know the syntax of the DataFrame drop() function.


# Pandas DaraFrame drop() Syntax
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters

labels – Single label or list-like. It’s used with axis param.
axis – Default sets to 0. 1 to drop columns and 0 to drop rows.
index – Use to specify rows. Accepts single label or list-like.
columns – Use to specify columns. Accepts single label or list-like.
level – int or level name, optional, use for Multiindex.
inplace – Default False, returns a copy of DataFrame. When used True, it drops the column inplace (current DataFrame) and returns None.
errors – {‘ignore’, ‘raise’}, default ‘raise’.

Let’s create a DataFrame, run some examples, and explore the output. Note that our DataFrame contains index labels for rows which I am going to use to demonstrate removing rows by labels.


# Create a DataFrame
import pandas as pd
import numpy as np

technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[20000,25000,26000,22000],
    'Duration':['30day','40days',np.nan, None],
    'Discount':[1000,2300,1500,1200]
               }

indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
print(df)

Yields below output.

Pandas Drop Rows From DataFrame Examples

By default drop() method removes rows (axis=0) from DataFrame. Let’s see several examples of how to remove rows from DataFrame.

Drop rows by Index Labels or Names

One of the Panda’s advantages is you can assign labels/names to rows, similar to column names. If you have DataFrame with row labels (index labels), you can specify what rows you want to remove by label names.


# Drop rows by Index Label
df = pd.DataFrame(technologies,index=indexes)
df1 = df.drop(['r1','r2'])
print("Drop rows from DataFrame:\n", df1)

Yields below output.

Alternatively, you can also write the same statement by using the field name 'index'.


# Delete Rows by Index Labels
df1 = df.drop(index=['r1','r2'])

And by using labels and axis as below.


# Delete Rows by Index Labels & axis
df1 = df.drop(labels=['r1','r2'])
df1 = df.drop(labels=['r1','r2'],axis=0)

Notes:

As you see using labels, axis=0 is equivalent to using index=label names.
axis=0 mean rows. By default drop() method considers axis=0 hence you don’t have to specify to remove rows. to remove columns explicitly specify axis=1 or columns.

Drop Rows by Index Number (Row Number)

Similarly by using drop() method you can also remove rows by index position from pandas DataFrame. drop() method doesn’t have a position index as a param, hence we need to get the row labels from the index and pass these to the drop method. We will use df.index it to get row labels for the indexes we want to delete.

df.index.values returns all row labels as a list.
df.index[[1,3]] gets you row labels for the 2nd and 3rd rows, bypassing these to drop() method removes these rows. Note that in Python, the list index starts from zero.


# Delete Rows by Index numbers
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[[1,3]])
print(df1)

Yields the same output as section 2.1. In order to drop the first row, you can use df.drop(df.index[0]), and to drop the last row use df.drop(df.index[-1]).


# Removes First Row
df=df.drop(df.index[0])

# Removes Last Row
df=df.drop(df.index[-1])

Delete Rows by Index Range

You can also remove rows by specifying the index range. The below example removes all rows starting 3rd row.


# Delete Rows by Index Range
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[2:])
print(df1)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Delete Rows when you have Default Index

By default, pandas assign a sequence number to all rows also called index, row index starts from zero and increments by 1 for every row. If you are not using custom index labels, pandas DataFrame assigns sequence numbers as Index. To remove rows with the default index, you can try below.


# Remove rows when you have default index.
df = pd.DataFrame(technologies)
df1 = df.drop(0)
df3 = df.drop([0, 3])
df4 = df.drop(range(0,2))

Note that df.drop(-1) doesn’t remove the last row as the -1 index is not present in DataFrame. You can still use df.drop(df.index[-1]) it to remove the last row.

Remove DataFrame Rows Inplace

All examples you have seen above return a copy of DataFrame after removing rows. In case if you want to remove rows inplace from referring DataFrame use inplace=True. By default, inplace param is set to False.


# Delete Rows inplace
df = pd.DataFrame(technologies,index=indexes)
df.drop(['r1','r2'],inplace=True)
print(df)

Drop Rows by Checking Conditions

Most of the time we would also need to remove DataFrame rows based on some conditions (column value), you can do this by using loc[] and iloc[] methods.


# Delete Rows by Checking Conditions
df = pd.DataFrame(technologies)
df1 = df.loc[df["Discount"] >=1500 ]
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
1  PySpark  25000   40days      2300
2   Hadoop  26000      NaN      1500

Drop Rows that NaN/None/Null Values

While working with analytics you would often be required to clean up the data that has None, Null & np.NaN values. By using df.dropna() you can remove NaN values from DataFrame.


# Delete rows with Nan, None & Null Values
df = pd.DataFrame(technologies,index=indexes)
df2=df.dropna()
print(df2)

This removes all rows that have None, Null & NaN values on any columns.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Remove Rows by Slicing DataFrame

You can also drop a list of DataFrame rows by slicing. Remember index starts from zero.


# Remove Rows by Slicing DataFrame
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4

Pandas – Rename Column

Shishir Kant Singh — Tue, 28 Jan 2025 16:06:46 +0000

Pandas DataFrame.rename() function is used to change the single column name, multiple columns, by index position, in place, with a list, with a dict, and renaming all columns, etc. We are often required to change the column name of the DataFrame before we perform any operations. In fact, changing the name of a column is one of the most searched and used functions of the Pandas.

Advertisements

The good thing about this function is it provides a way to rename a specific single column.

In this pandas article, You will learn several ways to rename a column name of the Pandas DataFrame with examples by using functions like DataFrame.rename(), DataFrame.set_axis(), DataFrame.add_prefix(), DataFrame.add_suffix() and more.

Key Points –

Use the rename() method to rename columns, specifying a dictionary that maps old column names to new ones.
The rename() method does not modify the original DataFrame by default; to apply changes directly, set the inplace parameter to True.
Column renaming is case-sensitive, so ensure the exact match of column names when renaming.
To rename a single column, use the columns parameter in the rename() method with a dictionary that maps only the specific column.
Use the DataFrame.columns attribute to directly modify column names by assigning a new list of names. Ensure the new list has the same length as the original.

1. Quick Examples Rename Columns of DataFrame

If you are in a hurry, below are some quick examples of renaming column names in Pandas DataFrame.

Pandas Rename Scenario	Rename Column Example
Rename columns with list	df.columns=[‘A’,’B’,’C’]
Rename column name by index	df.columns.values[2] = “C”
Rename the column using Dict	df2=df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’})
Rename column using Dict & axis	df2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=1)
Rename column in place	df2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=’columns’)
df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, in place = True)	df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, inplace = True)
Rename using lambda function	df.rename(columns=lambda x: x[1:], inplace=True)
Rename with error	df.rename(columns = {‘x’:’X’}, errors = “raise”)
Rename using set_asis()	df2=df.set_axis([‘A’,’B’,’C’], axis=1)

Pandas Rename Column(s) Examples

Now let’s see the Syntax and examples.

2. Syntax of Pandas DataFrame.rename()

Following is the syntax of the pandas.DataFrame.rename() method, this returns either DataFrame or None. By default returns pandas DataFrame after renaming columns. When use inplace=True it updates the existing DataFrame in place (self) and returns None.


#  DataFrame.rename() Syntax
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, 
       copy=True, inplace=False, level=None, errors='ignore')

2.1 Parameters

The following are the parameters.

mapper – dictionary or function to rename columns and indexes.
index – dictionary or function to rename index. When using with axis param, it should be (mapper, axis=0) which is equivalent to index=mapper.
columns – dictionary or function to rename columns. When using with axis param, it should be (mapper, axis=0) which is equivalent to column=mapper.
axis – Value can be either 0 or index | 1 or columns. Default set to ‘0’.
copy – Copies the data as well. Default set to True.
inplace – Used to specify the DataFrame referred to be updated. Default to False. When used True, copy property will be ignored.
level – Used with MultiIndex. Takes Integer value. Default set to None.
errors – Take values raise or ignore. if ‘raise’ is used, raise a KeyError when a dict-like mapper, index, or column contains labels that are not present in the Index being transformed. If ‘ignore’ is used, existing keys will be renamed and extra keys will be ignored. Default set to ignore.

Let’s create a DataFrame with a dictionary of lists, our pandas DataFrame contains column names Courses, Fee, Duration.


import pandas as pd
technologies = ({
  'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
  'Fee' :[20000,25000,26000,22000,24000,21000,22000],
  'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print(df.columns)

Yields below output.

3. Pandas Rename Column Name

In order to rename a single column name on pandas DataFrame, you can use column={} parameter with the dictionary mapping of the old name and a new name. Note that when you use column param, you cannot explicitly use axis param.

pandas DataFrame.rename() accepts a dict(dictionary) as a param for columns you want to rename, so you just pass a dict with a key-value pair; the key is an existing column you would like to rename and the value would be your preferred column name.


# Rename a Single Column 
df2=df.rename(columns = {'Courses':'Courses_List'})
print(df2.columns)

Yields below output. As you see it rename the column from Courses to Courses_List.

Alternatively, you can also write the above statement by using axis=1 or axis='columns'.


# Alternatively you can write above code using axis
df2=df.rename({'Courses':'Courses_List'}, axis=1)
df2=df.rename({'Courses':'Courses_List'}, axis='columns')

In order to change columns on the existing DataFrame without copying to the new DataFrame, you have to use inplace=True.


# Replace existing DataFrame (inplace). This returns None
df.rename({'Courses':'Courses_List'}, axis='columns', inplace=True)
print(df.columns)

4. Rename Multiple Columns

You can also use the same approach to rename multiple columns of Pandas DataFrame. All you need to specify multiple columns you want to rename in a dictionary mapping.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

Yields below output. As you see it renames multiple columns.

5. Rename the Column by Index or Position

To rename the column by Index/position can be done by using df.columns.values[index]='value. Index or position can be used interchanging to access columns at a given position. By using this you can rename the first column to the last column.

As you can see from the above df.columns returns a column names as a pandas Index and df.columns.values get column names as an array, now you can set the specific index/position with a new value. The below example updates the column Courses to Courses_Duration at index 3. Note that the column index starts from zero.


# Pandas rename column by index
df.columns.values[2] = "Courses_Duration"
print(df.columns)

# Output:
# Index(['Courses', 'Fee', 'Courses_Duration'], dtype='object')

6. Rename Columns with a List

Python list can be used to rename all columns in pandas DataFrame, the length of the list should be the same as the number of columns in the DataFrame. Otherwise, an error occurs.


# Rename columns with list
column_names = ['Courses_List','Courses_Fee','Courses_Duration']
df.columns = column_names
print(df.columns)

Yields below output.


# Output:
Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')

7. Rename Columns Inplace

By default rename() function returns a new pandas DataFrame after updating the column name, you can change this behavior and rename in place by using inplace=True param.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

This renames column names on DataFrame in place and returns the None type.

8. Rename All Columns by adding Suffixes or Prefix

Sometimes you may need to add a string text to the suffix or prefix of all column names. You can do this by getting all columns one by one in a loop and adding a suffix or prefix string.


# Rename All Column Names by adding Suffix or Prefix
df.columns = column_names
df.columns = ['col_'+str(col) for col in df.columns]

You can also use pandas.DataFrame.add_prefix() and pandas.DataFrame.add_suffix() to add prefixes and suffixes respectively to the pandas DataFrame column names.


# Add prefix to the column names
df2=df.add_prefix('col_')
print(df2.columns)

# Add suffix to the column names
df2=df.add_suffix('_col')
print(df2.columns))

Yields below output.


# Output:
Index(['col_Courses', 'col_Fee', 'col_Duration'], dtype='object')
Index(['Courses_col', 'Fee_col', 'Duration_col'], dtype='object')

9. Rename the Column using the Lambda Function

You can also change the column name using the Pandas lambda expression, This gives us more control and applies custom functions. The below examples add a ‘col_’ string to all column names. You can also try removing spaces from columns etc.


# Rename using Lambda function
df.rename(columns=lambda x: 'col_'+x, inplace=True)

10. Rename or Convert All Columns to Lower or Upper Case

When column names are mixed with lower and upper case, it would be best practice to convert/update all column names to either lower or upper case.


# Change to all lower case
df = pd.DataFrame(technologies)
df2=df.rename(str.lower, axis='columns')
print(df2.columns)

# Change to all upper case
df = pd.DataFrame(technologies)
df2=df.rename(str.upper, axis='columns')
print(df2.columns)

Yields below output.


# Output:
Index(['courses', 'fee', 'duration'], dtype='object')
Index(['COURSES', 'FEE', 'DURATION'], dtype='object')

11. Change Column Names Using DataFrame.set_axis()

By using DataFrame.set_axis() you can also change the column names. Note that with set_axis() you need to assign all column names. This updates the DataFrame with a new set of column names. set_axis() also used to rename pandas DataFrame Index


# Change column name using set_axis()
df.set_axis(['Courses_List', 'Course_Fee', 'Course_Duration'], axis=1, inplace=True)
print(df.columns)

12. Using String replace()

Pandas String.replace() a method is used to replace a string, series, dictionary, list, number, regex, etc. from a DataFrame column. This is a very rich function as it has many variations. If you have used this syntax: df.columns.str.replace("Fee", "Courses_Fee"), it replaces 'Fee' column with 'Courses_Fee'.


# Change column name using String.replace()
df.columns = df.columns.str.replace("Fee","Courses_Fee")
print(df.columns)

Yields below output.


# Output:
Index(['Courses', 'Courses_Fee', 'Duration'], dtype='object')

To replace all column names in a DataFrame using str.replace() method. You can define DataFrame with column names('Courses_List', 'Courses_Fee', 'Courses_Duration') and then apply str.replace() method over the columns of DataFrame it will replace underscores("_") with white space(" ").


# Rename all column names
df.columns = df.columns.str.replace("_"," ")
print(df.columns)

Yields below output.


# Output:
Index(['Courses List', 'Course Fee', 'Course Duration'], dtype='object')

13. Raise Error when Column Not Found

By default when rename column label is not found on Pandas DataFrame, rename() method just ignores the column. In case you wanted to throw an error when a column is not found, use errors = "raise".


# Throw Error when Rename column doesn't exists.
df.rename(columns = {'Cour':'Courses_List'}, errors = "raise")

Yields error message raise KeyError("{} not found in axis".format(missing_labels)).


# Output:
raise KeyError("{} not found in axis".format(missing_labels))
KeyError: "['Cour'] not found in axis"

14. Rename Only If the Column Exists

This example changes the Courses column to Courses_List and it doesn’t update Fees as we don’t have the Fees column. Note that even though the Fees column does not exist it didn’t raise errors even when we used errors="raise".


# Change column only if column exists.
df = pd.DataFrame(technologies)
d={'Courses':'Courses_List','Fees':'Courses_fees'}
df.rename(columns={k: v for k, v in d.items() if k in df.columns}, inplace=True,errors = "raise")
print(df.columns)

Pandas – Add New Column

Shishir Kant Singh — Tue, 28 Jan 2025 16:02:24 +0000

In Pandas, you can add a new column to an existing DataFrame using the DataFrame.insert() function, which updates the DataFrame in place. Alternatively, you can use DataFrame.assign() to insert a new column, but this method returns a new DataFrame with the added column.

Advertisements

In this article, I will cover examples of adding multiple columns, adding a constant value, deriving new columns from an existing column to the Pandas DataFrame.

Key Points –

A new column can be created by assigning values directly to a new column name with df['new_column'] = values.
The assign() method adds columns and returns a modified copy of the DataFrame, leaving the original DataFrame unchanged unless reassigned.
Adding a column directly modifies the DataFrame in place, while using assign() creates a new DataFrame.
Lambda functions within assign() enable complex calculations or conditional logic to define values for the new column.
The insert() method allows adding a new column at a specific position within the DataFrame, providing flexibility for organizing columns.
Using functions like np.where() or apply(), you can populate a new column based on conditional values.

Quick Examples of Adding Column

If you are in a hurry, below are some quick examples of adding column to pandas DataFrame.


# Quick examples of add column to dataframe

# Add new column to the dataframe
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)

# Add a multiple columns to the dataframe
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 =df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )

# Derive new Column from existing column
df = pd.DataFrame(technologies)
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

# Add a constant or empty value to the DataFrame
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")

# Add new column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies

# Add new column at the specific position
df = pd.DataFrame(technologies)
df.insert(0,'Tutors', tutors )

# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print(df)

To run some examples of adding column to DataFrame, let’s create DataFrame using data from a dictionary.


# Create DataFrame
import pandas as pd
import numpy as np

technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Discount':[1000,2300,1000,1200,2500]
          }

df = pd.DataFrame(technologies)
print("Create a DataFrame:\n", df)

Yields below output.

Add Column to DataFrame

DataFrame.assign() is used to add/append a column to the DataFrame, this method generates a new DataFrame incorporating the added column, while the original DataFrame remains unchanged.

Below is the syntax of the assign() method.


# Syntax of DataFrame.assign()
DataFrame.assign(**kwargs)

Now let’s add a column ‘TutorsAssigned” to the DataFrame. Using assign() we cannot modify the existing DataFrame inplace instead it returns a new DataFrame after adding a column. The below example adds a list of values as a new column to the DataFrame.


# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)
print("Add column to DataFrame:\n", df2)

Yields below output.

Add Multiple Columns to the DataFrame

You can add multiple columns to a Pandas DataFrame by using the assign() function.


# Add multiple columns to the DataFrame
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 = df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )
print("Add multiple columns to DataFrame:\n", df2)

Yields below output.


# Output:
# Add multiple columns to DataFrame:
    Courses    Fee  Discount  MNCComp TutorsAssigned
0    Spark  22000      1000     TATA        William
1  PySpark  25000      2300      HCL          Henry
2   Hadoop  23000      1000  Infosys        Michael
3   Python  24000      1200   Google           John
4   Pandas  26000      2500   Amazon          Messi

Adding a Column From Existing

In real-time scenarios, there’s often a need to compute and add new columns to a dataset based on existing ones. The following demonstration calculates the Discount_Percent column based on Fee and Discount. In this instance, I’ll utilize a lambda function to generate a new column from the existing data.


# Derive New Column from Existing Column
df = pd.DataFrame(technologies)
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
print("Add column to DataFrame:\n", df2)

You can explore deriving multiple columns and appending them to a DataFrame within a single statement. This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount  Discount_Percent
0    Spark  22000      1000          220000.0
1  PySpark  25000      2300          575000.0
2   Hadoop  23000      1000          230000.0
3   Python  24000      1200          288000.0
4   Pandas  26000      2500          650000.0

Add a Constant or Empty Column

The below example adds 3 new columns to the DataFrame, one column with all None values, a second column with 0 value, and the third column with an empty string value.


# Add a constant or empty value to the DataFrame.
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
    Courses    Fee  Discount     A  B C
0    Spark  22000      1000  None  0  
1  PySpark  25000      2300  None  0  
2   Hadoop  23000      1000  None  0  
3   Python  24000      1200  None  0  
4   Pandas  26000      2500  None  0

Append Column to Existing Pandas DataFrame

The above examples create a new DataFrame after adding new columns instead of appending a column to an existing DataFrame. The example explained in this section is used to append a new column to the existing DataFrame.


# Add New column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount MNCCompanies
0    Spark  22000      1000         TATA
1  PySpark  25000      2300          HCL
2   Hadoop  23000      1000      Infosys
3   Python  24000      1200       Google
4   Pandas  26000      2500       Amazon

You can also use this approach to add a new column by deriving from an existing column,


# Derive a new column from existing column
df2 = df['Discount_Percent'] = df['Fee'] * df['Discount'] / 100
print("Add column to DataFrame:\n", df2)

# Output:
# Add column to DataFrame:
#  0    220000.0
# 1    575000.0
# 2    230000.0
# 3    288000.0
# 4    650000.0
dtype: float64

Add Column to Specific Position of DataFrame

The DataFrame.insert() method offers the flexibility to add columns at any position within an existing DataFrame. While many examples often showcase appending columns at the end of the DataFrame, this method allows for insertion at the beginning, in the middle, or at any specific column index of the DataFrame.


# Add new column at the specific position
# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.insert(0,'Tutors', tutors )
print("Add column to DataFrame:\n", df)

# Insert 'Tutors' column at the specified position
# Add new column to the DataFrame
position = 0
df.insert(position, 'Tutors', tutors)
print("Add column to DataFrame:\n", df)

Yields below output.


# Output:
# Add column to DataFrame:
    Tutors  Courses    Fee  Discount
0  William    Spark  22000      1000
1    Henry  PySpark  25000      2300
2  Michael   Hadoop  23000      1000
3     John   Python  24000      1200
4    Messi   Pandas  26000      2500

Add a Column From Dictionary Mapping

If you want to add a column with specific values for each row based on an existing value, you can do this using a Dictionary. Here, The values from the dictionary will be added as Tutors column in df, by matching the key value with the column 'Courses'.


# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print("Add column to DataFrame:\n", df)

Note that it is unable to map pandas as the key in the dictionary is not exactly matched with the value in the Courses column (case sensitive). This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount   Tutors
0    Spark  22000      1000  William
1  PySpark  25000      2300    Henry
2   Hadoop  23000      1000  Michael
3   Python  24000      1200     John
4   Pandas  26000      2500      NaN

Using loc[] Add Column

Using pandas loc[] you can access rows and columns by labels or names however, you can also use this for adding a new column to pandas DataFrame. This loc[] property uses the first argument as rows and the second argument for columns hence, I will use the second argument to add a new column.


# Assign the column to the DataFrame
df = pd.DataFrame(technologies)
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.loc[:, 'Tutors'] = tutors
print("Add column to DataFrame:\n", df)

Yields the same output as above.

Pandas – Get Cell Value

Shishir Kant Singh — Tue, 28 Jan 2025 15:54:17 +0000

You can use DataFrame properties loc[], iloc[], at[], iat[] and other ways to get/select a cell value from a Pandas DataFrame. Pandas DataFrame is structured as rows & columns like a table, and a cell is referred to as a basic block that stores the data. Each cell contains information relating to the combination of the row and column.

Advertisements

loc[] & iloc[] are also used to select rows from pandas DataFrame and select columns from pandas DataFrame.

Key Points –

Use .loc[] to get a cell value by row label and column label.
Use .iloc[] to get a cell value by row and column index.
at[] is a faster alternative for accessing a single cell using label-based indexing.
.iat[] is similar to .at[], but uses integer-based indexing for faster access to a single cell.
Convert the DataFrame to a NumPy array and access elements by array indexing.
Prefer .at[] when performance is critical and only one value needs to be accessed.

1. Quick Examples of Get Cell Value of DataFrame

If you are in a hurry, below are some quick examples of how to select cell values from Pandas DataFrame.


# Quick examples of get cell value of DataFrame

# Using loc[]. Get cell value by name & index
print(df.loc['r4']['Duration'])
print(df.loc['r4'][2])

# Using iloc[]. Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3,2])

# Using DataFrame.at[]
print(df.at['r4','Duration'])
print(df.at[df.index[3],'Duration'])

# Using DataFrame.iat[]
print(df.iat[3,2])

# Get a cell value
print(df["Duration"].values[3])

# Get cell value from last row
print(df.iloc[-1,2])
print(df.iloc[-1]['Duration'])
print(df.at[df.index[-1],'Duration'])

Now, let’s create a DataFrame with a few rows and columns and execute some examples, and validate the results. Our DataFrame contains column names Courses, Fee, Duration, Discount.


# Create DataFrame
import pandas as pd
technologies = {
     'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
     'Fee' :[24000,25000,25000,24000,24000],
     'Duration':['30day','50days','55days', '40days','60days'],
     'Discount':[1000,2300,1000,1200,2500]
          }
index_labels=['r1','r2','r3','r4','r5']
df = pd.DataFrame(technologies, index=index_labels)
print("Create DataFrame:\n", df)

Yields below output.

2. Using DataFrame.loc[] to Get a Cell Value by Column Name

In Pandas, DataFrame.loc[] property is used to get a specific cell value by row & label name(column name). Below all examples return a cell value from the row label r4 and Duration column (3rd column).


# Using loc[]. Get cell value by name & index
print(df.loc['r4']['Duration'])
print(df.loc['r4','Duration'])
print(df.loc['r4'][2])

Yields below output. From the above examples df.loc['r4'] returns a pandas Series.


# Output:
40days

3. Using DataFrame.iloc[] to Get a Cell Value by Column Position

If you want to get a cell value by column number or index position use DataFrame.iloc[], index position starts from 0 to length-1 (index starts from zero). In order to refer last column use -1 as the column position.


# Using iloc[]. Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3][2])
print(df.iloc[3,2])

This returns the same output as above. Note that iloc[] property doesn’t support df.iloc[3,'Duration'], by using this notation, returns an error.

4. Using DataFrame.at[] to select Specific Cell Value by Column Label Name

DataFrame.at[] property is used to access a single cell by row and column label pair. Like loc[] this doesn’t support column by position. This performs better when you want to get a specific cell value from Pandas DataFrame as it uses both row and column labels. Note that at[] property doesn’t support a negative index to refer rows or columns from last.


# Using DataFrame.at[]
print(df.at['r4','Duration'])
print(df.at[df.index[3],'Duration'])

These examples also yield the same output 40days.

5. Using DataFrame.iat[] select Specific Cell Value by Column Position

DataFrame.iat[] is another property to select a specific cell value by row and column position. Using this you can refer to columns only by position but not by a label. This also doesn’t support a negative index or column position.


# Using DataFrame.iat[]
print(df.iat[3,2])

6. Select Cell Value from DataFrame Using df[‘col_name’].values[]

We can use df['col_name'].values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].values[3].


# Get a cell value
print(df["Duration"].values[3])

7. Get Cell Value from Last Row of Pandas DataFrame

If you want to get a specific cell value from the last Row of Pandas DataFrame, use the negative index to point to the rows from the last. For example, Index -1 represents the last row, and -2 for the second row from the last. Similarly, you should also use -1 for the last column.


# Get cell value from last row
print(df.iloc[-1,2])                  # prints 60days
print(df.iloc[-1]['Duration'])        # prints 60days
print(df.at[df.index[-1],'Duration']) # prints 60days

To select the cell value of the last row and last column use df.iloc[-1,-1], this returns 2500. Similarly, you can also try other approaches.

Pandas – Query Rows by Value

Shishir Kant Singh — Tue, 28 Jan 2025 15:48:58 +0000

The pandas.DataFrame.query() method is used to query rows based on the provided expression (single or multiple column conditions) and returns a new DataFrame. If you want to modify the existing DataFrame in place, you can set the inplace=True argument. This allows for efficient filtering and manipulation of DataFrame data without creating additional copies.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like a query with multiple conditions and a query with a string containing to new few.

Key Points –

Pandas.DataFrame.query() function filters rows from a DataFrame based on a specified condition.
Pandas.DataFrame.query() offers a powerful and concise syntax for filtering DataFrame rows, resembling SQL queries, enhancing code readability and maintainability.
The method supports a wide range of logical and comparison operators, including ==, !=, >, <, >=, <=, and logical operators like and, or, and not.

Quick Examples of Pandas query()

Following are quick examples of the Pandas DataFrame query() method.


# Quick examples of pandas query()

# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

First, let’s create a Pandas DataFrame.


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Note that the DataFrame may contain None and NaN values in the Duration column, which will be taken into account in the examples below for selecting rows with None & NaN values or selecting while disregarding these values

Using DataFrame.query()

Following is the syntax of the DataFrame.query() method.


# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)

expr – This parameter specifies the query expression string, which follows Python’s syntax for conditional expressions.
inplace – Defaults to False. When it is set to True, it updates the existing DataFrame, and query() method returns None.
**kwargs – This parameter allows passing additional keyword arguments to the query expression. It is optional. Keyword arguments that work with eval()

DataFrame.query() is used to filter rows based on one or multiple conditions in an expression.


# Query all rows with Courses equals 'Spark'
df2 = df.query("Courses == 'Spark'")
print("After filtering the rows based on condition:\n", df2)

Yields below output.

You can use the @ character followed by the variable name. This allows you to reference Python variables directly within the query expression.


# Query Rows by using Python variable
value='Spark'
df2 = df.query("Courses == @value")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the variable value is referenced within the query expression "Courses == @value", enabling dynamic filtering based on the value stored in the Python variable.

To filter and update the existing DataFrame in place using the query() method, you can use the inplace=True parameter. This will modify the original DataFrame directly without needing to reassign it to a new variable.


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the DataFrame df is modified in place using the query() method. The expression "Courses=='Spark'" filters rows where the Courses column equals Spark. By setting inplace=True, the original DataFrame df is updated with the filtered result.

The != operator in a DataFrame query expression allows you to select rows where a specific column’s value does not equal a given value.


# Not equals condition
df2 = df.query("Courses != 'Spark'")
print("After filtering the rows based on condition:\n", df2)

# Output:
#    Courses  Courses Fee Duration  Discount
# 1  PySpark        25000   50days      2300
# 2   Hadoop        23000   30days      1000
# 3   Python        24000     None      1200
# 4   Pandas        26000      NaN      2500

In the above example, the DataFrame df is filtered to create a new DataFrame df2, where the Courses column does not equal Spark. This expression ensures that only rows with Courses values different from Spark are included in the resulting DataFrame df2.

Query Rows by the List of Values

Using the in operator in a DataFrame query expression allows you to filter rows based on whether a specific column’s value is present in a Python list of values.


# Query rows by list of values
df2 = df.query("Courses in ('Spark','PySpark')")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Similarly, you can define a Python variable to hold a list of values and then use that variable in your query. This approach allows for more dynamic filtering based on the contents of the list variable.


# Query rows by list of values
values=['Spark','PySpark']
df2 = df.query("Courses in @values")
print("After filtering the rows based on condition:\n", df2)

Using the not-in operator in a DataFrame query expression allows you to filter rows based on values that are not present in a specified list.


# Query rows not in list of values
values=['Spark','PySpark']
df2 = df.query("Courses not in @values")
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 2  Hadoop  23000   30days      1000
# 3  Python  24000     None      1200
# 4  Pandas  26000      NaN      2500

When dealing with column names containing special characters, such as spaces, you can enclose the column name within backticks (`) to ensure it is recognized properly in a query expression.


import pandas as pd
import numpy as np

# Create DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

# Using columns with special characters
df2 = df.query("`Courses Fee` >= 23000")
print("After filtering the rows based on condition:\n", df2)

Yields below output.


# Output:
# Create DataFrame:
    Courses  Courses Fee Duration  Discount
0    Spark        22000   30days      1000
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

# After filtering the rows based on condition:
    Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

Query with Multiple Conditions

Querying with multiple conditions involves filtering data in a DataFrame based on more than one criterion simultaneously. Each condition typically involves one or more columns of the DataFrame and specifies a logical relationship that must be satisfied for a row to be included in the filtered result.


# Query by multiple conditions
df2 = df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
print("After filtering the rows based on multiple conditions:\n", df2)

# Output:
# After filtering the rows based on multiple conditions:
#   Courses  Courses Fee Duration  Discount
# 2  Hadoop        23000   30days      1000
# 3  Python        24000     None      1200

Query Rows using apply()

If you want to filter rows using apply() along with a lambda function, you can do so, but the lambda function needs to return a boolean indicating whether each row should be included or not.


# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

# Select startswith
print(df[df['Courses'].str.startswith("P")])

Pandas – Select Columns

Shishir Kant Singh — Tue, 28 Jan 2025 15:39:37 +0000

In Pandas, selecting columns by name or index allows you to access specific columns in a DataFrame based on their labels (names) or positions (indices). Use loc[] & iloc[] to select a single column or multiple columns from pandas DataFrame by column names/label or index position respectively.

In this article, I will explain how to select one or more columns from a DataFrame using different methods such as column labels, index, positions, and ranges.

Key Points –

Pandas allow selecting columns from a DataFrame by their names using square brackets notation or the .loc[] accessor.
The .loc[] accessor allows for more explicit selection, accepting row and column labels or boolean arrays.
Alternatively, you can use the .iloc[] accessor to select columns by their integer index positions.
For selecting the last column, use df.iloc[:,-1:], and for the first column, use df.iloc[:,:1].
Understanding both column name and index-based selection is essential for efficient data manipulation with Pandas.

Quick Examples of Select Columns by Name or Index

If you are in a hurry, below are some quick examples of selecting columns by name or index in Pandas DataFrame.


# Quick examples of select columns by name or index

# Example 1: By using df[] notation
df2 = df[["Courses","Fee","Duration"]] # select multile columns

# Example 2: Using loc[] to take column slices
df2 = df.loc[:, ["Courses","Fee","Duration"]] # Selecte multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]] # Select Random columns
df2 = df.loc[:,'Fee':'Discount'] # Select columns between two columns
df2 = df.loc[:,'Duration':]  # Select columns by range
df2 = df.loc[:,:'Duration']  # Select columns by range
df2 = df.loc[:,::2]          # Select every alternate column

# Example 3: Using iloc[] to select column by Index
df2 = df.iloc[:,[1,3,4]] # Select columns by Index
df2 = df.iloc[:,1:4] # Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,2:] # Select From 3rd to end
df2 = df.iloc[:,:2] # Select First Two Columns

First, let’s create a pandas DataFrame.


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300]
              }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Create DataFrame:
       Courses    Fee      Duration    Discount
0      Shishir    20000    30days         1000
1      Pandas     25000    40days         2300

Using loc[] to Select Columns by Name

The df[] and DataFrame.loc[] methods in Pandas provide convenient ways to select multiple columns by names or labels, you can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Another syntax supported by pandas.DataFrame.loc[] is [:, [labels]], where you provide a list of column names as labels.


# loc[] syntax to slice columns
df.loc[:,start:stop:step]

Select DataFrame Columns by Name

To select DataFrame columns by name, you can directly specify the column names within square brackets []. Here, df[['Courses', 'Fee', 'Duration']] select only the Courses, Fee, and Duration columns from the DataFrame df.


# Select Columns by labels
df2 = df[["Courses","Fee","Duration"]]
print("Select columns by labels:\n", df2)

Yields below output.

Select columns by labesl:
       Courses    Fee      Duration    
0      Shishir    20000    30days 
1      Pandas    25000    40days

Select Columns by Index in Multiple Columns

To select multiple columns using df.loc[], you specify both row and column labels. If you want to select all rows and specific columns, you can use : to select all rows and provide a list of column labels. Note that loc[] also supports multiple conditions when selecting rows based on column values.


# Select multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]]
print("Select multiple columns by labels:\n", df2)

# Output:
# Select multiple columns by labels:
#   Courses    Fee  Discount
# 0  Shishir  20000      1000
# 1   Pandas  25000      2300

In the above example, df.loc[:, ["Courses", "Fee", "Discount"]] selects all rows (:) and the columns labeled Courses, Fee, and Discount from the DataFrame df.

Select Columns Based on Label Indexing

When you want to select columns based on label Indexes, provide start and stop indexes.

If you don’t specify a start index, iloc[] selects from the first column.
If you don’t provide a stop index, iloc[] selects all columns from the start index to the last column.
Specifying both start and stop indexes selects all columns in between, including the start index but excluding the stop index.


# Select all columns between Fee an Discount columns
df2 = df.loc[:,'Fee':'Discount']
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select from 'Duration' column
df2 = df.loc[:,'Duration':]
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select from beginning and end at 'Duration' column
df2 = df.loc[:,:'Duration']
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#   Courses    Fee Duration
# 0  Shishir  20000   30days
# 1  Pandas  25000   40days

Select Every Alternate Column

You can select every alternate column from a DataFrame, you can use the iloc[] accessor with a step size of 2.


# Select every alternate column
df2 = df.loc[:,::2]
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#   Courses Duration   Tutor
# 0  Shishir   30days  Michel
# 1  Pandas   40days     Sam

This code effectively selects every alternate column, starting from the first column, which results in selecting Courses and Duration.

Pandas iloc[] to Select Column by Index or Position

By using pandas.DataFrame.iloc[], to select multiple columns from a DataFrame by their positional indices. You can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Or, you can use the syntax [:, [indices]] with iloc[], where you provide a list of column names as labels.

Select Columns by Index Position

To select multiple columns from a DataFrame by their index positions, you can use the iloc[] accessor. For instance, retrieves Fee, Discount and Duration and returns a new DataFrame with the columns selected.


# Select columns by position
df2 = df.iloc[:,[1,3,4]]
print("Selec columns by position:\n", df2)

# Output:
# Selec columns by position:
#     Fee  Discount   Tutor
# 0  20000      1000  Michel
# 1  25000      2300     Sam

Select Columns by Position Range

You can also slice a DataFrame by a range of positions. For instance, select columns by position range using the .iloc[] accessor in Pandas. It selects columns with positions 1 through 3 (exclusive of position 4) from the DataFrame df and assigns them to df2.


# Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,1:4]
print("Select columns by position:\n", df2)

# OUtput:
# Selec columns by position:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select From 3rd to end
df2 = df.iloc[:,2:]
print("Select columns by position:\n", df2)

# Output:
# Selec columns by position:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select First Two Columns
df2 = df.iloc[:,:2]
print("Selec columns by position:\n", df2))

# Output:
# Selec columns by position:
#   Courses    Fee
# 0  Shishir  20000
# 1   Pandas  25000

To retrieve the last column of a DataFrame, you can use df.iloc[:,-1:], and to obtain just the first column, you can use df.iloc[:,:1].

Complete Example


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300],
    'Tutor':['Michel','Sam']
              }
df = pd.DataFrame(technologies)
print(df)

# Select multiple columns
print(df[["Courses","Fee","Duration"]])

# Select Random columns
print(df.loc[:, ["Courses","Fee","Discount"]])

# Select columns by range
print(df.loc[:,'Fee':'Discount']) 
print(df.loc[:,'Duration':])
print(df.loc[:,:'Duration'])

# Select every alternate column
print(df.loc[:,::2])

# Selected by column position
print(df.iloc[:,[1,3,4]])

# Select between indexes 1 and 4 (2,3,4)
print(df.iloc[:,1:4])

# Select From 3rd to end
print(df.iloc[:,2:])

# Select First Two Columns
print(df.iloc[:,:2])

Pandas – Select Rows

Shishir Kant Singh — Tue, 28 Jan 2025 15:25:54 +0000

Use Pandas DataFrame.iloc[] & DataFrame.loc[] to select rows by integer Index and by row indices respectively. iloc[] attribute can accept single index, multiple indexes from the list, indexes by a range, and many more. loc[] operator is explicitly used with labels that can accept single index labels, multiple index labels from the list, indexes by a range (between two index labels), and many more. When using iloc[] or loc[] with an index that doesn’t exist it returns an error.

Advertisements

In this article, I will explain how to select rows from Pandas DataFrame by integer index and label (single & multiple rows), by the range, and by selecting first and last n rows with several examples. loc[] & iloc[] attributes are also used to select columns from Pandas DataFrame.

Key Points –

The iloc method is used to select rows by their integer position, starting from 0.
The loc method is used to select rows based on the index label.
You can use slicing with iloc to select a range of rows based on their positions.
With loc, you can specify a range of index labels to select multiple rows.
Rows can be selected using square brackets for simpler cases, though this is less flexible than iloc or loc.
A list of specific index labels can be passed to loc to select multiple non-consecutive rows.

Select Rows by Integer Index
Select Rows by Index Label

1. Quick Examples of Select Rows by Index Position & Labels

If you are in a hurry, below are some quick examples of how to select a row of Pandas DataFrame by index.


# Quick examples of select rows by index position & labels

# Select rows by integer index
df2 = df.iloc[2]     # Select Row by Index
df2 = df.iloc[[2,3,6]]  # Select rows by index list
df2 = df.iloc[1:5]   # Select rows by integer index range
df2 = df.iloc[:1]    # Select First Row
df2 = df.iloc[:3]    # Select First 3 Rows
df2 = df.iloc[-1:]   # Select Last Row
df2 = df.iloc[-3:]   # Select Last 3 Row
df2 = df.iloc[::2]   # Selects alternate rows

# Select Rows by Index Labels
df2 = df.loc['r2']          # Select Row by Index Label
df2 = df.loc[['r2','r3','r6']]  # Select Rows by Index Label List
df2 = df.loc['r1':'r5']     # Select Rows by Label Index Range
df2 = df.loc['r1':'r5']     # Select Rows by Label Index Range
df2 = df.loc['r1':'r5':2]   # Select Alternate Rows with in Index Labels

Let’s create a DataFrame with a few rows and columns and execute some examples to learn how to use an index. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print("Create DataFrame:\n", df)

Yields below output.

2. Select Rows by Index using Pandas iloc[]

pandas.iloc[] attribute is used for integer-location-based indexing to select rows and columns in a DataFrame. Remember index starts from 0, you can use pandas.DataFrame.iloc[] with the syntax [start:stop:step]; where start indicates the index of the first row to start, stop indicates the index of the last row to stop, and step indicates the number of indices to advance after each extraction. Or, use the syntax: [[indices]] with indices as a list of row indices to take.

2.1 Select Row by Integer Index

You can select a single row from Pandas DataFrame by integer index using df.iloc[n]. Replace n with a position you want to select.


# Select Row by Integer Index
df1 = df.iloc[2]
print("After selecting a row by index position:\n", df1)

2.2. Get Multiple Rows by Index List

Sometimes you may need to get multiple rows from DataFrame by specifying indexes as a list. Certainly, you can do this. For example, df.iloc[[2,3,6]] selects rows 3, 4, and 7 as the index starts from zero.


# Select Rows by Index List
df1 = df.iloc[[2,3,6]])
print("After selecting rows by index position:\n", df1)

# Output:
# After selecting rows by index position:
#   Courses    Fee Duration  Discount
# r3  Hadoop  26000   35days      1500
# r4  Python  22000   40days      1200
# r7    Java  22000   55days      2000

2.3. Get DataFrame Rows by Index Range

When you want to select a DataFrame by the range of Indexes, provide start and stop indexes.

By not providing a start index, iloc[] selects from the first row.
By not providing stop, iloc[] selects all rows from the start index.
Providing both start and stop, selects all rows in between.


# Select Rows by Integer Index Range
print(df.iloc[1:5])
print("After selecting rows by index range:\n", df1)

# Output:
# After selecting rows by index range:
#    Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r4   Python  22000   40days      1200
# r5   pandas  24000      NaN      2500

# Select First Row by Index
print(df.iloc[:1])

# Outputs:
# Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000

# Select First 3 Rows
print(df.iloc[:3])

# Outputs:
#    Courses    Fee Duration  Discount
# r1    Spark  20000   30days      1000
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500

# Select Last Row by Index
print(df.iloc[-1:])

# Outputs:
# r7    Java  22000   55days      2000
# Courses    Fee Duration  Discount

# Select Last 3 Row
print(df.iloc[-3:])

# Output:
#   Courses    Fee Duration  Discount
# r5  pandas  24000      NaN      2500
# r6  Oracle  21000     None      2100
# r7    Java  22000   55days      2000

# Selects alternate rows
print(df.iloc[::2])

# Output:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Hadoop  26000   35days      1500
# r5  Pandas  24000      NaN      2500
# r7    Java  22000   55days      2000

3. Select Rows by Index Labels using Pandas loc[]

By using pandas.DataFrame.loc[] you can get rows by index names or labels. To select the rows, the syntax is df.loc[start:stop:step]; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows. Or, use the syntax: [[labels]] with labels as a list of row labels to take.

3.1. Get Row by Label

If you have custom index labels on DataFrame, you can use these label names to select row. For example df.loc['r2'] returns row with label ‘r2’.


# Select Row by Index Label
df1 = df.loc['r2']
print("After selecting a row by index label:\n", df1)

# Output:
# After selecting row by index label:
# Courses     PySpark
# Fee           25000
# Duration     40days
# Discount       2300
# Name: r2, dtype: object

3.2. Get Multiple Rows by Label List

If you have a list of row labels, you can use this to select multiple rows from Pandas DataFrame.


# Select Rows by Index Label List
df1 = df.loc[['r2','r3','r6']]
print("After selecting rows by index label:\n", df1)

# Output:
# After selecting rows by index label:
#    Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r6   Oracle  21000     None      2100

3.3. Get Rows Between Two Labels

You can also select rows between two index labels.


# Select Rows by Label Index Range
df1 = df.loc['r1':'r5']
print("After selecting rows by index label range:\n", df1)

# Output:
# After selecting rows by index label range:
#    Courses    Fee Duration  Discount
# r1    Spark  20000   30days      1000
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r4   Python  22000   40days      1200
# r5   Pandas  24000      NaN      2500

# Select Alternate Rows with in Index Labels
print(df.loc['r1':'r5':2])

# Outputs:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Hadoop  26000   35days      1500
# r5  Pandas  24000      NaN      2500

You can get the first two rows using df.loc[:'r2'], but this approach is not much used as you need to know the row labels hence, to select the first n rows it is recommended to use by index df.iloc[:n], replace n with the value you want. The same applies to get the last n rows.

4. Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Select Row by Index
print(df.iloc[2])

# Select Rows by Index List
print(df.iloc[[2,3,6]])

# Select Rows by Integer Index Range
print(df.iloc[1:5])

# Select First Row
print(df.iloc[:1])

# Select First 3 Rows
print(df.iloc[:3])

# Select Last Row
print(df.iloc[-1:])

# Select Last 3 Row
print(df.iloc[-3:])

# Selects alternate rows
print(df.iloc[::2])

# Select Row by Index Label
print(df.loc['r2'])

# Select Rows by Index Label List
print(df.loc[['r2','r3','r6']])

# Select Rows by Label Index Range
print(df.loc['r1':'r5'])

# Select rows by label index range
print(df.loc['r1':'r5'])

# Select alternate rows with in index labels
print(df.loc['r1':'r5':2])

Pandas – Create DataFrame

Shishir Kant Singh — Tue, 28 Jan 2025 15:19:56 +0000

Python pandas is widely used for data science/data analysis and machine learning applications. It is built on top of another popular package named Numpy, which provides scientific computing in Python. pandas DataFrame is a 2-dimensional labeled data structure with rows and columns (columns of potentially different types like integers, strings, float, None, Python objects e.t.c). You can think of it as an excel spreadsheet or SQL table.

1. Create Pandas DataFrame

One of the easiest ways to create a pandas DataFrame is by using its constructor. DataFrame constructor takes several optional params that are used to specify the characteristics of the DataFrame.

Below is the syntax of the DataFrame constructor.


# DataFrame constructor syntax
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Now, let’s create a DataFrame from a list of lists (with a few rows and columns).


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Shishir",20000, "30days"], 
                 ["Pandas",25000, "40days"], 
               ]
df=pd.DataFrame(technologies)
print(df)

Since we have not given index and column labels, DataFrame by default assigns incremental sequence numbers as labels to both rows and columns.


# Output:
        0      1       2
0  Shishir  20000  30days
1  Pandas  25000  40days

Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column param and index param to provide column & custom index respectively to the DataFrame.


# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)

Yields below output. Alternatively, you can also add columns labels to the existing DataFrame.


# Output:
  Courses    Fee Duration
a  Shishir  20000   30days
b  Pandas  25000   40days

By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes returns the data type of each column.


# Output:
Courses     object
Fee          int64
Duration    object
dtype: object

You can also assign custom data types to columns.


# Set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)

2. Create DataFrame from the Dic (dictionary).

Another most used way to create pandas DataFrame is from the python Dict (dictionary) object. This comes in handy if you wanted to convert the dictionary object into DataFrame. Key from the Dict object becomes column and value convert into rows.


# Create DataFrame from Dict
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days']
              }
df = pd.DataFrame(technologies)
print(df)

3. Create DataFrame with Index

By default, DataFrame add’s a numeric index starting from zero. It can be changed with a custom index while creating a DataFrame.


# Create DataFrame with Index.
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days']
              }
index_label=["r1","r2"]
df = pd.DataFrame(technologies, index=index_label)
print(df)

4. Creating DataFrame from List of Dicts Object

Sometimes we get data in JSON string (similar dict), you can convert it to DataFrame as shown below.


# Creates DataFrame from list of dict
technologies = [{'Courses':'Shishir', 'Fee': 20000, 'Duration':'30days'},
        {'Courses':'Pandas', 'Fee': 25000, 'Duration': '40days'}]

df = pd.DataFrame(technologies)
print(df)

5. Creating DataFrame From Series

By using concat() method you can create Dataframe from multiple Series. This takes several params, for the scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.


# Create pandas Series
courses = pd.Series(["Shishir","Pandas"])
fees = pd.Series([20000,25000])
duration = pd.Series(['30days','40days'])

# Create DataFrame from series objects.
df=pd.concat([courses,fees,duration],axis=1)
print(df)

# Outputs
#        0      1       2
# 0  Shishir  20000  30days
# 1  Pandas  25000  40days

6. Add Column Labels

As you see above, by default concat() method doesn’t add column labels. You can do so as below.


# Assign Index to Series
index_labels=['r1','r2']
courses.index = index_labels
fees.index = index_labels
duration.index = index_labels

# Concat Series by Changing Names
df=pd.concat({'Courses': courses,
              'Course_Fee': fees,
              'Course_Duration': duration},axis=1)
print(df)

# Outputs:
# Courses  Course_Fee Course_Duration
# r1  Shishir       20000          30days
# r2  Pandas       25000          40days

7. Creating DataFrame using zip() function

Multiple lists can be merged using zip() method and the output is used to create a DataFrame.


# Create Lists
Courses = ['Shishir', 'Pandas']
Fee = [20000,25000]
Duration = ['30days','40days']
   
# Merge lists by using zip().
tuples_list = list(zip(Courses, Fee, Duration))
df = pd.DataFrame(tuples_list, columns = ['Courses', 'Fee', 'Duration'])

8. Create an Empty DataFrame in Pandas

Sometimes you would need to create an empty pandas DataFrame with or without columns. This would be required in many cases, below is one example.

When working with files, there are times when a file may not be available for processing. However, we may still need to manually create a DataFrame with the expected column names. Failing to use the correct column names can cause operations or transformations, such as unions, to fail, as they rely on columns that may not exist.

To handle situations like these, it’s important to always create a DataFrame with the expected columns, ensuring that the column names and data types are consistent, whether the file exists or if we’re processing an empty file.


# Create Empty DataFrame
df = pd.DataFrame()
print(df)

# Outputs:
# Empty DataFrame
# Columns: []
# Index: []

To create an empty DataFrame with just column names but no data.


# Create Empty DataFraem with Column Labels
df = pd.DataFrame(columns = ["Courses","Fee","Duration"])
print(df)

# Outputs:
# Empty DataFrame
# Columns: [Courses, Fee, Duration]
# Index: []

9. Create DataFrame From CSV File

In real-time we are often required to read the contents of CSV files and create a DataFrame. In pandas, creating a DataFrame from CSV is done by using pandas.read_csv() method. This returns a DataFrame with the contents of a CSV file.


# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')

10. Create From Another DataFrame

Finally, you can also copy a DataFrame from another DataFrame using copy() method.


# Copy DataFrame to another
df2=df.copy()
print(df2)

Pandas – Install on Windows

Shishir Kant Singh — Tue, 28 Jan 2025 15:11:52 +0000

You can install the Python pandas latest version or a specific version on windows either using pip command that comes with Python binary or conda if you are using Anaconda distribution. Before using either of these commands, you need to install Python or Anaconda distribution. If you already have either one installed, you can skip the document’s first section and directly jump to installing pandas. If not let’s see how to install pandas using these two approaches. You can use either one.

pip (Python package manager) command that comes with python to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.
Conda is the package manager that comes with Anaconda distribution, It is a package manager that is both cross-platform and language agnostic.

1. Install Python Pandas On Windows

As I said above if you already have python installed and have set the path to run python and pip from the command prompt, you can skip this section and directly jump to Install pandas using-pip-command-on-windows.

1.1 Download & Install Python

Let’s see step-by-step how to install python and set environment variables.

1.1.1 Download Python

Go to https://www.python.org/downloads/ and download the latest version for windows. If you want a specific version then use Active Python Releases section or scroll down to select the specific version to download.

Python Download

This downloads the .exe file to your downloads folder.

1.1.2 Install Python to Custom Location

Now double click on the download to install it on windows. This will give you an installer screen similar to below.

From the below screen, you can select “Install Now” option if you wanted to install to the default location or select “Customize installation” to change the location where to install Python. In my case, I use the second option and installed at c:\apps\opt\python folder.

Note: Select the check box bottom of the screen that reads “Add Python 3.9 to PATH”. This adds the python location to the PATH environment variable so that you can run pip and python from the command line. In case if you do not select, don’t worry I will show you how to add python installation location to PATH post-installation.

1.1.3 Set Python Installed Location to PATH Environment

Now set the Python installed location and scripts locations (C:\apps\opt\Python\Python39;C:\apps\opt\Python\Python39\Scripts) to PATH environment variables by following the below images in order.

1.1.4 Run Python shell from Command Prompt

Now open the windows command prompt by entering cmd from windows run ( Press windows icon + R) or from the search command

This opens the command prompt. Now type python and press enter, this should give you a python prompt.

In case if you get an error like "'python' is not recognized as an internal or external command" then something wrong with your PATH environment variable from the above step. Correct it and re-open the command line and try python again. If you still get an error then try setting PATH from the command prompt by running the below command. Change paths according to your installation.


set PATH=%PATH%;C:\apps\opt\Python\Python39;C:\apps\opt\Python\Python39\Scripts;

Now type again python and confirm you are seeing the below message.

1.2 Install Pandas Using pip Command on Windows

Python that I have installed comes with pip and pip3 commands (You can find these in the python installed folder @ C:\apps\opt\Python\Python39\Scripts.

pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.

Since the pandas package is available in PyPI, we should use this to install pandas latest version on windows.


# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

This should give you output as below. If your pip is not up to date, then upgrade pip to the latest version.

To check what version of pandas installed use pip list or pip3 list commands.

If you want to install a specific version of pandas, use the below command


# Installing pandas to specific version
pip install pandas==1.3.1

In case if you wanted to upgrade pandas to the latest or specific version


# Using pip3 to upgrade pandas
pip3 install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

This completes the installation of pandas to the latest or specific version on windows. If you have trouble installing or any steps are incorrect here, please comment. Your comment would help others !!

2. Install Pandas From Anaconda Distribution

If you already have Anaconda install then jump to Install pandas using conda command on Windows

2.1 Download & Install Anaconda distribution

Follow the below step-by-step instructions to install Anaconda on windows.

2.1.1 Download Anaconda .exe File

Go to https://anaconda.com/ and select Anaconda Individual Edition to download the latest version of Anaconda. This downloads the .exe file to the windows default downloads folder.

2.1.2 Install Anaconda on Windows

By double-clicking the .exe file starts the Anaconda installation. Follow the below screen shot’s and complete the installation

This finishes the installation of the Anaconda distribution. Now let’s see how to install pandas.

2.2 Install Pandas using conda command on Windows

2.2.1 Open Anaconda Navigator from the windows start or search box.

2.2.2 Create Anaconda Environment

This is optional but recommended to create an environment before you proceed. This gives complete segregation of different package installs for different projects you would be working on. If you already have an environment, you can use it too.

Select + Create option -> select the Python version you would like to use and enter your environment name. I am using the environment as pandas-tutorial.

2.2.3 Open Anaconda Terminal

You open the Anaconda terminal from Anaconda Navigator or open it from the windows start menu/search.

2.2.4 Install Pandas using conda

Now enter conda install pandas to install pandas in your environment. Note that along with pandas it also installs several other packages including the most used numpy.

2.2.5 Test Pandas From Command Line or Using Jupyter Notebook

now open Python terminal by entering python on the command line and then run the following command at prompt >>>.


>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>

Writing pandas commands from the terminal is not practical in real-time, so let’s see how to run panda programs from Jupyter Notebook.

Go to Anaconda Navigator -> Environments -> your environment (mine pandas-tutorial) -> select Open With Jupyter Notebook

This opens up Jupyter Notebook in the default browser.

Now select New -> PythonX and enter the below lines and select Run.

This completes installing pandas on Anaconda and running sample pandas statements on the command line and Jupyter Notebook.

Programming C – What is a Translator

Shishir Kant Singh — Sat, 05 Aug 2023 16:11:42 +0000

Translators in Programming Languages

In this article, I am going to discuss What is a Translator and its need in Programming Languages.

What is a Translator?

Always the user’s given instructions are in English, which is called source code. But the computer is not able to understand this source code and the computer understandable code is binary / machine. To convert this source code into binary code we are using the interface software called translators.

Translators are system software that converts programming language code into binary format. The translators are classified into three types:

Compiler
Interpreter
Assembler

For better understanding please have a look at the following image.

Compiler and interpreter are both used to convert high-level programs to machine code. Assembler is used to convert low-level programs to machine code.

Compiler:

A compiler is the system software that translates High-level programming language code into binary format in a single step except for those lines which are having an error. It checks all kinds of limits, ranges, errors, etc. But its execution time is more and occupies the largest part of the memory.

Interpreter:

It is the system software that converts programming language code into binary format step by step i.e. line by line compilation takes place. It reads one statement and then executes it until it proceeds further to all the statements. If an error occurs it will stop the compilation process. Development-wise, an interpreter is recommended to use.

Note: The compiler converts the total source code at once by leaving the error lines. Whereas the interpreter is line by line. C & C++ are compiler-based languages. Java / .Net / Python, etc. are compiler-based interpreted languages. The assembler working style is similar to the compiler.

Assembler:

It is the system software that converts assembly language instructions into binary formats.

Operating System:

An Operating System (OS) is an interface between a computer user and computer hardware. An Operating system is a software that performs all the basic tasks like file management, memory management, process management, handling input and output, and controlling peripheral devices such as disk drives and Printers.

Loader:

A loader is a program that loads the machine codes of a program into system memory. And a locator is a program that assigns specific memory addresses for each machine code of a program that is to be loaded into system memory.

Linker:

Usually, a longer program is divided into a number of smaller subprograms called modules. It is easier to develop, test, and debug smaller programs. A linker is a program that links smaller programs to form a single program. The linker links the machine codes of the program. Therefore, it accepts the user’s programs after the editor has edited the program, and the compiler has produced machine codes of the program. The Process is called Linking.