Shishir Kant Singh – Shishir Kant Singh https://shishirkant.com Jada Sir जाड़ा सर :) Sun, 04 May 2025 15:40:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://shishirkant.com/wp-content/uploads/2020/05/cropped-shishir-32x32.jpg Shishir Kant Singh – Shishir Kant Singh https://shishirkant.com 32 32 187312365 Pandas – Get Row Count https://shishirkant.com/pandas-get-row-count/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-get-row-count https://shishirkant.com/pandas-get-row-count/#respond Sun, 04 May 2025 15:40:47 +0000 https://shishirkant.com/?p=4389 You can get the number of rows in Pandas DataFrame using len(df.index) and df.shape[0] properties. Pandas allow us to get the shape of the DataFrame by counting the number of rows in the DataFrame.

DataFrame.shape property returns the rows and columns, for rows get it from the first index which is zero; like df.shape[0] and for columns count, you can get it from df.shape[1]. Alternatively, to find the number of rows that exist in a DataFrame, you can use DataFrame.count() method, but this is not recommended approach due to performance issues.

source: stackoverflow.com

In this article, I will explain how to count or find the DataFrame rows count with examples.

Key Points –

  • The shape attribute returns a tuple of the form (rows, columns), where the first element represents the number of rows.
  • The len() function can be used to return the number of rows in a DataFrame.
  • Accessing the first element of the shape tuple gives the number of rows directly.
  • Accessing shape[0] is more efficient than using len() because shape is a direct attribute of the DataFrame.
  • Accessing the first element of the shape tuple gives the number of rows directly.
  • When applying filters or conditions, the number of rows can change, and you can use these methods to get the updated count.

1. Quick Examples of Get the Number of Rows in DataFrame

If you are in hurry, below are some quick examples of how to get the number of rows (row count) in Pandas DataFrame.


# Quick examples of get the number of rows

# Example 1: Get the row count 
# Using len(df.index)
rows_count = len(df.index)

# Example 2: Get count of rows 
# Using len(df.axes[])
rows_count = len(df.axes[0])

# Example 3:Get count of rows 
# Using df.shape[0]
rows_count = df.shape[0]

# Example 4: Get count of rows
# Using count()
rows_count = df.count()[0]

If you are a Pandas learner, read through the article as I have explained these examples with the sample data to understand better.

Let’s create a Pandas DataFrame with a dictionary of lists, pandas DataFrame columns names CoursesFeeDurationDiscount.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Pandas get number rows

2. Get Number of Rows in DataFrame

You can use len(df.index) to find the number of rows in pandas DataFrame, df.index returns RangeIndex(start=0, stop=8, step=1) and use it on len() to get the count. You can also use len(df) but this performs slower when compared with len(df.index) since it has one less function call. Both these are faster than df.shape[0] to get the count.

If performance is not a constraint better use len(df) as this is neat and easy to read.


# Get the row count using len(df.index)
print(df.index)

# Outputs: 
# RangeIndex(start=0, stop=5, step=1)

print('Row count is:', len(df.index))
print('Row count is:', len(df))

# Outputs:
# Row count is:5

3. Get Row Count in DataFrame Using .len(DataFrame.axes[0]) Method

Pandas also provide Dataframe.axes property that returns a tuple of your DataFrame axes for rows and columns. Access the axes[0] and call len(df.axes[0]) to return the number of rows. For columns count, use df.axes[1]. For example: len(df.axes[1]).

Here, DataFrame.axes[0] returns the row axis (index), and len() is then used to get the length of that axis, which corresponds to the number of rows in the DataFrame.


# Get the row count using len(df.axes[0])
print(df.axes)

# Output:
# [RangeIndex(start=0, stop=5, step=1), Index(['Courses', 'Courses Fee', 'Duration', 'Discount'], dtype='object')]

print(df.axes[0])

# Output:
# RangeIndex(start=0, stop=5, step=1)

print('Row count is:', len(df.axes[0]))

# Outputs:
# Row count is:5

4. Using df.shape[0] to Get Rows Count

Pandas DataFrame.shape returns the count of rows and columns, df.shape[0] is used to get the number of rows. Use df.shape[1] to get the column count.

In the below example, df.shape returns a tuple containing the number of rows and columns in the DataFrame, and df.shape[0] specifically extracts the number of rows. This approach is concise and widely used for obtaining the row count in Pandas DataFrames.


# Get row count using df.shape[0]
df = pd.DataFrame(technologies)
row_count = df.shape[0]  # Returns number of rows
col_count = df.shape[1]  # Returns number of columns
print(row_count)

# Outputs:
# Number of rows: 5

5. Using df.count() Method

This is not recommended approach due to its performance but, still I need to cover this as this is also one of the approaches to get the row count of a DataFrame. Note that this ignores the values from columns that have None or Nan while calculating the count. As you see, my DataFrame contains 2 None/nan values in column Duration hence it returned 3 instead of 5 on the below example.


# Get count of each column
print(df.count())

# Outputs: 
# Courses        5
# Courses Fee    5
# Duration       3
# Discount       5
# dtype: int64

Now let’s see how to get the row count.


# Get count of rows using count()
rows_count = df.count()[0]
rows_count = = df[df.columns[0]].count()
print('Number of Rows count is:', rows_count )

# Outputs:
# Number of Rows count is: 5

Reference

]]>
https://shishirkant.com/pandas-get-row-count/feed/ 0 4389
Pandas – Cast Column Type https://shishirkant.com/pandas-cast-column-type/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-cast-column-type https://shishirkant.com/pandas-cast-column-type/#respond Sun, 04 May 2025 15:36:45 +0000 https://shishirkant.com/?p=4386 While working in Pandas DataFrame or any table-like data structures we are often required to change the data type(dtype) of a column also called type casting, for example, convert from int to string, string to int e.t.c, In pandas, you can do this by using several methods like astype()to_numeric()covert_dttypes()infer_objects() and e.t.c. In this article, I will explain different examples of how to change or convert the data type in Pandas DataFrame – convert all columns to a specific type, convert single or multiple column types – convert to numeric types e.t.c.

Key Points–

  • Applying the .astype() method to convert data types directly, specifying the desired dtype.
  • Utilizing the .to_numeric() function to coerce object types into numeric types, with options for handling errors and coercing strings.
  • Using the infer_objects() method to automatically infer and convert data types.
  • Employing the as_type() method to convert data types with specific parameters like nullable integers.
  • Utilizing custom functions or mapping techniques for more complex type conversions.

1. Quick Examples of Changing Data Type

Below are some quick examples of converting column data type on Pandas DataFrame.


# Quick examples of converting data types 

# Example 1: Convert all types to best possible types
df2=df.convert_dtypes()

# Example 2: Change All Columns to Same type
df = df.astype(str)

# Example 3: Change Type For One or Multiple Columns
df = df.astype({"Fee": int, "Discount": float})

# Example 4: Ignore errors
df = df.astype({"Courses": int},errors='ignore')

# Example 5: Converts object types to possible types
df = df.infer_objects()

# Example 6: Converts fee column to numeric type
df['Fee'] = pd.to_numeric(df['Fee'])

# Example 7: Convert Fee and Discount to numeric types
df[['Fee', 'Discount']] =df [['Fee', 'Discount']].apply(pd.to_numeric)

Now let’s see with an example. first, create a Pandas DataFrame with columns names CoursesFeeDurationDiscount.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration ':['30day','40days','35days', '40days','60days','50days','55days'],
    'Discount':[11.8,23.7,13.4,15.7,12.5,25.4,18.4]
    }
df = pd.DataFrame(technologies)
print(df.dtypes)

Yields below output.


# Output:
Courses       object
Fee            int64
Duration      object
Discount     float64

2. DataFrame.convert_dtypes() to Convert Data Type in Pandas

convert_dtypes() is available in Pandas DataFrame since version 1.0.0, this is the most used method as it automatically converts the column types to best possible types.

Below is the Syntax of the pandas.DataFrame.convert_dtypes().


# Syntax of DataFrame.convert_dtypes
DataFrame.convert_dtypes(infer_objects=True, convert_string=True,
      convert_integer=True, convert_boolean=True, convert_floating=True)

Now, let’s see a simple example.


# Convert all types to best possible types
df2=df.convert_dtypes()
print(df2.dtypes)

Yields below output. Note that it converted columns with object type to string type.


# Output:
Courses       string
Fee            int64
Duration      string
Discount     float64

This method is handy when you want to leverage Pandas’ built-in type inference capabilities to automatically convert data types, especially when dealing with large datasets or when you’re unsure about the optimal data type for each column.

3. DataFrame.astype() to Change Data Type in Pandas

In pandas DataFrame use dataframe.astype() function to convert one type to another type of single or multiple columns at a time, you can also use it to change all column types to the same type. When you perform astype() on a DataFrame without specifying a column name, it changes all columns to a specific type. To convert a specific column, you need to explicitly specify the column.

Below is the syntax of pandas.DataFrame.astype()


# Below is syntax of astype()
DataFrame.astype(dtype, copy=True, errors='raise')

3.1 Change All Columns to Same type in Pandas

df.astype(str) converts all columns of Pandas DataFrame to string type. To convert all columns in the DataFrame to strings, as confirmed by printing the data types before and after the conversion. Each column will be of type object, which is the dtype Pandas uses for storing strings.


# Change All Columns to Same type
df = df.astype(str)
print(df.dtypes)

Yields below output.


# Output:
Courses      object
Fee          object
Duration     object
Discount     object
dtype: object

3.2 Change Type For One or Multiple Columns in Pandas

On astype() Specify the param as JSON notation with column name as key and type you wanted to convert as a value to change one or multiple columns. Below example cast DataFrame column Fee to int type and Discount to float type.


# Change Type For One or Multiple Columns
df = df.astype({"Fee": int, "Discount": float})
print(df.dtypes)

3.3 Convert Data Type for All Columns in a List

Sometimes you may need to convert a list of DataFrame columns to a specific type, you can achieve this in several ways. Below are 3 different ways that convert columns Fee and Discount to float type.


# Convert data type for all columns in a list
df = pd.DataFrame(technologies)
cols = ['Fee', 'Discount']
df[cols] = df[cols].astype('float')

# By using a loop
for col in ['Fee', 'Discount']:
    df[col] = df[col].astype('float')

# By using apply() & astype() together
df[['Fee', 'Discount']].apply(lambda x: x.astype('float'))

3.4 Raise or Ignore Error when Convert Column type Fails

By default, when you are trying to change a column to a type that is not supported with the data, Pandas generates an error, in order to ignore error use errors param; this takes either ignore or error as value. In the below example I am converting a column that has string value to int which is not supported hence it generates an error, I used errors='ignore' to ignore the error.


# Ignores error
df = df.astype({"Courses": int},errors='ignore')

# Generates error
df = df.astype({"Courses": int},errors='raise')

4. DataFrame.infer_objects() to Change Data Type in Pandas

Use DataFrame.infer_objects() method to automatically convert object columns to a type of data it holding. It checks the data of each object column and automatically converts it to data type. Note that it converts only object types. For example, if a column with object type is holding int or float types, using infer_object() converts it to respective types.


# Converts object types to possible types
df = pd.DataFrame(technologies)
df = df.infer_objects()
print(df.dtypes)

5. Using DataFrame.to_numeric() to Convert Numeric Types

pandas.DataFrame.to_numeric() is used to convert columns with non-numeric dtypes to the most suitable numeric type.

5.1 Convert Numeric Types

Using pd.to_numeric() is another way to convert a specific column to a numeric type in Pandas. Here’s how you can use it to convert the Fee column to numeric type


# Converts fee column to numeric type
df['Fee'] = pd.to_numeric(df['Fee'])
print(df.dtypes)

This code will convert the Fee column from strings to numeric values, as confirmed by printing the data types after the conversion.

5.2 Convert Multiple Numeric Types using apply() Method

Use to_numeric() along with DataFrame.apply() method to convert multiple columns into a numeric type. The below example converts column Fee and Discount to numeric types.


# Convert Fee and Discount to numeric types
df = pd.DataFrame(technologies)
df[['Fee', 'Discount']] =df [['Fee', 'Discount']].apply(pd.to_numeric)
print(df.dtypes)

References

]]>
https://shishirkant.com/pandas-cast-column-type/feed/ 0 4386
Pandas Drop Rows Based on Column Value https://shishirkant.com/pandas-drop-rows-based-on-column-value/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-drop-rows-based-on-column-value https://shishirkant.com/pandas-drop-rows-based-on-column-value/#respond Sun, 04 May 2025 15:17:59 +0000 https://shishirkant.com/?p=4381 Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value. In this article, I will explain dropping rows based on column value.

Key Points –

  • Use boolean indexing to filter rows based on specific conditions in a DataFrame column.
  • The condition inside the boolean indexing can involve any comparison or logical operation.
  • Apply the mask to the DataFrame using the .loc[] indexer or DataFrame.drop() method.
  • Use boolean indexing or conditional statements to create a mask identifying rows to be dropped.
  • Always ensure to create a new DataFrame or use inplace=True parameter to modify the original DataFrame when dropping rows to avoid unintended consequences.

Create DataFrame

To run some examples of drop rows based on column value, let’s create Pandas DataFrame.


# Create pandas DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print("DataFrame:\n", df)

Yields below output.

pandas drop rows value

Delete Rows Using drop()

To delete rows based on specific column values in a Pandas DataFrame, you typically filter the DataFrame using boolean indexing and then reassign the filtered DataFrame back to the original variable or use the drop() method to remove those rows.


# Delete rows using drop()
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print("Drop rows based on column value:\n", df)

Yields below output.

pandas drop rows value

In the above example, use the drop() method to remove the rows where the Fee column is greater than or equal to 24000. We used inplace=True to modify the original DataFrame df.

Using loc[]

Using loc[] to drop rows based on a column value involves leveraging the loc[] accessor in pandas to filter rows from a DataFrame according to a condition applied to a specific column, effectively filtering out rows that do not meet the condition.


# Remove row
df2 = df[df.Fee >= 24000]
print("Drop rows based on column value:\n", df)

# Using loc[]
df2 = df.loc[df["Fee"] >= 24000 ]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0
# 3   Python  24000.0      NaN       NaN

Delete Rows Based on Multiple Column Values

To delete rows from a DataFrame based on multiple column values in pandas, you can use the drop() function along with boolean indexing.


# Delect rows based on multiple column value
df = pd.DataFrame(technologies)
df = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print("Drop rows based on multiple column values:\n", df)

# Output:
# Drop rows based on multiple column values:
#    Courses      Fee Duration  Discount
# 1  PySpark  25000.0     None    2300.0

Delete Rows Based on None or NaN Column Values

When you have None or NaN values on columns, you may need to remove NaN values before you apply some calculations. you can do this using notnull() function.

Note: With None or NaN values you cannot use == or != operators.


# Drop rows with None/NaN values
df2 = df[df.Discount.notnull()]
print("Drop rows based on column value:\n", df)

# Output:
#  Drop rows based on column value:
#    Courses      Fee Duration  Discount
# 0    Spark  22000.0   30days    1000.0
# 1  PySpark  25000.0     None    2300.0
# 2   Hadoop      NaN   55days    1000.0

Using query()

To use the DataFrame.query() function is primarily used for filtering rows based on a condition, rather than directly deleting rows. However, you can filter rows using query() and then assign the filtered DataFrame back to the original DataFrame, effectively removing the rows that do not meet the specified condition.


# Delete rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Delete Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

# Using lambda
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

Based on the Inverse of Column Values

To delete rows from a DataFrame where the value in the Courses column is not equal to PySpark. The tilde ~ operator is used to invert the boolean condition.


# Delect rows based on inverse of column values
df1 = df[~(df['Courses'] == "PySpark")].index 
df.drop(df1, inplace = True)
print("Drop rows based on column value:\n", df)

# Output:
# Drop rows based on column value:
#    Courses    Fee Duration  Discount
# b  PySpark  25000   50days      2300
# f  PySpark  25000   50days      2000

The above code will drop rows from the DataFrame df where the value in the Courses column is not equal to PySpark. It first finds the index of such rows using the boolean condition and then drops those rows using the drop() method.

Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[22000,25000,np.nan,24000],
    'Duration':['30days',None,'55days',np.nan],
    'Discount':[1000,2300,1000,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Using drop() to remove rows
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
print(df)

# Remove rows
df = pd.DataFrame(technologies)
df2 = df[df.Fee >= 24000]
print(df2)

# Reset index after deleting rows
df2 = df[df.Fee >= 24000].reset_index()
print(df2)

# If you have space in column name.
# Surround the column name with single quote
df2 = df[df['column name']]

# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
print(df2)

# Delect rows based on multiple column value
df2 = df[(df['Fee'] >= 22000) & (df['Discount'] == 2300)]
print(df2)

# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]
print(df2)

References

]]>
https://shishirkant.com/pandas-drop-rows-based-on-column-value/feed/ 0 4381
Pandas – Drop Columns by Label | Index https://shishirkant.com/pandas-drop-columns-by-label-index/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-drop-columns-by-label-index Sun, 02 Mar 2025 08:20:59 +0000 https://shishirkant.com/?p=4377 How to drop column(s) by index in Pandas? You can use the drop() function in Pandas to remove columns by index. Set the axis parameter to 1 (indicating columns) and specify either the single-column index or a list of column indices you want to eliminate.

In this article, I will explain how to drop column(s) by index using multiple ways of pandas such as the DataFrame.drop() function, DataFrame.loc[] function and DataFrame.iloc[].columns property.

Key Points –

  • Use the DataFrame.drop() method with the axis parameter set to 1 to drop columns by index.
  • Specify the column index or a list of column indices to drop multiple columns at once.
  • Dropping columns by index modifies the DataFrame in place unless the inplace parameter is set to False.
  • Provide either a single-column index or a list of column indices to be dropped.
  • To drop multiple columns, pass a list containing the indices of the columns you want to drop.

Quick Examples of Dropping Columns by Index

Below are some quick examples of dropping column(s) by an index in pandas.


# Quick examples of dropping columns by index

# Example 1: Using DataFrame.drop() method
df2=df.drop(df.columns[1], axis=1)

# Example 2: Drop first column with param inplace = True
df.drop(df.columns[1], axis=1, inplace = True)
                    
# Example 3: Drop columns based on column index
df2 = df.drop(df.columns[[0, 1, 2]],axis = 1)

# Example 4: Drop column of index 
# Using DataFrame.iloc[] and drop() methods
df2 = df.drop(df.iloc[:, 1:3],axis = 1)

# Example 5: Drop columns by labels 
# Using DataFrame.loc[] and drop() methods
df2 = df.drop(df.loc[:, 'Courses':'Fee'].columns,axis = 1)

To run some examples of drop column(s) by index. let’s create DataFrame using data from a dictionary.


# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","Spark","PySpark","JAVA","Hadoop",".Net","Python","AEM","Oracle","SQL DBA","C","WebTechnologies"],
    'Fee' :[22000,25000,23000,24000,26000,30000,27000,28000,35000,32000,20000,15000],
    'Duration':['30days','35days','40days','45days','50days','55days','60days','35days','30days','40days','50days','55days']
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas drop columns index

Using DataFrame.drop() Column by Index

You can use DataFrame.drop() function to remove the column by index. The drop()function is used to drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axes, or by specifying the direct index of columns. When using a multi-index, labels on different levels can be removed by specifying the level.


# Using DataFrame.drop() method.
df2=df.drop(df.columns[1], axis=1)
print("After dropping the first column:\n", df2)

Yields below output. This deletes the second column as the index starts from 0.

pandas drop columns index

If you want to update the existing DataFrame instead of creating a new DataFrame after dropping it’s column you can use the inplace=True parameter of the drop() function. This function will return the original DataFrame with the remaining columns.


# Drop first column with param inplace = True
df.drop(df.columns[1], axis=1, inplace = True)
print("After dropping the first column:\n", df)

Yields below output.


# Output:
After dropping the first column:
             Courses Duration
0             Spark   30days
1             Spark   35days
2           PySpark   40days
3              JAVA   45days
4            Hadoop   50days
5              .Net   55days
6            Python   60days
7               AEM   35days
8            Oracle   30days
9           SQL DBA   40days
10                C   50days
11  WebTechnologies   55days

Drop Multiple Columns By Index

In this section, you’ll learn how to drop multiple columns by index. You can use df.columns[[index1, index2, indexn]] to identify the list of column names in that index position and pass that list to the drop method.


# Drop multiple columns based on column index
df2 = df.drop(df.columns[[1, 2]],axis = 1)
print("After dropping multiple columns:\n", df2)

Yields below output.


# Output:
After dropping multiple columns:
            Courses
0             Spark
1             Spark
2           PySpark
3              JAVA
4            Hadoop
5              .Net
6            Python
7               AEM
8            Oracle
9           SQL DBA
10                C
11  WebTechnologies

Drop Columns Using DataFrame.iloc[] and drop() Methods

To drop columns using DataFrame.iloc[] and drop() methods, you specify the index positions of the columns you want to drop using iloc[]. For instance, df.iloc[:, 1:3] selects all rows (:) and columns from index position 1 up to, but not including, index position 3. This selects columns at index positions 1 and 2, which are Fee and Duration. Then you can use df.drop() to drop these selected columns along the specified axis (axis=1 for columns).


# Drop column of index 
# Using DataFrame.iloc[] and drop() methods
df2 = df.drop(df.iloc[:, 1:3],axis = 1)
print("After dropping multiple columns:\n", df2)                     

Yields below output.


After dropping multiple columns:
# Output:
            Courses
0             Spark
1             Spark
2           PySpark
3              JAVA
4            Hadoop
5              .Net
6            Python
7               AEM
8            Oracle
9           SQL DBA
10                C
11  WebTechnologies

Drop Columns of Index Using DataFrame.loc[] and drop() Methods

Similarly, to drop columns using DataFrame.loc[] and the drop() method, you specify the range of column labels you want to drop using loc[]. For instance, df.loc[:, Courses:Fee] select all rows (:) and columns from Courses to Fee using label-based indexing. .columns returns the column labels within the specified range. Then you can use df.drop() to drop the columns obtained from the previous step along the specified axis (axis=1 for columns).


# Drop columns of index 
# Using DataFrame.loc[] and drop() methods.
df2 = df.drop(df.loc[:, 'Courses':'Fee'].columns,axis = 1)
print("After dropping multiple columns:\n", df2)

# Drop columns of index 
# Using DataFrame.loc[] and drop() methods
columns_to_drop = df.loc[:, 'Courses':'Fee'].columns
df2 = df.drop(columns_to_drop, axis=1)
print("After dropping multiple columns:\n", df2)

Yields below output.


# Output:
   Duration
0    30days
1    35days
2    40days
3    45days
4    50days
5    55days
6    60days
7    35days
8    30days
9    40days
10   50days
11   55days

Complete Examples of Drop Columns By Index


# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","Spark","PySpark","JAVA","Hadoop",".Net","Python","AEM","Oracle","SQL DBA","C","WebTechnologies"],
    'Fee' :[22000,25000,23000,24000,26000,30000,27000,28000,35000,32000,20000,15000],
    'Duration':['30days','35days','40days','45days','50days','55days','60days','35days','30days','40days','50days','55days']
          }
df = pd.DataFrame(technologies)
print(df)

# Using DataFrame.drop() method.
df2=df.drop(df.columns[1], axis=1)
print(df2)

# Drop Multiple Columns by labels.
df2 = df.drop(['Courses', 'Duration'],axis = 1)
print(df2)
                    
# Drop columns based on column index.
df2 = df.drop(df.columns[[0, 1, 2]],axis = 1)
print(df2)

# Drop column by index using DataFrame.iloc[] and drop() methods.
df2 = df.drop(df.iloc[:, 1:3],axis = 1)
print(df2)

# Drop columns by labels using DataFrame.loc[] and drop() methods.
df2 = df.drop(df.loc[:, 'Courses':'Fee'].columns,axis = 1)
print(df2)

FAQ on Drop Columns by Index

How do I drop a single column by its index in a Pandas DataFrame?

Use the DataFrame.drop() method with the column index specified in the columns parameter and set the axis parameter to 1. For example: df.drop(columns=df.columns[index], axis=1).

How can I drop multiple non-contiguous columns by index at once?

You can drop multiple non-contiguous columns by specifying a list of column indexes within the df.columns accessor. For example, df = df.drop(df.columns[[0, 2, 4]], axis=1)

How can I remove multiple columns using their indices?

Provide a list of column indices to the columns parameter in the drop() method and set axis=1. For example: df.drop(columns=[df.columns[index1], df.columns[index2]], axis=1).

What if I want to drop columns by index using column names?

If you prefer to use column names instead of indexes, you can directly specify the column names within the drop() method. For example, df = df.drop(['column1′, ‘column3’], axis=1)

Will the DataFrame be modified in place when dropping columns by index?

By default, drop() does not modify the DataFrame in place. To modify the DataFrame in place, set the inplace parameter to True.

]]>
4377
Pandas – Drop Rows by Label | Index https://shishirkant.com/pandas-drop-rows-by-label-index/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-drop-rows-by-label-index Tue, 28 Jan 2025 16:11:01 +0000 https://shishirkant.com/?p=4370 By using pandas.DataFrame.drop() method you can drop/remove/delete rows from DataFrame. axis param is used to specify what axis you would like to remove. By default axis=0 meaning to remove rows. Use axis=1 or columns param to remove columns. By default, Pandas return a copy DataFrame after deleting rows, used inpalce=True to remove from existing referring DataFrame.

In this article, I will cover how to remove rows by labels, indexes, and ranges and how to drop inplace and NoneNan & Null values with examples. if you have duplicate rows, use drop_duplicates() to drop duplicate rows from pandas DataFrame

Key Points –

  • Use the drop() method to remove rows by specifying the row labels or indices.
  • Set the axis parameter to 0 (or omit it) to indicate that rows should be dropped.
  • Use the inplace parameter to modify the original DataFrame directly without creating a new one.
  • After dropping rows, consider resetting the index with reset_index() to maintain sequential indexing.
  • Set the errors parameter to ‘ignore’ to suppress errors when attempting to drop non-existent row labels.
  • Leverage the query() method to filter and drop rows based on complex conditions.

Pandas.DataFrame.drop() Syntax – Drop Rows & Columns

Let’s know the syntax of the DataFrame drop() function.


# Pandas DaraFrame drop() Syntax
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters

  • labels – Single label or list-like. It’s used with axis param.
  • axis – Default sets to 0. 1 to drop columns and 0 to drop rows.
  • index – Use to specify rows. Accepts single label or list-like.
  • columns – Use to specify columns. Accepts single label or list-like.
  • level – int or level name, optional, use for Multiindex.
  • inplace – Default False, returns a copy of DataFrame. When used True, it drops the column inplace (current DataFrame) and returns None.
  • errors – {‘ignore’, ‘raise’}, default ‘raise’.

Let’s create a DataFrame, run some examples, and explore the output. Note that our DataFrame contains index labels for rows which I am going to use to demonstrate removing rows by labels.


# Create a DataFrame
import pandas as pd
import numpy as np

technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[20000,25000,26000,22000],
    'Duration':['30day','40days',np.nan, None],
    'Discount':[1000,2300,1500,1200]
               }

indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
print(df)

Yields below output.

pandas drop rows

Pandas Drop Rows From DataFrame Examples

By default drop() method removes rows (axis=0) from DataFrame. Let’s see several examples of how to remove rows from DataFrame.

Drop rows by Index Labels or Names

One of the Panda’s advantages is you can assign labels/names to rows, similar to column names. If you have DataFrame with row labels (index labels), you can specify what rows you want to remove by label names.


# Drop rows by Index Label
df = pd.DataFrame(technologies,index=indexes)
df1 = df.drop(['r1','r2'])
print("Drop rows from DataFrame:\n", df1)

Yields below output.

pandas drop rows

Alternatively, you can also write the same statement by using the field name 'index'.


# Delete Rows by Index Labels
df1 = df.drop(index=['r1','r2'])

And by using labels and axis as below.


# Delete Rows by Index Labels & axis
df1 = df.drop(labels=['r1','r2'])
df1 = df.drop(labels=['r1','r2'],axis=0)

Notes:

  • As you see using labels, axis=0 is equivalent to using index=label names.
  • axis=0 mean rows. By default drop() method considers axis=0 hence you don’t have to specify to remove rows. to remove columns explicitly specify axis=1 or columns.

Drop Rows by Index Number (Row Number)

Similarly by using drop() method you can also remove rows by index position from pandas DataFrame. drop() method doesn’t have a position index as a param, hence we need to get the row labels from the index and pass these to the drop method. We will use df.index it to get row labels for the indexes we want to delete.

  • df.index.values returns all row labels as a list.
  • df.index[[1,3]] gets you row labels for the 2nd and 3rd rows, bypassing these to drop() method removes these rows. Note that in Python, the list index starts from zero.

# Delete Rows by Index numbers
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[[1,3]])
print(df1)

Yields the same output as section 2.1. In order to drop the first row, you can use df.drop(df.index[0]), and to drop the last row use df.drop(df.index[-1]).


# Removes First Row
df=df.drop(df.index[0])

# Removes Last Row
df=df.drop(df.index[-1])

Delete Rows by Index Range

You can also remove rows by specifying the index range. The below example removes all rows starting 3rd row.


# Delete Rows by Index Range
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[2:])
print(df1)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Delete Rows when you have Default Index

By default, pandas assign a sequence number to all rows also called index, row index starts from zero and increments by 1 for every row. If you are not using custom index labels, pandas DataFrame assigns sequence numbers as Index. To remove rows with the default index, you can try below.


# Remove rows when you have default index.
df = pd.DataFrame(technologies)
df1 = df.drop(0)
df3 = df.drop([0, 3])
df4 = df.drop(range(0,2))

Note that df.drop(-1) doesn’t remove the last row as the -1 index is not present in DataFrame. You can still use df.drop(df.index[-1]) it to remove the last row.

Remove DataFrame Rows Inplace

All examples you have seen above return a copy of DataFrame after removing rows. In case if you want to remove rows inplace from referring DataFrame use inplace=True. By default, inplace param is set to False.


# Delete Rows inplace
df = pd.DataFrame(technologies,index=indexes)
df.drop(['r1','r2'],inplace=True)
print(df)

Drop Rows by Checking Conditions

Most of the time we would also need to remove DataFrame rows based on some conditions (column value), you can do this by using loc[] and iloc[] methods.


# Delete Rows by Checking Conditions
df = pd.DataFrame(technologies)
df1 = df.loc[df["Discount"] >=1500 ]
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
1  PySpark  25000   40days      2300
2   Hadoop  26000      NaN      1500

Drop Rows that NaN/None/Null Values

While working with analytics you would often be required to clean up the data that has NoneNull & np.NaN values. By using df.dropna() you can remove NaN values from DataFrame.


# Delete rows with Nan, None & Null Values
df = pd.DataFrame(technologies,index=indexes)
df2=df.dropna()
print(df2)

This removes all rows that have None, Null & NaN values on any columns.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Remove Rows by Slicing DataFrame

You can also drop a list of DataFrame rows by slicing. Remember index starts from zero.


# Remove Rows by Slicing DataFrame
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4
]]>
4370
Pandas – Rename Column https://shishirkant.com/pandas-rename-column/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-rename-column Tue, 28 Jan 2025 16:06:46 +0000 https://shishirkant.com/?p=4367 Pandas DataFrame.rename() function is used to change the single column name, multiple columns, by index position, in place, with a list, with a dict, and renaming all columns, etc. We are often required to change the column name of the DataFrame before we perform any operations. In fact, changing the name of a column is one of the most searched and used functions of the Pandas.

Advertisements

The good thing about this function is it provides a way to rename a specific single column.

In this pandas article, You will learn several ways to rename a column name of the Pandas DataFrame with examples by using functions like DataFrame.rename()DataFrame.set_axis()DataFrame.add_prefix()DataFrame.add_suffix() and more.

Key Points –

  • Use the rename() method to rename columns, specifying a dictionary that maps old column names to new ones.
  • The rename() method does not modify the original DataFrame by default; to apply changes directly, set the inplace parameter to True.
  • Column renaming is case-sensitive, so ensure the exact match of column names when renaming.
  • To rename a single column, use the columns parameter in the rename() method with a dictionary that maps only the specific column.
  • Use the DataFrame.columns attribute to directly modify column names by assigning a new list of names. Ensure the new list has the same length as the original.

1. Quick Examples Rename Columns of DataFrame

If you are in a hurry, below are some quick examples of renaming column names in Pandas DataFrame.

Pandas Rename ScenarioRename Column Example
Rename columns with listdf.columns=[‘A’,’B’,’C’]
Rename column name by indexdf.columns.values[2] = “C”
Rename the column using Dictdf2=df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’})
Rename column using Dict & axisdf2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=1)
Rename column in placedf2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=’columns’)
df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, in place = True)df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, inplace = True)
Rename using lambda functiondf.rename(columns=lambda x: x[1:], inplace=True)
Rename with errordf.rename(columns = {‘x’:’X’}, errors = “raise”)
Rename using set_asis()df2=df.set_axis([‘A’,’B’,’C’], axis=1)

Pandas Rename Column(s) Examples

Now let’s see the Syntax and examples.

2. Syntax of Pandas DataFrame.rename()

Following is the syntax of the pandas.DataFrame.rename() method, this returns either DataFrame or None. By default returns pandas DataFrame after renaming columns. When use inplace=True it updates the existing DataFrame in place (self) and returns None.


#  DataFrame.rename() Syntax
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, 
       copy=True, inplace=False, level=None, errors='ignore')

2.1 Parameters

The following are the parameters.

  • mapper – dictionary or function to rename columns and indexes.
  • index – dictionary or function to rename index. When using with axis param, it should be (mapper, axis=0) which is equivalent to index=mapper.
  • columns – dictionary or function to rename columns. When using with axis param, it should be (mapper, axis=0) which is equivalent to column=mapper.
  • axis – Value can be either 0 or index | 1 or columns. Default set to ‘0’.
  • copy – Copies the data as well. Default set to True.
  • inplace – Used to specify the DataFrame referred to be updated. Default to False. When used True, copy property will be ignored.
  • level – Used with MultiIndex. Takes Integer value. Default set to None.
  • errors – Take values raise or ignore. if ‘raise’ is used, raise a KeyError when a dict-like mapper, index, or column contains labels that are not present in the Index being transformed. If ‘ignore’ is used, existing keys will be renamed and extra keys will be ignored. Default set to ignore.

Let’s create a DataFrame with a dictionary of lists, our pandas DataFrame contains column names CoursesFeeDuration.


import pandas as pd
technologies = ({
  'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
  'Fee' :[20000,25000,26000,22000,24000,21000,22000],
  'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print(df.columns)

Yields below output.

pandas rename column

3. Pandas Rename Column Name

In order to rename a single column name on pandas DataFrame, you can use column={} parameter with the dictionary mapping of the old name and a new name. Note that when you use column param, you cannot explicitly use axis param.

pandas DataFrame.rename() accepts a dict(dictionary) as a param for columns you want to rename, so you just pass a dict with a key-value pair; the key is an existing column you would like to rename and the value would be your preferred column name.


# Rename a Single Column 
df2=df.rename(columns = {'Courses':'Courses_List'})
print(df2.columns)

Yields below output. As you see it rename the column from Courses to Courses_List.

pandas rename columns

Alternatively, you can also write the above statement by using axis=1 or axis='columns'.


# Alternatively you can write above code using axis
df2=df.rename({'Courses':'Courses_List'}, axis=1)
df2=df.rename({'Courses':'Courses_List'}, axis='columns')

In order to change columns on the existing DataFrame without copying to the new DataFrame, you have to use inplace=True.


# Replace existing DataFrame (inplace). This returns None
df.rename({'Courses':'Courses_List'}, axis='columns', inplace=True)
print(df.columns)

4. Rename Multiple Columns

You can also use the same approach to rename multiple columns of Pandas DataFrame. All you need to specify multiple columns you want to rename in a dictionary mapping.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

Yields below output. As you see it renames multiple columns.

5. Rename the Column by Index or Position

To rename the column by Index/position can be done by using df.columns.values[index]='value. Index or position can be used interchanging to access columns at a given position. By using this you can rename the first column to the last column.

As you can see from the above df.columns returns a column names as a pandas Index and df.columns.values get column names as an array, now you can set the specific index/position with a new value. The below example updates the column Courses to Courses_Duration at index 3. Note that the column index starts from zero.


# Pandas rename column by index
df.columns.values[2] = "Courses_Duration"
print(df.columns)

# Output:
# Index(['Courses', 'Fee', 'Courses_Duration'], dtype='object')

6. Rename Columns with a List

Python list can be used to rename all columns in pandas DataFrame, the length of the list should be the same as the number of columns in the DataFrame. Otherwise, an error occurs.


# Rename columns with list
column_names = ['Courses_List','Courses_Fee','Courses_Duration']
df.columns = column_names
print(df.columns)

Yields below output.


# Output:
Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')

7. Rename Columns Inplace

By default rename() function returns a new pandas DataFrame after updating the column name, you can change this behavior and rename in place by using inplace=True param.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

This renames column names on DataFrame in place and returns the None type.

8. Rename All Columns by adding Suffixes or Prefix

Sometimes you may need to add a string text to the suffix or prefix of all column names. You can do this by getting all columns one by one in a loop and adding a suffix or prefix string.


# Rename All Column Names by adding Suffix or Prefix
df.columns = column_names
df.columns = ['col_'+str(col) for col in df.columns]

You can also use pandas.DataFrame.add_prefix() and pandas.DataFrame.add_suffix() to add prefixes and suffixes respectively to the pandas DataFrame column names.


# Add prefix to the column names
df2=df.add_prefix('col_')
print(df2.columns)

# Add suffix to the column names
df2=df.add_suffix('_col')
print(df2.columns))

Yields below output.


# Output:
Index(['col_Courses', 'col_Fee', 'col_Duration'], dtype='object')
Index(['Courses_col', 'Fee_col', 'Duration_col'], dtype='object')

9. Rename the Column using the Lambda Function

You can also change the column name using the Pandas lambda expression, This gives us more control and applies custom functions. The below examples add a ‘col_’ string to all column names. You can also try removing spaces from columns etc.


# Rename using Lambda function
df.rename(columns=lambda x: 'col_'+x, inplace=True)

10. Rename or Convert All Columns to Lower or Upper Case

When column names are mixed with lower and upper case, it would be best practice to convert/update all column names to either lower or upper case.


# Change to all lower case
df = pd.DataFrame(technologies)
df2=df.rename(str.lower, axis='columns')
print(df2.columns)

# Change to all upper case
df = pd.DataFrame(technologies)
df2=df.rename(str.upper, axis='columns')
print(df2.columns)

Yields below output.


# Output:
Index(['courses', 'fee', 'duration'], dtype='object')
Index(['COURSES', 'FEE', 'DURATION'], dtype='object')

11. Change Column Names Using DataFrame.set_axis()

By using DataFrame.set_axis() you can also change the column names. Note that with set_axis() you need to assign all column names. This updates the DataFrame with a new set of column names. set_axis() also used to rename pandas DataFrame Index


# Change column name using set_axis()
df.set_axis(['Courses_List', 'Course_Fee', 'Course_Duration'], axis=1, inplace=True)
print(df.columns)

12. Using String replace()

Pandas String.replace() a method is used to replace a string, series, dictionary, list, number, regex, etc. from a DataFrame column. This is a very rich function as it has many variations. If you have used this syntax: df.columns.str.replace("Fee", "Courses_Fee"), it replaces 'Fee' column with 'Courses_Fee'.


# Change column name using String.replace()
df.columns = df.columns.str.replace("Fee","Courses_Fee")
print(df.columns)

Yields below output.


# Output:
Index(['Courses', 'Courses_Fee', 'Duration'], dtype='object')

To replace all column names in a DataFrame using str.replace() method. You can define DataFrame with column names('Courses_List''Courses_Fee''Courses_Duration') and then apply str.replace() method over the columns of DataFrame it will replace underscores("_") with white space(" ").


# Rename all column names
df.columns = df.columns.str.replace("_"," ")
print(df.columns)

Yields below output.


# Output:
Index(['Courses List', 'Course Fee', 'Course Duration'], dtype='object')

13. Raise Error when Column Not Found

By default when rename column label is not found on Pandas DataFrame, rename() method just ignores the column. In case you wanted to throw an error when a column is not found, use errors = "raise".


# Throw Error when Rename column doesn't exists.
df.rename(columns = {'Cour':'Courses_List'}, errors = "raise")

Yields error message raise KeyError("{} not found in axis".format(missing_labels)).


# Output:
raise KeyError("{} not found in axis".format(missing_labels))
KeyError: "['Cour'] not found in axis"

14. Rename Only If the Column Exists

This example changes the Courses column to Courses_List and it doesn’t update Fees as we don’t have the Fees column. Note that even though the Fees column does not exist it didn’t raise errors even when we used errors="raise".


# Change column only if column exists.
df = pd.DataFrame(technologies)
d={'Courses':'Courses_List','Fees':'Courses_fees'}
df.rename(columns={k: v for k, v in d.items() if k in df.columns}, inplace=True,errors = "raise")
print(df.columns)
]]>
4367
Pandas – Add New Column https://shishirkant.com/pandas-add-new-column/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-add-new-column Tue, 28 Jan 2025 16:02:24 +0000 https://shishirkant.com/?p=4362  In Pandas, you can add a new column to an existing DataFrame using the DataFrame.insert() function, which updates the DataFrame in place. Alternatively, you can use DataFrame.assign() to insert a new column, but this method returns a new DataFrame with the added column.

Advertisements

In this article, I will cover examples of adding multiple columns, adding a constant value, deriving new columns from an existing column to the Pandas DataFrame.

Key Points –

  • A new column can be created by assigning values directly to a new column name with df['new_column'] = values.
  • The assign() method adds columns and returns a modified copy of the DataFrame, leaving the original DataFrame unchanged unless reassigned.
  • Adding a column directly modifies the DataFrame in place, while using assign() creates a new DataFrame.
  • Lambda functions within assign() enable complex calculations or conditional logic to define values for the new column.
  • The insert() method allows adding a new column at a specific position within the DataFrame, providing flexibility for organizing columns.
  • Using functions like np.where() or apply(), you can populate a new column based on conditional values.

Quick Examples of Adding Column

If you are in a hurry, below are some quick examples of adding column to pandas DataFrame.


# Quick examples of add column to dataframe

# Add new column to the dataframe
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)

# Add a multiple columns to the dataframe
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 =df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )

# Derive new Column from existing column
df = pd.DataFrame(technologies)
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

# Add a constant or empty value to the DataFrame
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")

# Add new column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies

# Add new column at the specific position
df = pd.DataFrame(technologies)
df.insert(0,'Tutors', tutors )

# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print(df)

To run some examples of adding column to DataFrame, let’s create DataFrame using data from a dictionary.


# Create DataFrame
import pandas as pd
import numpy as np

technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Discount':[1000,2300,1000,1200,2500]
          }

df = pd.DataFrame(technologies)
print("Create a DataFrame:\n", df)

Yields below output.

Pandas Add Column DataFrame

Add Column to DataFrame

DataFrame.assign() is used to add/append a column to the DataFrame, this method generates a new DataFrame incorporating the added column, while the original DataFrame remains unchanged.

Below is the syntax of the assign() method.


# Syntax of DataFrame.assign()
DataFrame.assign(**kwargs)

Now let’s add a column ‘TutorsAssigned” to the DataFrame. Using assign() we cannot modify the existing DataFrame inplace instead it returns a new DataFrame after adding a column. The below example adds a list of values as a new column to the DataFrame.


# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)
print("Add column to DataFrame:\n", df2)

Yields below output.

Pandas Add Column DataFrame

Add Multiple Columns to the DataFrame

You can add multiple columns to a Pandas DataFrame by using the assign() function.


# Add multiple columns to the DataFrame
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 = df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )
print("Add multiple columns to DataFrame:\n", df2)

Yields below output.


# Output:
# Add multiple columns to DataFrame:
    Courses    Fee  Discount  MNCComp TutorsAssigned
0    Spark  22000      1000     TATA        William
1  PySpark  25000      2300      HCL          Henry
2   Hadoop  23000      1000  Infosys        Michael
3   Python  24000      1200   Google           John
4   Pandas  26000      2500   Amazon          Messi

Adding a Column From Existing

In real-time scenarios, there’s often a need to compute and add new columns to a dataset based on existing ones. The following demonstration calculates the Discount_Percent column based on Fee and Discount. In this instance, I’ll utilize a lambda function to generate a new column from the existing data.


# Derive New Column from Existing Column
df = pd.DataFrame(technologies)
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
print("Add column to DataFrame:\n", df2)

You can explore deriving multiple columns and appending them to a DataFrame within a single statement. This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount  Discount_Percent
0    Spark  22000      1000          220000.0
1  PySpark  25000      2300          575000.0
2   Hadoop  23000      1000          230000.0
3   Python  24000      1200          288000.0
4   Pandas  26000      2500          650000.0

Add a Constant or Empty Column

The below example adds 3 new columns to the DataFrame, one column with all None values, a second column with 0 value, and the third column with an empty string value.


# Add a constant or empty value to the DataFrame.
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
    Courses    Fee  Discount     A  B C
0    Spark  22000      1000  None  0  
1  PySpark  25000      2300  None  0  
2   Hadoop  23000      1000  None  0  
3   Python  24000      1200  None  0  
4   Pandas  26000      2500  None  0  

Append Column to Existing Pandas DataFrame

The above examples create a new DataFrame after adding new columns instead of appending a column to an existing DataFrame. The example explained in this section is used to append a new column to the existing DataFrame.


# Add New column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount MNCCompanies
0    Spark  22000      1000         TATA
1  PySpark  25000      2300          HCL
2   Hadoop  23000      1000      Infosys
3   Python  24000      1200       Google
4   Pandas  26000      2500       Amazon

You can also use this approach to add a new column by deriving from an existing column,


# Derive a new column from existing column
df2 = df['Discount_Percent'] = df['Fee'] * df['Discount'] / 100
print("Add column to DataFrame:\n", df2)

# Output:
# Add column to DataFrame:
#  0    220000.0
# 1    575000.0
# 2    230000.0
# 3    288000.0
# 4    650000.0
dtype: float64

Add Column to Specific Position of DataFrame

The DataFrame.insert() method offers the flexibility to add columns at any position within an existing DataFrame. While many examples often showcase appending columns at the end of the DataFrame, this method allows for insertion at the beginning, in the middle, or at any specific column index of the DataFrame.


# Add new column at the specific position
# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.insert(0,'Tutors', tutors )
print("Add column to DataFrame:\n", df)

# Insert 'Tutors' column at the specified position
# Add new column to the DataFrame
position = 0
df.insert(position, 'Tutors', tutors)
print("Add column to DataFrame:\n", df)

Yields below output.


# Output:
# Add column to DataFrame:
    Tutors  Courses    Fee  Discount
0  William    Spark  22000      1000
1    Henry  PySpark  25000      2300
2  Michael   Hadoop  23000      1000
3     John   Python  24000      1200
4    Messi   Pandas  26000      2500

Add a Column From Dictionary Mapping

If you want to add a column with specific values for each row based on an existing value, you can do this using a Dictionary. Here, The values from the dictionary will be added as Tutors column in df, by matching the key value with the column 'Courses'.


# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print("Add column to DataFrame:\n", df)

Note that it is unable to map pandas as the key in the dictionary is not exactly matched with the value in the Courses column (case sensitive). This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount   Tutors
0    Spark  22000      1000  William
1  PySpark  25000      2300    Henry
2   Hadoop  23000      1000  Michael
3   Python  24000      1200     John
4   Pandas  26000      2500      NaN

Using loc[] Add Column

Using pandas loc[] you can access rows and columns by labels or names however, you can also use this for adding a new column to pandas DataFrame. This loc[] property uses the first argument as rows and the second argument for columns hence, I will use the second argument to add a new column.


# Assign the column to the DataFrame
df = pd.DataFrame(technologies)
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.loc[:, 'Tutors'] = tutors
print("Add column to DataFrame:\n", df)

Yields the same output as above.

]]>
4362
Pandas – Get Cell Value https://shishirkant.com/pandas-get-cell-value/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-get-cell-value Tue, 28 Jan 2025 15:54:17 +0000 https://shishirkant.com/?p=4358 You can use DataFrame properties loc[]iloc[]at[]iat[] and other ways to get/select a cell value from a Pandas DataFrame. Pandas DataFrame is structured as rows & columns like a table, and a cell is referred to as a basic block that stores the data. Each cell contains information relating to the combination of the row and column.

Advertisements

loc[] & iloc[] are also used to select rows from pandas DataFrame and select columns from pandas DataFrame.

Key Points –

  • Use .loc[] to get a cell value by row label and column label.
  • Use .iloc[] to get a cell value by row and column index.
  • at[] is a faster alternative for accessing a single cell using label-based indexing.
  • .iat[] is similar to .at[], but uses integer-based indexing for faster access to a single cell.
  • Convert the DataFrame to a NumPy array and access elements by array indexing.
  • Prefer .at[] when performance is critical and only one value needs to be accessed.

1. Quick Examples of Get Cell Value of DataFrame

If you are in a hurry, below are some quick examples of how to select cell values from Pandas DataFrame.


# Quick examples of get cell value of DataFrame

# Using loc[]. Get cell value by name & index
print(df.loc['r4']['Duration'])
print(df.loc['r4'][2])

# Using iloc[]. Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3,2])

# Using DataFrame.at[]
print(df.at['r4','Duration'])
print(df.at[df.index[3],'Duration'])

# Using DataFrame.iat[]
print(df.iat[3,2])

# Get a cell value
print(df["Duration"].values[3])

# Get cell value from last row
print(df.iloc[-1,2])
print(df.iloc[-1]['Duration'])
print(df.at[df.index[-1],'Duration'])

Now, let’s create a DataFrame with a few rows and columns and execute some examples, and validate the results. Our DataFrame contains column names CoursesFeeDurationDiscount.


# Create DataFrame
import pandas as pd
technologies = {
     'Courses':["Spark","PySpark","Hadoop","Python","pandas"],
     'Fee' :[24000,25000,25000,24000,24000],
     'Duration':['30day','50days','55days', '40days','60days'],
     'Discount':[1000,2300,1000,1200,2500]
          }
index_labels=['r1','r2','r3','r4','r5']
df = pd.DataFrame(technologies, index=index_labels)
print("Create DataFrame:\n", df)

Yields below output.

Pandas DataFrame Get Value Cell

2. Using DataFrame.loc[] to Get a Cell Value by Column Name

In Pandas, DataFrame.loc[] property is used to get a specific cell value by row & label name(column name). Below all examples return a cell value from the row label r4 and Duration column (3rd column).


# Using loc[]. Get cell value by name & index
print(df.loc['r4']['Duration'])
print(df.loc['r4','Duration'])
print(df.loc['r4'][2])

Yields below output. From the above examples df.loc['r4'] returns a pandas Series.


# Output:
40days

3. Using DataFrame.iloc[] to Get a Cell Value by Column Position

If you want to get a cell value by column number or index position use DataFrame.iloc[], index position starts from 0 to length-1 (index starts from zero). In order to refer last column use -1 as the column position.


# Using iloc[]. Get cell value by index & name
print(df.iloc[3]['Duration'])
print(df.iloc[3][2])
print(df.iloc[3,2])

This returns the same output as above. Note that iloc[] property doesn’t support df.iloc[3,'Duration'], by using this notation, returns an error.

4. Using DataFrame.at[] to select Specific Cell Value by Column Label Name

DataFrame.at[] property is used to access a single cell by row and column label pair. Like loc[] this doesn’t support column by position. This performs better when you want to get a specific cell value from Pandas DataFrame as it uses both row and column labels. Note that at[] property doesn’t support a negative index to refer rows or columns from last.


# Using DataFrame.at[]
print(df.at['r4','Duration'])
print(df.at[df.index[3],'Duration'])

These examples also yield the same output 40days.

5. Using DataFrame.iat[] select Specific Cell Value by Column Position

DataFrame.iat[] is another property to select a specific cell value by row and column position. Using this you can refer to columns only by position but not by a label. This also doesn’t support a negative index or column position.


# Using DataFrame.iat[]
print(df.iat[3,2])

6. Select Cell Value from DataFrame Using df[‘col_name’].values[]

We can use df['col_name'].values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].values[3].


# Get a cell value
print(df["Duration"].values[3])

7. Get Cell Value from Last Row of Pandas DataFrame

If you want to get a specific cell value from the last Row of Pandas DataFrame, use the negative index to point to the rows from the last. For example, Index -1 represents the last row, and -2 for the second row from the last. Similarly, you should also use -1 for the last column.


# Get cell value from last row
print(df.iloc[-1,2])                  # prints 60days
print(df.iloc[-1]['Duration'])        # prints 60days
print(df.at[df.index[-1],'Duration']) # prints 60days

To select the cell value of the last row and last column use df.iloc[-1,-1], this returns 2500. Similarly, you can also try other approaches.

]]>
4358
Pandas – Query Rows by Value https://shishirkant.com/pandas-query-rows-by-value/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-query-rows-by-value Tue, 28 Jan 2025 15:48:58 +0000 https://shishirkant.com/?p=4356 The pandas.DataFrame.query() method is used to query rows based on the provided expression (single or multiple column conditions) and returns a new DataFrame. If you want to modify the existing DataFrame in place, you can set the inplace=True argument. This allows for efficient filtering and manipulation of DataFrame data without creating additional copies.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like a query with multiple conditions and a query with a string containing to new few.

Key Points –

  • Pandas.DataFrame.query() function filters rows from a DataFrame based on a specified condition.
  • Pandas.DataFrame.query() offers a powerful and concise syntax for filtering DataFrame rows, resembling SQL queries, enhancing code readability and maintainability.
  • The method supports a wide range of logical and comparison operators, including ==, !=, >, <, >=, <=, and logical operators like and, or, and not.

Quick Examples of Pandas query()

Following are quick examples of the Pandas DataFrame query() method.


# Quick examples of pandas query()

# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

# Using variable
value='Spark'
df2=df.query("Courses == @value")

# Inpace
df.query("Courses == 'Spark'",inplace=True)

# Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

First, let’s create a Pandas DataFrame.


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

pandas dataframe query

Note that the DataFrame may contain None and NaN values in the Duration column, which will be taken into account in the examples below for selecting rows with None & NaN values or selecting while disregarding these values

Using DataFrame.query()

Following is the syntax of the DataFrame.query() method.


# Query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)
  • expr – This parameter specifies the query expression string, which follows Python’s syntax for conditional expressions.
  • inplace – Defaults to False. When it is set to True, it updates the existing DataFrame, and query() method returns None.
  • **kwargs –  This parameter allows passing additional keyword arguments to the query expression. It is optional. Keyword arguments that work with eval()

DataFrame.query() is used to filter rows based on one or multiple conditions in an expression.


# Query all rows with Courses equals 'Spark'
df2 = df.query("Courses == 'Spark'")
print("After filtering the rows based on condition:\n", df2)

Yields below output.

pandas dataframe query

You can use the @ character followed by the variable name. This allows you to reference Python variables directly within the query expression.


# Query Rows by using Python variable
value='Spark'
df2 = df.query("Courses == @value")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the variable value is referenced within the query expression "Courses == @value", enabling dynamic filtering based on the value stored in the Python variable.

To filter and update the existing DataFrame in place using the query() method, you can use the inplace=True parameter. This will modify the original DataFrame directly without needing to reassign it to a new variable.


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0   Spark  22000   30days      1000

In the above example, the DataFrame df is modified in place using the query() method. The expression "Courses=='Spark'" filters rows where the Courses column equals Spark. By setting inplace=True, the original DataFrame df is updated with the filtered result.

The != operator in a DataFrame query expression allows you to select rows where a specific column’s value does not equal a given value.


# Not equals condition
df2 = df.query("Courses != 'Spark'")
print("After filtering the rows based on condition:\n", df2)

# Output:
#    Courses  Courses Fee Duration  Discount
# 1  PySpark        25000   50days      2300
# 2   Hadoop        23000   30days      1000
# 3   Python        24000     None      1200
# 4   Pandas        26000      NaN      2500

In the above example, the DataFrame df is filtered to create a new DataFrame df2, where the Courses column does not equal Spark. This expression ensures that only rows with Courses values different from Spark are included in the resulting DataFrame df2.

Query Rows by the List of Values

Using the in operator in a DataFrame query expression allows you to filter rows based on whether a specific column’s value is present in a Python list of values.


# Query rows by list of values
df2 = df.query("Courses in ('Spark','PySpark')")
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Similarly, you can define a Python variable to hold a list of values and then use that variable in your query. This approach allows for more dynamic filtering based on the contents of the list variable.


# Query rows by list of values
values=['Spark','PySpark']
df2 = df.query("Courses in @values")
print("After filtering the rows based on condition:\n", df2)

Using the not-in operator in a DataFrame query expression allows you to filter rows based on values that are not present in a specified list.


# Query rows not in list of values
values=['Spark','PySpark']
df2 = df.query("Courses not in @values")
print("After filtering the rows based on condition:\n", df)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 2  Hadoop  23000   30days      1000
# 3  Python  24000     None      1200
# 4  Pandas  26000      NaN      2500

When dealing with column names containing special characters, such as spaces, you can enclose the column name within backticks (`) to ensure it is recognized properly in a query expression.


import pandas as pd
import numpy as np

# Create DataFrame
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Courses Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

# Using columns with special characters
df2 = df.query("`Courses Fee` >= 23000")
print("After filtering the rows based on condition:\n", df2)

Yields below output.


# Output:
# Create DataFrame:
    Courses  Courses Fee Duration  Discount
0    Spark        22000   30days      1000
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

# After filtering the rows based on condition:
    Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

Query with Multiple Conditions

Querying with multiple conditions involves filtering data in a DataFrame based on more than one criterion simultaneously. Each condition typically involves one or more columns of the DataFrame and specifies a logical relationship that must be satisfied for a row to be included in the filtered result.


# Query by multiple conditions
df2 = df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
print("After filtering the rows based on multiple conditions:\n", df2)

# Output:
# After filtering the rows based on multiple conditions:
#   Courses  Courses Fee Duration  Discount
# 2  Hadoop        23000   30days      1000
# 3  Python        24000     None      1200

Query Rows using apply()

If you want to filter rows using apply() along with a lambda function, you can do so, but the lambda function needs to return a boolean indicating whether each row should be included or not.


# By using lambda function
df2 = df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
print("After filtering the rows based on condition:\n", df2)

# Output:
# After filtering the rows based on condition:
#    Courses    Fee Duration  Discount
# 0    Spark  22000   30days      1000
# 1  PySpark  25000   50days      2300

Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

# Select startswith
print(df[df['Courses'].str.startswith("P")])
]]>
4356
Pandas – Select Columns https://shishirkant.com/pandas-select-columns/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-select-columns Tue, 28 Jan 2025 15:39:37 +0000 https://shishirkant.com/?p=4351  In Pandas, selecting columns by name or index allows you to access specific columns in a DataFrame based on their labels (names) or positions (indices). Use loc[] & iloc[] to select a single column or multiple columns from pandas DataFrame by column names/label or index position respectively.

In this article, I will explain how to select one or more columns from a DataFrame using different methods such as column labels, index, positions, and ranges.

Key Points –

  • Pandas allow selecting columns from a DataFrame by their names using square brackets notation or the .loc[] accessor.
  • The .loc[] accessor allows for more explicit selection, accepting row and column labels or boolean arrays.
  • Alternatively, you can use the .iloc[] accessor to select columns by their integer index positions.
  • For selecting the last column, use df.iloc[:,-1:], and for the first column, use df.iloc[:,:1].
  • Understanding both column name and index-based selection is essential for efficient data manipulation with Pandas.

Quick Examples of Select Columns by Name or Index

If you are in a hurry, below are some quick examples of selecting columns by name or index in Pandas DataFrame.


# Quick examples of select columns by name or index

# Example 1: By using df[] notation
df2 = df[["Courses","Fee","Duration"]] # select multile columns

# Example 2: Using loc[] to take column slices
df2 = df.loc[:, ["Courses","Fee","Duration"]] # Selecte multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]] # Select Random columns
df2 = df.loc[:,'Fee':'Discount'] # Select columns between two columns
df2 = df.loc[:,'Duration':]  # Select columns by range
df2 = df.loc[:,:'Duration']  # Select columns by range
df2 = df.loc[:,::2]          # Select every alternate column

# Example 3: Using iloc[] to select column by Index
df2 = df.iloc[:,[1,3,4]] # Select columns by Index
df2 = df.iloc[:,1:4] # Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,2:] # Select From 3rd to end
df2 = df.iloc[:,:2] # Select First Two Columns

First, let’s create a pandas DataFrame.


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300]
              }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Create DataFrame:
Courses Fee Duration Discount
0 Shishir 20000 30days 1000
1 Pandas 25000 40days 2300

Using loc[] to Select Columns by Name

The df[] and DataFrame.loc[] methods in Pandas provide convenient ways to select multiple columns by names or labels, you can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Another syntax supported by pandas.DataFrame.loc[] is [:, [labels]], where you provide a list of column names as labels.


# loc[] syntax to slice columns
df.loc[:,start:stop:step]

Select DataFrame Columns by Name

To select DataFrame columns by name, you can directly specify the column names within square brackets []. Here, df[['Courses', 'Fee', 'Duration']] select only the CoursesFee, and Duration columns from the DataFrame df.


# Select Columns by labels
df2 = df[["Courses","Fee","Duration"]]
print("Select columns by labels:\n", df2)

Yields below output.

Select columns by labesl:
Courses Fee Duration
0 Shishir 20000 30days
1 Pandas 25000 40days

Select Columns by Index in Multiple Columns

To select multiple columns using df.loc[], you specify both row and column labels. If you want to select all rows and specific columns, you can use : to select all rows and provide a list of column labels. Note that loc[] also supports multiple conditions when selecting rows based on column values.


# Select multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]]
print("Select multiple columns by labels:\n", df2)

# Output:
# Select multiple columns by labels:
#   Courses    Fee  Discount
# 0  Shishir  20000      1000
# 1   Pandas  25000      2300

In the above example, df.loc[:, ["Courses", "Fee", "Discount"]] selects all rows (:) and the columns labeled CoursesFee, and Discount from the DataFrame df.

Select Columns Based on Label Indexing

When you want to select columns based on label Indexes, provide start and stop indexes.

  • If you don’t specify a start index, iloc[] selects from the first column.
  • If you don’t provide a stop index, iloc[] selects all columns from the start index to the last column.
  • Specifying both start and stop indexes selects all columns in between, including the start index but excluding the stop index.

# Select all columns between Fee an Discount columns
df2 = df.loc[:,'Fee':'Discount']
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select from 'Duration' column
df2 = df.loc[:,'Duration':]
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select from beginning and end at 'Duration' column
df2 = df.loc[:,:'Duration']
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#   Courses    Fee Duration
# 0  Shishir  20000   30days
# 1  Pandas  25000   40days

Select Every Alternate Column

You can select every alternate column from a DataFrame, you can use the iloc[] accessor with a step size of 2.


# Select every alternate column
df2 = df.loc[:,::2]
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#   Courses Duration   Tutor
# 0  Shishir   30days  Michel
# 1  Pandas   40days     Sam

This code effectively selects every alternate column, starting from the first column, which results in selecting Courses and Duration.

Pandas iloc[] to Select Column by Index or Position

By using pandas.DataFrame.iloc[], to select multiple columns from a DataFrame by their positional indices. You can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Or, you can use the syntax [:, [indices]] with iloc[], where you provide a list of column names as labels.

Select Columns by Index Position

To select multiple columns from a DataFrame by their index positions, you can use the iloc[] accessor. For instance, retrieves Fee, Discount and Duration and returns a new DataFrame with the columns selected.


# Select columns by position
df2 = df.iloc[:,[1,3,4]]
print("Selec columns by position:\n", df2)

# Output:
# Selec columns by position:
#     Fee  Discount   Tutor
# 0  20000      1000  Michel
# 1  25000      2300     Sam

Select Columns by Position Range

You can also slice a DataFrame by a range of positions. For instance, select columns by position range using the .iloc[] accessor in Pandas. It selects columns with positions 1 through 3 (exclusive of position 4) from the DataFrame df and assigns them to df2.


# Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,1:4]
print("Select columns by position:\n", df2)

# OUtput:
# Selec columns by position:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select From 3rd to end
df2 = df.iloc[:,2:]
print("Select columns by position:\n", df2)

# Output:
# Selec columns by position:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select First Two Columns
df2 = df.iloc[:,:2]
print("Selec columns by position:\n", df2))

# Output:
# Selec columns by position:
#   Courses    Fee
# 0  Shishir  20000
# 1   Pandas  25000

To retrieve the last column of a DataFrame, you can use df.iloc[:,-1:], and to obtain just the first column, you can use df.iloc[:,:1].

Complete Example


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300],
    'Tutor':['Michel','Sam']
              }
df = pd.DataFrame(technologies)
print(df)

# Select multiple columns
print(df[["Courses","Fee","Duration"]])

# Select Random columns
print(df.loc[:, ["Courses","Fee","Discount"]])

# Select columns by range
print(df.loc[:,'Fee':'Discount']) 
print(df.loc[:,'Duration':])
print(df.loc[:,:'Duration'])

# Select every alternate column
print(df.loc[:,::2])

# Selected by column position
print(df.iloc[:,[1,3,4]])

# Select between indexes 1 and 4 (2,3,4)
print(df.iloc[:,1:4])

# Select From 3rd to end
print(df.iloc[:,2:])

# Select First Two Columns
print(df.iloc[:,:2])
]]>
4351