Python – Shishir Kant Singh https://shishirkant.com Jada Sir जाड़ा सर :) Tue, 28 Jan 2025 16:11:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://shishirkant.com/wp-content/uploads/2020/05/cropped-shishir-32x32.jpg Python – Shishir Kant Singh https://shishirkant.com 32 32 187312365 Pandas – Drop Rows by Label | Index https://shishirkant.com/pandas-drop-rows-by-label-index/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-drop-rows-by-label-index Tue, 28 Jan 2025 16:11:01 +0000 https://shishirkant.com/?p=4370 By using pandas.DataFrame.drop() method you can drop/remove/delete rows from DataFrame. axis param is used to specify what axis you would like to remove. By default axis=0 meaning to remove rows. Use axis=1 or columns param to remove columns. By default, Pandas return a copy DataFrame after deleting rows, used inpalce=True to remove from existing referring DataFrame.

In this article, I will cover how to remove rows by labels, indexes, and ranges and how to drop inplace and NoneNan & Null values with examples. if you have duplicate rows, use drop_duplicates() to drop duplicate rows from pandas DataFrame

Key Points –

  • Use the drop() method to remove rows by specifying the row labels or indices.
  • Set the axis parameter to 0 (or omit it) to indicate that rows should be dropped.
  • Use the inplace parameter to modify the original DataFrame directly without creating a new one.
  • After dropping rows, consider resetting the index with reset_index() to maintain sequential indexing.
  • Set the errors parameter to ‘ignore’ to suppress errors when attempting to drop non-existent row labels.
  • Leverage the query() method to filter and drop rows based on complex conditions.

Pandas.DataFrame.drop() Syntax – Drop Rows & Columns

Let’s know the syntax of the DataFrame drop() function.


# Pandas DaraFrame drop() Syntax
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Parameters

  • labels – Single label or list-like. It’s used with axis param.
  • axis – Default sets to 0. 1 to drop columns and 0 to drop rows.
  • index – Use to specify rows. Accepts single label or list-like.
  • columns – Use to specify columns. Accepts single label or list-like.
  • level – int or level name, optional, use for Multiindex.
  • inplace – Default False, returns a copy of DataFrame. When used True, it drops the column inplace (current DataFrame) and returns None.
  • errors – {‘ignore’, ‘raise’}, default ‘raise’.

Let’s create a DataFrame, run some examples, and explore the output. Note that our DataFrame contains index labels for rows which I am going to use to demonstrate removing rows by labels.


# Create a DataFrame
import pandas as pd
import numpy as np

technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python"],
    'Fee' :[20000,25000,26000,22000],
    'Duration':['30day','40days',np.nan, None],
    'Discount':[1000,2300,1500,1200]
               }

indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
print(df)

Yields below output.

pandas drop rows

Pandas Drop Rows From DataFrame Examples

By default drop() method removes rows (axis=0) from DataFrame. Let’s see several examples of how to remove rows from DataFrame.

Drop rows by Index Labels or Names

One of the Panda’s advantages is you can assign labels/names to rows, similar to column names. If you have DataFrame with row labels (index labels), you can specify what rows you want to remove by label names.


# Drop rows by Index Label
df = pd.DataFrame(technologies,index=indexes)
df1 = df.drop(['r1','r2'])
print("Drop rows from DataFrame:\n", df1)

Yields below output.

pandas drop rows

Alternatively, you can also write the same statement by using the field name 'index'.


# Delete Rows by Index Labels
df1 = df.drop(index=['r1','r2'])

And by using labels and axis as below.


# Delete Rows by Index Labels & axis
df1 = df.drop(labels=['r1','r2'])
df1 = df.drop(labels=['r1','r2'],axis=0)

Notes:

  • As you see using labels, axis=0 is equivalent to using index=label names.
  • axis=0 mean rows. By default drop() method considers axis=0 hence you don’t have to specify to remove rows. to remove columns explicitly specify axis=1 or columns.

Drop Rows by Index Number (Row Number)

Similarly by using drop() method you can also remove rows by index position from pandas DataFrame. drop() method doesn’t have a position index as a param, hence we need to get the row labels from the index and pass these to the drop method. We will use df.index it to get row labels for the indexes we want to delete.

  • df.index.values returns all row labels as a list.
  • df.index[[1,3]] gets you row labels for the 2nd and 3rd rows, bypassing these to drop() method removes these rows. Note that in Python, the list index starts from zero.

# Delete Rows by Index numbers
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[[1,3]])
print(df1)

Yields the same output as section 2.1. In order to drop the first row, you can use df.drop(df.index[0]), and to drop the last row use df.drop(df.index[-1]).


# Removes First Row
df=df.drop(df.index[0])

# Removes Last Row
df=df.drop(df.index[-1])

Delete Rows by Index Range

You can also remove rows by specifying the index range. The below example removes all rows starting 3rd row.


# Delete Rows by Index Range
df = pd.DataFrame(technologies,index=indexes)
df1=df.drop(df.index[2:])
print(df1)

Yields below output.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Delete Rows when you have Default Index

By default, pandas assign a sequence number to all rows also called index, row index starts from zero and increments by 1 for every row. If you are not using custom index labels, pandas DataFrame assigns sequence numbers as Index. To remove rows with the default index, you can try below.


# Remove rows when you have default index.
df = pd.DataFrame(technologies)
df1 = df.drop(0)
df3 = df.drop([0, 3])
df4 = df.drop(range(0,2))

Note that df.drop(-1) doesn’t remove the last row as the -1 index is not present in DataFrame. You can still use df.drop(df.index[-1]) it to remove the last row.

Remove DataFrame Rows Inplace

All examples you have seen above return a copy of DataFrame after removing rows. In case if you want to remove rows inplace from referring DataFrame use inplace=True. By default, inplace param is set to False.


# Delete Rows inplace
df = pd.DataFrame(technologies,index=indexes)
df.drop(['r1','r2'],inplace=True)
print(df)

Drop Rows by Checking Conditions

Most of the time we would also need to remove DataFrame rows based on some conditions (column value), you can do this by using loc[] and iloc[] methods.


# Delete Rows by Checking Conditions
df = pd.DataFrame(technologies)
df1 = df.loc[df["Discount"] >=1500 ]
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
1  PySpark  25000   40days      2300
2   Hadoop  26000      NaN      1500

Drop Rows that NaN/None/Null Values

While working with analytics you would often be required to clean up the data that has NoneNull & np.NaN values. By using df.dropna() you can remove NaN values from DataFrame.


# Delete rows with Nan, None & Null Values
df = pd.DataFrame(technologies,index=indexes)
df2=df.dropna()
print(df2)

This removes all rows that have None, Null & NaN values on any columns.


# Output:
    Courses    Fee Duration  Discount
r1    Spark  20000    30day      1000
r2  PySpark  25000   40days      2300

Remove Rows by Slicing DataFrame

You can also drop a list of DataFrame rows by slicing. Remember index starts from zero.


# Remove Rows by Slicing DataFrame
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4
]]>
4370
Pandas – Rename Column https://shishirkant.com/pandas-rename-column/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-rename-column Tue, 28 Jan 2025 16:06:46 +0000 https://shishirkant.com/?p=4367 Pandas DataFrame.rename() function is used to change the single column name, multiple columns, by index position, in place, with a list, with a dict, and renaming all columns, etc. We are often required to change the column name of the DataFrame before we perform any operations. In fact, changing the name of a column is one of the most searched and used functions of the Pandas.

Advertisements

The good thing about this function is it provides a way to rename a specific single column.

In this pandas article, You will learn several ways to rename a column name of the Pandas DataFrame with examples by using functions like DataFrame.rename()DataFrame.set_axis()DataFrame.add_prefix()DataFrame.add_suffix() and more.

Key Points –

  • Use the rename() method to rename columns, specifying a dictionary that maps old column names to new ones.
  • The rename() method does not modify the original DataFrame by default; to apply changes directly, set the inplace parameter to True.
  • Column renaming is case-sensitive, so ensure the exact match of column names when renaming.
  • To rename a single column, use the columns parameter in the rename() method with a dictionary that maps only the specific column.
  • Use the DataFrame.columns attribute to directly modify column names by assigning a new list of names. Ensure the new list has the same length as the original.

1. Quick Examples Rename Columns of DataFrame

If you are in a hurry, below are some quick examples of renaming column names in Pandas DataFrame.

Pandas Rename ScenarioRename Column Example
Rename columns with listdf.columns=[‘A’,’B’,’C’]
Rename column name by indexdf.columns.values[2] = “C”
Rename the column using Dictdf2=df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’})
Rename column using Dict & axisdf2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=1)
Rename column in placedf2=df.rename({‘a’: ‘A’, ‘b’: ‘B’}, axis=’columns’)
df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, in place = True)df.rename(columns={‘a’: ‘A’, ‘b’: ‘B’}, inplace = True)
Rename using lambda functiondf.rename(columns=lambda x: x[1:], inplace=True)
Rename with errordf.rename(columns = {‘x’:’X’}, errors = “raise”)
Rename using set_asis()df2=df.set_axis([‘A’,’B’,’C’], axis=1)

Pandas Rename Column(s) Examples

Now let’s see the Syntax and examples.

2. Syntax of Pandas DataFrame.rename()

Following is the syntax of the pandas.DataFrame.rename() method, this returns either DataFrame or None. By default returns pandas DataFrame after renaming columns. When use inplace=True it updates the existing DataFrame in place (self) and returns None.


#  DataFrame.rename() Syntax
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, 
       copy=True, inplace=False, level=None, errors='ignore')

2.1 Parameters

The following are the parameters.

  • mapper – dictionary or function to rename columns and indexes.
  • index – dictionary or function to rename index. When using with axis param, it should be (mapper, axis=0) which is equivalent to index=mapper.
  • columns – dictionary or function to rename columns. When using with axis param, it should be (mapper, axis=0) which is equivalent to column=mapper.
  • axis – Value can be either 0 or index | 1 or columns. Default set to ‘0’.
  • copy – Copies the data as well. Default set to True.
  • inplace – Used to specify the DataFrame referred to be updated. Default to False. When used True, copy property will be ignored.
  • level – Used with MultiIndex. Takes Integer value. Default set to None.
  • errors – Take values raise or ignore. if ‘raise’ is used, raise a KeyError when a dict-like mapper, index, or column contains labels that are not present in the Index being transformed. If ‘ignore’ is used, existing keys will be renamed and extra keys will be ignored. Default set to ignore.

Let’s create a DataFrame with a dictionary of lists, our pandas DataFrame contains column names CoursesFeeDuration.


import pandas as pd
technologies = ({
  'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
  'Fee' :[20000,25000,26000,22000,24000,21000,22000],
  'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print(df.columns)

Yields below output.

pandas rename column

3. Pandas Rename Column Name

In order to rename a single column name on pandas DataFrame, you can use column={} parameter with the dictionary mapping of the old name and a new name. Note that when you use column param, you cannot explicitly use axis param.

pandas DataFrame.rename() accepts a dict(dictionary) as a param for columns you want to rename, so you just pass a dict with a key-value pair; the key is an existing column you would like to rename and the value would be your preferred column name.


# Rename a Single Column 
df2=df.rename(columns = {'Courses':'Courses_List'})
print(df2.columns)

Yields below output. As you see it rename the column from Courses to Courses_List.

pandas rename columns

Alternatively, you can also write the above statement by using axis=1 or axis='columns'.


# Alternatively you can write above code using axis
df2=df.rename({'Courses':'Courses_List'}, axis=1)
df2=df.rename({'Courses':'Courses_List'}, axis='columns')

In order to change columns on the existing DataFrame without copying to the new DataFrame, you have to use inplace=True.


# Replace existing DataFrame (inplace). This returns None
df.rename({'Courses':'Courses_List'}, axis='columns', inplace=True)
print(df.columns)

4. Rename Multiple Columns

You can also use the same approach to rename multiple columns of Pandas DataFrame. All you need to specify multiple columns you want to rename in a dictionary mapping.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

Yields below output. As you see it renames multiple columns.

5. Rename the Column by Index or Position

To rename the column by Index/position can be done by using df.columns.values[index]='value. Index or position can be used interchanging to access columns at a given position. By using this you can rename the first column to the last column.

As you can see from the above df.columns returns a column names as a pandas Index and df.columns.values get column names as an array, now you can set the specific index/position with a new value. The below example updates the column Courses to Courses_Duration at index 3. Note that the column index starts from zero.


# Pandas rename column by index
df.columns.values[2] = "Courses_Duration"
print(df.columns)

# Output:
# Index(['Courses', 'Fee', 'Courses_Duration'], dtype='object')

6. Rename Columns with a List

Python list can be used to rename all columns in pandas DataFrame, the length of the list should be the same as the number of columns in the DataFrame. Otherwise, an error occurs.


# Rename columns with list
column_names = ['Courses_List','Courses_Fee','Courses_Duration']
df.columns = column_names
print(df.columns)

Yields below output.


# Output:
Index(['Courses_List', 'Courses_Fee', 'Courses_Duration'], dtype='object')

7. Rename Columns Inplace

By default rename() function returns a new pandas DataFrame after updating the column name, you can change this behavior and rename in place by using inplace=True param.


# Rename multiple columns
df.rename(columns = {'Courses':'Courses_List','Fee':'Courses_Fee', 
   'Duration':'Courses_Duration'}, inplace = True)
print(df.columns)

This renames column names on DataFrame in place and returns the None type.

8. Rename All Columns by adding Suffixes or Prefix

Sometimes you may need to add a string text to the suffix or prefix of all column names. You can do this by getting all columns one by one in a loop and adding a suffix or prefix string.


# Rename All Column Names by adding Suffix or Prefix
df.columns = column_names
df.columns = ['col_'+str(col) for col in df.columns]

You can also use pandas.DataFrame.add_prefix() and pandas.DataFrame.add_suffix() to add prefixes and suffixes respectively to the pandas DataFrame column names.


# Add prefix to the column names
df2=df.add_prefix('col_')
print(df2.columns)

# Add suffix to the column names
df2=df.add_suffix('_col')
print(df2.columns))

Yields below output.


# Output:
Index(['col_Courses', 'col_Fee', 'col_Duration'], dtype='object')
Index(['Courses_col', 'Fee_col', 'Duration_col'], dtype='object')

9. Rename the Column using the Lambda Function

You can also change the column name using the Pandas lambda expression, This gives us more control and applies custom functions. The below examples add a ‘col_’ string to all column names. You can also try removing spaces from columns etc.


# Rename using Lambda function
df.rename(columns=lambda x: 'col_'+x, inplace=True)

10. Rename or Convert All Columns to Lower or Upper Case

When column names are mixed with lower and upper case, it would be best practice to convert/update all column names to either lower or upper case.


# Change to all lower case
df = pd.DataFrame(technologies)
df2=df.rename(str.lower, axis='columns')
print(df2.columns)

# Change to all upper case
df = pd.DataFrame(technologies)
df2=df.rename(str.upper, axis='columns')
print(df2.columns)

Yields below output.


# Output:
Index(['courses', 'fee', 'duration'], dtype='object')
Index(['COURSES', 'FEE', 'DURATION'], dtype='object')

11. Change Column Names Using DataFrame.set_axis()

By using DataFrame.set_axis() you can also change the column names. Note that with set_axis() you need to assign all column names. This updates the DataFrame with a new set of column names. set_axis() also used to rename pandas DataFrame Index


# Change column name using set_axis()
df.set_axis(['Courses_List', 'Course_Fee', 'Course_Duration'], axis=1, inplace=True)
print(df.columns)

12. Using String replace()

Pandas String.replace() a method is used to replace a string, series, dictionary, list, number, regex, etc. from a DataFrame column. This is a very rich function as it has many variations. If you have used this syntax: df.columns.str.replace("Fee", "Courses_Fee"), it replaces 'Fee' column with 'Courses_Fee'.


# Change column name using String.replace()
df.columns = df.columns.str.replace("Fee","Courses_Fee")
print(df.columns)

Yields below output.


# Output:
Index(['Courses', 'Courses_Fee', 'Duration'], dtype='object')

To replace all column names in a DataFrame using str.replace() method. You can define DataFrame with column names('Courses_List''Courses_Fee''Courses_Duration') and then apply str.replace() method over the columns of DataFrame it will replace underscores("_") with white space(" ").


# Rename all column names
df.columns = df.columns.str.replace("_"," ")
print(df.columns)

Yields below output.


# Output:
Index(['Courses List', 'Course Fee', 'Course Duration'], dtype='object')

13. Raise Error when Column Not Found

By default when rename column label is not found on Pandas DataFrame, rename() method just ignores the column. In case you wanted to throw an error when a column is not found, use errors = "raise".


# Throw Error when Rename column doesn't exists.
df.rename(columns = {'Cour':'Courses_List'}, errors = "raise")

Yields error message raise KeyError("{} not found in axis".format(missing_labels)).


# Output:
raise KeyError("{} not found in axis".format(missing_labels))
KeyError: "['Cour'] not found in axis"

14. Rename Only If the Column Exists

This example changes the Courses column to Courses_List and it doesn’t update Fees as we don’t have the Fees column. Note that even though the Fees column does not exist it didn’t raise errors even when we used errors="raise".


# Change column only if column exists.
df = pd.DataFrame(technologies)
d={'Courses':'Courses_List','Fees':'Courses_fees'}
df.rename(columns={k: v for k, v in d.items() if k in df.columns}, inplace=True,errors = "raise")
print(df.columns)
]]>
4367
Pandas – Add New Column https://shishirkant.com/pandas-add-new-column/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-add-new-column Tue, 28 Jan 2025 16:02:24 +0000 https://shishirkant.com/?p=4362  In Pandas, you can add a new column to an existing DataFrame using the DataFrame.insert() function, which updates the DataFrame in place. Alternatively, you can use DataFrame.assign() to insert a new column, but this method returns a new DataFrame with the added column.

Advertisements

In this article, I will cover examples of adding multiple columns, adding a constant value, deriving new columns from an existing column to the Pandas DataFrame.

Key Points –

  • A new column can be created by assigning values directly to a new column name with df['new_column'] = values.
  • The assign() method adds columns and returns a modified copy of the DataFrame, leaving the original DataFrame unchanged unless reassigned.
  • Adding a column directly modifies the DataFrame in place, while using assign() creates a new DataFrame.
  • Lambda functions within assign() enable complex calculations or conditional logic to define values for the new column.
  • The insert() method allows adding a new column at a specific position within the DataFrame, providing flexibility for organizing columns.
  • Using functions like np.where() or apply(), you can populate a new column based on conditional values.

Quick Examples of Adding Column

If you are in a hurry, below are some quick examples of adding column to pandas DataFrame.


# Quick examples of add column to dataframe

# Add new column to the dataframe
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)

# Add a multiple columns to the dataframe
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 =df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )

# Derive new Column from existing column
df = pd.DataFrame(technologies)
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

# Add a constant or empty value to the DataFrame
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")

# Add new column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies

# Add new column at the specific position
df = pd.DataFrame(technologies)
df.insert(0,'Tutors', tutors )

# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print(df)

To run some examples of adding column to DataFrame, let’s create DataFrame using data from a dictionary.


# Create DataFrame
import pandas as pd
import numpy as np

technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Discount':[1000,2300,1000,1200,2500]
          }

df = pd.DataFrame(technologies)
print("Create a DataFrame:\n", df)

Yields below output.

Pandas Add Column DataFrame

Add Column to DataFrame

DataFrame.assign() is used to add/append a column to the DataFrame, this method generates a new DataFrame incorporating the added column, while the original DataFrame remains unchanged.

Below is the syntax of the assign() method.


# Syntax of DataFrame.assign()
DataFrame.assign(**kwargs)

Now let’s add a column ‘TutorsAssigned” to the DataFrame. Using assign() we cannot modify the existing DataFrame inplace instead it returns a new DataFrame after adding a column. The below example adds a list of values as a new column to the DataFrame.


# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df2 = df.assign(TutorsAssigned=tutors)
print("Add column to DataFrame:\n", df2)

Yields below output.

Pandas Add Column DataFrame

Add Multiple Columns to the DataFrame

You can add multiple columns to a Pandas DataFrame by using the assign() function.


# Add multiple columns to the DataFrame
MNCCompanies = ['TATA','HCL','Infosys','Google','Amazon']
df2 = df.assign(MNCComp = MNCCompanies,TutorsAssigned=tutors )
print("Add multiple columns to DataFrame:\n", df2)

Yields below output.


# Output:
# Add multiple columns to DataFrame:
    Courses    Fee  Discount  MNCComp TutorsAssigned
0    Spark  22000      1000     TATA        William
1  PySpark  25000      2300      HCL          Henry
2   Hadoop  23000      1000  Infosys        Michael
3   Python  24000      1200   Google           John
4   Pandas  26000      2500   Amazon          Messi

Adding a Column From Existing

In real-time scenarios, there’s often a need to compute and add new columns to a dataset based on existing ones. The following demonstration calculates the Discount_Percent column based on Fee and Discount. In this instance, I’ll utilize a lambda function to generate a new column from the existing data.


# Derive New Column from Existing Column
df = pd.DataFrame(technologies)
df2 = df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
print("Add column to DataFrame:\n", df2)

You can explore deriving multiple columns and appending them to a DataFrame within a single statement. This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount  Discount_Percent
0    Spark  22000      1000          220000.0
1  PySpark  25000      2300          575000.0
2   Hadoop  23000      1000          230000.0
3   Python  24000      1200          288000.0
4   Pandas  26000      2500          650000.0

Add a Constant or Empty Column

The below example adds 3 new columns to the DataFrame, one column with all None values, a second column with 0 value, and the third column with an empty string value.


# Add a constant or empty value to the DataFrame.
df = pd.DataFrame(technologies)
df2=df.assign(A=None,B=0,C="")
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
    Courses    Fee  Discount     A  B C
0    Spark  22000      1000  None  0  
1  PySpark  25000      2300  None  0  
2   Hadoop  23000      1000  None  0  
3   Python  24000      1200  None  0  
4   Pandas  26000      2500  None  0  

Append Column to Existing Pandas DataFrame

The above examples create a new DataFrame after adding new columns instead of appending a column to an existing DataFrame. The example explained in this section is used to append a new column to the existing DataFrame.


# Add New column to the existing DataFrame
df = pd.DataFrame(technologies)
df["MNCCompanies"] = MNCCompanies
print("Add column to DataFrame:\n", df2)

Yields below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount MNCCompanies
0    Spark  22000      1000         TATA
1  PySpark  25000      2300          HCL
2   Hadoop  23000      1000      Infosys
3   Python  24000      1200       Google
4   Pandas  26000      2500       Amazon

You can also use this approach to add a new column by deriving from an existing column,


# Derive a new column from existing column
df2 = df['Discount_Percent'] = df['Fee'] * df['Discount'] / 100
print("Add column to DataFrame:\n", df2)

# Output:
# Add column to DataFrame:
#  0    220000.0
# 1    575000.0
# 2    230000.0
# 3    288000.0
# 4    650000.0
dtype: float64

Add Column to Specific Position of DataFrame

The DataFrame.insert() method offers the flexibility to add columns at any position within an existing DataFrame. While many examples often showcase appending columns at the end of the DataFrame, this method allows for insertion at the beginning, in the middle, or at any specific column index of the DataFrame.


# Add new column at the specific position
# Add new column to the DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.insert(0,'Tutors', tutors )
print("Add column to DataFrame:\n", df)

# Insert 'Tutors' column at the specified position
# Add new column to the DataFrame
position = 0
df.insert(position, 'Tutors', tutors)
print("Add column to DataFrame:\n", df)

Yields below output.


# Output:
# Add column to DataFrame:
    Tutors  Courses    Fee  Discount
0  William    Spark  22000      1000
1    Henry  PySpark  25000      2300
2  Michael   Hadoop  23000      1000
3     John   Python  24000      1200
4    Messi   Pandas  26000      2500

Add a Column From Dictionary Mapping

If you want to add a column with specific values for each row based on an existing value, you can do this using a Dictionary. Here, The values from the dictionary will be added as Tutors column in df, by matching the key value with the column 'Courses'.


# Add new column by mapping to the existing column
df = pd.DataFrame(technologies)
tutors = {"Spark":"William", "PySpark":"Henry", "Hadoop":"Michael","Python":"John", "pandas":"Messi"}
df['Tutors'] = df['Courses'].map(tutors)
print("Add column to DataFrame:\n", df)

Note that it is unable to map pandas as the key in the dictionary is not exactly matched with the value in the Courses column (case sensitive). This example yields the below output.


# Output:
# Add column to DataFrame:
   Courses    Fee  Discount   Tutors
0    Spark  22000      1000  William
1  PySpark  25000      2300    Henry
2   Hadoop  23000      1000  Michael
3   Python  24000      1200     John
4   Pandas  26000      2500      NaN

Using loc[] Add Column

Using pandas loc[] you can access rows and columns by labels or names however, you can also use this for adding a new column to pandas DataFrame. This loc[] property uses the first argument as rows and the second argument for columns hence, I will use the second argument to add a new column.


# Assign the column to the DataFrame
df = pd.DataFrame(technologies)
tutors = ['William', 'Henry', 'Michael', 'John', 'Messi']
df.loc[:, 'Tutors'] = tutors
print("Add column to DataFrame:\n", df)

Yields the same output as above.

]]>
4362
Pandas – Select Columns https://shishirkant.com/pandas-select-columns/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-select-columns Tue, 28 Jan 2025 15:39:37 +0000 https://shishirkant.com/?p=4351  In Pandas, selecting columns by name or index allows you to access specific columns in a DataFrame based on their labels (names) or positions (indices). Use loc[] & iloc[] to select a single column or multiple columns from pandas DataFrame by column names/label or index position respectively.

In this article, I will explain how to select one or more columns from a DataFrame using different methods such as column labels, index, positions, and ranges.

Key Points –

  • Pandas allow selecting columns from a DataFrame by their names using square brackets notation or the .loc[] accessor.
  • The .loc[] accessor allows for more explicit selection, accepting row and column labels or boolean arrays.
  • Alternatively, you can use the .iloc[] accessor to select columns by their integer index positions.
  • For selecting the last column, use df.iloc[:,-1:], and for the first column, use df.iloc[:,:1].
  • Understanding both column name and index-based selection is essential for efficient data manipulation with Pandas.

Quick Examples of Select Columns by Name or Index

If you are in a hurry, below are some quick examples of selecting columns by name or index in Pandas DataFrame.


# Quick examples of select columns by name or index

# Example 1: By using df[] notation
df2 = df[["Courses","Fee","Duration"]] # select multile columns

# Example 2: Using loc[] to take column slices
df2 = df.loc[:, ["Courses","Fee","Duration"]] # Selecte multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]] # Select Random columns
df2 = df.loc[:,'Fee':'Discount'] # Select columns between two columns
df2 = df.loc[:,'Duration':]  # Select columns by range
df2 = df.loc[:,:'Duration']  # Select columns by range
df2 = df.loc[:,::2]          # Select every alternate column

# Example 3: Using iloc[] to select column by Index
df2 = df.iloc[:,[1,3,4]] # Select columns by Index
df2 = df.iloc[:,1:4] # Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,2:] # Select From 3rd to end
df2 = df.iloc[:,:2] # Select First Two Columns

First, let’s create a pandas DataFrame.


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300]
              }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.

Create DataFrame:
Courses Fee Duration Discount
0 Shishir 20000 30days 1000
1 Pandas 25000 40days 2300

Using loc[] to Select Columns by Name

The df[] and DataFrame.loc[] methods in Pandas provide convenient ways to select multiple columns by names or labels, you can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Another syntax supported by pandas.DataFrame.loc[] is [:, [labels]], where you provide a list of column names as labels.


# loc[] syntax to slice columns
df.loc[:,start:stop:step]

Select DataFrame Columns by Name

To select DataFrame columns by name, you can directly specify the column names within square brackets []. Here, df[['Courses', 'Fee', 'Duration']] select only the CoursesFee, and Duration columns from the DataFrame df.


# Select Columns by labels
df2 = df[["Courses","Fee","Duration"]]
print("Select columns by labels:\n", df2)

Yields below output.

Select columns by labesl:
Courses Fee Duration
0 Shishir 20000 30days
1 Pandas 25000 40days

Select Columns by Index in Multiple Columns

To select multiple columns using df.loc[], you specify both row and column labels. If you want to select all rows and specific columns, you can use : to select all rows and provide a list of column labels. Note that loc[] also supports multiple conditions when selecting rows based on column values.


# Select multiple columns
df2 = df.loc[:, ["Courses","Fee","Discount"]]
print("Select multiple columns by labels:\n", df2)

# Output:
# Select multiple columns by labels:
#   Courses    Fee  Discount
# 0  Shishir  20000      1000
# 1   Pandas  25000      2300

In the above example, df.loc[:, ["Courses", "Fee", "Discount"]] selects all rows (:) and the columns labeled CoursesFee, and Discount from the DataFrame df.

Select Columns Based on Label Indexing

When you want to select columns based on label Indexes, provide start and stop indexes.

  • If you don’t specify a start index, iloc[] selects from the first column.
  • If you don’t provide a stop index, iloc[] selects all columns from the start index to the last column.
  • Specifying both start and stop indexes selects all columns in between, including the start index but excluding the stop index.

# Select all columns between Fee an Discount columns
df2 = df.loc[:,'Fee':'Discount']
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select from 'Duration' column
df2 = df.loc[:,'Duration':]
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select from beginning and end at 'Duration' column
df2 = df.loc[:,:'Duration']
print("Select columns by labels:\n", df2)

# Output
# Select columns by labels:
#   Courses    Fee Duration
# 0  Shishir  20000   30days
# 1  Pandas  25000   40days

Select Every Alternate Column

You can select every alternate column from a DataFrame, you can use the iloc[] accessor with a step size of 2.


# Select every alternate column
df2 = df.loc[:,::2]
print("Select columns by labels:\n", df2)

# Output:
# Select columns by labels:
#   Courses Duration   Tutor
# 0  Shishir   30days  Michel
# 1  Pandas   40days     Sam

This code effectively selects every alternate column, starting from the first column, which results in selecting Courses and Duration.

Pandas iloc[] to Select Column by Index or Position

By using pandas.DataFrame.iloc[], to select multiple columns from a DataFrame by their positional indices. You can use the syntax [:, start:stop:step] to define the range of columns to include, where the start is the index where the slice starts (inclusive), stop is the index where the slice ends (exclusive), and step is the step size between elements. Or, you can use the syntax [:, [indices]] with iloc[], where you provide a list of column names as labels.

Select Columns by Index Position

To select multiple columns from a DataFrame by their index positions, you can use the iloc[] accessor. For instance, retrieves Fee, Discount and Duration and returns a new DataFrame with the columns selected.


# Select columns by position
df2 = df.iloc[:,[1,3,4]]
print("Selec columns by position:\n", df2)

# Output:
# Selec columns by position:
#     Fee  Discount   Tutor
# 0  20000      1000  Michel
# 1  25000      2300     Sam

Select Columns by Position Range

You can also slice a DataFrame by a range of positions. For instance, select columns by position range using the .iloc[] accessor in Pandas. It selects columns with positions 1 through 3 (exclusive of position 4) from the DataFrame df and assigns them to df2.


# Select between indexes 1 and 4 (2,3,4)
df2 = df.iloc[:,1:4]
print("Select columns by position:\n", df2)

# OUtput:
# Selec columns by position:
#     Fee Duration  Discount
# 0  20000   30days      1000
# 1  25000   40days      2300

# Select From 3rd to end
df2 = df.iloc[:,2:]
print("Select columns by position:\n", df2)

# Output:
# Selec columns by position:
#  Duration  Discount   Tutor
# 0   30days      1000  Michel
# 1   40days      2300     Sam

# Select First Two Columns
df2 = df.iloc[:,:2]
print("Selec columns by position:\n", df2))

# Output:
# Selec columns by position:
#   Courses    Fee
# 0  Shishir  20000
# 1   Pandas  25000

To retrieve the last column of a DataFrame, you can use df.iloc[:,-1:], and to obtain just the first column, you can use df.iloc[:,:1].

Complete Example


import pandas as pd
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days'],
    'Discount':[1000,2300],
    'Tutor':['Michel','Sam']
              }
df = pd.DataFrame(technologies)
print(df)

# Select multiple columns
print(df[["Courses","Fee","Duration"]])

# Select Random columns
print(df.loc[:, ["Courses","Fee","Discount"]])

# Select columns by range
print(df.loc[:,'Fee':'Discount']) 
print(df.loc[:,'Duration':])
print(df.loc[:,:'Duration'])

# Select every alternate column
print(df.loc[:,::2])

# Selected by column position
print(df.iloc[:,[1,3,4]])

# Select between indexes 1 and 4 (2,3,4)
print(df.iloc[:,1:4])

# Select From 3rd to end
print(df.iloc[:,2:])

# Select First Two Columns
print(df.iloc[:,:2])
]]>
4351
Pandas – Select Rows https://shishirkant.com/pandas-select-rows/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-select-rows Tue, 28 Jan 2025 15:25:54 +0000 https://shishirkant.com/?p=4345 Use Pandas DataFrame.iloc[] & DataFrame.loc[] to select rows by integer Index and by row indices respectively. iloc[] attribute can accept single index, multiple indexes from the list, indexes by a range, and many more. loc[] operator is explicitly used with labels that can accept single index labels, multiple index labels from the list, indexes by a range (between two index labels), and many more. When using iloc[] or loc[] with an index that doesn’t exist it returns an error.

Advertisements

In this article, I will explain how to select rows from Pandas DataFrame by integer index and label (single & multiple rows), by the range, and by selecting first and last n rows with several examples. loc[] & iloc[] attributes are also used to select columns from Pandas DataFrame.

Key Points –

  • The iloc method is used to select rows by their integer position, starting from 0.
  • The loc method is used to select rows based on the index label.
  • You can use slicing with iloc to select a range of rows based on their positions.
  • With loc, you can specify a range of index labels to select multiple rows.
  • Rows can be selected using square brackets for simpler cases, though this is less flexible than iloc or loc.
  • A list of specific index labels can be passed to loc to select multiple non-consecutive rows.
  • Select Rows by Integer Index
  • Select Rows by Index Label

1. Quick Examples of Select Rows by Index Position & Labels

If you are in a hurry, below are some quick examples of how to select a row of Pandas DataFrame by index.


# Quick examples of select rows by index position & labels

# Select rows by integer index
df2 = df.iloc[2]     # Select Row by Index
df2 = df.iloc[[2,3,6]]  # Select rows by index list
df2 = df.iloc[1:5]   # Select rows by integer index range
df2 = df.iloc[:1]    # Select First Row
df2 = df.iloc[:3]    # Select First 3 Rows
df2 = df.iloc[-1:]   # Select Last Row
df2 = df.iloc[-3:]   # Select Last 3 Row
df2 = df.iloc[::2]   # Selects alternate rows

# Select Rows by Index Labels
df2 = df.loc['r2']          # Select Row by Index Label
df2 = df.loc[['r2','r3','r6']]  # Select Rows by Index Label List
df2 = df.loc['r1':'r5']     # Select Rows by Label Index Range
df2 = df.loc['r1':'r5']     # Select Rows by Label Index Range
df2 = df.loc['r1':'r5':2]   # Select Alternate Rows with in Index Labels

Let’s create a DataFrame with a few rows and columns and execute some examples to learn how to use an index. Our DataFrame contains column names Courses, FeeDuration, and Discount.


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print("Create DataFrame:\n", df)

Yields below output.

Pandas get Rows Index

2. Select Rows by Index using Pandas iloc[]

pandas.iloc[] attribute is used for integer-location-based indexing to select rows and columns in a DataFrame. Remember index starts from 0, you can use pandas.DataFrame.iloc[] with the syntax [start:stop:step]; where start indicates the index of the first row to start, stop indicates the index of the last row to stop, and step indicates the number of indices to advance after each extraction. Or, use the syntax: [[indices]] with indices as a list of row indices to take.

2.1 Select Row by Integer Index

You can select a single row from Pandas DataFrame by integer index using df.iloc[n]. Replace n with a position you want to select.


# Select Row by Integer Index
df1 = df.iloc[2]
print("After selecting a row by index position:\n", df1)
Pandas get Rows Index

2.2. Get Multiple Rows by Index List

Sometimes you may need to get multiple rows from DataFrame by specifying indexes as a list. Certainly, you can do this. For example, df.iloc[[2,3,6]] selects rows 3, 4, and 7 as the index starts from zero.


# Select Rows by Index List
df1 = df.iloc[[2,3,6]])
print("After selecting rows by index position:\n", df1)

# Output:
# After selecting rows by index position:
#   Courses    Fee Duration  Discount
# r3  Hadoop  26000   35days      1500
# r4  Python  22000   40days      1200
# r7    Java  22000   55days      2000

2.3. Get DataFrame Rows by Index Range

When you want to select a DataFrame by the range of Indexes, provide start and stop indexes.

  • By not providing a start index, iloc[] selects from the first row.
  • By not providing stop, iloc[] selects all rows from the start index.
  • Providing both start and stop, selects all rows in between.

# Select Rows by Integer Index Range
print(df.iloc[1:5])
print("After selecting rows by index range:\n", df1)

# Output:
# After selecting rows by index range:
#    Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r4   Python  22000   40days      1200
# r5   pandas  24000      NaN      2500

# Select First Row by Index
print(df.iloc[:1])

# Outputs:
# Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000

# Select First 3 Rows
print(df.iloc[:3])

# Outputs:
#    Courses    Fee Duration  Discount
# r1    Spark  20000   30days      1000
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500

# Select Last Row by Index
print(df.iloc[-1:])

# Outputs:
# r7    Java  22000   55days      2000
# Courses    Fee Duration  Discount

# Select Last 3 Row
print(df.iloc[-3:])

# Output:
#   Courses    Fee Duration  Discount
# r5  pandas  24000      NaN      2500
# r6  Oracle  21000     None      2100
# r7    Java  22000   55days      2000

# Selects alternate rows
print(df.iloc[::2])

# Output:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Hadoop  26000   35days      1500
# r5  Pandas  24000      NaN      2500
# r7    Java  22000   55days      2000

3. Select Rows by Index Labels using Pandas loc[]

By using pandas.DataFrame.loc[] you can get rows by index names or labels. To select the rows, the syntax is df.loc[start:stop:step]; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows. Or, use the syntax: [[labels]] with labels as a list of row labels to take.

3.1. Get Row by Label

If you have custom index labels on DataFrame, you can use these label names to select row. For example df.loc['r2'] returns row with label ‘r2’.


# Select Row by Index Label
df1 = df.loc['r2']
print("After selecting a row by index label:\n", df1)

# Output:
# After selecting row by index label:
# Courses     PySpark
# Fee           25000
# Duration     40days
# Discount       2300
# Name: r2, dtype: object

3.2. Get Multiple Rows by Label List

If you have a list of row labels, you can use this to select multiple rows from Pandas DataFrame.


# Select Rows by Index Label List
df1 = df.loc[['r2','r3','r6']]
print("After selecting rows by index label:\n", df1)

# Output:
# After selecting rows by index label:
#    Courses    Fee Duration  Discount
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r6   Oracle  21000     None      2100

3.3. Get Rows Between Two Labels

You can also select rows between two index labels.


# Select Rows by Label Index Range
df1 = df.loc['r1':'r5']
print("After selecting rows by index label range:\n", df1)

# Output:
# After selecting rows by index label range:
#    Courses    Fee Duration  Discount
# r1    Spark  20000   30days      1000
# r2  PySpark  25000   40days      2300
# r3   Hadoop  26000   35days      1500
# r4   Python  22000   40days      1200
# r5   Pandas  24000      NaN      2500

# Select Alternate Rows with in Index Labels
print(df.loc['r1':'r5':2])

# Outputs:
#   Courses    Fee Duration  Discount
# r1   Spark  20000   30days      1000
# r3  Hadoop  26000   35days      1500
# r5  Pandas  24000      NaN      2500

You can get the first two rows using df.loc[:'r2'], but this approach is not much used as you need to know the row labels hence, to select the first n rows it is recommended to use by index df.iloc[:n], replace n with the value you want. The same applies to get the last n rows.

4. Complete Example


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days','40days',np.nan,None,'55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
index_labels=['r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Select Row by Index
print(df.iloc[2])

# Select Rows by Index List
print(df.iloc[[2,3,6]])

# Select Rows by Integer Index Range
print(df.iloc[1:5])

# Select First Row
print(df.iloc[:1])

# Select First 3 Rows
print(df.iloc[:3])

# Select Last Row
print(df.iloc[-1:])

# Select Last 3 Row
print(df.iloc[-3:])

# Selects alternate rows
print(df.iloc[::2])

# Select Row by Index Label
print(df.loc['r2'])

# Select Rows by Index Label List
print(df.loc[['r2','r3','r6']])

# Select Rows by Label Index Range
print(df.loc['r1':'r5'])

# Select rows by label index range
print(df.loc['r1':'r5'])

# Select alternate rows with in index labels
print(df.loc['r1':'r5':2])
]]>
4345
Pandas – Create DataFrame https://shishirkant.com/pandas-create-dataframe/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-create-dataframe Tue, 28 Jan 2025 15:19:56 +0000 https://shishirkant.com/?p=4341 Python pandas is widely used for data science/data analysis and machine learning applications. It is built on top of another popular package named Numpy, which provides scientific computing in Python. pandas DataFrame is a 2-dimensional labeled data structure with rows and columns (columns of potentially different types like integers, strings, float, None, Python objects e.t.c). You can think of it as an excel spreadsheet or SQL table.

1. Create Pandas DataFrame

One of the easiest ways to create a pandas DataFrame is by using its constructor. DataFrame constructor takes several optional params that are used to specify the characteristics of the DataFrame.

Below is the syntax of the DataFrame constructor.


# DataFrame constructor syntax
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Now, let’s create a DataFrame from a list of lists (with a few rows and columns).


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Shishir",20000, "30days"], 
                 ["Pandas",25000, "40days"], 
               ]
df=pd.DataFrame(technologies)
print(df)

Since we have not given index and column labels, DataFrame by default assigns incremental sequence numbers as labels to both rows and columns.


# Output:
        0      1       2
0  Shishir  20000  30days
1  Pandas  25000  40days

Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column param and index param to provide column & custom index respectively to the DataFrame.


# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)

Yields below output. Alternatively, you can also add columns labels to the existing DataFrame.


# Output:
  Courses    Fee Duration
a  Shishir  20000   30days
b  Pandas  25000   40days

By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes returns the data type of each column.


# Output:
Courses     object
Fee          int64
Duration    object
dtype: object

You can also assign custom data types to columns.


# Set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)

2. Create DataFrame from the Dic (dictionary).

Another most used way to create pandas DataFrame is from the python Dict (dictionary) object. This comes in handy if you wanted to convert the dictionary object into DataFrame. Key from the Dict object becomes column and value convert into rows.


# Create DataFrame from Dict
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days']
              }
df = pd.DataFrame(technologies)
print(df)

3. Create DataFrame with Index

By default, DataFrame add’s a numeric index starting from zero. It can be changed with a custom index while creating a DataFrame.


# Create DataFrame with Index.
technologies = {
    'Courses':["Shishir","Pandas"],
    'Fee' :[20000,25000],
    'Duration':['30days','40days']
              }
index_label=["r1","r2"]
df = pd.DataFrame(technologies, index=index_label)
print(df)

4. Creating DataFrame from List of Dicts Object

Sometimes we get data in JSON string (similar dict), you can convert it to DataFrame as shown below.


# Creates DataFrame from list of dict
technologies = [{'Courses':'Shishir', 'Fee': 20000, 'Duration':'30days'},
        {'Courses':'Pandas', 'Fee': 25000, 'Duration': '40days'}]

df = pd.DataFrame(technologies)
print(df)

5. Creating DataFrame From Series

By using concat() method you can create Dataframe from multiple Series. This takes several params, for the scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.


# Create pandas Series
courses = pd.Series(["Shishir","Pandas"])
fees = pd.Series([20000,25000])
duration = pd.Series(['30days','40days'])

# Create DataFrame from series objects.
df=pd.concat([courses,fees,duration],axis=1)
print(df)

# Outputs
#        0      1       2
# 0  Shishir  20000  30days
# 1  Pandas  25000  40days

6. Add Column Labels

As you see above, by default concat() method doesn’t add column labels. You can do so as below.


# Assign Index to Series
index_labels=['r1','r2']
courses.index = index_labels
fees.index = index_labels
duration.index = index_labels

# Concat Series by Changing Names
df=pd.concat({'Courses': courses,
              'Course_Fee': fees,
              'Course_Duration': duration},axis=1)
print(df)

# Outputs:
# Courses  Course_Fee Course_Duration
# r1  Shishir       20000          30days
# r2  Pandas       25000          40days

7. Creating DataFrame using zip() function

Multiple lists can be merged using zip() method and the output is used to create a DataFrame.


# Create Lists
Courses = ['Shishir', 'Pandas']
Fee = [20000,25000]
Duration = ['30days','40days']
   
# Merge lists by using zip().
tuples_list = list(zip(Courses, Fee, Duration))
df = pd.DataFrame(tuples_list, columns = ['Courses', 'Fee', 'Duration'])

8. Create an Empty DataFrame in Pandas

Sometimes you would need to create an empty pandas DataFrame with or without columns. This would be required in many cases, below is one example.

When working with files, there are times when a file may not be available for processing. However, we may still need to manually create a DataFrame with the expected column names. Failing to use the correct column names can cause operations or transformations, such as unions, to fail, as they rely on columns that may not exist.

To handle situations like these, it’s important to always create a DataFrame with the expected columns, ensuring that the column names and data types are consistent, whether the file exists or if we’re processing an empty file.


# Create Empty DataFrame
df = pd.DataFrame()
print(df)

# Outputs:
# Empty DataFrame
# Columns: []
# Index: []

To create an empty DataFrame with just column names but no data.


# Create Empty DataFraem with Column Labels
df = pd.DataFrame(columns = ["Courses","Fee","Duration"])
print(df)

# Outputs:
# Empty DataFrame
# Columns: [Courses, Fee, Duration]
# Index: []

9. Create DataFrame From CSV File

In real-time we are often required to read the contents of CSV files and create a DataFrame. In pandas, creating a DataFrame from CSV is done by using pandas.read_csv() method. This returns a DataFrame with the contents of a CSV file.


# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')

10. Create From Another DataFrame

Finally, you can also copy a DataFrame from another DataFrame using copy() method.


# Copy DataFrame to another
df2=df.copy()
print(df2)
]]>
4341
Pandas – What is Series https://shishirkant.com/pandas-what-is-series/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-what-is-series Mon, 27 Jan 2025 15:05:21 +0000 https://shishirkant.com/?p=4333 Pandas Series Introduction

This is a beginner’s guide of Python pandas Series Tutorial where you will learn what is pandas Series? its features, advantages, and how to use panda Series with sample examples.

Every sample example explained in this tutorial is tested in our development environment and is available for reference.

All pandas Series examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn pandas and advance their career in Data Science, analytics, and Machine Learning.

Note: In case you can’t find the pandas Series examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial and sample example code, there are hundreds of tutorials in pandas on this website you can learn from.

What is the Pandas Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.). It’s similar to a one-dimensional array or a list in Python, but with additional functionalities. Each element in a Pandas Series has a label associated with it, called an index. This index allows for fast and efficient data access and manipulation. Pandas Series can be created from various data structures like lists, dictionaries, NumPy arrays, etc.

Pandas Series vs DataFrame?

  • As I explained above, pandas Series is a one-dimensional labeled array of the same data type whereas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. 
  • In a DataFrame, each column of data is represented as a pandas Series.
  • DataFrame column can have a name/label but, Series cannot have a column name.
  • DataFrame can also be converted to Series and single or multiple Series can be converted to a DataFrame. Refer to pandas DataFrame Tutorial for more details and examples on DataFrame.

Syntax of pandas.series() 

Following is the syntax of the pandas.series(), which is used to create Pandas Series objects.


# Pandas Series Constructor Syntax
Pandas.series(data,index,dtype,copy)
  • data – The data to be stored in the Series. It can be a list, ndarray, dictionary, scalar value (like an integer or string), etc.
  • Index – Optional. It allows you to specify the index labels for the Series. If not provided, default integer index labels (0, 1, 2, …) will be used.
  • dtype – Optional. The data type of the Series. If not specified, it will be inferred from the data.
  • copy – Optional. If True, it makes a copy of the data. Default is False.

Create pandas Series

pandas Series can be created in multiple ways, From array, list, dict, and from existing DataFrame.

Create Series using array

Before creating a Series, first, we have to import the NumPy module and use array() function in the program. If the data is ndarray, then the passed index should be in the same length, if the index is not passed the default value is range(n).


# Create Series from array
import pandas as pd 
import numpy as np
data = np.array(['python','php','java'])
series = pd.Series(data)
print (series)

# Output:
# 0    python
# 1       php
# 2      java
# dtype: object

Notice that the column doesn’t have a name. And Series also adds an incremental sequence number as Index (first column) by default.

To customize the index of a Pandas Series, you can provide the index parameter when creating the Series using the pd.Series() constructor.


# Create pandas DataFrame with custom index
s2=pd.Series(data=data, index=['r1', 'r2', 'r3'])
print(s2)

# Output:
# r1    python
# r2       php
# r3      java
# dtype: object

Create Series using Dict

Dict can be used as input. Keys from Dict are used as Index and values are used as a column.


# Create a Dict from a input
data = {'Courses' :"pandas", 'Fees' : 20000, 'Duration' : "30days"}
s2 = pd.Series(data)
print (s2)

# Output:
# Courses     pandas
# Fees         20000
# Duration    30days
# dtype: object

Now let’s see how to ignore Index from Dict and add the Index while creating a Series with Dict.


# To See index from Dict and add index while creating a Series.
data = {'Courses' :"pandas", 'Fees' : 20000, 'Duration' : "30days"}
s2 = pd.Series(data, index=['Courses','Course_Fee','Course_Duration'])
print (s2)

Create Series using List

Below is an example of creating DataFrame from List.


# Creating DataFrame from List
data = ['python','php','java']
s2 = pd.Series(data, index=['r1', 'r2','r3'])
print(s2)

# Output:
# r1 python
# r2 php
# r3 java
# dtype:object

Create Empty Series

Sometimes you would require to create an empty Series. you can do so by using its empty constructor.


# Create empty Series
import pandas as pd
s = pd.Series()
print(s)

This shows an empty series.

Convert a Series into a DataFrame

To convert Series into DataFrame, you can use pandas.concat()pandas.merge()DataFrame.join(). Below I have explained using concat() function. For others, please refer to pandas combine two Series to DataFrame


# Convert series to dataframe
courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')
df=pd.concat([courses,fees],axis=1)
print(df)

# Output:
#   courses   fees
# 0    Spark  22000
# 1  PySpark  25000
# 2   Hadoop  23000

Convert pandas DataFrame to Series

In this section of the pandas Series Tutorial, I will explain different ways to convert DataFrame to Series. As I explained in the beginning. Given that each column in a DataFrame is essentially a Series, it follows that we can easily extract single or multiple columns from a DataFrame and convert them into Series objects

  1. You can convert a single-column DataFrame into a Series by extracting that single column.
  2. To obtain a Series from a specific column in a multi-column DataFrame, simply access that column using its name.
  3. To convert a single row of a DataFrame into a Series, you can utilize indexing to select the row and obtain it as a Series

Convert a single DataFrame column into a series

To run some examples of converting a single DataFrame column into a series, let’s create a DataFrame. By using DataFrame.squeeze() to convert the DataFrame into a Series:


# Create DataFrame with single column
data =  ["Python","PHP","Java"]
df = pd.DataFrame(data, columns = ['Courses'])
my_series = df.squeeze()
print(my_series)
print (type(my_series))

The DataFrame will now get converted into a Series:


# Output:
0    Python
1       PHP
2      Java
Name: Courses, dtype: object
<class 'pandas.core.series.Series'>

Convert the DataFrame column into a series

You can use the .squeeze() method to convert a DataFrame column into a Series.

For example, if we a multiple-column DataFrame


# Create DataFrame with multiple columns
import pandas as pd
data = {'Courses': ['Spark', 'PySpark', 'Python'],
        'Duration':['30 days', '40 days', '50 days'],
        'Fee':[20000, 25000, 26000]
        }
df = pd.DataFrame(data, columns = ['Courses', 'Duration', 'Fee'])
print(df)
print (type(df))

This will convert the Fee column of your DataFrame df into a Series named my_series. If the column contains only one level of data (i.e., it’s not a DataFrame itself), .squeeze() will return it as a Series.


# Pandas DataFrame column to series
my_series= df['Fee'].squeeze()

Convert DataFrame Row into a Series

You can use .iloc[] to access a row by its integer position and then use .squeeze() to convert it into a Series if it has only one element.


# Convert dataframe row to series
my_series = df.iloc[2].squeeze()
print(my_series)
print (type(my_series))

Then, we can get the following series:


# Output:
Courses      Python
Duration    50 days
Fee             NaN
Name: 2, dtype: object
<class 'pandas.core.series.Series'>

Merge DataFrame and Series?

  1. Construct a dataframe from the series.
  2. After that merge with the dataframe.
  3. Specify the data as the values, multiply them by the length, set the columns to the index and set params for left_index and set the right_index to True.

# Syntax for merge with the DataFrame.
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)

Pandas Series Attributes:

TReturn the transpose, which is by definition self.
arrayThe ExtensionArray of the data backing this Series or Index.
atAccess a single value for a row/column label pair.
attrsDictionary of global attributes of this dataset.
axesReturn a list of the row axis labels.
dtypeReturn the dtype object of the underlying data.
dtypesReturn the dtype object of the underlying data.
flagsGet the properties associated with this pandas object.
hasnasReturn if I have any nans; enables various perf speedups.
iatAccess a single value for a row/column pair by integer position.
ilocPurely integer-location based indexing for selection by position.
indexThe index (axis labels) of the Series.
is_monotonicReturn boolean if values in the object are monotonic_increasing.
is_monotonic_decreasingReturn boolean if values in the object are monotonic_decreasing.
is_monotonic_increasingAlias for is_monotonic.
is_uniqueReturn boolean if values in the object are unique.
locAccess a group of rows and columns by label(s) or a boolean array.
nameReturn the name of the Series.
nbytesReturn the number of bytes in the underlying data.
ndimNumber of dimensions of the underlying data, by definition 1.
shapeReturn a tuple of the shape of the underlying data.
sizeReturn the number of elements in the underlying data.
valuesReturn Series as ndarray or ndarray-like depending on the dtype.

Pandas Series Methods:

abs()Return a Series/DataFrame with absolute numeric value of each element.
add(other[, level, fill_value, axis])Return Addition of series and other, element-wise (binary operator add).
add_prefix(prefix)Prefix labels with string prefix.
add_suffix(suffix)Suffix labels with string suffix.
agg([func, axis])Aggregate using one or more operations over the specified axis.
aggregate([func, axis])Aggregate using one or more operations over the specified axis.
align(other[, join, axis, level, copy, …])Align two objects on their axes with the specified join method.
all([axis, bool_only, skipna, level])Return whether all elements are True, potentially over an axis.
any([axis, bool_only, skipna, level])Return whether any element is True, potentially over an axis.
append(to_append[, ignore_index, …])Concatenate two or more Series.
apply(func[, convert_dtype, args])Invoke function on values of Series.
argmax([axis, skipna])Return int position of the largest value in the Series.
argmin([axis, skipna])Return int position of the smallest value in the Series.
argsort([axis, kind, order])Return the integer indices that would sort the Series values.
asfreq(freq[, method, how, normalize, …])Convert time series to specified frequency.
asof(where[, subset])Return the last row(s) without any NaNs before where.
astype(dtype[, copy, errors])Cast a pandas object to a specified dtype .
at_time(time[, asof, axis])Select values at particular time of day (e.g., 9:30AM).
autocorr([lag])Compute the lag-N autocorrelation.
backfill([axis, inplace, limit, downcast])Synonym for DataFrame.fillna() with method=”bfill”.
between(left, right[, inclusive])Return boolean Series equivalent to left <= series <= right.
between_(start_time, end_time[, …])Select values between particular times of the day (e.g., 9:00-9:30 AM).

Continue..

bfill([axis, inplace, limit, downcast])Synonym for DataFrame.fillna() with method=”bfill” .
bool()Return the bool of a single element Series or DataFrame.
catalias of pandas.core.arrays.categorical.categoricalAccessor
clip([lower, upper, axis, inplace])Trim values at input threshold(s).
combine(other, func[, fill_value])Combine the Series with a Series or scalar according to func.
combine_first(other)Update null elements with value in the same location in ‘other’.
compare(other[, align_axis, keep_shape, …])Compare to another Series and show the differences.
convert([infer_objects, …])Convert columns to best possible dtypes using dtypes supporting pd.NA.
copy([deep])Make a copy of this object’s indices and data.
corr(other[, method, min_periods])Compute correlation with other Series, excluding missing values.
count([level])Return number of non-NA/null observations in the Series.
cov(other[, min_periods, ddof])Compute covariance with Series, excluding missing values.
cummax([axis, skipna])Return cumulative maximum over a DataFrame or Series axis.
cummin([axis, skipna])Return cumulative minimum over a DataFrame or Series axis.
cumprod([axis, skipna])Return cumulative product over a DataFrame or Series axis.
cumsum([axis, skipna])Return cumulative sum over a DataFrame or Series axis.
describe([percentiles, include, exclude, …])Generate descriptive statistics.
diff([periods])First discrete difference of element.
div(other[, level, fill_value, axis])Return Floating division of series and other, element-wise (binary operator truediv).
divide(other[, level, fill_value, axis])Return Floating division of series and other, element-wise (binary operator truediv).
divmod(other[, level, fill_value, axis])Return Integer division and modulo of series and other, element-wise (binary operator divmod).
dot(other)Compute the dot product between the Series and the columns of other.
drop([labels, axis, index, columns, level, …])Return Series with specified index labels removed.
drop_duplicate([keep, inplace])Return Series with duplicate values removed.
droplevel(level[, axis])Return Series/DataFrame with requested index / column level(s) removed.
dropna([axis, inplace, how])Return a new Series with missing values removed.
dtalias of pandas.core.indexes.accessors.CombinedDatetimelikeproperties.
duplicated([keep])Indicate duplicate Series values.
eq(other[, level, fill_value, axis])Return Equal to of series and other, element-wise (binary operator eq).
equals(other)Test whether two objects contain the same elements.
ewm([com, span, halflife, alpha, …])Provide exponential weighted (EW) functions.
expanding([min_periods, center, axis, method])Provide expanding transformations.
explode([ignore_index])Transform each element of a list-like to a row.
factorize([sort, na_sentinel])Encode the object as an enumerated type or categorical variable.
ffill([axis, inplace, limit, downcast])Synonym for DataFrame.fillna()with method=ffill().
fillna([value, method, axis, inplace, …])Fill NA/NaN values using the specified method.
filter([items, like, regex, axis])Subset the dataframe rows or columns according to the specified index labels.
first(offset)Select initial periods of time series data based on a date offset.
first_valid_other()Return index for first non-NA value or None, if no NA value is found.

Conclusion

In this pandas Series tutorial, we have learned about what is panda series? how to create a Panda Series with different types of inputs, convert Pandas Series to DataFrame, and vice versa with working examples.

]]>
4333
Pandas – What is DataFrame https://shishirkant.com/pandas-what-is-dataframe/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-what-is-dataframe Mon, 27 Jan 2025 14:56:41 +0000 https://shishirkant.com/?p=4329 What is Pandas DataFrame?

A pandas DataFrame represents a two-dimensional dataset, characterized by labeled rows and columns, making it a versatile and immutable tabular structure. It comprises three essential components: the data itself, along with its rows and columns. Built upon the robust foundation of the NumPy library, pandas is implemented using languages such as Python, Cython, and C.

DataFrame Features

  • DataFrames support named rows & columns (you can also provide names to rows)
  • Supports heterogeneous collections of data.
  • DataFrame labeled axes (rows and columns).
  • Can perform arithmetic operations on rows and columns.
  • Supporting reading flat files like CSV, Excel, and JSON and also reads SQL tables’s
  • Handling of missing data.

Create Pandas DataFrame

In this section of the tutorial, I will explain different ways to create pandas DataFrame with examples.

– Create using Constructor

A simple method to create a DataFrame is by using its constructor.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days"], 
                 ["pandas",20000, "40days"], 
               ]
df=pd.DataFrame(technologies)
print(df)

As we haven’t provided labels for the columns and indexes, the DataFrame automatically assigns incremental sequence numbers as labels for both rows and columns. These are referred to as the Index.


# Output
        0      1       2
0   Spark  20000  30days
1  pandas  20000  40days

Assigning sequence numbers as column names can be confusing, as it becomes challenging to discern the content of each column. Therefore, it’s advisable to assign meaningful names to columns that reflect the data they contain. To achieve this, utilize the ‘column’ parameter to label columns and the ‘index’ parameter to label rows in the DataFrame.


# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)

Output.


# Output
  Courses    Fee Duration
a   Spark  20000   30days
b  pandas  20000   40days

By default, pandas automatically detects the data types from the data and assign them to the DataFrame. The df.dtypes attribute returns the data type of each column.


# types
df.dtypes

# Output
Courses     object
Fee          int64
Duration    object
dtype: object

Custom data types can also be assigned to the columns.


# Set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)

Another commonly used method to create a DataFrame is from a dictionary.


# Create DataFrame from dictionary
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :[20000,25000,26000],
    'Duration':['30day','40days','35days'],
    'Discount':[1000,2300,1500]
              }
df = pd.DataFrame(technologies)

— Create DataFrame From CSV File

In real-time we are often required to read TEXT, CSV, JSON files to create a DataFrame. In pandas, creating a DataFrame from CSV is pretty simple.


# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')

DataFrame Basic Operations

In order to explain some basic operations with pandas DataFrame let’s create it with some data.


# Create DataFrame with None/Null to work with examples
import pandas as pd
import numpy as np
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas",None,"Spark","Python"],
    'Fee' :[22000,25000,23000,24000,np.nan,25000,25000,22000],
    'Duration':['30day','50days','55days','40days','60days','35day','','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
          })
row_labels=['r0','r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies, index=row_labels)
print(df)

Note that our data contains np.nanNone, and empty values. Note that every column in DataFrame is internally represented as pandas Series.

— DataFrame properties

DataFrame has several properties, in this pandas DataFrame tutorial I will cover most used properties with examples.

Method/PropertyResultDescription
df.shape(8, 4)Returns a shape of the pandas DataFrame (number of rows and columns) as a tuple. Number of rows and columns
df.size32Returns number of cells. It would be rows * columns.
df.emptyFalseReturn boolean. True when DF is empty.
df.columnsIndex([‘Courses’, ‘Fee’, ‘Duration’, ‘Discount’], dtype=’object’)Returns all column names as Series
df.columns.values[‘Courses’ ‘Fee’ ‘Duration’ ‘Discount’]Returns column names from the header as a list in pandas.
df.indexIndex([‘r0’, ‘r1’, ‘r2’, ‘r3’, ‘r4’, ‘r5’, ‘r6’, ‘r7′], dtype=’object’)Returns Index of DataFrame
df.index.values[‘r0’ ‘r1’ ‘r2’ ‘r3’ ‘r4’ ‘r5’ ‘r6’ ‘r7’]Returns Index as List.
df.dtypesCourses object
Fee float64
Duration object
Discount int64
dtype: object
Returns Data types of columns
df['Fee']
df[['Fee','Duration']]
r0 22000.0
r1 25000.0
r2 23000.0
r3 24000.0
r4 NaN
r5 25000.0
r6 25000.0
r7 22000.0
Name: Fee, dtype: float64
Pandas Select Columns by Name.
Also, use to select multiple columns
df2=df[df['Fee'] == 22000]Courses Fee Duration Discount
r0 Spark 22000.0 30day 1000
r7 Python 22000.0 50days 1600
Filter DataFrame
df2=df[6:]Courses Fee Duration Discount
r6 Spark 25000.0 30day 1400
r7 Python 22000.0 50days 1600
Select Dataframe Rows by Index
Select’s Row from 6th Index
df['Duration'][3]
df["Duration"].values[3]
40daysGet cell value (row x column) of DataFrame
df['Fee'] = df['Fee'] - 500
df['Fee']
r0 21500.0
r1 24500.0
r2 22500.0
r3 23500.0
r4 NaN
r5 24500.0
r6 24500.0
r7 21500.0
Update DataFrame Column
Substract 500 from ‘Fee’ Column
df[‘new_column’] = ”Add new column with empty values

Manipulate DataFrame

— Describe DataFrame

describe() – describe function calculates count, mean, std, min, max, and different percentages of each numeric column of pandas DataFrame.


# Describe DataFrame for all numberic columns
df.describe()

# Output
                Fee     Discount
count      7.000000     8.000000
mean   23714.285714  1537.500000
std     1380.131119   570.557372
min    22000.000000  1000.000000
25%    22500.000000  1150.000000
50%    24000.000000  1350.000000
75%    25000.000000  1775.000000
max    25000.000000  2500.000000

— Filter Rows from DataFrame

query()/apply()/loc[] – These are used to query pandas DataFrame. you can also do operator chaining while filtering pandas rows.

  • pandas.DataFrame.filter() – To filter rows by index and columns by name.
  • pandas.DataFrame.loc[] – To select rows by indices label and column by name.
  • pandas.DataFrame.iloc[] – To select rows by index and column by position.
  • pandas.DataFrame.apply() – To custom select using lambda function.

# Using DataFrame.query()
df.query("Courses == 'Spark'",inplace=True)
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Using DataFrame.loc[]
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Using apply()
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])

# Other ways to filter 
df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

— Insert Rows & Columns to DataFrame

insert()/assign() – Adds a new column to the pandas DataFrame

By using assign() & insert() methods you can add one or multiple columns to the pandas DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Adds new column 'TutorsAssigned' to DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 
          'Messi', 'Ramana','Kumar','Vasu']
df2 = df.assign(TutorsAssigned=tutors)

# Add new column from existing column
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

# Other way to add a column
df["TutorsAssigned"] = tutors

# Add new column at the beginning
df.insert(0,'TutorsAssigned', tutors )

— Rename DataFrame Columns

rename() – Renames pandas DataFrame columns

Pandas DataFrame.rename() method is used to change/replace columns (single & multiple columns), by index, and all columns of the DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Assign new header by setting new column names.
df.columns=['A','B','C']

# Change column name by index. This changes 3rd column 
df.columns.values[2] = "C"

# Rename Column Names using rename() method
df2 = df.rename({'a': 'A', 'b': 'B'}, axis=1)
df2 = df.rename({'a': 'A', 'b': 'B'}, axis='columns')
df2 = df.rename(columns={'a': 'A', 'b': 'B'})

# Rename columns inplace (self DataFrame)
df.rename(columns={'a': 'A', 'b': 'B'}, inplace = True)

# Rename using lambda function
df.rename(columns=lambda x: x[1:], inplace=True)

# Rename with error. When x not present, it thorows error.
df.rename(columns = {'x':'X'}, errors = "raise")

— Drop DataFrame Rows and Columns

drop() – drop method is used to drop rows and columns

Below are some examples. In order to understand better go through drop rows from panda DataFrame with examples. dropping rows doesn’t complete without learning how to drop rows with/by condition


df = pd.DataFrame(technologies, index=row_labels)

# Drop rows by labels
df1 = df.drop(['r1','r2'])

# Delete Rows by position
df1=df.drop(df.index[[1,3]])

# Delete Rows by Index Range
df1=df.drop(df.index[2:])

# When you have default indexs for rows
df1 = df.drop(0)
df1 = df.drop([0, 3])
df1 = df.drop(range(0,2))

# Drop rows by checking conditions
df1 = df.loc[df["Discount"] >=1500 ]

# DataFrame slicing
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4

Now let’s see how to how to drop columns from pandas DataFrame with examples. In order to drop columns, you have to use either axis=1 or columns param to drop() method.


df = pd.DataFrame(technologies, index=row_labels)

# Delete Column by Name
df2=df.drop(["Fee"], axis = 1)

# Drop by using labels & axis
df2=df.drop(labels=["Fee"], axis = 1)

# Drop by using columns
df2=df.drop(columns=["Fee"])

# Drop column by index
df2=df.drop(df.columns[[1]], axis = 1)

# Other ways to drop columns
df.loc[:, 'Courses':'Fee'].columns, axis = 1, inplace=True)
df.drop(df.iloc[:, 1:2], axis=1, inplace=True)

If you wanted to drop duplicate rows from pandas DataFrame use DataFrame.drop_duplicates()

Pandas Join, Merge, Concat to Combine DataFrames

In this section of the python pandas tutorial I will cover how to combine DataFrame using join(), merge(), and concat() methods. All these methods perform below join types. All these join methods works similarly to SQL joins.

Join TypesSupported ByDescription
innerjoin(), merge() and concat()Performs Inner Join on pandas DataFrames
leftjoin(), merge()Performs Left Join on pandas DataFrames
rightjoin(), merge()Performs Right Join on pandas DataFrames
outerjoin(), merge() and concat()Performs Outer Join on pandas DataFrames
crossmerge()Performs Cross Join on pandas DataFrames

Both pandas.merge() and DataFrame.merge() operate similarly, allowing the merging of two or more DataFrames. When performing a join based on columns, they disregard indexes. However, when joining on the index, the resulting DataFrame retains the indexes from the source DataFrames. In cases where no parameters are specified, the default behavior is to perform the join on all common columns.


# Quick Examples of pandas merge DataFrames
# pandas.merge()
df3=pd.merge(df1,df2)

# DataFrame.merge()
df3=df1.merge(df2)

# Merge by column
df3=pd.merge(df1,df2, on='Courses')

# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')

# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)

# Merge by multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])

# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')

# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')

# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')

Alternatively use join() for joining on the index. pandas.DataFrame.join() method is the most efficient way to join two pandas DataFrames on row index.


# pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")

# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')

# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')

# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

Similarly, pandas also support concatenate two pandas DataFrames using concat() method.

Iterate over Rows to perform an operation

Pandas DataFrame offers two methods, iterrows() and itertuples(), for iterating over each row. With iterrows(), you receive a tuple containing the index of the row and a Series representing its data. Conversely, itertuples() returns all DataFrame elements as an iterator, with each row represented as a tuple. Notably, itertuples() is quicker than iterrows() and maintains data types intact.


df = pd.DataFrame(technologies, index=row_labels)

# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
    print (index,row["Fee"], row["Courses"])

# Iterate all rows using DataFrame.itertuples()
for row in df.itertuples(index = True):
    print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))

# Using DataFrame.index
for idx in df.index:
     print(df['Fee'][idx], df['Courses'][idx])

Working with Null, np.NaN & Empty Values

Below are some of the articles that I have covered to handle None/NaN values in pandas DataFrame. It is very important to handle missing data in Pandas before you perform any analytics or run with machine learning algorithms.

  • How to replace None & NaN values with Blank or Empty String in pandas DataFrame
  • How to replace None & NaN values with zero (0) in pandas DataFrame
  • Check If any Value is NaN in pandas DataFrame
  • Drop Rows with NaN Values in pandas DataFrame
  • Drop Columns with NaN Values in pandas DataFrame
  • Drop Infinite Values From DataFrame

Column Manipulations

One most used way to manipulate is by using pandas apply() function to DataFrame columns. If you are familiar with the lambda expressions, you can also use lambda expression with apply().

If you’re new to the concept of lambda functions, they’re essentially concise, anonymous functions in Python capable of handling any number of arguments and executing expressions. These expressions are particularly handy for creating functions on-the-fly without the need for formal definition using the lambda keyword.

DataFrame also provides several methods to manipulate data on columns.

  • pandas Concatenate Two Columns

Pandas Read & Write Excel

Use pandas DataFrame.to_excel() function to write a DataFrame to an excel sheet with extension .xlsx and use pandas.read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame

Read excel sheet Example


# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)

Write DataFrame to excel sheet


# Write DataFrame to Excel file
df.to_excel('Courses.xlsx')
]]>
4329
Pandas Introduction https://shishirkant.com/pandas-introduction/?utm_source=rss&utm_medium=rss&utm_campaign=pandas-introduction Mon, 27 Jan 2025 14:41:55 +0000 https://shishirkant.com/?p=4324 This is a beginner’s guide of Python Pandas DataFrame Tutorial where you will learn what is DataFrame? its features, its advantages, and how to use DataFrame with sample examples.

Every sample example explained in this tutorial is tested in our development environment and is available for reference.

All pandas DataFrame examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn about Pandas and advance their careers in Data Science, Analytics, and Machine Learning.

Note: In case you can’t find the pandas DataFrame examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your example code, there are hundreds of tutorials in pandas on this website you can learn from.

2. What is Python Pandas?

Pandas is the most popular open-source library in the Python programming language and pandas is widely used for data science/data analysis and machine learning applications. It is built on top of another popular package named Numpy, which provides scientific computing in Python and supports multi-dimensional arrays. It is developed by Wes McKinney, check his GitHub for other projects he is working on.

Following are the main two data structures supported by Pandas.

  • pandas Series
  • pandas DataFrame
  • pandas Index

2.1 What is Pandas Series

In simple words Pandas Series is a one-dimensional labeled array that holds any data type (integers, strings, floating-point numbers, None, Python objects, etc.). The axis labels are collectively referred to as the index. The later section of this pandas tutorial covers more on the Series with examples.

2.2 What is Pandas DataFrame

Pandas DataFrame is a 2-dimensional labeled data structure with rows and columns (columns of potentially different types like integers, strings, float, None, Python objects e.t.c). You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. The later section of this pandas tutorial covers more on DataFrame with examples.

3. Pandas Advantages

4. Pandas vs PySpark

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best where you need to process operations many times(100x) faster than Pandas.

PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.

4.1 How to Decide Between Pandas vs PySpark

Below are a few considerations when choosing PySpark over Pandas.

  • If your data is huge and grows significantly over the years and you wanted to improve your processing time.
  • If you want fault-tolerant.
  • ANSI SQL compatibility.
  • Language to choose (Spark supports Python, Scala, Java & R)
  • When you want Machine-learning capability.
  • Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
  • If you wanted to stream the data and process it real-time.

5. Installing Pandas

In this section of the python pandas tutorial let’s see how to install & upgrade pandas. In order to run pandas, you should have python installed first. You can install Python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.

  • Step-by-Step Instruction of Install Anaconda & Pandas
  • Run pandas from Anaconda & Jupyter Notebook
  • Install Python & Run pandas from Windows

Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.

5.1 Install Pandas using Python pip Command

pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.

Since the Pandas package is available in PyPI (Python Package Index), we should use it to install Pandas latest version on windows.


# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

This should give you the output as below. If your pip is not up to date, then upgrade pip to the latest version.

python pandas tutorial

5.2 Install Pandas using Anaconda conda Command

Anaconda distribution comes with a conda tool that is used to install/upgrade/downgrade most of the python and other packages.


# Install pandas using conda
conda install pandas

6. Upgrade Pandas to Latest or Specific Version

In order to upgrade pandas to the latest or specific version, you can use either pip install command or conda install if you are using Anaconda distribution. Before you start to upgrade, you use the following command to know the current version of pandas installed.

pandas installed version

Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip or conda statements.


# Using pip to upgrade pandas
pip install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

# Upgrade pandas to specific version
pip install pandas==specific-higher-version

# Use conda update
conda update pandas

#Upgrade to specific version
conda update pandas==0.14.0

If you use pip3 to upgrade, you should see something like the below.

pandas tutorial

7. Run Pandas Hello World Example

7.1 Run Pandas From Command Line

If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.


>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>

7.2 Run Pandas From Jupyter

Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook

python pandas tutorial

This opens up Jupyter Notebook in the default browser.

jupyter notenook

Now select New -> PythonX and enter the below lines and select Run.

jupyter notebook

7.3 Run Pandas from IDE

You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c

8. Pandas Series Introduction

A pandas Series is a one-dimensional array that can accommodate diverse data types, including integers, strings, floats, Python objects, and more. Utilizing the series() method, we can convert lists, tuples, and dictionaries into Series objects. Within a pandas Series, the row labels are referred to as the index. It’s important to note that a Series can only consist of a single column and cannot hold multiple columns simultaneously. Lists, NumPy arrays, and dictionaries can all be transformed into pandas Series.

8.1. Pandas.series() Constructor

Below is the syntax of pandas Series Constructor, which is used to create Series object.


# Pandas Series Constructor Syntax
Pandas.series(data,index,dtype,copy)
  • data: The data contains ndarray, list, constants.
  • Index: The index must be unique and hashable. np.arrange(n) if no index is passed.
  • dtype: dtype is also a data type.
  • copy: It is used to copy the data. The data contains ndarray, list, constants.

8.2 . Create Pandas Series

pandas Series can be created in multiple ways, From array, list, dict, and from existing DataFrame.

8.2.1 Creating Series from NumPy Array


# Create Series from array
import pandas as pd 
import numpy as np
data = np.array(['python','php','java'])
series = pd.Series(data)
print (series)

8.2.2 Creating Series from Dict


# Create a Dict from a input
data = {'Courses' :"pandas", 'Fees' : 20000, 'Duration' : "30days"}
s2 = pd.Series(data)
print (s2)

8.3.3 Creating Series from List


#Creating DataFrame from List
data = ['python','php','java']
s2 = pd.Series(data, index=['r1', 'r2','r3'])
print(s2)

9. Pandas DataFrame

I have a dedicated tutorial for python pandas DataFrame hence, in this section I will briefly explain what is DataFrame. DataFrame is a Two-Dimensional data structure, immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns.

9.1 DataFrame Features

  • DataFrames supported named rows & columns (you can also provide names to rows)
  • Pandas DataFrame size is mutable.
  • Supports Hetrogenous Collections of data.
  • DataFrame labeled axes (rows and columns).
  • Can perform arithmetic operations on rows and columns.
  • Supporting reading flat files like CSV,Excel, JSON and also reading SQL tables’s
  • Handling of missing data.

10. Pandas Series vs DataFrame?

Here is a comparison between pandas Series and DataFrames.

FeatureSeriesDataFrame
DimensionalityOne-dimensionalTwo-dimensional
StructureLabeled arrayLabeled data structure with rows and columns
ComponentsConsists of data and indexConsists of data, row index, and column index
Data TypesHomogeneous (same data type)Heterogeneous (different data types per column)
CreationFrom lists, arrays, dictionaries, or scalarsFrom dictionaries, arrays, lists, or other DataFrames
OperationsSupports operations like indexing, slicing, arithmetic operationsSupports operations like merging, joining, grouping, reshaping
Use CasesUseful for representing a single column of data or simple data structuresSuitable for tabular data with multiple columns and rows
]]>
4324
Control Statements in C https://shishirkant.com/control-statements-in-c/?utm_source=rss&utm_medium=rss&utm_campaign=control-statements-in-c Sat, 02 Sep 2023 05:12:54 +0000 https://shishirkant.com/?p=4058 Control Statements in C:

Control Statements are the statements that alter the flow of execution and provide better control to the programmer on the flow of execution. They are useful to write better and more complex programs. A program executes from top to bottom except when we use control statements, we can control the order of execution of the program, based on logic and values.

In C, control statements can be divided into the following three categories:

  • Selection Statements or Branching statements (ex: if-else, switch case, nested if-else, if-else ladder)
  • Iteration Statements or looping statements (ex: while loop, do-while loop, for-loop)
  • Jump Statements (ex: break, continue, return, goto)
Control Statements in C Language

C Control Statements are used to write powerful programs by repeating important sections of the program and selecting between optional sections of a program.

In the next article, I am going to discuss If else Selection Statements in C with examples. Here, in this article, I try to explain what is Control Statements in C, and their type. I hope you enjoy this Control Statements in C Language article. I would like to have your feedback. Please post your feedback, question, or comments about this Control Statements in the C Language article.

]]>
4058