crosstab () in Python

crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Import Modules

import pandas as pd
import seaborn as sns

Get Tips Dataset

Let’s get the tips dataset from the seaborn library and assign it to the DataFrame df_tips.

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column namecolumn description
total_billfinancial amount of meal in U.S. dollars
tipfinancial amount of the meal’s tip in U.S. dollars
sexgender of server
smokerboolean to represent if server smokes or not
dayday of week
timemeal name (Lunch or Dinner)
sizecount of people eating meal

Preview the first 5 rows of df_tips.

df_tips.head()
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

Implement Crosstabs with Tips Dataset

Grouping By a Single Field for the Index and Column

Let’s compute a simple crosstab across the day and sex column. We want our returned index to be the unique values from day and our returned columns to be the unique values from sex. By default in pandas, the crosstab() computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
sexMaleFemale
day
Thur3032
Fri109
Sat5928
Sun5818

One issue with this crosstab output is the column names are nonsensical. Just saying Male or Female isn’t very specific. They should be renamed to be clearer. We can use the rename() method and set the argument columns to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
sexcount_meals_served_by_malescount_meals_served_by_females
day
Thur3032
Fri109
Sat5928
Sun5818

Also, in the output above, see where it says sex as the column name and day for the row name? We can modify those names using arguments in the crosstab() method. Let’s set the colnames argument to gender since that’s a more specific name than sex.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gendercount_meals_served_by_malescount_meals_served_by_females
day
Thur3032
Fri109
Sat5928
Sun5818

In this example, we passed in two columns from our DataFrame. However, one nice feature of crosstab() is that you don’t need the data to be in a DataFrame. For the index and columns arguments, you can pass in two numpy arrays.

Let’s double check the logic from above makes sense. Let’s use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

len(df_tips.query("sex=='Male' and day=='Thur'"))
30

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won’t dive into details of this operation, but in addition to the code above, you can chain the unstack() method and then the reset_index() method to pivot the DataFrame so each row is a unique combination of a value from sex and day with the appropriate count of meals served.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
genderdaycount_meals_served
0MaleThur30
1MaleFri10
2MaleSat59
3MaleSun58
4FemaleThur32
5FemaleFri9
6FemaleSat28
7FemaleSun18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument margins to True to get these totals. By default, the returned output will have a column and row name of All.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gendercount_meals_served_by_malescount_meals_served_by_femalesAll
day
Thur303262
Fri10919
Sat592887
Sun581876
All15787244

In the crosstab() method, we can also rename the All column. First, use all the same arguments from above. Then, set the argument margins_name to count_meals_served .

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gendercount_meals_served_by_malescount_meals_served_by_femalescount_meals_served
day
Thur303262
Fri10919
Sat592887
Sun581876
count_meals_served15787244

For each cell value, we can calculate what percentage it is of the row’s total. To do that, set the normalize argument to 'index' (since index applies to each row).

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
genderproportion_meals_served_by_malesproportion_meals_served_by_females
day
Thur0.4838710.516129
Fri0.5263160.473684
Sat0.6781610.321839
Sun0.7631580.236842
proportion_meals_served0.6434430.356557

For each cell value, we can also calculate what percentage it is of the column’s total. To do that, set the normalize argument to 'columns'.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
genderproportion_meals_served_by_malesproportion_meals_served_by_femalesproportion_meals_served
day
Thur0.1910830.3678160.254098
Fri0.0636940.1034480.077869
Sat0.3757960.3218390.356557
Sun0.3694270.2068970.311475

Given two categorical columns, the crosstab() method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting – so let’s walk through it with a simple example.

We know there exists total_bill values in our datasets for males that served meals on Thursday. Below, I preview the first few total_bill values that meet this criteria.

df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()
77    27.20
78    22.76
79    17.29
80    19.44
81    16.66
Name: total_bill, dtype: float64

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()
18.714666666666666

Now, we can perform this same operation using the crosstab() method. Same as before, we want our returned index to be the unique values from day and our returned columns to be the unique values from sex. Additionally, we want the values inside the table to be from our total_bill column so we’ll set the argument values to be df_tips['total_bill']. We also want to calculate the mean total bill for each combination of a day and gender so we’ll set the aggfunc argument to 'mean'.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})
gendermean_bill_size_meals_served_by_malesmean_bill_size_meals_served_by_females
day
Thur18.71466716.715312
Fri19.85700014.145556
Sat20.80254219.680357
Sun21.88724119.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the aggfunc method too such as mean and standard deviation.

You can learn more about details of using crosstab() from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns

We can also use the crosstabs() method to group by multiple pandas columns for the index or columns arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from day and our returned columns to be the unique values from all combinations of the sex (gender) and time (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
genderMaleFemale
mealcount_lunch_meals_servedcount_dinner_meals_servedcount_lunch_meals_servedcount_dinner_meals_served
day
Thur300311
Fri3745
Sat059028
Sun058018

We can verify the logic in a value with a separate query. I’m curious to see the count of meals on Thursday served by females during lunch time. Our result on this query is 31 which matches the value in the crosstab above!

len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))
31

If you’re familiar with the groupby() method in pandas, the code below performs a similar operation to the crosstab above, but the output format is a little different.

df_tips.groupby(['day', 'sex'])['time'].count()
day   sex   
Thur  Male      30
      Female    32
Fri   Male      10
      Female     9
Sat   Male      59
      Female    28
Sun   Male      58
      Female    18
Name: time, dtype: int64

We can also group by multiple pandas columns in the index argument of the crosstabs() method.

Below is a fairly complex operation. The code below helps us find out for every combination of day of the week, gender and meal type what was the count of meals served for the various group sizes (ex – party of two people versus party of three).

In our output, the index includes unique values from the day and sex fields while columns include unique values from the time and size fields.

To interpret a value from our table below, across all Thursdays in our dataset, males 24 lunch meals that contained a group of 2 people.

pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
mealcount_lunch_meals_servedcount_dinner_meals_served
party_people_size123456123456
daysex
ThurMale0242211000000
Female1232302010000
FriMale120000060100
Female031000050000
SatMale000000034131110
Female0000002195200
SunMale00000003291421
Female000000076410

We can verify 24 lunch meals for a party of 2 were served by males on Thursdays with the query below too.

len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))
24

Style Output

While the above output is daunting to read with all the cells, there’s a cool trick in Seaborn to make your DataFrame output appear as a heatmap. I added some code based off the example from the pandas documention for built-in styles. Now, we can more easily identify the large values as being dark orange colors in our visualization below.

orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)
mealcount_lunch_meals_servedcount_dinner_meals_served
party_people_size123456123456
daysex
ThurMale0242211000000
Female1232302010000
FriMale120000060100
Female031000050000
SatMale000000034131110
Female0000002195200
SunMale00000003291421
Female000000076410
Follow Us On