Pivot vs Pivot_Table vs GroupBy

6 min readApr 17, 2021

Understanding differences in Pandas pivot_table, pivot and groupby.

The Pandas DataFrame.pivot_table and DataFrame.pivot functions allow one to look at data in different ways. In Data Science lingo, this is called reshaping or transforming a data set in order to glean information. Initially, these functions seem similar and behave the same way but deeper understanding will alleviate some frustration. Each function acts on a data frame by taking the initial data frame and your input of the index, columns, and values you want to see.

Basically, they filter and flip-flop the data. Lets look at an example:

Data

We have a small dataset of 5 students.

table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   gender  5 non-null      object
 1   sport   5 non-null      object
 2   age     5 non-null      int64 
 3   height  5 non-null      int64 
 4   weight  5 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 328.0+ bytes
table

Pivot Table

Lets pass the “gender” column to the index param(resulting left column),

pass the columns param equal to “sport”,

and our values param(data in the resulting cells) will be the “age”, “height” and “weight”.

I have explicitly entered the aggfunc to ‘mean’ which is the default value, if not explicitly set.

We will save the result into the student_pivot_tab for ease of use.

Calling the pivot_table function on our initial data frame, table results:

student_pivot_tab = table.pivot_table( index='gender', columns='sport', values=['age','height','weight'], aggfunc='mean')
student_pivot_tab

The gender values are now the left most column as we directed.

The values from the previous ‘sport’ column are the column headers now.

The values in the cells are the mean values of the age, height and weight columns.

Note: ’NaN’s are inserted where no values exist. For example, there are no female baseball players in this data set.

Pretty cool! We now see just 2 rows of data, which are aggregated for the data.

Pivot

Ok lets do the same thing with the Pivot function.

student_pivot = table.pivot(index=’gender’, columns=’sport’, values=[‘age’,’height’,’weight’])
student_pivot

This results in an Error:

ValueError: Index contains duplicate entries, cannot reshape

What does this mean? Hold that thought. Uno momento.

Lets try the same command without the index argument.

student_pivot = table.pivot( columns=’sport’, values=[‘age’,’height’,’weight’])
student_pivot

Without the index explicitly set we get these results:

Data Frame for student pivot — student_pivot results

This results in the columns of the ‘sport’ but not the columns of the ‘gender’. Notice the data frame has the index of our initial data frame(0…4), the one which is there by default. Also there are ‘Nan’s entered for the values which don’t exist as we transposed the rows into columns.

Back to the error:

ValueError: Index contains duplicate entries, cannot reshape

When in pivot and the index is set to ‘gender’, the function tries to set the left key to ‘female’ and then match the column name of the differing sport(basketball) values. In this case there are two rows which have female and columns of basketball. The function doesn't know what value to put into cell values.

In the pivot_table function, we passed the aggfunc=’mean’ to decide this.

It’s similar to when you tell Fire TV to play insert_Fav_Series and the Fire TV asks would you like to play this on Prime TV or Netflix or Apple TV.

Back in the pivot_table function we have already told it to average the value.

The pivot function does not know duplicate keys so it throws the duplicate entries error!

GroupBy

Lets look at the same data with the groupby function. It returns a data frame as well, based off your input. It is just in a different shape.

student_groupby = table.groupby([‘gender’,’sport’]).agg(‘mean’)
student_groupby

Note: A different shape to view the data. Lets look at the info() of it.

student_groupby.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4 entries, ('female', 'basketball') to ('male', 'basketball')
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     4 non-null      float64
 1   height  4 non-null      float64
 2   weight  4 non-null      float64
dtypes: float64(3)
memory usage: 279.0+ bytes

Note: The MultiIndex of 4 entries based off the index and columns we passed, a combination of the two gender values and the two sport values.

What about the pivot dataset info?

Without the gender index (remember: with it we get ValueError due to dupe keys) we get:

student_pivot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   (age, baseball)       1 non-null      float64
 1   (age, basketball)     3 non-null      float64
 2   (age, golf)           1 non-null      float64
 3   (height, baseball)    1 non-null      float64
 4   (height, basketball)  3 non-null      float64
 5   (height, golf)        1 non-null      float64
 6   (weight, baseball)    1 non-null      float64
 7   (weight, basketball)  3 non-null      float64
 8   (weight, golf)        1 non-null      float64
dtypes: float64(9)
memory usage: 400.0 bytes

What about the student_pivot_tab info?

student_pivot_tab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, female to male
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   (age, baseball)       1 non-null      float64
 1   (age, basketball)     2 non-null      float64
 2   (age, golf)           1 non-null      float64
 3   (height, baseball)    1 non-null      float64
 4   (height, basketball)  2 non-null      float64
 5   (height, golf)        1 non-null      float64
 6   (weight, baseball)    1 non-null      float64
 7   (weight, basketball)  2 non-null      float64
 8   (weight, golf)        1 non-null      float64
dtypes: float64(9)
memory usage: 160.0+ bytes

Note: The index is based off the two values for gender which we passed upon creation of the pivot_table.

Additionally, for example lets change the second female basketball player to be an archer.

The resulting data frames of the pivot_table call and pivot are exactly the same! There is no ValueError: thrown.

pivot_table and pivot return same results without dupe error

Lets compare:

if student_pivot.equals(student_pivot_tab):
    print(‘What What??!! I'm equal’)What What??!! I'm equal

Summary

Its all about the Indexes and the shape of the returning data frame.
Is the returning data frame MultiIndexed or not? One should take this into account when visualizing with Matplotlib or Seaborn and exploring data. Compare the basic bar plots for each as an exercise.
Pivot and pivot_table may only exhibit the same functionality if the data allows. If there are duplicate entries possible from the index(es) of interest you will need to aggregate the data in pivot_table, not pivot(due to duplicate error).
Groupby allows you to group by in a similar way and also chain an aggregate function.

Pivot vs Pivot_Table vs GroupBy

Written by Daniel M. Smith