Iterate Over Rows in a Pandas DataFrame

Abhinav Shukla
4 min readMay 16, 2023

“How can I iterate over my DataFrame to do X?” — This is a common question I have come across from users who are new to pandas and DataFrame concept. An individual who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question.

The Answer is : YOU DON’T

Through this article, my aim is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster, and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I’m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not even needed. Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with “iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to compute something? In that case, search for methods in this order:

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (for loop)
  4. DataFrame.apply()
  5. DataFrame.itertuples()
  6. DataFrame.iterrows()

iterrows and itertuples (itertuples is supposed to be faster, almost a 100 times! than iterrows) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

It’s actually a little more complicated than “don’t”.

df.iterrows() is the correct answer to the question (Sorry, not sorry), but “vectorize your ops” is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row)

Vectorization, Cython

A good number of basic operations and computations are “vectorised” by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations.

Vectorization in Python, as implemented by NumPy, can give you faster operations by using fast, low-level code to operate on bulk data. And Pandas builds on NumPy to provide similarly fast functionality. But vectorization isn’t a magic bullet that will solve all your problems:

  1. sometimes it will come at the cost of higher memory usage,
  2. sometimes the operation you need isn’t supported and,
  3. sometimes it’s just not relevant.

The other approach to faster code is to compile code ahead of time. Using Cython, or Rust, or C, you can:

List Comprehension

At its most basic level, list comprehension is a syntactic construct for creating lists from existing lists. It provides us with a simple way to create a list based on some sequence or another list that we can loop over. If there is no vectorized solution available or performance is important or if you are trying to perform elementwise transformation on your code, List comprehension is going to help you.

DataFrame.apply()

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required

DataFrame.iterrows() & DataFrame.itertuples()

The usual iterrows() is convenient, but damn slow (almost a 100 times!) than itertuples()

iterrows()
itertuples()
  1. As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns.
  2. Otherwise, use df.itertuples() except if your columns have special characters such as spaces or ‘-’.
  3. It is possible to use itertuples() even if your dataframe has strange columns.
  4. Only use iterrows() if you cannot the previous solutions (It’s Slow!).

Let’s demonstrate the difference with a simple example of adding two pandas columns A+B. This is a vectorizable operation, so it will be easy to contrast the performance of the methods discussed above.

Here’s the code:

Sometimes the answer to “what is the best method for an operation” is “it depends on your data”. My advice is to test out different approaches on your data before settling on one.

--

--

Abhinav Shukla

Just a curious speck, hopping from Neuroscience to Astrophysics, though currently landed at Data Science planet