pandas: inplace, view, and copy

In recent pandas versions, I sometimes get FutureWarning: Downcasting behavior in `replace` is deprecated warnings when I use inplace=True, e.g. df.x.replace({1: True, 0: False}, inplace=True). And I never understand the SettingWithCopyWarning either. I thought inplace operations should be preferred, as they are faster and they save memory, because they do things inplace rather than copying the object. After some readings, I realise I was wrong. The takeaways are

copy is generally safer,
a copy is almost always created, no matter whether I do it inplace or copy it, and
the speed difference between inplace and copy is minimal.

View vs copy: the notorious `SettingWithCopyWarning` and how to be safe

Inplace, copy, view, reference, … What are they exactly? Look at the code below.

>>> a = [0, 1, 2]  # Create a brand new object `a` in memory
>>> b = a          # `b` gets `a`, but it "points" to `a`, instead of copy
>>> print(a)
[0, 1, 2]
>>> print(b)
[0, 1, 2]
>>> print(id(a) == id(b))  # `a` and `b` has the same memory address
True
>>> b[1] = 'xxx'           # Therefore, change `b` will also change `a`
>>> print(b)
[0, 'xxx', 2]
>>> print(a)
[0, 'xxx', 2]

It’s usually very clear in native Python when we get a copy and when we get a reference. In C terminology, b kind of works like a pointer which points to a and automatically dereferences itself when you use it.¹ However, the same thing is a bit confusing in pandas. Again, check the code below.

>>> import pandas as pd
>>> df = pd.DataFrame({'col1': [0, 1, 2, 3],
                       'col2': [False, False, True, True]})
>>> df2 = df[df.col2 == True]  # A subset of `df`. A copy? A view/reference?
>>> print(df)
   col1   col2
0     0  False
1     1  False
2     2   True
3     3   True
>>> print(df2)
   col1  col2
2     2  True
3     3  True
>>> df2.col1 = 999              # Get a warning here
<python-input-71>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.col1 = 999
>>> df2  # `df2` changed as expected
   col1  col2
2   999  True
3   999  True
>>> df  # `df` untouched. Is it expected or not?
   col1   col2
0     0  False
1     1  False
2     2   True
3     3   True
>>> df[df.col2 == True].col1 = 999  # I want to change `df` inplace
<python-input-75>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df.col2 == True].col1 = 999
>>> df                              # But why it is not changed?
   col1   col2
0     0  False
1     1  False
2     2   True
3     3   True

The above is a very typical workflow for a data project: we have a raw input df, and we work on a subset of it, which is df2. The problem is, is the subset df2 a “pointer”/view pointing to df, or is it a copy, an entirely new object? “View” is a pandas terminology. It is more or less the same as a “pointer”. Basically a view views the memory of the original object. In the above example, df2 is a copy, but why do we still get the SettingWithCopyWarning? Then if we want to change certain values in df inplace, why nothing happens with the same warning?

The short answer is, we don’t really know when it’s a copy or a view. I know, it’s a deterministic code so must return a deterministic thing, but it’s just so confusing. Therefore, whenever returning a copy and view is confusing, pandas gives a warning, when things work as expected, and when things fail.²

pandas’ current behavior on whether indexing returns a view or copy is confusing. Even for experienced users, it’s hard to tell whether a view or copy will be returned[.]

—PDEP-7: Consistent copy/view semantics in pandas with Copy-on-Write

Currently, if we want to get rid of SettingWithCopyWarning and be safe, using df2 = xxxxx.copy() explicitly will do the job. That is, explicitly copy the dataframe when you don’t want to change the original.

No sorry, inplace does not save memory

Now we know what a copy is and what a view is, and we know it’s confusing so we frequently get SettingWithCopyWarning whenever there is ambiguity. But how is that related to inplace operations? I (wrongly) learnt from many many places and people that inplace saves memory, because it does not create a copy. It “sounds” like inplace directly works on the original object (or its view), without copying it. This is generally wrong! Regardless of inplace or not, we always need additional memory. It’s very clear in the source code that inplace=True or inplace=False only affects whether we return None or return a copy, but during the calculation, the copy is always created!

To see the actual memory usage, consider the following example: we create a dataframe with one single column and 100 million rows, filling with all one’s (int64). Then we replace all one’s with 100’s, and then sort the dataframe. The dataframe roughly takes $\frac{10^8 * 64\text{ bit}}{1024\times 1024} = 762$ MB. pythonprofilers/memory_profiler confirms that df does take 763MB RAM. Then the inplace replace uses additional 191.062 MB memory, and the sort uses 763.141 more memory. Note that the 763.141 MB is basically the same size of the original dataframe. So inplace actually copies the dataframe!

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
 91.312 MiB   91.312 MiB           1   @profile(precision=3)
                                       def foo():
854.438 MiB  763.125 MiB           1       df = pd.DataFrame(1, index=range(100_000_000), columns=['a'])
1045.500 MiB  191.062 MiB           1       df.a.replace({1: 100}, inplace=True)
1808.641 MiB  763.141 MiB           1       df.sort_values('a', inplace=True)
                                       
1808.641 MiB    0.000 MiB           1       return df

The not inplace version uses almost identical 191.047 MB for replace and 763.141 MB for sort.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
 90.078 MiB   90.078 MiB           1   @profile(precision=3)
                                       def foo():
853.203 MiB  763.125 MiB           1       df = pd.DataFrame(1, index=range(100_000_000), columns=['a'])
1044.250 MiB  191.047 MiB           1       df.a = df.a.replace({1: 100}, inplace=False)
1807.391 MiB  763.141 MiB           1       df = df.sort_values('a', inplace=False)
                                       
1807.391 MiB    0.000 MiB           1       return df

Inplace does not generally save memory!³

no inplace creates a new copy then assigns the pointer

these are equivalent from a memory perspective

—Jeff Reback (Two, Sigma, pandas core developer) at pandas-dev/pandas#20639

inplace does not generally do anything inplace

—Jeff Reback (Two Sigma, pandas core developer) at pandas-dev/pandas#16529

Speedwise, all are also very close

I run a simple benchmark test between inplace and not inplace operations, i.e. df.sort_values(xx, inplace=True) vs df = df.sort_values(xx). The test dataset has three integer columns and ten million rows. 20% of the cells are missing. The figure below shows there is almost no speed difference between inplace and not inplace operations!

Notes: I normalise the time for inplace operations to unit variance, so all bars can fit into the same $y$ scale. The error bars show 95% confidence interval.

Many other people have done similar benchmark before, but the results are very mixed. E.g. this blog plost finds inplace operations are slower than not inplace ones across the board. But this post shares similar results with me that there are no significant speed difference. But I believe the big picture is the time complexity of both is very close,

Another popular argument for not using inplace=True is that it kills method chaining. Compare

df = df.sort_values('a')
df = df.fillna(0)
df = df.replace(3, 5)

df = df.sort_values('a').fillna(0).replace(3, 5)

If you use inplace, because inplace always returns None, there is not way to chain the methods, and you have to write three methods in three lines. But does method chaining really improve efficiency? Not really, as shown below.

Notes: The error bars show 95% confidence interval.

Unlike Spark or Polars, pandas does not do lazy execution (I believe). So chaining or not chaining doesn’t matter much. Since inplace doesn’t matter for every single operation, the chained and unchained operation should also perform similarly in terms of speed. This is expected.

Conclusion

Don’t use inplace=True any more. There is no memory gain, no speed gain, and involves some risks. Just copy the dataframe, as you are doing it anyways already.

I found the blog post and the PyCon 2015 presentation by Ned Batchelder, who is a CPython core developer, extremely helpful in explaining how these things work in Python. ↩
If you don’t have much time, read the blog post by Joris Van den Bossche, who is a pandas core developer. His blog post is so clear and concise. If you have a lot of time, read PDEP-7 and pandas doc on returning a view versus a copy and on Copy-on-Write (CoW). They are much longer but cover more details. The long discussion at pandas-dev/pandas#57734 is also worth reading. ↩
Many people have asked whether inplace actually saves memory. And the answer is mixed, which is why I got the wrong impression that inplace saves memory. For example, people say no: https://stackoverflow.com/a/47253732, and people say yes: https://stackoverflow.com/a/74751146. ↩

View vs copy: the notorious SettingWithCopyWarning and how to be safe

No sorry, inplace does not save memory

Speedwise, all are also very close

Conclusion

View vs copy: the notorious `SettingWithCopyWarning` and how to be safe