The great upgrade
In October we were contacted by a customer who wanted to upgrade the backend of one of their services from Python 2 to Python 3. In this article I would like to report some of the most interesting aspects of the upgrade, strictly focusing on the Python part, and some new things I learned. I hope it will be useful to anyone else who will find himself in the same situation!
Alessandra Morelli, Simone Cardis, Luca Consonni and I followed this project from the beginning to the end.
I thank Ferdinando Reviglio for reviewing this article.
The service
I'm not new with upgrading a Python code but it was the first time for me to do an upgrade of this scale and complexity.
The service is a private in cloud storage (Dropbox like) where the backend serves a web frontend and applications, nothing really special here. At the time when this service was developed the developers had decided to not use any framework (Django or Flask) but to write their own wsgi Python framework. This leads to some complexity since it was 5 years old and there wasn't a standard documentation or a standard way to upgrade it. The backend code required many modules: some could be easily found on pypi and upgraded via pip (python package manager), some legacys were not maintained anymore and some were developed internally by the company.
Tests were present available and this was of great help helped us.
As a final note, they also had Docker scripts to run the service in a container.
Note on the approach
Starting from a clean environment at the moment I would use a framework like Django to do this kind of project, in a way much more reliable and maintainable than to write your own wsgi framework. Going back to when this service was developed, the programmers found it was not the case to use Django at the time, so they choose to write their own framework.
The natural approach would be to join a change of framework to the upgrade and have the new version of the service running in Python 3 on Django for example. But here we had a very short deadline so we couldn't do it, upgrading the functionalities on the current framework was the fastest solution.
How to make a time estimate
This is not easy, especially because usually you have a short deadline to deal with. Here is how we did.
First, we checked to see if all the dependencies were still maintained or not, this is important because if they are not maintained you have to find a substitute and, in the case there isn’t one you could waste a lot of time to get around this problem. Fortunately, we found only one not maintained module, the pycrypto
module had been replaced by the pycryptodome
module which I knew implemented a compatible API.
Secondly, we checked the code. The codebase was huge, we anyway took a (fast) look at all the files to spot critical points. This phase will not make you find small things but it could be useful. In this case we found that one of the modules implemented internally by the company had some obfuscated source code, this must be taken into account it slows down the process a lot.
Lastly, we checked the docker code, since this was to take into account, but I won't go deeper into this.
A thing that we would have liked to do and I recommend to do is to run all the tests to spot any further eventual existing problems. We were not able to do this because we didn't have a testing environment ready (the backend service depends on an Oracle DB, a storage solution, an Identity & Access service and a Zookeper cluster).
How to proceed
A first natural approach would be to run the tests sequentially and fix the code where it crashes, unfortunately we weren’t able to do it.
We decided to start by upgrading all the syntax problems and the explicitly wrong methods.
Python 3 does change some major things which are clearly visible when looking at the code, including:
- the removal of the
unicode
type and introduces the bytes type - introducing a massive use of the generators where it used lists in Python 2
- introducing parentheses for print statements and changing some built-in methods
Most of these changes were fixed more or less easily without running the tests.
The first change is a big deal for a wsgi application, since it handles HTTP requests and it does a lot of checks and conversions between unicode
type and string
type, these were all to be fixed. Also, many of the methods that required a string
now require a bytes
object, so we had to do a great number of conversions. To convert bytes
to string use the bytes.decode()
method, for the opposite the str.encode()
method.
The second point is a little tricky: many methods that in Python 2 were returning list
objects in Python 3 return generator
objects. This impacts all the other methods that are expecting a list as an input, for example the len()
method. Using an IDE it will help you spotting these and it will highlight it for you. But sometimes is more difficult, like in the case of a method that returns a generator
and it is used in another file where this information is lost. We fixed this mostly by replacing the generator creation with a list comprehension or, when not possible, passing the generator to a list constructor.
Third point: these were most of the hassles. Changing the print
statements can be done mostly with an automatic massive substitution but sometimes a manual operation is more precise. The change of the methods instead can be done all by an automatic substitution, for example the dict.iteritems()
method is to be replaced by dict.items()
.
These are usually the most common changes, I suggest to do this kind of work before running the tests. You won't like to see a test fail 20 times because of 20 dictionary methods to be replaced. In this way afterwards you can focus on the major problems that cannot be seen with a statical analysis of the code.
When the testing environment was ready, we could run the tests and we saw that the vast majority of the errors was still related to the points described above, but it required a dynamical analysis to find them. For example, the methods of many modules, both built in and not, now require / return bytes
objects where before strings
objects were required / returned in Python 2.
During this phase another obstacle was found: default algorithms and encodings sometimes changed. Some methods will give you an output using a certain algorithm or formatted in a certain encoding that is not always the same as in Python 2. This is the case of the pickle
module, which implements object serialization, its output was a printable string in Python 2, in Python 3 this has changed and if you need a printable string you will need to explicitly tell the method to do so: pickle.dumps(data, 0)
.
The last interesting point is that when a module has to change, for any reason. We found that the module pycrypto
wasn't anymore maintained and it was replaced by the pycryptodome
module. This is not a built-in module but it is a de facto standard. In this case you need to check the API documentation and see if the behavior is the same. It was the case for us but sometime you may need to fill in the code to have the same result as the old module.
Other updates were also included in this project which are not covered here, such as: docker scripts, apache configurations, Python interpreter, ecc.
Summary and final note
To estimate time needed:
- foresee critical module upgrades
- search for cryptic/obfuscated code
- where possible find changes in default algorithms and encoding
- pay attention to complex bytes / unicode operations
To upgrade the code:
- use the tests (whenever possible) to help you
- you can do fast replacing for the major changes (prints, methods, ...)
- replace old modules either with new ones or with a completely new code
- fix eventual changes in the algorithms / encodings
As a main suggestion: a good initial analysis and organization will help you and save you a lot of time.
As said, this was my first time to do such a major upgrade and I hope this post may help you. If you have done something similar and have any suggestions that could help me please leave a comment!
Happy programming :)