In the summer of 2020, I worked with the contributors of pip
, trying to improve the networking performance of the package manager. Admittedly, at the end of the internship period, the benchmark said otherwise; though I really hope the clean-up and minor fixes I happened to be doing to the codebase over the summer, in addition to the implementation of parallel utils and lazy wheel, might actually help the project.
Personally, I learned a lot: not just about Python packaging and networking stuff, but also on how to work with others. I am really grateful to @pradyunsg (my mentor), @chrahunt, @uranusjr, @pfmoore, @brainwane, @sbidoul, @xavfernandez, @webknjaz, @jaraco, @deveshks, @gutsytechster, @dholth, @dstufft, @cosmicexplorer and @ofek. While this feels like a long shout-out list, it really isn't. These people are the maintainers, the contributors of pip
and/or other Python packaging projects, and more importantly, they have been more than helpful, encouraging and patient to me throughout my every activities, showing me the way when I was lost, fixing me when I was wrong, putting up with my carelessness and showing me support across different social media.
To best serve the community, below I have tried my best to document what I have done, how I've done it and why I've done it for over the last three months. At the time of writing, some work is still in progress, so these also serve as a reference point for myself and others to reason about decisions in relevant topics.
The storyline can be divided into the following four main acts.
In this first act, I ensured the portibility of parallelization measures for later use in the final act. Multithreading and multiprocessing map
were properly fellback on platforms without full support.
GH-8538: Make utils.parallel
tests tear down properly
GH-8504: Parallelize pip list --outdated
and --uptodate
(using GH-8320)
As proposed by @cosmicexplorer in GH-7819, it is possible to only download a portion of a wheel to obtain metadata during dependency resolution. Not only that this would reduce the total amount of data to be transmitted over the network in case the resolver needs to perform heavy backtracking, but also it would create a synchronization point at the end of the resolution progress where parallel downloading can be applied to the needed wheels (some wheels solely serve their metadata during dependency backtracking and are not needed by the users).
GH-8467: Add utitlity to lazily acquire wheel metadata over HTTP
GH-8584: Revise lazy wheel and its tests
GH-8681: Make range requests closer to chunk size (help GH-8670)
During this act, the main works were refactoring to integrate the lazy wheel into pip
's codebase and clean up the way for download parallelization.
GH-8411: Refactor operations.prepare.prepare_linked_requirement
GH-8629: Abstract away AbstractDistribution
in higher-level resolver code
GH-8442, GH-8532 and GH-8588 (later reworked by @chrahunt in GH-8685): Use lazy wheel to obtain dependency information for the new resolver
GH-8743: Test hash checking for fast-deps
GH-8804: Check download directory before making range requests
The final act is mostly about the UI of the parallel download. My work involved around how the progress should be displayed and how other relevant information should be reported to the users.
GH-8710: Revise method fetching metadata using lazy wheels
GH-8737: Add a hook for batch downloading
GH-8771: Parallelize wheel download
In order to keep the wheel turning (no pun intended) and avoid wasting time waiting for the pull requests above to be reviewed, I decided to create even more PRs (as I am typing this, many of the patches listed below are nowhere near being merged).
GH-7878: Fail early when install path is not writable
GH-7928: Fix rst syntax in Getting Started guide
GH-7988: Fix tabulate col size in case of empty cell
GH-8137: Add subcommand alias mechanism
GH-8143: Make mypy happy with beta release automation
GH-8248: Fix typo and simplify ireq call
GH-8332: Add license requirement to _vendor/README.rst
GH-8423: Nitpick logging calls
GH-8435: Use str.format style in logging calls
GH-8456: Lint src/pip/_vendor/README.rst
GH-8568: Declare constants in configuration.py as such
GH-8571: Clean up Configuration.unset_value
and nit __init__
GH-8578: Allow verbose/quiet level to be specified via config files and environment variables
GH-8599: Replace tabs by spaces for consistency
GH-8614: Use monkeypatch.setenv
to mock environment variables
GH-8674: Fix tests/functional/test_install_check.py
, when run with new resolver
GH-8692: Make assertion failure give better message
GH-8709: List downloaded distributions before exiting (fix GH-8696)
GH-8759: Allow py2 deprecation warning from setuptools
GH-8766: Use the new resolver for test requirements
GH-8790: Mark tests using remote svn and hg as xfail
GH-8795: Reformat a few spots in user guide
Every Monday throughout the Summer of Code, I summarized what I had done in the week before in the form of either a short blog or an (even shorter) check-in. These write-ups often contain handfuls of popular culture references and was originally hosted on Python GSoC.
Follow the anchor in an author's name to reply. Please read the rules before commenting.