Motivation for range-streams
I came across
`stream_response.py
<https://gist.github.com/obskyr/b9d4b4223e7eaf4eedcd9defabb34f13>`__
by GitHub user obskyr which encouraged me to pursue this idea I had to
stream GET requests for large files as file-like objects.
My idea is similar, except that rather than get the entire file, what if you just wanted to download particular ranges, and avoid downloading bytes for the parts you don’t care about (or defer downloading them until you do).
In particular, “streaming” GET requests work by supplying the
"Transfer-Encoding": "chunked
header, which a server that supports
this transfer encoding will respond to by returning ‘chunks’. If the
server doesn’t support this, but does support range requests, then it
should be possible to achieve a similar outcome with a little effort.
Running stream_response.py
demonstrates this by retrieving a
chunk-capable and chunk-less URL:
url='https://httpbin.org/stream/20'
b'{"url": "https://httpbin.org/stream/20", "args": {'
url='https://raw.githubusercontent.com/lmmx/range-streams/bb5e0cc2e6980ea9e716a569ab0322587d3aa785/example_text_file.txt'
ValueError: Not a chunked stream
An example of a filetype which this would suit is .zip
, which has
well-defined sections (see my
notes).
People want various things from zip files:
the UK Department for International Trade just want the files, and ``pass` the central directory record entirely <https://github.com/uktrade/stream-unzip/blob/131767e806f09518cd51614ec7acd651099910bd/stream_unzip.py#L177-L181>`__,
when surveying the Anaconda and Conda-Forge repositories of Python binaries, (using my tool
`impscan
<https://github.com/lmmx/impscan/>`__ parsing`.conda
files <https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/packages.html#conda-file-format>`__ (which are zip files with a particular directory format), I want to conditionally download the metadata and package binaries themselves (which are compressed in separate Zstd sub-archives of the.conda
archive).The package binary in particular will likely be rather larger, so I’d like to avoid needlessly downloading that if I don’t need it (specifically if I don’t need to check it for
site-packages
paths, if the metadata tells me it doesn’t have any).
For this reason, sometimes you don’t want the bytes from the server for
a particular part of a file, and if the headers returned by the server
from a GET request include Accept-Ranges': 'bytes'
then that means
it supports HTTP range
requests
Here’s an excerpt of the headers returned when I request a packaged ``tqdm` build <https://repo.anaconda.com/pkgs/main/noarch/tqdm-4.61.1-pyhd3eb1b0_1.conda>`__:
'Content-Length': '83498',
'Content-Type': 'binary/octet-stream',
and here’s what you get when adding a range header to just get 2 bytes
using requests.get(url, headers={"range": "bytes=0-1"})
:
'Content-Length': '2',
'Content-Range': 'bytes 0-1/83498',
'Content-Type': 'binary/octet-stream'
When you send a HTTP range GET request rather than a regular one, note
you get back the additional “Content-Range” header. This can be used in
an equivalent way to seek
on a file, and can avoid the routine you
sometimes see (such as in the Python standard library’s zipfile
module) of seeking to a negative index, then tell()
ing the
position of the file cursor, to determine an absolute position from a
relative position.
For a zip file, you know the final 22 bytes contain the “end of central directory record”, so this is my test case for this library as proof of concept.
Another server which supports range requests is… GitHub! But only for files in repos, not gists.
The response code returned is 206 (partial response)
When used with
stream=True
, the response does not actually retrieve the bytes in question untilraw.read()
oriter_bytes()
is called on itNote: for binary/compressed files, don’t use
content
oriter_content
Running the demo
`demo_range_requests.py
<https://github.com/lmmx/range-streams/blob/d2c64d988697ff31252970d425335983e52d7c72/preamble/demo_range_requests.py>`__
No byte: bytes_range='-0' --> r.raw.read()=b''
File length from response: r.headers["Content-Range"].split("/")[-1]='11'
File length from file: len(Path("example_text_file.txt").read_bytes())=11
First byte: bytes_range='0-0' --> r.raw.read()=b'P'
First 2 bytes: bytes_range='0-1' --> r.raw.read()=b'P\x00'
Last byte: bytes_range='-1' --> r.raw.read()=b'K'