Add coding guidelines, architecture overview, and proto_debug

proto_debug is the toolchain for reverse engineering protobuf Signed-off-by: Jesús <heckyel@hyperbola.info>
author: James Taylor <user234683@users.noreply.github.com> 2020-12-21 16:55:51 -0800
committer: Jesús <heckyel@hyperbola.info> 2020-12-21 21:24:03 -0500
commit: 35b2b3e97327c2e304beff66c40c7b76ec0a32d2 (patch)
tree: f1444557ee4a00bf3e2a4380a5f494fbdb1f07ad
parent: 84c6192e08617c166065bbc6b91ecb7f9217f5e2 (diff)
download: yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.tar.lz
yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.tar.xz
yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.zip
3 files changed, 556 insertions, 2 deletions
diff --git a/HACKING.md b/HACKING.md
new file mode 100644
index 0000000..889ee08
--- /dev/null
+++ b/HACKING.md
@@ -0,0 +1,62 @@
+# Coding guidelines
+* Follow the [PEP 8 guidelines](https://www.python.org/dev/peps/pep-0008/) for all new Python code as best you can. Some old code doesn't follow PEP 8 yet. This includes limiting line length to 79 characters (with exception for long strings such as URLs that can't reasonably be broken across multiple lines) and using 4 spaces for indentation.
+
+* Do not use single letter or cryptic names for variables (except iterator variables or the like). When in doubt, choose the more verbose option.
+
+* For consistency, use ' instead of " for strings for all new code. Only use " when the string contains ' inside it. Exception: " is used for html attributes in Jinja templates.
+
+* Don't leave trailing whitespaces at the end of lines. Configure your editor the way you need to avoid this from happening.
+
+* Make commits highly descriptive, so that other people (and yourself in the future) know exactly why a change was made. The first line of the commit is a short summary. Add a blank line and then a more extensive summary. If it is a bug fix, this should include a description of what caused the bug and how this commit fixes it. There's a lot of knowledge you gather while solving a problem. Dump as much of it as possible into the commit for others and yourself to learn from. Mention the issue number (e.g. Fixes #23) in your commit if applicable. [Here](https://www.freecodecamp.org/news/writing-good-commit-messages-a-practical-guide/) are some useful guidelines.
+
+* The same guidelines apply to commenting code. If a piece of code is not self-explanatory, add a comment explaining what it does and why it's there.
+
+# Testing and releases
+* This project uses pytest. To install pytest and any future dependencies needed for development, run pip3 on the requirements-dev.txt file. To run tests, run `python3 -m pytest` rather than just `pytest` because the former will make sure the toplevel directory is in Python's import search path.
+
+* To build releases for Windows, run `python3 generate_release.py [intended python version here, without v infront]`. The required software (such as 7z, git) are listed in the `generate_release.py` file. For instance, wine is required if building on Linux. The build script will automatically download the embedded Python release to include. Use the latest release of Python 3.7.x so that Vista will be supported. See https://github.com/user234683/yt-local/issues/6#issuecomment-672608388
+
+# Overview of the software architecture
+
+## server.py
+* This is the entry point, and sets up the HTTP server that listens for incoming requests. It delegates the request to the appropriate "site_handler". For instance, `localhost:8080/youtube.com/...` goes to the `youtube` site handler, whereas `localhost:8080/ytimg.com/...` (the url for video thumbnails) goes to the site handler for just fetching static resources such as images from youtube.
+
+* The reason for this architecture: the original design philosophy when I first conceived the project was that this would work for any site supported by youtube-dl, including Youtube, Vimeo, DailyMotion, etc. I've dropped this idea for now, though I might pick it up later. (youtube-dl is no longer used)
+
+* This file uses the raw [WSGI request](https://www.python.org/dev/peps/pep-3333/) format. The WSGI format is a Python standard for how HTTP servers (I use the stock server provided by gevent) should call HTTP applications. So that's why the file contains stuff like `env['REQUEST_METHOD']`.
+
+
+## Flask and Gevent
+* The `youtube` handler in server.py then delegates the request to the Flask yt_app object, which the rest of the project uses. [Flask](https://flask.palletsprojects.com/en/1.1.x/) is a web application framework that makes handling requests easier than accessing the raw WSGI requests. Flask (Werkzeug specifically) figures out which function to call for a particular url. Each request handling function is registered into Flask's routing table by using function annotations above it. The request handling functions are always at the bottom of the file for a particular youtube page (channel, watch, playlist, etc.), and they're where you want to look to see how the response gets constructed for a particular url. Miscellaneous request handlers that don't belong anywhere else are located in `__init__.py`, which is where the `yt_app` object is instantiated.
+
+* The actual html for yt-local is generated using Jinja templates. Jinja lets you embed a Python-like language inside html files so you can use constructs such as for loops to construct the html for a list of 30 videos given a dictionary with information for those videos. Jinja is included as part of Flask. It has some annoying differences from Python in a lot of details, so check the [docs here](https://jinja.palletsprojects.com/en/2.11.x/) when you use it. The request handling functions will pass the information that has been scraped from Youtube into these templates for the final result.
+* The project uses the gevent library for parallelism (such as for launching requests in parallel), as opposed to using the async keyword.
+
+## util.py
+* util.py is a grab-bag of miscellaneous things; admittedly I need to get around to refactoring it. The biggest thing it has is the `fetch_url` function which is what I use for sending out requests for Youtube. The Tor routing is managed here. `fetch_url` will raise an a `FetchError` exception if the request fails. The parameter `debug_name` in `fetch_url` is the filename that the response from Youtube will be saved to if the hidden debugging option is enabled in settings.txt. So if there's a bug when Youtube changes something, you can check the response from Youtube from that file.
+
+## Data extraction - protobuf, polymer, and yt_data_extract
+* proto.py is used for generating what are called ctokens needed when making requests to Youtube. These ctokens use Google's [protobuf](https://developers.google.com/protocol-buffers) format. Figuring out how to generate these in new instances requires some reverse engineering. I have a messy python file I use to make this convenient which you can find under ./youtube/proto_debug.py
+
+* The responses from Youtube are in a JSON format called polymer (polymer is the name of the 2017-present Youtube layout). The JSON consists of a bunch of nested dictionaries which basically specify the layout of the page via objects called renderers. A renderer represents an object on a page in a similar way to html tags; the renders often contain renders inside them. The Javascript on Youtube's page translates this JSON to HTML. Example: `compactVideoRenderer` represents a video item in you can click on such as in the related videos (so these are called "items" in the codebase). This JSON is very messy. You'll need a JSON prettifier or something that gives you a tree view in order to study it.
+
+* `yt_data_extract` is a module that parses this this raw JSON page layout and extracts the useful information from it into a standardized dictionary. So for instance, it can take the raw JSON response from the watch page and return a dictionary containing keys such as `title`, `description`,`related_videos (list)`, `likes`, etc. This module contains a lot of abstractions designed to make parsing the polymer format easier and more resilient towards changes from Youtube. (A lot of Youtube extractors just traverse the JSON tree like `response[1]['response']['continuation']['gridContinuationRenderer']['items']...` but this tends to break frequently when Youtube changes things.) If it fails to extract a piece of data, such as the like count, it will place `None` in that entry. Exceptions are not used in this module. So it uses functions which return None if there's a failure, such as `deep_get(response, 1, 'response', 'continuation', 'gridContinuationRenderer', 'items')` which returns None if any of those keys aren't present. The general purpose abstractions are located in `common.py`, while the functions for parsing specific responses (watch page, playlist, channel, etc.) are located in `watch_extraction.py` and `everything_else.py`.
+
+* Most of these abstractions are self-explanatory, except for `extract_items_from_renderer`, a function that performs a recursive search for the specified renderers. You give it a renderer which contains nested renderers, and a set of the renderer types you want to extract (by default, these are the video/playlist/channel preview items). It will search through the nested renderers and gather the specified items, in addition to the continuation token (ctoken) for the last list of items it finds if there is one. Using this function achieves resiliency against Youtube rearranging the items into a different hierarchy.
+
+* The `extract_items` function is similar but works on the response object, automatically finding the appropriate renderer to call `extract_items_from_renderer` on.
+
+
+## Other
+* subscriptions.py uses SQLite to store data.
+
+* Hidden settings only relevant to developers (such as for debugging) are not displayed on the settings page. They can be found in the settings.txt file.
+
+* Since I can't anticipate the things that will trip up beginners to the codebase, if you spend awhile figuring something out, go ahead and make a pull request adding a brief description of your findings to this document to help other beginners.
+
+## Development tips
+* When developing functionality to interact with Youtube in new ways, you'll want to use the network tab in your browser's devtools to inspect which requests get made under normal usage of Youtube. You'll also want a tool you can use to construct custom requests and specify headers to reverse engineer the request format. I use the [HeaderTool](https://github.com/loreii/HeaderTool) extension in Firefox, but there's probably a more streamlined program out there.
+
+* You'll want to have a utility or IDE that can perform full text search on a repository, since this is crucial for navigating unfamiliar codebases to figure out where certain strings appear or where things get defined.
+
+* If you're confused what the purpose of a particular line/section of code is, you can use the "git blame" feature on github (click the line number and then the three dots) to view the commit where the line of code was created and check the commit message. This will give you an idea of how it was put together.
diff --git a/README.md b/README.md
index 45f9aac..202ce01 100644
--- a/README.md
+++ b/README.md
@@ -104,14 +104,18 @@ In the settings page, set "Route Tor" to "On, except video" (the second option).
 
 Ensure Tor is listening for Socks5 connections on port 9150 (a simple way to accomplish this is by opening the Tor Browser Bundle and leaving it open). Your connections should now be routed through Tor.
 
-Pull requests and issues are welcome
-
 ### Tor video routing
 
 If you wish to route the video through Tor, set "Route Tor" to "On, including video". Because this is bandwidth-intensive, you are strongly encouraged to donate to the [consortium of Tor node operators](https://torservers.net/donate.html). For instance, donations to [NoiseTor](https://noisetor.net/) go straight towards funding nodes. Using their numbers for bandwidth costs, together with an average of 485 kbit/sec for a diverse sample of videos, and assuming n hours of video watched per day, gives $0.03n/month. A $1/month donation will be a very generous amount to not only offset losses, but help keep the network healthy.
 
 In general, Tor video routing will be slower (for instance, moving around in the video is quite slow). I've never seen any signs that watch history in yt-local affects on-site Youtube recommendations. It's likely that requests to googlevideo are logged for some period of time, but are not integrated into Youtube's larger advertisement/recommendation systems, since those presumably depend more heavily on in-page tracking through Javascript rather than CDN requests to googlevideo.
 
+## Contributing
+
+Pull requests and issues are welcome
+
+For coding guidelines and an overview of the software architecture, see the HACKING.md file.
+
 ## License
 
 This project is licensed under the GNU Affero General Public License v3 (GNU AGPLv3) or any later version.
diff --git a/youtube/proto_debug.py b/youtube/proto_debug.py
new file mode 100644
index 0000000..3a5541c
--- /dev/null
+++ b/youtube/proto_debug.py
@@ -0,0 +1,488 @@
+# TODO: clean this file up more and heavily refactor
+
+''' Helper functions for reverse engineering protobuf.
+
+Basic guide:
+
+Run interactively with python3 -i proto_debug.py
+
+The function dec will decode a base64 string
+(regardless of whether it includes = or %3D at the end) to a bytestring
+
+The function pb (parse_protobuf) will return a list of tuples.
+Each tuple is (wire_type, field_number, field_data)
+
+The function enc encodes as base64 (inverse of dec)
+The function uenc is like enc but replaces = with %3D
+
+See https://developers.google.com/protocol-buffers/docs/encoding#structure
+
+Example usage:
+>>> pb(dec('4qmFsgJcEhhVQ1lPX2phYl9lc3VGUlY0YjE3QUp0QXcaQEVnWjJhV1JsYjNNWUF5QUFNQUU0QWVvREdFTm5Ua1JSVlVWVFEzZHBYM2gwTTBaeFRuRkZiRFZqUWclM0QlM0Q%3D'))
+[(2, 80226972, b'\x12\x18UCYO_jab_esuFRV4b17AJtAw\x1a@EgZ2aWRlb3MYAyAAMAE4AeoDGENnTkRRVUVTQ3dpX3h0M0ZxTnFFbDVjQg%3D%3D')]
+
+>>> pb(b'\x12\x18UCYO_jab_esuFRV4b17AJtAw\x1a@EgZ2aWRlb3MYAyAAMAE4AeoDGENnTkRRVUVTQ3dpX3h0M0ZxTnFFbDVjQg%3D%3D')
+[(2, 2, b'UCYO_jab_esuFRV4b17AJtAw'), (2, 3, b'EgZ2aWRlb3MYAyAAMAE4AeoDGENnTkRRVUVTQ3dpX3h0M0ZxTnFFbDVjQg%3D%3D')]
+
+>>> pb(dec(b'EgZ2aWRlb3MYAyAAMAE4AeoDGENnTkRRVUVTQ3dpX3h0M0ZxTnFFbDVjQg%3D%3D'))
+[(2, 2, b'videos'), (0, 3, 3), (0, 4, 0), (0, 6, 1), (0, 7, 1), (2, 61, b'CgNDQUESCwi_xt3FqNqEl5cB')]
+
+>>> pb(dec(b'CgNDQUESCwi_xt3FqNqEl5cB'))
+[(2, 1, b'CAA'), (2, 2, b'\x08\xbf\xc6\xdd\xc5\xa8\xda\x84\x97\x97\x01')]
+
+>>> pb(b'\x08\xbf\xc6\xdd\xc5\xa8\xda\x84\x97\x97\x01')
+[(0, 1, 10893665244101960511)]
+
+>>> pb(dec(b'CAA'))
+[(0, 1, 0)]
+
+The function recursive_pb will try to do dec/pb recursively automatically.
+It's a dumb function (so might try to dec or pb something that isn't really
+base64 or protobuf) and it's a mess right now so disclaimer.
+The function pp will pretty print the recursive structure:
+
+>>> pp(recursive_pb('4qmFsgJcEhhVQ1lPX2phYl9lc3VGUlY0YjE3QUp0QXcaQEVnWjJhV1JsYjNNWUF5QUFNQUU0QWVvREdFTm5Ua1JSVlVWVFEzZHBYM2gwTTBaeFRuRkZiRFZqUWclM0QlM0Q%3D'))
+
+('base64',
+ [
+  [2, 80226972,
+   [
+    [2, 2, b'UCYO_jab_esuFRV4b17AJtAw'],
+    [2, 3,
+     ('base64',
+      [
+       [2, 2, b'videos'],
+       [0, 3, 3],
+       [0, 4, 0],
+       [0, 6, 1],
+       [0, 7, 1],
+       [2, 61,
+        ('base64',
+         [
+          [2, 1, b'CAA'],
+          [2, 2,
+           [
+            [0, 1, 10893665244101960511],
+           ]
+          ],
+         ]
+        )
+       ],
+      ]
+     )
+    ],
+   ]
+  ],
+ ]
+)
+
+make_proto will take a recursive_pb structure and make a ctoken out of it:
+- base64 means a base64 encode with equals sign paddings
+- base64s means a base64 encode without padding
+- base64u means a url base64 encode with equals signs replaced with %3D
+
+recursive_pb cannot detect between base64 or base64u or base64s so
+those must be manually specified if recreating the token. Will not have
+make_proto(recursive_pb(x)) == x if x is using base64u or base64s
+
+There are some other functions I wrote while reverse engineering stuff
+that may or may not be useful.
+'''
+
+
+import urllib.request
+import urllib.parse
+import re
+import time
+import json
+import os
+import pprint
+
+
+# ------ from proto.py -----------------------------------------------
+from math import ceil
+import base64
+import io
+
+def byte(n):
+    return bytes((n,))
+
+
+def varint_encode(offset):
+    '''In this encoding system, for each 8-bit byte, the first bit is 1 if there are more bytes, and 0 is this is the last one.
+    The next 7 bits are data. These 7-bit sections represent the data in Little endian order. For example, suppose the data is
+    aaaaaaabbbbbbbccccccc (each of these sections is 7 bits). It will be encoded as:
+    1ccccccc 1bbbbbbb 0aaaaaaa
+
+    This encoding is used in youtube parameters to encode offsets and to encode the length for length-prefixed data.
+    See https://developers.google.com/protocol-buffers/docs/encoding#varints for more info.'''
+    needed_bytes = ceil(offset.bit_length()/7) or 1 # (0).bit_length() returns 0, but we need 1 in that case.
+    encoded_bytes = bytearray(needed_bytes)
+    for i in range(0, needed_bytes - 1):
+        encoded_bytes[i] = (offset & 127) | 128  # 7 least significant bits
+        offset = offset >> 7
+    encoded_bytes[-1] = offset & 127 # leave first bit as zero for last byte
+
+    return bytes(encoded_bytes)
+
+
+def varint_decode(encoded):
+    decoded = 0
+    for i, byte in enumerate(encoded):
+        decoded |= (byte & 127) << 7*i
+
+        if not (byte & 128):
+            break
+    return decoded
+
+
+def string(field_number, data):
+    data = as_bytes(data)
+    return _proto_field(2, field_number, varint_encode(len(data)) + data)
+nested = string
+
+def uint(field_number, value):
+    return _proto_field(0, field_number, varint_encode(value))
+
+
+def _proto_field(wire_type, field_number, data):
+    ''' See https://developers.google.com/protocol-buffers/docs/encoding#structure '''
+    return varint_encode( (field_number << 3) | wire_type) + data
+
+
+def percent_b64encode(data):
+    return base64.urlsafe_b64encode(data).replace(b'=', b'%3D')
+
+
+def unpadded_b64encode(data):
+    return base64.urlsafe_b64encode(data).replace(b'=', b'')
+
+
+def as_bytes(value):
+    if isinstance(value, str):
+        return value.encode('utf-8')
+    return value
+
+
+def read_varint(data):
+    result = 0
+    i = 0
+    while True:
+        try:
+            byte = data.read(1)[0]
+        except IndexError:
+            if i == 0:
+                raise EOFError()
+            raise Exception('Unterminated varint starting at ' + str(data.tell() - i))
+        result |= (byte & 127) << 7*i
+        if not byte & 128:
+            break
+
+        i += 1
+    return result
+
+
+def read_group(data, end_sequence):
+    start = data.tell()
+    index = data.original.find(end_sequence, start)
+    if index == -1:
+        raise Exception('Unterminated group')
+    data.seek(index + len(end_sequence))
+    return data.original[start:index]
+
+
+def parse(data):
+    return {field_number: value for _, field_number, value in read_protobuf(data)}
+
+
+def b64_to_bytes(data):
+    if isinstance(data, bytes):
+        data = data.decode('ascii')
+    data = data.replace("%3D", "=")
+    return base64.urlsafe_b64decode(data + "="*((4 - len(data)%4)%4) )
+# --------------------------------------------------------------------
+
+
+dec = b64_to_bytes
+
+
+def enc(t):
+    return base64.urlsafe_b64encode(t).decode('ascii')
+
+def uenc(t):
+    return enc(t).replace("=", "%3D")
+
+def b64_to_ascii(t):
+    return base64.urlsafe_b64decode(t).decode('ascii', errors='replace')
+
+def b64_to_bin(t):
+    decoded = base64.urlsafe_b64decode(t)
+    #print(len(decoded)*8)
+    return " ".join(["{:08b}".format(x) for x in decoded])
+
+def bytes_to_bin(t):
+    return " ".join(["{:08b}".format(x) for x in t])
+def bin_to_bytes(t):
+    return int(t, 2).to_bytes((len(t) + 7) // 8, 'big')
+
+def bytes_to_hex(t):
+    return ' '.join(hex(n)[2:].zfill(2) for n in t)
+tohex = bytes_to_hex
+fromhex = bytes.fromhex
+
+
+def aligned_ascii(data):
+    return ' '.join(' ' + chr(n) if n in range(32,128) else ' _' for n in data)
+
+def parse_protobuf(data, mutable=False, spec=()):
+    data_original = data
+    data = io.BytesIO(data)
+    data.original = data_original
+    while True:
+        try:
+            tag = read_varint(data)
+        except EOFError:
+            break
+        wire_type = tag & 7
+        field_number = tag >> 3
+        
+        if wire_type == 0:
+            value = read_varint(data)
+        elif wire_type == 1:
+            value = data.read(8)
+        elif wire_type == 2:
+            length = read_varint(data)
+            value = data.read(length)
+        elif wire_type == 3:
+            end_bytes = varint_encode((field_number << 3) | 4)
+            value = read_group(data, end_bytes)
+        elif wire_type == 5:
+            value = data.read(4)
+        else:
+            raise Exception("Unknown wire type: " + str(wire_type) + ", Tag: " + bytes_to_hex(varint_encode(tag)) + ", at position " + str(data.tell()))
+        if mutable:
+            yield [wire_type, field_number, value]
+        else:
+            yield (wire_type, field_number, value)
+
+def pb(data, mutable=False):
+    return list(parse_protobuf(data, mutable=mutable))
+
+def make_proto(fields):
+    if len(fields) == 2 and fields[0] == 'base64':
+        return enc(make_proto(fields[1]))
+    result = b''
+    for field in fields:
+        if field[0] == 0:
+            result += _proto_field(0, field[1], varint_encode(field[2]))
+        elif field[0] == 2:
+            data = field[2]
+            if isinstance(data, str):
+                data = data.encode('utf-8')
+            elif len(data) == 2 and data[0] == 'base64':
+                data = base64.urlsafe_b64encode(make_proto(data[1]))
+            elif len(data) == 2 and data[0] == 'base64s':
+                data = base64.urlsafe_b64encode(make_proto(data[1])).rstrip(b'=')
+            elif len(data) == 2 and data[0] == 'base64u':
+                data = base64.urlsafe_b64encode(make_proto(data[1])).replace(b'=', b'%3D')
+            elif isinstance(data, list):
+                data = make_proto(data)
+            result += _proto_field(2, field[1], varint_encode(len(data)) + data)
+        else:
+            raise NotImplementedError('Wire type ' + str(field[0]) + ' not implemented')
+    return result
+
+
+def bytes_to_base4(data):
+    result = ''
+    for b in data:
+        result += str(b >> 6) + str((b >> 4) & 0b11) + str((b >> 2) & 0b11) + str(b & 0b11)
+    return result
+
+
+import re
+import struct
+import binascii
+
+
+# Base32 encoding/decoding must be done in Python
+_b32alphabet = b'abcdefghijklmnopqrstuvwxyz012345'
+_b32tab2 = None
+_b32rev = None
+
+bytes_types = (bytes, bytearray)  # Types acceptable as binary data
+
+def _bytes_from_decode_data(s):
+    if isinstance(s, str):
+        try:
+            return s.encode('ascii')
+        except UnicodeEncodeError:
+            raise ValueError('string argument should contain only ASCII characters')
+    if isinstance(s, bytes_types):
+        return s
+    try:
+        return memoryview(s).tobytes()
+    except TypeError:
+        raise TypeError("argument should be a bytes-like object or ASCII "
+                        "string, not %r" % s.__class__.__name__) from None
+
+
+
+def b32decode(s, casefold=False, map01=None):
+    """Decode the Base32 encoded bytes-like object or ASCII string s.
+
+    Optional casefold is a flag specifying whether a lowercase alphabet is
+    acceptable as input.  For security purposes, the default is False.
+
+    RFC 3548 allows for optional mapping of the digit 0 (zero) to the
+    letter O (oh), and for optional mapping of the digit 1 (one) to
+    either the letter I (eye) or letter L (el).  The optional argument
+    map01 when not None, specifies which letter the digit 1 should be
+    mapped to (when map01 is not None, the digit 0 is always mapped to
+    the letter O).  For security purposes the default is None, so that
+    0 and 1 are not allowed in the input.
+
+    The result is returned as a bytes object.  A binascii.Error is raised if
+    the input is incorrectly padded or if there are non-alphabet
+    characters present in the input.
+    """
+    global _b32rev
+    # Delay the initialization of the table to not waste memory
+    # if the function is never called
+    if _b32rev is None:
+        _b32rev = {v: k for k, v in enumerate(_b32alphabet)}
+    s = _bytes_from_decode_data(s)
+    if len(s) % 8:
+        raise binascii.Error('Incorrect padding')
+    # Handle section 2.4 zero and one mapping.  The flag map01 will be either
+    # False, or the character to map the digit 1 (one) to.  It should be
+    # either L (el) or I (eye).
+    if map01 is not None:
+        map01 = _bytes_from_decode_data(map01)
+        assert len(map01) == 1, repr(map01)
+        s = s.translate(bytes.maketrans(b'01', b'O' + map01))
+    if casefold:
+        s = s.upper()
+    # Strip off pad characters from the right.  We need to count the pad
+    # characters because this will tell us how many null bytes to remove from
+    # the end of the decoded string.
+    l = len(s)
+    s = s.rstrip(b'=')
+    padchars = l - len(s)
+    # Now decode the full quanta
+    decoded = bytearray()
+    b32rev = _b32rev
+    for i in range(0, len(s), 8):
+        quanta = s[i: i + 8]
+        acc = 0
+        try:
+            for c in quanta:
+                acc = (acc << 5) + b32rev[c]
+        except KeyError:
+            raise binascii.Error('Non-base32 digit found') from None
+        decoded += acc.to_bytes(5, 'big')
+    # Process the last, partial quanta
+    if padchars:
+        acc <<= 5 * padchars
+        last = acc.to_bytes(5, 'big')
+        if padchars == 1:
+            decoded[-5:] = last[:-1]
+        elif padchars == 3:
+            decoded[-5:] = last[:-2]
+        elif padchars == 4:
+            decoded[-5:] = last[:-3]
+        elif padchars == 6:
+            decoded[-5:] = last[:-4]
+        else:
+            raise binascii.Error('Incorrect padding')
+    return bytes(decoded)
+
+def dec32(data):
+    if isinstance(data, bytes):
+        data = data.decode('ascii')
+    return b32decode(data + "="*((8 - len(data)%8)%8))
+
+def recursive_pb(data, filt=True):
+    b64 = False
+    if isinstance(data, str) or all(i > 32 for i in data):
+        try:
+            if len(data) > 11 and data[0:2] != b'UC':
+                data = b64_to_bytes(data)
+                b64 = True
+            else:
+                return data
+        except Exception as e:
+            return data
+
+    try:
+        result = pb(data, mutable=True) 
+    except Exception as e:
+        return data
+    for tuple in result:
+        if tuple[0] == 2:
+            try:
+                tuple[2] = recursive_pb(tuple[2])
+            except Exception:
+                pass
+    if b64:
+        return ('base64', result)
+    return result
+
+
+
+def indent_lines(lines, indent):
+    return re.sub(r'^', ' '*indent, lines, flags=re.MULTILINE)
+
+def _pp(obj, indent):   # not my best work
+    if isinstance(obj, tuple):
+        if len(obj) == 3:   # (wire_type, field_number, data)
+            return obj.__repr__()
+        else:   # (base64, [...])
+            return ('(' + obj[0].__repr__() + ',\n'
+                    + indent_lines(_pp(obj[1], indent), indent) + '\n'
+                    + ')')
+    elif isinstance(obj, list):
+        # [wire_type, field_number, data]
+        if (len(obj) == 3
+            and not any(isinstance(x, (list, tuple)) for x in obj)
+        ):
+            return obj.__repr__()
+
+        # [wire_type, field_number, [...]]
+        elif (len(obj) == 3
+            and not any(isinstance(x, (list, tuple)) for x in obj[0:2])
+        ):
+            return ('[' + obj[0].__repr__() + ', ' + obj[1].__repr__() + ',\n'
+                    + indent_lines(_pp(obj[2], indent), indent) + '\n'
+                    + ']')
+        else:
+            s = '[\n'
+            for x in obj:
+                s += indent_lines(_pp(x, indent), indent) + ',\n'
+            s += ']'
+            return s
+    else:
+        return obj.__repr__()
+
+def pp(obj, indent=1):
+    '''Pretty prints the recursive pb structure'''
+    print(_pp(obj, indent))
+
+
+desktop_user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0'
+desktop_headers = (
+    ('Accept', '*/*'),
+    ('Accept-Language', 'en-US,en;q=0.5'),
+    ('X-YouTube-Client-Name', '1'),
+    ('X-YouTube-Client-Version', '2.20180830'),
+) + (('User-Agent', desktop_user_agent),)
+
+mobile_user_agent = 'Mozilla/5.0 (Linux; Android 7.0; Redmi Note 4 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Mobile Safari/537.36'
+mobile_headers = (
+    ('Accept', '*/*'),
+    ('Accept-Language', 'en-US,en;q=0.5'),
+    ('X-YouTube-Client-Name', '2'),
+    ('X-YouTube-Client-Version', '2.20180830'),
+) + (('User-Agent', mobile_user_agent),)
+
+
author	James Taylor <user234683@users.noreply.github.com>	2020-12-21 16:55:51 -0800
committer	Jesús <heckyel@hyperbola.info>	2020-12-21 21:24:03 -0500
commit	35b2b3e97327c2e304beff66c40c7b76ec0a32d2 (patch)
tree	f1444557ee4a00bf3e2a4380a5f494fbdb1f07ad
parent	84c6192e08617c166065bbc6b91ecb7f9217f5e2 (diff)
download	yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.tar.lz yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.tar.xz yt-local-35b2b3e97327c2e304beff66c40c7b76ec0a32d2.zip