From 7fd2c3474fa71cfb36f64e7f5c4d89fb21c38334 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jes=C3=BAs?= Date: Thu, 10 Jun 2021 16:41:45 -0500 Subject: Capitalize name app --- docs/HACKING.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'docs') diff --git a/docs/HACKING.md b/docs/HACKING.md index 82128e5..6e6b7fd 100644 --- a/docs/HACKING.md +++ b/docs/HACKING.md @@ -21,7 +21,7 @@ ## server.py * This is the entry point, and sets up the HTTP server that listens for incoming requests. It delegates the request to the appropriate "site_handler". For instance, `localhost:8080/youtube.com/...` goes to the `youtube` site handler, whereas `localhost:8080/ytimg.com/...` (the url for video thumbnails) goes to the site handler for just fetching static resources such as images from youtube. -* The reason for this architecture: the original design philosophy when I first conceived the project was that this would work for any site supported by youtube-dl, including Youtube, Vimeo, DailyMotion, etc. I've dropped this idea for now, though I might pick it up later. (youtube-dl is no longer used) +* The reason for this architecture: the original design philosophy when I first conceived the project was that this would work for any site supported by youtube-dl, including YouTube, Vimeo, DailyMotion, etc. I've dropped this idea for now, though I might pick it up later. (youtube-dl is no longer used) * This file uses the raw [WSGI request](https://www.python.org/dev/peps/pep-3333/) format. The WSGI format is a Python standard for how HTTP servers (I use the stock server provided by gevent) should call HTTP applications. So that's why the file contains stuff like `env['REQUEST_METHOD']`. @@ -29,20 +29,20 @@ ## Flask and Gevent * The `youtube` handler in server.py then delegates the request to the Flask yt_app object, which the rest of the project uses. [Flask](https://flask.palletsprojects.com/en/1.1.x/) is a web application framework that makes handling requests easier than accessing the raw WSGI requests. Flask (Werkzeug specifically) figures out which function to call for a particular url. Each request handling function is registered into Flask's routing table by using function annotations above it. The request handling functions are always at the bottom of the file for a particular youtube page (channel, watch, playlist, etc.), and they're where you want to look to see how the response gets constructed for a particular url. Miscellaneous request handlers that don't belong anywhere else are located in `__init__.py`, which is where the `yt_app` object is instantiated. -* The actual html for yt-local is generated using Jinja templates. Jinja lets you embed a Python-like language inside html files so you can use constructs such as for loops to construct the html for a list of 30 videos given a dictionary with information for those videos. Jinja is included as part of Flask. It has some annoying differences from Python in a lot of details, so check the [docs here](https://jinja.palletsprojects.com/en/2.11.x/) when you use it. The request handling functions will pass the information that has been scraped from Youtube into these templates for the final result. +* The actual html for yt-local is generated using Jinja templates. Jinja lets you embed a Python-like language inside html files so you can use constructs such as for loops to construct the html for a list of 30 videos given a dictionary with information for those videos. Jinja is included as part of Flask. It has some annoying differences from Python in a lot of details, so check the [docs here](https://jinja.palletsprojects.com/en/2.11.x/) when you use it. The request handling functions will pass the information that has been scraped from YouTube into these templates for the final result. * The project uses the gevent library for parallelism (such as for launching requests in parallel), as opposed to using the async keyword. ## util.py -* util.py is a grab-bag of miscellaneous things; admittedly I need to get around to refactoring it. The biggest thing it has is the `fetch_url` function which is what I use for sending out requests for Youtube. The Tor routing is managed here. `fetch_url` will raise an a `FetchError` exception if the request fails. The parameter `debug_name` in `fetch_url` is the filename that the response from Youtube will be saved to if the hidden debugging option is enabled in settings.txt. So if there's a bug when Youtube changes something, you can check the response from Youtube from that file. +* util.py is a grab-bag of miscellaneous things; admittedly I need to get around to refactoring it. The biggest thing it has is the `fetch_url` function which is what I use for sending out requests for YouTube. The Tor routing is managed here. `fetch_url` will raise an a `FetchError` exception if the request fails. The parameter `debug_name` in `fetch_url` is the filename that the response from YouTube will be saved to if the hidden debugging option is enabled in settings.txt. So if there's a bug when YouTube changes something, you can check the response from YouTube from that file. ## Data extraction - protobuf, polymer, and yt_data_extract -* proto.py is used for generating what are called ctokens needed when making requests to Youtube. These ctokens use Google's [protobuf](https://developers.google.com/protocol-buffers) format. Figuring out how to generate these in new instances requires some reverse engineering. I have a messy python file I use to make this convenient which you can find under ./youtube/proto_debug.py +* proto.py is used for generating what are called ctokens needed when making requests to YouTube. These ctokens use Google's [protobuf](https://developers.google.com/protocol-buffers) format. Figuring out how to generate these in new instances requires some reverse engineering. I have a messy python file I use to make this convenient which you can find under ./youtube/proto_debug.py -* The responses from Youtube are in a JSON format called polymer (polymer is the name of the 2017-present Youtube layout). The JSON consists of a bunch of nested dictionaries which basically specify the layout of the page via objects called renderers. A renderer represents an object on a page in a similar way to html tags; the renders often contain renders inside them. The Javascript on Youtube's page translates this JSON to HTML. Example: `compactVideoRenderer` represents a video item in you can click on such as in the related videos (so these are called "items" in the codebase). This JSON is very messy. You'll need a JSON prettifier or something that gives you a tree view in order to study it. +* The responses from YouTube are in a JSON format called polymer (polymer is the name of the 2017-present YouTube layout). The JSON consists of a bunch of nested dictionaries which basically specify the layout of the page via objects called renderers. A renderer represents an object on a page in a similar way to html tags; the renders often contain renders inside them. The Javascript on YouTube's page translates this JSON to HTML. Example: `compactVideoRenderer` represents a video item in you can click on such as in the related videos (so these are called "items" in the codebase). This JSON is very messy. You'll need a JSON prettifier or something that gives you a tree view in order to study it. -* `yt_data_extract` is a module that parses this this raw JSON page layout and extracts the useful information from it into a standardized dictionary. So for instance, it can take the raw JSON response from the watch page and return a dictionary containing keys such as `title`, `description`,`related_videos (list)`, `likes`, etc. This module contains a lot of abstractions designed to make parsing the polymer format easier and more resilient towards changes from Youtube. (A lot of Youtube extractors just traverse the JSON tree like `response[1]['response']['continuation']['gridContinuationRenderer']['items']...` but this tends to break frequently when Youtube changes things.) If it fails to extract a piece of data, such as the like count, it will place `None` in that entry. Exceptions are not used in this module. So it uses functions which return None if there's a failure, such as `deep_get(response, 1, 'response', 'continuation', 'gridContinuationRenderer', 'items')` which returns None if any of those keys aren't present. The general purpose abstractions are located in `common.py`, while the functions for parsing specific responses (watch page, playlist, channel, etc.) are located in `watch_extraction.py` and `everything_else.py`. +* `yt_data_extract` is a module that parses this this raw JSON page layout and extracts the useful information from it into a standardized dictionary. So for instance, it can take the raw JSON response from the watch page and return a dictionary containing keys such as `title`, `description`,`related_videos (list)`, `likes`, etc. This module contains a lot of abstractions designed to make parsing the polymer format easier and more resilient towards changes from YouTube. (A lot of YouTube extractors just traverse the JSON tree like `response[1]['response']['continuation']['gridContinuationRenderer']['items']...` but this tends to break frequently when YouTube changes things.) If it fails to extract a piece of data, such as the like count, it will place `None` in that entry. Exceptions are not used in this module. So it uses functions which return None if there's a failure, such as `deep_get(response, 1, 'response', 'continuation', 'gridContinuationRenderer', 'items')` which returns None if any of those keys aren't present. The general purpose abstractions are located in `common.py`, while the functions for parsing specific responses (watch page, playlist, channel, etc.) are located in `watch_extraction.py` and `everything_else.py`. -* Most of these abstractions are self-explanatory, except for `extract_items_from_renderer`, a function that performs a recursive search for the specified renderers. You give it a renderer which contains nested renderers, and a set of the renderer types you want to extract (by default, these are the video/playlist/channel preview items). It will search through the nested renderers and gather the specified items, in addition to the continuation token (ctoken) for the last list of items it finds if there is one. Using this function achieves resiliency against Youtube rearranging the items into a different hierarchy. +* Most of these abstractions are self-explanatory, except for `extract_items_from_renderer`, a function that performs a recursive search for the specified renderers. You give it a renderer which contains nested renderers, and a set of the renderer types you want to extract (by default, these are the video/playlist/channel preview items). It will search through the nested renderers and gather the specified items, in addition to the continuation token (ctoken) for the last list of items it finds if there is one. Using this function achieves resiliency against YouTube rearranging the items into a different hierarchy. * The `extract_items` function is similar but works on the response object, automatically finding the appropriate renderer to call `extract_items_from_renderer` on. @@ -55,7 +55,7 @@ * Since I can't anticipate the things that will trip up beginners to the codebase, if you spend awhile figuring something out, go ahead and make a pull request adding a brief description of your findings to this document to help other beginners. ## Development tips -* When developing functionality to interact with Youtube in new ways, you'll want to use the network tab in your browser's devtools to inspect which requests get made under normal usage of Youtube. You'll also want a tool you can use to construct custom requests and specify headers to reverse engineer the request format. I use the [HeaderTool](https://github.com/loreii/HeaderTool) extension in Firefox, but there's probably a more streamlined program out there. +* When developing functionality to interact with YouTube in new ways, you'll want to use the network tab in your browser's devtools to inspect which requests get made under normal usage of YouTube. You'll also want a tool you can use to construct custom requests and specify headers to reverse engineer the request format. I use the [HeaderTool](https://github.com/loreii/HeaderTool) extension in Firefox, but there's probably a more streamlined program out there. * You'll want to have a utility or IDE that can perform full text search on a repository, since this is crucial for navigating unfamiliar codebases to figure out where certain strings appear or where things get defined. -- cgit v1.2.3