batch-file enumeration improvements (https://github.com/ytdl-org/youtube-dl/pull/26813)

Co-authored by: glenn-slayden Modified from https://github.com/ytdl-org/youtube-dl/pull/26813/commits/c9a9ccf8a35e157e22afeaafc2851176ddd87e68 These improvements apply to reading the list of URLs from the file supplied via the `--batch-file` (`-a`) command line option. 1. Skip blank and empty lines in the file. Currently, lines with leading whitespace are only skipped when that whitespace is followed by a comment character (`#`, `;`, or `]`). This means that empty lines and lines consisting only of whitespace are returned as (trimmed) empty strings in the list of URLs to process. 2. [bug fix] Detect and remove the Unicode BOM when the file descriptor is already decoding Unicode. With Python 3, the `batch_fd` enumerator returns the lines of the file as Unicode. For UTF-8, this means that the raw BOM bytes from the file `\xef \xbb \xbf` show up converted into a single `\ufeff` character prefixed to the first enumerated text line. This fix solves several buggy interactions between the presence of BOM, the skipping of comments and/or blank lines, and ensuring the list of URLs is consistently trimmed. For example, if the first line of the file is blank, the BOM is incorrectly returned as a URL standing alone. If the first line contains a URL, it will be prefixed with this unwanted single character--but note that its being there will have inhibited the proper trimming of any leading whitespace. Currently, the `UnicodeBOMIE` helper attempts to recover from some of these error cases, but this fix prevents the error from happening in the first place (at least on Python3). In any case, the `UnicodeBOMIE` approach is flawed, because it is clearly illogical for a BOM to appear in the (non-batch) URL(s) specified directly on the command line (and for that matter, on URLs *after the first line* of a batch list, also) 3. Adds proper trimming of the " #" into the read_batch_urls processing so that the URLs it enumerates are cleaned and trimmed more consistently.
author: pukkandan <pukkandan@gmail.com> 2021-01-09 18:08:03 +0530
committer: pukkandan <pukkandan@gmail.com> 2021-01-09 18:08:03 +0530
commit: 8c04f0be96399cf23d092b286574f48d768783da (patch)
tree: 026b81d9ba83c49e54e5abeec23ba6267335e2ba /youtube_dlc/utils.py
parent: ab8e5e516f38c3eab8947614e2347a2473e5dbbc (diff)
download: hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.tar.lz
hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.tar.xz
hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.zip
1 files changed, 9 insertions, 6 deletions
diff --git a/youtube_dlc/utils.py b/youtube_dlc/utils.py
index 586ad4150..ae293589b 100644
--- a/youtube_dlc/utils.py
+++ b/youtube_dlc/utils.py
@@ -3892,13 +3892,16 @@ def read_batch_urls(batch_fd):
     def fixup(url):
         if not isinstance(url, compat_str):
             url = url.decode('utf-8', 'replace')
-        BOM_UTF8 = '\xef\xbb\xbf'
-        if url.startswith(BOM_UTF8):
-            url = url[len(BOM_UTF8):]
-        url = url.strip()
-        if url.startswith(('#', ';', ']')):
+        BOM_UTF8 = ('\xef\xbb\xbf', '\ufeff')
+        for bom in BOM_UTF8:
+            if url.startswith(bom):
+                url = url[len(bom):]
+        url = url.lstrip()
+        if not url or url.startswith(('#', ';', ']')):
             return False
-        return url
+        # "#" cannot be stripped out since it is part of the URI
+        # However, it can be safely stipped out if follwing a whitespace
+        return re.split(r'\s#', url, 1)[0].rstrip()
 
     with contextlib.closing(batch_fd) as fd:
         return [url for url in map(fixup, fd) if url]
author	pukkandan <pukkandan@gmail.com>	2021-01-09 18:08:03 +0530
committer	pukkandan <pukkandan@gmail.com>	2021-01-09 18:08:03 +0530
commit	8c04f0be96399cf23d092b286574f48d768783da (patch)
tree	026b81d9ba83c49e54e5abeec23ba6267335e2ba /youtube_dlc/utils.py
parent	ab8e5e516f38c3eab8947614e2347a2473e5dbbc (diff)
download	hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.tar.lz hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.tar.xz hypervideo-pre-8c04f0be96399cf23d092b286574f48d768783da.zip