aboutsummaryrefslogtreecommitdiffstats
path: root/filters
diff options
context:
space:
mode:
authorAndrea Pappacoda <andrea@pappacoda.it>2023-01-28 14:57:53 +0100
committerRobin Jarry <robin@jarry.cc>2023-01-28 23:06:59 +0100
commit406e4750bd0d8ec756e4b5e28eefc46839a7f230 (patch)
tree56e13c3485be14c4ddbd28788352158551d614ba /filters
parent89f01264cd0b2f84f113772a3d7f2751a9ed222d (diff)
downloadaerc-406e4750bd0d8ec756e4b5e28eefc46839a7f230.tar.gz
filters: make colorize URL regex more strict
The previous URL regex was too lax, allowing all "[:graph:]" characters after the protocol:// part. This caused the script to mark as part of an URL also things like ">", which is commonly used as a URL delimiter in plain text and Markdown; the url() function tried to account for this with some heuristic to remove trailing characters, but it didn't always work (see the screenshots below). As RFC 3986 specifies the list of allowed characters in URLs, we can simply make our regex stricter and only mark characters as part of an URL if they match the allowed set. As the number of allowed characters has been reduced, the aforementioned heuristic has been slightly simplified. I've also removed the backslash escapes from the bracket expressions, as POSIX regular expressions do not require them; the only characters that need special handling are ']' and '-', which need to be placed at the start and at the end of the expression, respectively. Signed-off-by: Andrea Pappacoda <andrea@pappacoda.it> Acked-by: Robin Jarry <robin@jarry.cc>
Diffstat (limited to 'filters')
-rw-r--r--filters/colorize.c8
1 files changed, 4 insertions, 4 deletions
diff --git a/filters/colorize.c b/filters/colorize.c
index 17eb548a..dcc486b6 100644
--- a/filters/colorize.c
+++ b/filters/colorize.c
@@ -423,8 +423,8 @@ static void diff_chunk(const char *in)
}
#define URL_RE \
- "[a-z]{2,8}://[[:graph:]]{4,}" \
- "|(mailto:)?[[:alnum:]_\\+\\.~/-]*[[:alnum:]]@[a-z][[:alnum:]\\.-]*[a-z]"
+ "[a-z]{2,8}://[][:alnum:]._~:/?#[@!$&'()*+,;=%-]{4,}" \
+ "|(mailto:)?[[:alnum:]_+.~/-]*[[:alnum:]]@[a-z][[:alnum:].-]*[a-z]"
static regex_t url_re;
static void urls(const char *in, struct style *ctx)
@@ -442,8 +442,8 @@ static void urls(const char *in, struct style *ctx)
trim = 1;
while (trim && len > 0) {
switch (in[len - 1]) {
- case '>': case '.': case ',': case ';': case ')':
- case '!': case '?': case '"': case '\'':
+ case '.': case ',': case ';': case ')':
+ case '!': case '?': case '\'':
len--;
break;
default: