[tin-dev] URL_REGEX update
Urs Janßen
urs at tin.org
Tue Sep 20 14:27:25 CEST 2022
The following update for the URL_REGEX should stop capturing illegal path
components (non ascii chars) like in the 2nd url and take into account that
non punycode TLDs exists with 18 chars *sigh*
<https://micha.freeshell.org/tmp/An-URI-Test.html>
<https://micha.freeshell.org/tmp/Αn-URI-Test.html>
=== modified file 'include/tin.h'
--- include/tin.h 2022-08-29 13:27:10 +0000
+++ include/tin.h 2022-09-20 12:06:30 +0000
@@ -707,7 +707,7 @@
* - test IDNA (RFC 3490) case
* - adjust to follow RFC 3986 (section 2.3)
*/
-#define URL_REGEX "\\b(?:https?|ftp|gopher)://(?:[^:@/\\s]*(?::[^:@/\\s]*)?@)?(?:(?:(?:[^\\W_](?:(?:-|[^\\W_]){0,61}(?<!---)[^\\W_])?|xn--[^\\W_](?:-(?!-)|[^\\W_]){1,57}[^\\W_])\\.)+[a-z]{2,14}\\.?|localhost|(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)|\\[(?:(?:[0-9A-F]{0,4}:){1,7}[0-9A-F]{1,4}|(?:[0-9A-F]{0,4}:){1,3}(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?))\\])(?::\\d+)?(?(?=[^\\)\\]\\>\"\\s]*\\()(?:/[^\\]\\>\"\\s]*|$|(?=[)\\]\\>\"\\s]))|(?:/[^)\\]\\>\"\\s]*|$|(?=[)\\]\\>\"\\s])))"
+#define URL_REGEX "\\b(?:https?|ftp|gopher)://(?:[^:@/\\s]*(?::[^:@/\\s]*)?@)?(?:(?:(?:[^\\W_](?:(?:-|[^\\W_]){0,61}(?<!---)[^\\W_])?|xn--[^\\W_](?:-(?!-)|[^\\W_]){1,57}[^\\W_])\\.)+[a-z]{2,18}\\.?|localhost|(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)|\\[(?:(?:[0-9A-F]{0,4}:){1,7}[0-9A-F]{1,4}|(?:[0-9A-F]{0,4}:){1,3}(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?))\\])(?::\\d+)?(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=\$,]*)"
/*
* case insensitive
* TOFO: check against RFC 6068
More information about the tin-dev
mailing list