[tin-dev] URL_REGEX update

Urs Janßen urs at tin.org
Tue Sep 20 14:27:25 CEST 2022


The following update for the URL_REGEX should stop capturing illegal path
components (non ascii chars) like in the 2nd url and take into account that
non punycode TLDs exists with 18 chars *sigh*

<https://micha.freeshell.org/tmp/An-URI-Test.html>
<https://micha.freeshell.org/tmp/Αn-URI-Test.html>

=== modified file 'include/tin.h'
--- include/tin.h	2022-08-29 13:27:10 +0000
+++ include/tin.h	2022-09-20 12:06:30 +0000
@@ -707,7 +707,7 @@
  *       - test IDNA (RFC 3490) case
  *       - adjust to follow RFC 3986 (section 2.3)
  */
-#define URL_REGEX	"\\b(?:https?|ftp|gopher)://(?:[^:@/\\s]*(?::[^:@/\\s]*)?@)?(?:(?:(?:[^\\W_](?:(?:-|[^\\W_]){0,61}(?<!---)[^\\W_])?|xn--[^\\W_](?:-(?!-)|[^\\W_]){1,57}[^\\W_])\\.)+[a-z]{2,14}\\.?|localhost|(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)|\\[(?:(?:[0-9A-F]{0,4}:){1,7}[0-9A-F]{1,4}|(?:[0-9A-F]{0,4}:){1,3}(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?))\\])(?::\\d+)?(?(?=[^\\)\\]\\>\"\\s]*\\()(?:/[^\\]\\>\"\\s]*|$|(?=[)\\]\\>\"\\s]))|(?:/[^)\\]\\>\"\\s]*|$|(?=[)\\]\\>\"\\s])))"
+#define URL_REGEX	"\\b(?:https?|ftp|gopher)://(?:[^:@/\\s]*(?::[^:@/\\s]*)?@)?(?:(?:(?:[^\\W_](?:(?:-|[^\\W_]){0,61}(?<!---)[^\\W_])?|xn--[^\\W_](?:-(?!-)|[^\\W_]){1,57}[^\\W_])\\.)+[a-z]{2,18}\\.?|localhost|(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)|\\[(?:(?:[0-9A-F]{0,4}:){1,7}[0-9A-F]{1,4}|(?:[0-9A-F]{0,4}:){1,3}(?:(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(?:2[0-4]\\d|25[0-5]|[01]?\\d\\d?))\\])(?::\\d+)?(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=\$,]*)"
 /*
  * case insensitive
  * TOFO: check against RFC 6068



More information about the tin-dev mailing list