Tuesday 29 November 2011

URL encoding

Today I was doing more work on my photo website.

I wanted to try and find the regex that wordpress uses for converting page / post titles into permalinks (also known as 'slugs'). However despite much googling and searching through the source code (wordpress codebase is too large to search properly), I didn't find anything.

So instead I made my own:

/**
 *Replace ' with nothing e.g. don't becomes dont, other punctuation replaced with a - with max one dash between words
 *@param $str The string to be encoded
 *@return $str The encoded string
*/
function myurlencode($url){
    return preg_replace('/[!"#$%&\'()*+,\/:;<=>\-?@\\\[\]^`{|}\s]+/', '-', str_replace("'", '', $url));
}

Unless I've made a mistake (quite possible), this should replace any punctuation or space with a dash, with multiples being collapsed to a single dash, e.g. 'hel!o - there' should become 'hel-o-there'. I also elected to change words like don't to dont rather than don-t or don%27t. So this should allow me to use RFC3987 compatible IRIs/URLs without using any percent encoding at all (since all characters that need percent encoding are removed or converted to dashes).

Because &, <, >, ", and ' are removed / converted to dashes this also means that the url doesn't need to be run through htmlspecialchars before printing as part of a webpage / xml doc either. The function doesn't deal with all characters not allowed in URLs per RFC3987, but these are control characters or reserved blocks that there is virtually 0% chance will be in any string I run through the function.

After some help on the sitepoint forums, I managed to get a regex that should work to encode a URL per RFC3987 correctly. I tested it against a more simplistic str_replace function and another regex that don't bother trying to encode later blocks (which tend to be reserved / not used). So these functions all encode control chars and should be good enough (I think) to make a url comply with RFC3987, similar to how rawurlencode works for RFC3986.

function iriencode($url){
 $notiunreserved = array("\x25","\x0","\x1","\x2","\x3","\x4","\x5","\x6","\x7","\x8","\x9","\x0a","\x0b","\x0c","\x0d","\x0e","\x0f","\x10","\x11","\x12","\x13","\x14","\x15","\x16","\x17","\x18","\x19","\x1a","\x1b","\x1c","\x1d","\x1e","\x1f","\x20","\x21","\x22","\x23","\x24","\x26","\x27","\x28","\x29","\x2a","\x2b","\x2c","\x2f","\x3a","\x3b","\x3c","\x3d","\x3e","\x3f","\x40","\x5b","\x5c","\x5d","\x5e","\x60","\x7b","\x7c","\x7d","\x7f","\xc2\x80","\xc2\x81","\xc2\x82","\xc2\x83","\xc2\x84","\xc2\x85","\xc2\x86","\xc2\x87","\xc2\x88","\xc2\x89","\xc2\x8a","\xc2\x8b","\xc2\x8c","\xc2\x8d","\xc2\x8e","\xc2\x8f","\xc2\x90","\xc2\x91","\xc2\x92","\xc2\x93","\xc2\x94","\xc2\x95","\xc2\x96","\xc2\x97","\xc2\x98","\xc2\x99","\xc2\x9a","\xc2\x9b","\xc2\x9c","\xc2\x9d","\xc2\x9e","\xc2\x9f","\xef\xbf\xb0","\xef\xbf\xb1","\xef\xbf\xb2","\xef\xbf\xb3","\xef\xbf\xb4","\xef\xbf\xb5","\xef\xbf\xb6","\xef\xbf\xb7","\xef\xbf\xb8","\xef\xbf\xb9","\xef\xbf\xba","\xef\xbf\xbb","\xef\xbf\xbc","\xef\xbf\xbd");
 $notiunreservedEncoded = array('%25','','%01','%02','%03','%04','%05','%06','%07','%08','%09','%0A','%0B','%0C','%0D','%0E','%0F','%10','%11','%12','%13','%14','%15','%16','%17','%18','%19','%1A','%1B','%1C','%1D','%1E','%1F','%20','%21','%22','%23','%24','%26','%27','%28','%29','%2A','%2B','%2C','%2F','%3A','%3B','%3C','%3D','%3E','%3F','%40','%5B','%5C','%5D','%5E','%60','%7B','%7C','%7D','%7F','%C2%80','%C2%81','%C2%82','%C2%83','%C2%84','%C2%85','%C2%86','%C2%87','%C2%88','%C2%89','%C2%8A','%C2%8B','%C2%8C','%C2%8D','%C2%8E','%C2%8F','%C2%90','%C2%91','%C2%92','%C2%93','%C2%94','%C2%95','%C2%96','%C2%97','%C2%98','%C2%99','%C2%9A','%C2%9B','%C2%9C','%C2%9D','%C2%9E','%C2%9F','%EF%BF%B0','%EF%BF%B1','%EF%BF%B2','%EF%BF%B3','%EF%BF%B4','%EF%BF%B5','%EF%BF%B6','%EF%BF%B7','%EF%BF%B8','%EF%BF%B9','%EF%BF%BA','%EF%BF%BB','%EF%BF%BC','%EF%BF%BD');
 return str_replace($notiunreserved, $notiunreservedEncoded, $url);
}
function preg_iriencode($url){
 return preg_replace('/[^0-9a-zA-Z\-._~\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}]+/eu', 'rawurlencode("$0")', $url);
}
function preg_iriencode_basic($url){
    return preg_replace('/[\x{0000}-\x{009F}]+/eu', 'rawurlencode("$0")', $url);
}
$i=0;
$url = 'Exclamation!Question?NBSP Newline
Atsign@Tab Hyphen-Plus+Tilde~好';
$iriencode = 0;
$preg_iriencode = 0;
$preg_iriencode_basic = 0;
$methods = array('iriencode','preg_iriencode','preg_iriencode_basic');
while($i<500){
    shuffle($methods);
    foreach($methods as $method){
        $start = microtime(true);
        $method($url);
        $end=microtime(true);
        $$method+=($end-$start);
    }
    $i++;
}
foreach($methods as $method){
    echo $method.number_format($$method/500, 30);
}

Example results:

  • preg_iriencode 0.000042529106140136717665797828
  • preg_iriencode_basic 0.000018751144409179686578428153
  • iriencode 0.000083773136138916016709202172

I did try a 50,000 run loop but the str_replace function decreased in performance a lot. At 500 loop performance is similar to a single loop. I will be sticking to my myurlencode function though, at least until I find some problem with it.

No comments: