PHP and encodings

It drives me crazy when I import text from various sources across the internet, and it’s been written in Word, or Pages, and it’s got what I’m going to call decorative characters to make it look nicer. Either way, these characters cause problems.

So, here’s some PHP code use to help clean the text up. It’s probably a work in progress, but it seems to work for me.



function cleanText($string)
{ 
  echo "cleaning";
    

    /*
        // Windows codepage 1252
        "\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
        "\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
        "\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
        "\x91" => "'", // U+0091⇒U+2018 left single quotation mark
        "\x92" => "'", // U+0092⇒U+2019 right single quotation mark
        "\x93" => '"', // U+0093⇒U+201C left double quotation mark
        "\x94" => '"', // U+0094⇒U+201D right double quotation mark
        "\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

        // Regular Unicode     // U+0022 quotation mark (")
                                // U+0027 apostrophe     (')
        "\xAB"     => '"', // U+00AB left-pointing double angle quotation mark
        "\xBB"     => '"', // U+00BB right-pointing double angle quotation mark
        "\x98" => "'", // U+2018 left single quotation mark
        "\x99" => "'", // U+2019 right single quotation mark
        "\x9A" => "'", // U+201A single low-9 quotation mark
        "\x9B" => "'", // U+201B single high-reversed-9 quotation mark
        "\x9C" => '"', // U+201C left double quotation mark
        "\x9D" => '"', // U+201D right double quotation mark
        "\x9E" => '"', // U+201E double low-9 quotation mark
        "\x9F" => '"', // U+201F double high-reversed-9 quotation mark
        "\xB9" => "'", // U+2039 single left-pointing angle quotation mark
        "\xBA" => "'", // U+203A single right-pointing angle quotation mark
        ". . . " => " ", // U+203A single right-pointing angle quotation mark
        "... " => " ", // U+203A single right-pointing angle quotation mark
        "–" => "-", // U+203A single right-pointing angle quotation mark   
 
        $find[] = '“'; // left side double smart quote
        $find[] = '‘'; // left side single smart quote
        $find[] = '’'; // right side single smart quote        
        $find[] = '…'; // elipsis
        $find[] = '—'; // em dash
        $find[] = '–'; // en dash
        $find[] = '’'; // en dash
        $find[] = 'â€'; // right side double smart quote
        $find[] = '�'; // left side double smart quote 




    */

    $chr_map = array(  
    'Â ' => " ",
    '‘' => "'",
    '’' => "'",
    '–' => "-",
    "%E2%80%9D" => '"',
    "%E2%80%9C" => '"',
    "%E2%80%99" => "'", 
    "%E2%80%93" => "-",
    "%60" => "'",
    "%C2%B1" => "'",
    '%C2%B2' => "'",
    '%C2%B3' => "'",
    '%C2%B4' => "'",
    '%C2%B5' => "'",
    '%C2%B6' => "'",
    '%C2%B7' => "'",
    '%C2%B8' => "'",
    '%C2%B9' => "'",
    '%C2%BA' => "'",
    '%C2%BB' => "'",
    '%C2%BC' => "'",
    '%C2%BD' => "'",
    '%C2%BE' => "'",
    '%C2%BF' => "'",
    '%C3%80' => "'",
    '%C3%81' => "'",
    '%C3%82' => "'",
    '%C3%83' => "'",
    '%C3%84' => "'",
    '%C3%80' => "'",
    '%C3%81' => "'",
    '%C3%82' => "'",
    '%C3%83' => "'",
    '%C3%84' => "'",
    '%C3%85' => "'",
    '%C3%86' => "'",
    '%C3%87' => "'",
    '%C3%88' => "'",
    '%C3%89' => "'",
    '%C3%8A' => "'",
    '%C3%8B' => "'",
    '%C3%8C' => "'",
    '%C3%8D' => "'",
    '%C3%8E' => "'",
    '%C3%8F' => "'",
    '%C3%90' => "'",
    '%C3%91' => "'",
    '%C3%92' => "'",
    '%C3%93' => "'",
    '%C3%94' => "'",
    '%C3%95' => "'",
    '%C3%96' => "'",
    '%C3%97' => "'",
    '%C3%98' => "'",
    '%C3%99' => "'",
    '%C3%9A' => "'",
    '%C3%9B' => "'",
    '%C3%9C' => "'",
    '%C3%9D' => "'",
    '%C3%9E' => "'",
    '%C3%9F' => "'",
    '%C3%A0' => "'",
    '%C3%A1' => "'",
    '%C3%A2' => "'",
    '%C3%A3' => "'",
    '%C3%A4' => "'",
    '%C3%A5' => "'",
    '%C3%A6' => "'",
    '%C3%A7' => "'",
    '%C3%A8' => "'",
    '%C3%A9' => "'",
    '%C3%AA' => "'",
    '%C3%AB' => "'",
    '%C3%AC' => "'",
    '%C3%AD' => "'",
    '%C3%AE' => "'",
    '%C3%AF' => "'",
    '%C3%B0' => "'",
    '%C3%B1' => "'",
    '%C3%B2' => "'",
    '%C3%B3' => "'",
    '%C3%B4' => "'",
    '%C3%B5' => "'",
    '%C3%B6' => "'",
    '%C3%B7' => "'",
    '%C3%B8' => "'",
    '%C3%B9' => "'",
    '%C3%BA' => "'",
    '%C3%BB' => "'",
    '%C3%BC' => "'",
    '%C3%BD' => "'",
    '%C3%BE' => "'",
    '%C3%BF' => "'", 
    "'%C2%80%C2%27" => '%27' 
    );
    

    $chr = array_keys  ($chr_map); // but: for efficiency you should
    $rpl = array_values($chr_map); // pre-calculate these two arrays

    

    $oldstring = $string;
    $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); 

    if ($oldstring != $string)
    {
        echo "They are different 1 UTF-8: \n\n";
        echo $oldstring . "\n\n";
        echo $string . "\n\n";
    } 

    $oldstring = $string; 
    $string = str_replace($chr, $rpl, $string); 

    if ($oldstring != $string)
    {
        echo "They are different: \n";
        echo $oldstring . "\n";
        echo $string . "\n";
    }
 
  return $string;
}


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: