emrahgunduz
always eats his vegetables
Blog RSS Feed
  • twitter
  • friendfeed
  • linkedin
  • facebook
  • vimeo
  • flickr
  • lastfm
Take 2 on UTF8 BOM : Remove BOM with PHP

Take 2 on UTF8 BOM : Remove BOM with PHP

Some people asked me about my UTF8 BOM problems in PHP and XML post. They were wondering if it was possible to remove the BOM from the files, without damaging it. And if PHP could do this. They had hundreds of files with UTF8 BOM in them and it would be time consuming to remove by hand, if they weren’t able to find a solution.

My answer was, “of course”. PHP can read and remove BOM from every file. As we encounter this problem only in text based files, a string remover will do the trick. Applause for substr().

At the end of the post, you can find my old BOM php code tweaked a little. This time it finds plus removes the UTF8 BOM problems out of your life.

Remember

GET A COMPLETE BACKUP OF YOUR FILES BEFORE YOU RUN THIS SCRIPT. Some files and software depend on the BOM to understand the content encoding. I won’t accept any responsibilities on how you used the code or what happened with it. So, be careful.

After this paranoid paragraph, here is the refurbished code. Just copy paste it to a text file, save as .php and run. Don’t use notepad. Oh what the hell, use it if you like, this baby will remove BOM from itself too :D

<?php // Tell me the root folder path. // You can also try this one // $HOME = $_SERVER["DOCUMENT_ROOT"]; // Or this // dirname(__FILE__) $HOME = dirname(__FILE__); // Is this a Windows host ? If it is, change this line to $WIN = 1; $WIN = 0; // That's all I need ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>UTF8 BOM FINDER and REMOVER</title> <style> body { font-size: 10px; font-family: Arial, Helvetica, sans-serif; background: #FFF; color: #000; } .FOUND { color: #F30; font-size: 14px; font-weight: bold; } </style> </head> <body> <?php $BOMBED = array(); RecursiveFolder($HOME); echo '<h2>These files had UTF8 BOM, but i cleaned them:</h2><p class="FOUND">'; foreach ($BOMBED as $utf) { echo $utf ."<br />\n"; } echo '</p>'; // Recursive finder function RecursiveFolder($sHOME) {   global $BOMBED, $WIN;     $win32 = ($WIN == 1) ? "\\" : "/";     $folder = dir($sHOME);     $foundfolders = array();   while ($file = $folder->read()) {     if($file != "." and $file != "..") {       if(filetype($sHOME . $win32 . $file) == "dir"){         $foundfolders[count($foundfolders)] = $sHOME . $win32 . $file;       } else {         $content = file_get_contents($sHOME . $win32 . $file);         $BOM = SearchBOM($content);         if ($BOM) {           $BOMBED[count($BOMBED)] = $sHOME . $win32 . $file;                     // Remove first three chars from the file           $content = substr($content,3);           // Write to file           file_put_contents($sHOME . $win32 . $file, $content);         }       }     }   }   $folder->close();     if(count($foundfolders) > 0) {     foreach ($foundfolders as $folder) {       RecursiveFolder($folder, $win32);     }   } } // Searching for BOM in files function SearchBOM($string) {     if(substr($string,0,3) == pack("CCC",0xef,0xbb,0xbf)) return true;     return false; } ?> </body> </html>

This post's short url is: http://emrg.me/6j

UTF8 BOM problems in PHP and XML

Yep, it’s a hell if you’ve got the UTF8 BOM (byte order mark) at the beginning of your PHP or XML files. These files need to send their own headers before anything else. Because of the BOM’s location, which is the first bytes of the file, headers can not be received by browsers and unintented errors might occur.

For PHP the error mostly will be “Warning: Cannot modify header information”, and for XML, “XML declaration allowed only at the start of the document”. If you are having header errors in your WordPress (including admin pages), it is most probably caused by a byte order mark in your theme files (First check, functions.php file of your theme).

How you can find and delete the BOM from text files ? Most frameworks and editors include a setting for saving non BOM UTF8 files. Check your help file, or ask at the forum or helpdesk of the tool you are using. Second, never use Notepad on Windows for development purposes. It directly inserts the BOM when you save your file in UTF8 format.

If you are dealing with hundreds of files, finding and deleting the BOM is time consuming. So here is a PHP file I wrote for finding the files that you’ll need to correct. What this script does is actually check all files’ first bytes for BOM characters by recursively moving around your home folder and subfolders. Every subfolder and file is checked and reported. After the script ends, it will give you a small list of files that have BOM.

Before running check the $HOME line, and change it with your home directory. You can also try setting it to document_root or file dir location. If you are hosted on a Windows based machine, do not forget to change the $WIN line to 1, as Windows recursive needs a different set of slashes.

PS. Delete the file after usage. This one prints the file list of your domain’s root and subfolders.
PSS. The php file is a resource monster. So try not to use it in your host’s peak hours.

<?php // Tell me the root folder path. // You can also try this one // $HOME = $_SERVER["DOCUMENT_ROOT"]; // Or this // dirname(__FILE__) $HOME = dirname(__FILE__); // Is this a Windows host ? If it is, change this line to $WIN = 1; $WIN = 0; // That's all I need ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>UTF8 BOM FINDER</title> <style> body { font-size: 10px; font-family: Arial, Helvetica, sans-serif; background: #FFF; color: #000; } .FOUND { color: #F30; font-size: 14px; font-weight: bold; } </style> </head> <body> <?php $BOMBED = array(); RecursiveFolder($HOME); echo '<h2>These files have UTF8 BOM:</h2><p class="FOUND">'; foreach ($BOMBED as $utf) { echo $utf ."<br />\n"; } echo '</p>'; // Recursive finder function RecursiveFolder($sHOME) {   global $BOMBED, $WIN;     $win32 = ($WIN == 1) ? "\\" : "/";     $folder = dir($sHOME);     $foundfolders = array();   while ($file = $folder->read()) {     if($file != "." and $file != "..") {       if(filetype($sHOME . $win32 . $file) == "dir"){         $foundfolders[count($foundfolders)] = $sHOME . $win32 . $file;       } else {         $BOM = SearchBOM(file_get_contents($sHOME . $win32 . $file));         if ($BOM) $BOMBED[count($BOMBED)] = $sHOME . $win32 . $file;       }     }   }   $folder->close();     if(count($foundfolders) > 0) {     foreach ($foundfolders as $folder) {       RecursiveFolder($folder, $win32);     }   } } // Searching for BOM in files function SearchBOM($string) {     if(substr($string, 0,3) == pack("CCC",0xef,0xbb,0xbf)) return true;     return false; } ?> </body> </html>

This post's short url is: http://emrg.me/6n

Calendar

May 2012
M T W T F S S
« Feb    
 123456
78910111213
14151617181920
21222324252627
28293031  
Web Analytics
Author: Emrah Gunduz