Saving bandwith and cpu, Caching output in PHP
After reading the excellent article of Dave Child "
Caching output in PHP" and the comments on that page,
I tried to rewrite the code with some of the suggested improvements.
I also looked at the suggestions from Omar AlBadri "
Better Php Caching"
and other ones found at php.net. And of course some of my own ideas.
Improvements
- If possible use the browser cache.
- Compress the cachefile with PHP
If a cachefile is served 100 times, than gzip has to compress it 100 times, wasting cpu.
So it's better to disable gzip (for the cached files only) and compress the cachefile with PHP.
In the cache directory create a file .htaccess with the following line in it:
SetEnv no-gzip dont-vary
- Handle race conditions and solve the multiple non cached request problem.
The basic idea behind the code from Omar AlBadri is a very smart one: if there are 6 visitors at the same time and the cachefile has expired, serve it anyway to the last 5 of them, don't run your program more than necessary.
- Only cache Get requests.
- Make it possible to have a special character (or sequence of characters) in the name of a php file, to exclude it from caching "__" so "__cacl.php" will not be cached.
- The cache is created even if the visitor leaves the page
- Optimized the program for speed and Fixed some bugs
<?php
// original code
$ignore_page = false;
for ($i = 0; $i < count($ignore_list); $i++) {
$ignore_page = (strpos($page, $ignore_list[$i]) !== false) ? true : $ignore_page;
}
// Basicly it sets $ignore_page to true if the page was found in the ignore_list.
// Once it is set to true it stays true because the code will now read:
$ignore_page = (strpos($page, $ignore_list[$i]) !== false) ? true : true;
// Neat trick in only 3 lines of code, HOWEVER!
// 1) If there are 100 items in the ignore_list you calculate 100 times the length of this array
//This is better:
$count = count($ignore_list);
for ($i = 0; $i < $count; $i++) {
// etc by the way every time you find yourself counting the length of an array consider using a foreach loop
// 2) If the match is made on the first record the $ignore_page will be true.
// Why process the other parts of the array when they will not change the $ignore_page?
// For small arrays there is hardly a problem but as they grow the extra overhead will become a problem.
// Improved code, yes more lines:
$ignore_page = false;
foreach($ignore_list as $file_to_ignore) // no need to calculate the length of the array, native php does it faster than we can
{
if (strpos($page, $file_to_ignore) !== false)
{
$ignore_page = true; // we do this only once
break; // we are done it cannot become more true than true
}
}
// I tested this the foreach was more than twice as fast.
// The final improvement use native php: more readable, less lines and much much faster than the foreach
if( in_array($page,$ignore_list) )
$ignore_page = true;
// For very large arrays you could use array_flip and isset, even faster than in_array (in case of large arrays)
$flipped_ignore_list = array_flip($ignore_list);
if ( isset($flipped_ignore_list[$page]) )
$ignore_page = true;
// The real time saving comes when you need to adjust/maintain the code and the in_array code is by far the most readable
// the original code used time() the code below is faster
if ( isset($_SERVER["REQUEST_TIME"]) ) // usually available not always ,but if so faster dan time()
$now_bc = $_SERVER["REQUEST_TIME"];
else
$now_bc = time();
// the original code used file_exists() is_file is faster
if (is_file($cachefile_bc))
?>
This gives the following program flow:
- Route 1:Saves bandwith and cpu we just send 304 not modified header (fastest)
- Route 2:Next best we serve the already compressed valid cachefile saving bandwith and cpu (fast)
- Route 3:Next best we serve the already compressed expired cachefile saving bandwith and cpu (fast) not waiting for a lock
- Route 4:We run our program and serve compressed content and save the compressed cachefile (slow)
- Route 5:We run our program and serve compressed content and save the compressed cachefile (slow)
Note that routes 3 and 4 only happen when there are several visitors at the same time so in 99% of the cases route 1,2 or 5 will happen
Structure
begin_cache.php
yourPhpFile.php
end_cache.php
The first, "begin_cache.php" in this case, will run before any other PHP on your site.
The second, "end_cache.php" in this case, runs after normal scripts have run.
The two scripts effectively wrap around your current site.
How to do it:
Method 1
Simply use the require_once() function and add them manually to every script you run.
Advantage portable, but a lot of work
<?php
require_once ("path_to_begin_caching.php/begin_caching.php");
//your php code
require_once ("path_to_end_caching.php/begin_caching.php");
?>
Method 2
Relies on adding the following two lines of code (modified to reflect the correct path to the two PHP files needed) to your htaccess file.
Easy to turn off (just by commenting # out the relevant lines in the .htaccess file). But session variables are needed, so a lot of work
php_value auto_prepend_file /full/path/to/begin_caching.php
php_value auto_append_file /full/path/to/end_caching.php
If you use this method the following variables need to be session variables $key_bc, $fp_bc, $cachefile_bc, $ignore_page_bc, $now_bc, $cachetime_bc
Method 3
The preferred way easy to turn off (just by commenting # out the relevant line in the .htaccess file) and quick to do.
Add the following line to your htaccess file:
RewriteRule ^([a-z]+)$ loader.php?p=$1
Create a file loader.php with the following lines in it:
<?php
require_once("begin_cache.php");
require_once($_REQUEST["p"]);
require_once("end_cache.php");
?>
The code in begin_cache.php
<?php
/* begin_cache.php
* programmer RvH date 09-11-2013
*/
if ($_SERVER['REQUEST_METHOD'] === 'GET') // Only process get REQUESTS
{
$cachedir_bc = '/path/to/your/cache/'; // Directory to cache files in (keep outside web root)
$extension_bc = '.htm'; // DOT + Extension to give cached files (usually .cache, .htm, .txt)
$exclude_bc = '__'; // Don't cache files with these markers ( caches t_est_.php but not tes__t.php )
$key_bc = "s3crEt";
$cachetime_bc = 86400; // cache is valid: time in seconds 60*60*24 = 86400 = 1 day
$page_bc = 'http://' . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
error_reporting(0); // expensive
ignore_user_abort(TRUE); // make sure the cachefile gets written to disk
$ignore_list_bc = array( // files not to be cached = ignore list
'counter.php',
'search.php',
);
if (strpos($page_bc,$exclude_bc) !== false ) // add files with markers to the ignore list
$ignore_list_bc[] = $page_bc;
$cachefile_bc = $cachedir_bc . md5($page_bc) . $extension_bc; // Cache file to either load or create
$ignore_page_bc = false;
if( in_array($page_bc,$ignore_list_bc) ) // do we need to ignore this file
$ignore_page_bc = true; // if a match was found set it to true
if ( isset($_SERVER['REQUEST_TIME']) ) // not always available, but if so faster dan time()
$now_bc = $_SERVER['REQUEST_TIME'];
else
$now_bc = time();
// caching
if ($ignore_page_bc === false) // not in ignore list than proceed
{
if (is_file($cachefile_bc)) // do we have one
{
$filesize_bc = filesize($cachefile_bc);
$last_modified_bc = filemtime($cachefile_bc);
$etag_bc = md5_file($cachefile_bc);
clearstatcache();
header_remove ("Pragma"); // sometimes hosting sets Pragma no-cache
header_remove ("Expires");
header("Last-Modified: ".gmdate("D, d M Y H:i:s", $last_modified_bc)." GMT");
header("Etag: $etag_bc");
header ("Cache-Control: must-REVALIDATE");
// serve from browser cache just send headers
if (@strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE']) == $last_modified_bc || trim($_SERVER['HTTP_IF_NONE_MATCH']) == $etag_bc )
{
header("Warning: BROWSER cache size $filesize_bc");
header("HTTP/1.1 304 Not Modified");
ob_end_clean();
exit;
}
// serve from cache if still valid
if ($now_bc - $cachetime_bc < $last_modified_bc)
{
$size_bc = filesize($cachefile_bc);
header("Content-Encoding: gzip");
header("Expires: $cachetime_bc");
header("Warning: 200 VALID from cache filesize $filesize_bc");
if (readfile($cachefile_bc)){
ob_end_clean();
exit();
}
}
// expired cache
$fp_bc = fopen( $cachefile_bc, "ab"); // Let's first try to lock (w or w+ would erase the file)
if (! flock($fp_bc, LOCK_EX | LOCK_NB)) // lock but don't wait for it, if we can't get one
{ // can't lock, SERVE AN OLD CACHEFILE
ob_start();
header("Content-Encoding: gzip");
header("Content-Length: ".$filesize_bc);
header("Expires: $cachetime_bc");
header("Warning: 200 EXPIRED cache");
header("Etag: ".$etag_bc);
if (fpassthru($fp_bc))
{
fclose($fp_bc);
ob_end_clean();
exit();
}
}
else
flock($fp_bc, LOCK_EX); // lock it (wait) and create new cachefile
}// eof cachefile exists
else
{
$fp_bc = fopen( $cachefile_bc, "ab"); // No cachefile at all, let's first try to lock ( w or w+ would erase the file)
flock($fp_bc, LOCK_EX); // lock and wait for it than create new cachefile
}
ob_start();
}// eof ignore page
}// eof GET
The code in end_cache.php
<?php
/* end_cache.php
* programmer RvH date 09-11-2013
*/
if ( ! isset( $key_bc ) || $key_bc !== "s3crEt") // if someone forgets to put this outside webspace
{
ob_end_clean();
@fclose($fp_bc);
@unlink($cachefile_bc);
die("No direct acces allowed");
}
if ($_SERVER['REQUEST_METHOD'] === 'GET') // ignore post!
{
if ($ignore_page_bc === false) // not in ignore list than proceed
{
ftruncate($fp_bc, 0); // same as fopen with w
$contents_bc = ob_get_contents();
$filesize_bc = ob_get_length();
ob_end_clean();
if( stripos($contents_bc,"<pre>") === false ) // not to be used with <pre> tags
$contents_bc = sanitize_output($contents_bc);
$compressed_out_bc = gzencode($contents_bc,9);
header("Content-Encoding: gzip");
header("Warning: 200 cache CREATED $filesize_bc bytes time = ".date("Y-m-dTH:i:s",$now_bc));
header("Cache-Control: must-revalidate");
header('Last-Modified: '.gmdate('D, d M Y H:i:s \G\M\T', $now_bc));
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', $now_bc + $cachetime_bc));
header("Etag: ".$etag);
fwrite($fp_bc, $compressed_out_bc); // save the contents of output buffer to the file
flock($fp_bc, LOCK_UN);
fclose($fp_bc);
echo"$compressed_out_bc";
exit();
}
}
function sanitize_output($buffer)
{
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
The names of the variables have _bc behind them to avoid a conflict
Configuration
- $cachedir_bc: '/pathToYourCacheDirectory/'; Directory to cache files in. Keep outside web root! And only use for cache files!
- $cachetime_bc: a site specific setting. Default → 86400 (1 day)
With a high traffic site or an often updated site this should be a lot smaller value than 1 day
- $extension_bc: dot plus extension for cached files. Default → ".htm"
But could also be ".cache" or ".html" eh don' forget the dot!
- $exclude_bc: this can be any (sequence) of character(s) that are allowed in filenames Files with $exclude_bc in their name will not be cached.
If your file is smaller than 1KB exclude it, it isn't worth the overhead.
- $key_bc: Not really necessary if you keep the begin_caching.php and end_caching.php outside the webroot
The code:
We check if the request is a get.
Define the settings
We add the files with the marker "__" to the ignore list
Check if the requested file is to be ignored
Send headers if the file is in the browser cache and not expired
We serve a cachefile if it is valid and exists.
If not we will try to get a lock, if we can't we serve an old cachefile if it exists,
If we are still in the program we need to run end_cache.php
Usually this method will reduce the file size by 30%-70%
The page will load faster the onload time is reduced by around 25%
When you see compressed html directly or after refreshing the page (F5) you need to put this rule in .htaccess or change the value in php.ini:
php_flag zlib.output_compression Off
Finaly we need to delete old cachefiles. Rule of thumb delete them as often as your $cachetime_bc.
Crontab should run for the following php once a day
<?php
$cachedir_bc = ""; // the directory where the cachefiles are
$cachetime_bc = ""; // the time the cachefiles stay valid
$cachedir_bc .= ( substr($cachepath, -1) !== "/" ) ? "/" : "";
$files = array_diff( scandir(), array("..", ".",".htaccess","index.html") ); // skip these
foreach($files as $file)
if ( time() - $cachetime_bc > filemtime($cachedir_bc.$file) ) // delete cachefiles if they are too old
@unlink($cachedir_bc.$file);
?>
Or add the folowing line to crontab
10 5 * * * find /path/to/cachedir/ -name "*.htm" -mtime +1 -delete
Example cleans cache dir at 5:10 AM every day assuming your files have extension .htm and your $cachetime_bc = 1 day
Test the program look at the headers
?>