How to Read Big Files with PHP (Without Killing Your Server)
It’s not often that we, as PHP developers, need to worry about memory management. The PHP engine does a stellar job of cleaning up after us, and the web server model of short-lived execution contexts means even the sloppiest code has no long-lasting effects.
There are rare times when we may need to step outside of this comfortable boundary --- like when we're trying to run Composer for a large project on the smallest VPS we can create, or when we need to read large files on an equally small server.
It’s the latter problem we'll look at in this tutorial.
The code for this tutorial can be found on GitHub.
Measuring Success
The only way to be sure we’re making any improvement to our code is to measure a bad situation and then compare that measurement to another after we’ve applied our fix. In other words, unless we know how much a “solution” helps us (if at all), we can’t know if it really is a solution or not.
There are two metrics we can care about. The first is CPU usage. How fast or slow is the process we want to work on? The second is memory usage. How much memory does the script take to execute? These are often inversely proportional --- meaning that we can offload memory usage at the cost of CPU usage, and vice versa.
In an asynchronous execution model (like with multi-process or multi-threaded PHP applications), both CPU and memory usage are important considerations. In traditional PHP architecture, these generally become a problem when either one reaches the limits of the server.
It's impractical to measure CPU usage inside PHP. If that’s the area you want to focus on, consider using something like top
, on Ubuntu or macOS. For Windows, consider using the Linux Subsystem, so you can use top
in Ubuntu.
For the purposes of this tutorial, we’re going to measure memory usage. We’ll look at how much memory is used in “traditional” scripts. We’ll implement a couple of optimization strategies and measure those too. In the end, I want you to be able to make an educated choice.
The methods we’ll use to see how much memory is used are:
// formatBytes is taken from the php.net documentation
memory_get_peak_usage();
function formatBytes($bytes, $precision = 2) {
$units = array("b", "kb", "mb", "gb", "tb");
$bytes = max($bytes, 0);
$pow = floor(($bytes ? log($bytes) : 0) / log(1024));
$pow = min($pow, count($units) - 1);
$bytes /= (1 << (10 * $pow));
return round($bytes, $precision) . " " . $units[$pow];
}
We’ll use these functions at the end of our scripts, so we can see which script uses the most memory at one time.
What Are Our Options?
There are many approaches we could take to read files efficiently. But there are also two likely scenarios in which we could use them. We could want to read and process data all at the same time, outputting the processed data or performing other actions based on what we read. We could also want to transform a stream of data without ever really needing access to the data.
Let’s imagine, for the first scenario, that we want to be able to read a file and create separate queued processing jobs every 10,000 lines. We’d need to keep at least 10,000 lines in memory, and pass them along to the queued job manager (whatever form that may take).
For the second scenario, let’s imagine we want to compress the contents of a particularly large API response. We don’t care what it says, but we need to make sure it’s backed up in a compressed form.
In both scenarios, we need to read large files. In the first, we need to know what the data is. In the second, we don’t care what the data is. Let’s explore these options…
Reading Files, Line By Line
There are many functions for working with files. Let’s combine a few into a naive file reader:
// from memory.php
function formatBytes($bytes, $precision = 2) {
$units = array("b", "kb", "mb", "gb", "tb");
$bytes = max($bytes, 0);
$pow = floor(($bytes ? log($bytes) : 0) / log(1024));
$pow = min($pow, count($units) - 1);
$bytes /= (1 << (10 * $pow));
return round($bytes, $precision) . " " . $units[$pow];
}
print formatBytes(memory_get_peak_usage());
// from reading-files-line-by-line-1.php
function readTheFile($path) {
$lines = [];
$handle = fopen($path, "r");
while(!feof($ha
Truncated by Planet PHP, read more at the original (another 7383 bytes)