Wagner Meira Jr
Erik Fonseca
Cristina Murta
Virgílio Almeida
Computing Systems Performance Analysis and Modeling Lab
Department of Computer Science
- UFMG
Belo Horizonte - MG - Brazil
Although caching and the creation of cache server hierarchies has become a popular strategy for reducing user waiting time and network traffic, there is no standard recipe for determining the best hierarchy configuration given a set of machines and the workload that they have to serve.
This paper presents a framework for evaluating the performance of cache proxy hierarchies. By using this framework, administrators can determine the trade-offs among possible cache hierarchy configurations. We implemented this framework for Squid cache servers and demonstrated its use by evaluating possible hierarchy configurations for a real cache proxy.
WWW caching has been a widely adopted solution for reducing user waiting time, server load and network traffic [6]. Moreover, these caches are very important for user communities that have restricted connectivity, such as countries in which the Internet is not well established [1]. Many cache servers in use today are already overloaded, presenting scalability problems and motivating the creation of co-operative caches, where several servers work together and are organized as a hierarchy.
Despite of this fact, configuring and tuning cache hierarchies is still done by trial-and-error in most cases. The absence of ``recipes'' for configuring these hierarchies is explained by several factors such as the variable connectivity between the caches and the Internet, different user profiles [1], and a variety of hardware/software platforms. As a consequence, administrators usually tune their cache hierarchies through trial-and-error.
In this paper we present a framework for performance evaluation of cache hierarchies, which helps to characterize the workload that is submitted to the cache, the execution of experiments for gathering performance information, and the presentation of performance data, providing an intuitive notion of the cost-benefit associated with possible configurations. This paper is organized in six sections. This section serves as introduction. The next section presents Squid, a popular cache server. Section 3 describes our performance evaluation framework. We demonstrate the use of our framework in Section 4. The last two sections present some related work and our conclusions.
Squid [4] caches Internet data by acting as a proxy server for Internet users. That is, whenever a user requests a page to a WWW browser, the browser sends the request to Squid that is responsible for getting the requested objects for the client. If these objects are available at the Squid server, they are sent to the browser. Otherwise, Squid connects to the remote server and requests the objects. When the remote server answers, Squid sends the objects to the client machine, and keeps a copy of them. The next time a request is made for each of those objects, Squid checks whether they are available at the server, retrieves them from memory or disk, transferring the data to the client machine almost immediately.
Squid servers can co-operate, increasing both the storage and the connectivity capacities, and thus the efficacy of the cache system as a whole. Moreover, these servers can be arranged as a hierarchy, which works as one cluster that answers requests, and objects can be cached at one or more levels of the hierarchy. Also, each machine in the hierarchy can be assigned to a specific range of domains (e.g., .com, .net) so that it is possible to balance the load and to exploit the reference locality inherent to accesses. ICP [13] is a protocol designed to speed up the communication between servers that compose a hierarchy. By using the approach presented in this paper, users can quantify the intrusion caused by ICP .
While creating Squid hierarchies [12], we are able to set machine parameters (such as the amount of disk and memory that can be used for WWW caching) and to configure the hierarchy. There are basically two hierarchy-related settings for each server: the relationships that each server has with other machines (i.e., sibling, parent, none) and the domain(s) that are answered by each server (if not all of them). Note that there is no standard recipe for configuring a Squid hierarchy, because machine configurations, network resources, and traffic characteristics may vary drastically among sites. Our experience is that system managers usually tune hierarchy parameters by trial-and-error, which is both laborious and error-prone.
For sake of performance evaluation, we define three parameters that characterize the configuration of each server S that participates in a cache hierarchy:
In order to tune a Squid hierarchy, we have to address the trade-off between the amount of data that is available in the hierarchy and the intrusion that is caused by satisfying requests from other cache servers in the hierarchy. Thus, an increase in request connectivity usually increases not only the probability of finding the desired data, but also the intrusion associated with requests from other cache servers that compose the hierarchy, since the response connectivity also increases. As mentioned, domain coverage facilitates load balance both in terms of disk utilization and communication demands, but requires knowledge about the domain popularity in the workload. In the next section we describe how we can evaluate these trade-offs and compare possible hierarchies for a given set of machines.
In this section we present a performance evaluation framework for cache server hierarchies that consists of a methodology for workload characterization and performance evaluation, programs that simulate client browsers and WWW servers, and monitoring facilities for Squid servers. Our framework is based on trace-driven experimentation of cache server hierarchies. By utilizing our framework, administrators can test different configurations, determine potential bottlenecks in advance, and evaluate trade-offs among configurations. The framework is depicted in Figure 1 and is described in detail in the next subsections.
We characterize WWW cache workloads based on real traces from a proxy cache server. We adopted this trace-driven approach because both hit ratio and characteristics of stored data (e.g., source) vary significantly across caches [1,10], as a consequence of user interests and connectivity. A workload consists of one or more characterizations, each associated with a pre-defined domain. Each characterization is defined through the following parameters:
These parameters are used to drive the generation of streams of synthetic requests. Each request also encodes the size of the requested object, which follows the file size distribution of Inkbench Benchmark [5].
The clients in our framework are implemented as Unix processes that generate requests to a proxy server according to a workload specified as presented in Section 3.1. Each client generates a pre-defined number of requests, keeping a given number of outstanding requests. These clients are implemented as single processes that communicate through asynchronous I/O primitives, allowing them to handle several simultaneous connections.
Servers are also implemented as single processes that answer requests on demand, creating the synthetic pages associated with requested URLs, and returning them to the requestor. In a experiment, each server represents a domain of the workload. In the current version of our workload, the requested pages are sent immediately to the requestor, i.e., we do not model response times, which is the topic currently investigated. The implementations of both clients and servers are able to handle about 100,000 requests per hour, which is more than most proxy server hierarchies handle nowadays [1].
We instrumented the source code of Squid to collect cumulative timings from four perspectives:
We calculate the cache efficiency by using only the requestor perspective, i.e., we divide the time devoted to answer requests from clients by the total computation time. The other perspectives are used for further performance analysis, as described in Section 4.
We measure the performance of cache proxy hierarchies in two dimensions: hit ratio and efficiency. The hit ratio tells the percentage of the requests that were satisfied within the hierarchy, while the efficiency tells the percentage of the computation time that is devoted to answering requests made by clients.
We determine the hit ratio associated with a given cache proxy hierarchy in a per-client basis. Thus, for each client that requests data to the hierarchy, we divide the number of requests that are satisfied within the hierarchy by the total number of requests. We can calculate this ratio by analyzing the access.log file produced by Squid. More specifically, we verify the TCP requests in a per requestor basis. For each client, the number of its requests that are satisfied within the hierarchy is the sum of the number of requests that were satisfied by the server (TCP_HIT), and the number of requests that were satisfied by siblings (SIBLING_HIT) and parents (PARENT_HIT).
After determining the hit ratio and cache efficiency associated
with each hierarchy, we plot these two values in the same graph,
generating a ``Hit Ratio
Efficiency Graph'' (HEG).
HEGs are a simple and an efficient representation of the
trade-offs among configurations, providing a visualization of both the cost (eff
iciency)
and the benefit (hit ratio) of each configuration.
Note that the shape of the HEG is dependent on the characteristics of the workload, thus, after determining some of these characteristics (an example is given in Section 4), the administrator plots the HEG for possible configurations and pick the configuration that seems to be more appropriate. For example, organizations that have slow communication links to the Internet may choose configurations that provide better hit ratio. On the other hand, organizations that have fast external connections but limitations regarding computing facilities may adopt configurations that are more efficient computationally.
| Domain | Hot | Hot Set | Domain | Cache Size |
| Ratio | Size | Popularity | Ratio | |
| .br | 0.68 | 2304 | 0.58 | 0.1 |
| .com | 0.36 | 4356 | 0.33 | 0.1 |
| !.br !.com | 0.39 | 1764 | 0.17 | 0.1 |
| Configuration | Request | Response | Domain | Response |
| Connectivity | Connectivity | Coverage | Time (sec) | |
| 4.0 | 3 Sib. | 3 | All | 2.807 |
| 3.1 | 2 Sib./1 Par. | 2 | All | 4.084 |
| 2.2 | 1 Sib./2 Par. | 1 | All | 2.454 |
| 2.2.dom | 1 Sib./1 Par. | 1 | .br/!.br | 2.154 |
| Conf. | 4|c|Hits | |||
| Server | Parent | Sibling | Hierarchy | |
| 4.0 | 0.317 | 0.000 | 0.128 | 0.445 |
| 3.1 | 0.320 | 0.008 | 0.086 | 0.414 |
| 2.2 | 0.317 | 0.030 | 0.040 | 0.389 |
| 2.2.dom | 0.318 | 0.033 | 0.041 | 0.392 |
| Conf. | Children(%) | Sibling(%) | Overhead (%) |
| 3.1 | 66 | - | 34 |
| 2.2 | 55 | 3 | 42 |
| 2.2.dom | 57 | - | 43 |
| Conf. | Parent(%) | Sibling(%) | Overhead(%) | Client(%) |
| 4.0 | - | 14 | 43 | 43 |
| 3.1 | 2 | 12 | 44 | 42 |
| 2.2 | 5 | 8 | 40 | 47 |
| 2.2.dom | 3 | 8 | 39 | 50 |
In order to demonstrate how our framework can be used for analyzing the performance of cache proxy hierarchies, we evaluated possible configurations of a set of four machines to work as a cache server hierarchy for a given workload. Three of these machines are Pentium 166 Mhz with 64 MB RAM memory, and an IDE disk. The other machine is a Pentium Pro with 128 MB RAM memory, and an ultra-wide SCSI disk. All of these machines run FreeBSD 2.2.5 and our instrumented version of Squid 1.1.17. Finally, all machines have 300 Mbytes of disk for caching WWW objects. The use of two different machine configurations allows us to quantify how much a faster machine with a faster disk is better than a low-end personal computer, under the same conditions. These machines are connected to two networks. The first is an 100Mb Fast-Ethernet that connects only the four Pentium machines. The second is an Ethernet network, from where the requests come.
The workload used in our tests is based on logs from POP-MG, which is the Internet backbone that serves the state of Minas Gerais, Brazil. POP-MG has several national and international links that add to a bandwidth of up to 9 Mbps and an average traffic that is close to 6 Mbps. The average load of the caching proxy servers is 1,800,000 requests per day. We analyzed a log containing 4,235,511 requests for 1,079,044 unique objects, which results in about 12 Gbytes of data that has to be stored in our machines. On the other hand, our machines can store up to 1.2 Gbytes, i.e., the cache size ratio is 10%. As described in [1], half of the requests (50%) to POP-MG proxy servers is for Brazilian domains, while the US commercial domain accounts for 33% of the requests. For the purpose of analyzing domain-based caches, we divided our workload into characterizations for three domains: (1) .br, representing Brazilian sites, (2) .com, representing commercial US sites, and (3) other, representing the remaining sites. The parameters for these characterizations are presented in Table 1 and the file popularity distributions in Figure 2.
When evaluating hierarchies, administrators usually want to minimize
the cost associated with experiments. By using our framework it is
possible to scale down the experiments if we maintain the workload
ratios. Thus, after characterizing the workload, we generate a stream
of requests that is feasible to experiment with and set the disk and
memory parameters so that our workload ratios are kept. In the case of
POP-MG for example, we generated a set of 96,000 requests that follow
the workload and determined the total size of unique objects (
400 Mbytes). We then used 10% of this size as our total cache size,
leading to 10 Mbytes of disk cache per machine.
We investigated four hierarchy configurations in our experiments: (1) 4.0, where all four servers answer client requests and are siblings one from each other; (2) 3.1, a two-level hierarchy with three children (the three Pentium 166 Mhz) and one parent (Pentium Pro); (3) 2.2, a two-level hierarchy with two children (two Pentium 166Mhz) and two parents (one Pentium 166Mhz and the Pentium Pro); and (4) 2.2.dom, similar to 2.2, but the parents answer only pre-defined domains, i.e., one parent answers requests to .br domain, while the other answers the remaining requests. We performed the experiments for each of the four configurations. Each experiment consists of two runs, each requesting 96,000 randomly chosen URLs to the hierarchy as a whole. The results presented are always the average from second runs, since the purpose of the first runs is to ``warm up'' the caches. A summary of these configurations is presented in Table 2, where the last column presents the average response time from the client perspective for each configuration. Although these measurements do not reflect real response times, they indicate the configurations that are more efficient for the workload.
The hit ratios achieved using each configuration are presented in Table 3, where the first column identifies the configuration, the second column the average hit ratio for the children servers, the third and fourth columns give the percentage of requests that were satisfied by parents and siblings, respectively, and the last column gives the hierarchy hit ratio, which is the sum of the three previous columns. Note that the flat configuration (4.0) provides the best hit ratio, because of the higher degree of cooperation and temporal locality among requests to different clients.
Summaries of the computational profiles are presented in Tables 4 and 5. Table 4 presents the average results for servers that worked as parents in the various configurations. As we may observe in the table, their computational time was divided into three categories: (1) time spent answering children requests; (2) time spent answering sibling requests, i.e., other parents; and (3) time spent on server overhead (e.g., message reception, storage management). We can observe that children-related time always account for more than half of the computational time of the parents, and this time increases with the response connectivity, as expected. Table 5 presents the average results for servers that worked as children in the various configurations. The computational time of these servers was divided into four categories: (1) time spent receiving parent's responses; (2) time spent handling siblings requests; (3) time spent on server overhead; and (4) time spent answering client's requests. Note that the efficiency (i.e., client's time) increases as we use more structured hierarchies, while the intrusion caused by siblings decreases. Also, note that the use of domain-based configurations (2.2.dom) reduces by half the time devoted to parent-related work.
The HEG resultant from our evaluation (Figure 3) gives a graphical representation of the trade-offs inherent to the various hierarchies. By analyzing the graph, we can easily see that more structured configurations result in smaller hit ratios but are more efficient. In this case, the administrator would probably decide to use either 4.0 or 2.2.dom, depending on the constraints in implementing the cache hierarchy (i.e., network and hardware availability).
In all cases the server overhead time is very high and the most important problem to be addressed in order to improve the performance of the cache hierarchy. Checking code location profiles, we found that about 70% of this overhead is associated with reception of requests and responses associated with both TCP and UDP protocols, another 20% is storage management, and the remaining 10% is spread out among various tasks. This result indicates an urgent need for more efficient network primitives, which is a research topic that we intend to investigate.
Comparing the profiles from the Pentium with the Pentium Pro, we did not find any significant variations, although the Pentium Pro was able to answer requests faster (30%) than the other Pentium machines. This invariant may be explained by the fact that the faster machine also has the faster disk, presenting the same ratio processor/disk of the slower configurations.
We also performed tests varying the number of simultaneous connections from each client, and there was no variation on the profiles observed only on the response time, as a result of the varying multiprogramming level. This observation is very relevant because it allows users to profile and study their hierarchy configurations without the need to stress them to their limits, which may have consequences that are not easily predictable.
There are several tools that provide performance information for Squid proxy servers such as the cache manager, which is part of the Squid distribution, and Calamaris [2]. These tools usually provide request-oriented information, quantifying the costs of requests and the throughput of the proxy server.
Access characteristics of a client proxy server were discussed in [3]. One of their conclusions is that hit rate increases with request rate, which has implications for cache design. They discuss the cache design at a higher level, that is, hierarchical structure versus monolithic cache structure in a single level. However, they do not discuss the organization of the hierarchy and its implications for the cache performance.
Rousskov [11] studied the performance of seven SQUID proxies using per request measurements of network and disk delays. The hierarchical cache itself is seen as a black box. Although the cache software is fixed, the data were collected in various environments and cooperative caches with several hierarchy levels, begin hard to evaluate the trade-offs among configurations.
Krishnam and Sugla [9] evaluated the impact of co-operation of cache servers on the performance and efficacy of cache proxy hierarchies. They observed that co-operation does increase the document hit rate and has more impact on smaller proxies. On the other hand, co-operation also caused the largest (up to 300%) overhead on the smaller proxies. They propose a ``fairness'' metric that is the fraction between the hit ratio increase as a consequence of the co-operation and the increase in terms of extra connections that are inherent to the co-operation. This metric differs from our work because our efficiency is based on real-time profiles and reflects all sources of overhead. Moreover, to the best of our knowledge, our work is the first that quantifies the costs associated with hierarchies as a whole and determines the influence of the ICP protocol.
In this paper we presented a framework for evaluating and tuning the
performance of cache proxy hierarchies.
By using this framework, administrators can determine the trade-offs
among possible cache hierarchy configurations.
For sake of evaluation, performance data is presented in two dimensions:
hit ratio and
computational efficiency. The performance data associated with
various hierarchy configurations can be visualized via
``Hit Ratio
Efficiency Graphs'',
which allow easy understanding of the various trade-offs
across different hierarchy configurations.
We demonstrate the use of our approach by evaluating possible
configurations for the largest cache proxy in Brazil.
Our future research efforts include quantifying the effects of network delays, investigating the dynamic behavior of these cache hierarchies by using cause-effect analysis [7], developing models that facilitate the analysis of configuration trade-offs, and studying more efficient communication primitives that reduce the overhead observed, which implies an evaluation of the benefits of persistent TCP connections [8], in the scope of cache server hierarchies.