Michal Kurcewicz, Wojtek Sylwestrzak, Adam Wierzbicki
Interdisciplinary Center for Mathematical and Computational Modeling
University of Warsaw
{mkur, wojsyl, adamw}@icm.edu.pl
Topic: Caching software, Cache server performance analysis
The growth of HTTP traffic and experience with use of proxy caches
have demonstrated the need for increasing the scalability and
efficiency of proxy cache software. To increase scalability, proxy
caching software needs to be written in a modular, decentralized
fashion. On the other hand, to improve efficiency, the problem needs
to be decomposed. Both these basic and common-sense considerations
point to a new design solution: a distributed proxy cache.
In a centralized cache server, the main bottleneck will be a central
resource. In a distributed cache, it will be communication.
Therefore we stand before design considerations different than in a
centralized solution: efficient request direction to minimize
communication overhead, and keeping essential data structures
consistent - as cheaply as possible.
Finally, to exploit fully the opportunities for improving performance
by a distributed design, we must manage global resources of a
distributed cache as efficiently as possible. This means that
components of a distributed server must cooperate with each other
closely. To achieve improved scalability, we must distinguish
functional components of a proxy cache and build the design from
autonomous modules which realize these functions.
In this paper, we focus on the described above problems. We attempt
to improve the efficiency of proxy cache operation in basic ways.
This can be achieved through additional functionality of our
design. The distributed cache has better global memory management and
avoids use of disk in favor of transfers between memory caches.
Another field of our attention is resilience of a proxy cache. The distributed cache does not have a single point of failure. This is achieved through the use of election algorithms (see section 4). As a result, our design is safer than currently used solutions.
Our design is in many respect based on the distributed file
system xFS [1], [2]. However, there are significant differences
because of the specific character of our application.
Our design uses a hierarchy similar to that of xFS, but it is more
simple. A distributed file system has complex functions for
consistency control, which are needed because of repeated, concurrent
writes. A cache or WWW server writes objects once, then removes them
if they expire - possibly replaces them after a modification.
In the next section, we briefly discuss some design considerations which arose from the evaluation of performance of existing solutions. In section three we describe the functional architecture of our distributed proxy cache. Section four discusses failure control of the functional components. In section five, we describe load sharing. In section six, we recount our experiences with a prototype implementation of our design and conclude.
In this section, let us describe some problems which we aim to solve by our design. We do this by analyzing the characteristics of two mechanisms for dividing load among individual caches.
A commonly suggested way to partition traffic to a set of proxy
servers is to use a client-initiated proxy selection mechanism called
proxy auto-configuration. In the suggested scheme [5], [9] a hash function is computed from
the URL to route requests to different proxies.
Let us briefly criticize this solution: the points mentioned here shall be elaborated in further sections.
In this solution, several caches - members of the cluster - are provided to clients as a single cache through a load balancing TCP router or round-robin DNS. Internally, every cache has a routing array which is used to select a parent.
Our design is based on a division of a distributed server into functional
components. These components are not separate caches connected by a
hierarchy; in fact, more than one of the functional components could run
together on one computer. There are three kinds of server components
which we distinguish: the Frontend, the Backend and the Manager. No
components of the same kind can run together on one machine.
Main Memory Caches on Frontends
The Frontends are responsible for handling client connections. One
can think of a Frontend as an autonomous proxy, which maintains its
own cache in main memory.
A Frontend is selected for a client using Round Robin DNS or a load
balancing TCP router. After that, the selected Frontend checks if it
has the requested object in its main memory cache. If not, it asks
the Manager where the object can be found.
The Frontend's main memory caches store popular objects. Because of
that, there is a high probability that a client request will be
handled directly by some Frontend, without the need to touch disk.
The hit rate of the Frontends is determined by three factors: the
sizes of their main memory caches, the quality of their replacement
policy and the quality of the algorithm which determines if a object
should be placed on some Frontend.
We can now demonstrate how our design removes the problem with using
persistent connections and proxy auto-configuration. A persistent
HTTP connection is made to a Frontend, and it can remain open
regardless of what objects are requested and where the requests are
directed.
Processing a client request
On Figure 1 we show two kinds of components: the Frontends and Backends. A Manager is also present: the table on the figure is part of the Manager's data structures and is called the Manager routing table. The small boxes containing ``Hit'' or ``Miss'' on Figure 1 also represent information internal to the Manager. Figure 1 shows three Frontends (F1, F2, F3).

After a client makes a request to some Frontend (for example,
F2) and the object (URL3) is not found immediately in
F2's memory cache, F2 asks the Manager to find URL3. He
consults his data structures to determine if the requested object is
cached on some other Frontend. On Figure1, URL3 is found on
F1. The Manager asks F1 to send the data to F2 and
determines if F2 can store a duplicate of the object himself
(for details, see section 5). If the requested object is not found
on some Frontend (as in the case of URL2), the Manager computes
the value of the hash function on the object URL modulo the size of
the routing array. This value is used as an index to determine the id
of the Backend which is written in the indexed field of the routing
array.
The Manager checks if the object is present on the selected Backend.
This is done without querying the Backend, because the Manager has all
object metadata. If the object is found, the Manager asks the Backend
to send it to the Frontend which made the request. Otherwise, the
Manager asks the Frontend to fetch the object from the home server and
to store it on the selected Backend.
One of the issues raised in section two can be resolved here. In our
distributed cache, we use transfers between Frontends to fetch the
object because this can be faster than using local disk. This
strategy makes sense if the cluster is connected by a fast, switched
network such as ATM. The remote Frontend's memory becomes a stage in
the memory hierarchy which fits between local memory and local disk
[3].
Disk Caches on Backends
As mentioned earlier, the Backends form a distributed file server to
which Managers direct requests. Basically the function of a Backend
is very simple: it only needs to store objects and return them on
request. In our design the Backend does not even decide whether it
has an object or not: this is done by the Manager, who stores the
object metadata. We make no other functional assumptions about
Backends. However, we believe that a Backends main memory use should
be as small as possible, so that it would not retard Manager or
Frontend operation.
We believe that the Backend could use dedicated solutions for optimal
storing of files, such as the WAFL file system used by NetCache. In
our design the role of Backend could be fulfilled by any caching
software: in fact, in our first prototype implementation Backends were
Squid caches.
We say that a Backend ``owns'' the fields in the routing table which
have its identifier. The amount of fields owned by a Backend is a
variable which can be used to control the portion of all requests
which will be served by the Backend. We return to this point when we
discuss load sharing in section 5.
Management of Metadata and Routing
All object metadata and the routing table to Backends are maintained
by the Manager. Frontends ask Managers to direct object requests. If
the design is to scale well, there cannot be too many Frontends to one
Manager (for example, not more than 8 Frontends to one Manager).
Otherwise, the Managers will be forced to store too much data and
become overloaded. Therefore, when the distributed server has more
components, there should be more than one Manager.
In such a case, the stream of requests has to be divided by Frontends
among the Managers. Every Frontend has a Frontend routing
table which tells him which Manager to ask for the requested URL.
The Frontend routing table is provided for the Frontends by a selected
Manager, who acts as a consistency server. We shall say more about
this in the next section, when we discuss Manager failures.
On Figure 2 we show the situation when there are many Managers. F1 and F2 have an identical Frontend routing table, which contains the id's of 3 Managers. Only two are shown on the Figure, along with requests for which the F1 asks them. The two Managers are still shown in the form of Manager routing tables, to demonstrate why our design is not a one-level hierarchy.

On Figure 2, the Backends are not numbered. The reason for this is that the two shown requests, for URL1 and URL2, could be directed to the same Backend by the two Managers. Such a solution would make sense if there are many Frontends and few Backends in the distributed cache.
In this section, we discuss how failures of various functional components are handled in our design.
A client selects a Frontend by the use of a mechanism external to our design, such as round-robin DNS or a load balancing TCP router. The failure of a Frontend must be discovered by that mechanism. It is common knowledge that the load balancing TCP router has an advantage over the DNS server in that respect. This is because clients often cache DNS server responses. The used mechanism would also need to handle Frontend recovery. Regardless of the mechanism used, the delay and damage associated with correcting a Frontend failure will be smaller than in the proxy auto-configuration scheme, where each client must individually notice and correct the failure.
The reaction to a failure of a Backend is more complex. A Backend
owns several fields of the routing tables of the system's Managers.
After a Backend failure, the fields which it owned must be assigned to
a different Backend. After a Backend recovery, the Managers
repartition their routing arrays to add the Backend again.
To complete our discussion of Backend failures, let us remind the reader that the objects which are stored on Backends are not very popular. A Backend failure does not seriously impair the distributed cache.
The Manager is a critical component with respect to resilience. A
Manager failure means a loss of object metadata and consequently, of
direction for object requests. However, a Manager can create a backup
of object metadata on a selected group of Backends. To make this less
expensive and more resilient, the backup can be done in a
log-structured way (as in standard caching software). In this way,
the backed up metadata is always consistent, and after a Manager
failure a different Manager can pick up the backed up metadata and add
it to his own.
The Frontend routing table (described in section 4) is kept consistent
by a selected Manager, called the coordinating Manager. In
the case of a failure of an ordinary Manager, the coordinating Manager
determines which Manager should replace the failed one.
Let us stress that the coordinating Managers role is restricted to
intervention in the case of a failure or recovery. During ordinary
operation, the coordinating Manager does not have additional work or
store much additional data.
Because of the need for a coordinating Manager, another issue arises.
It is the failure of the coordinating Manager itself. We can discuss
this issue together with the situation of a failure of the only one
Manager in the system. This only Manager can also be called
coordinating.
In the case of failure of the coordinating Manager, an election takes
place. We use the well-known tyrant algorithm [7], in which all
parties participating in the election have unique numbers - and the
one with the largest number wins. If we allow all machines running
Frontends without Managers and all machines running Managers to
participate in the election, and assign to the former numbers smaller
than to the latter, then we have a solution to the stated problem. If
any Managers are present, one of them will win. If no Managers are
running, one of the Machines running Frontends will win and start a
new Manager.
In order to achieve good scalability, the system needs a certain minimal number of Managers (as stated in the previous section). Therefore, after a Manager failure, it may be necessary to start a new Manager on another machine. This should be decided by the coordinating Manager.
Frontends Replacement Policy and Filtering Algorithms
The Frontend's main memory caches cannot be very large. Therefore, a
Frontend replacement policy will need to be very efficient, because
most Frontend objects are popular and a removal from the cache might
result in significant loss. However, on the basis of investigation of
log files at the Polish national cache we can say that the 10% most
popular objects account for about 50% of all requests. The Frontend's
main memory cache does not need to be large to achieve a very good hit
rate.
The algorithm which places objects in Frontend caches needs to select
the most popular objects. It will therefore require a model of Web
object popularity or client access characteristics. Such models and
algorithms have been proposed in our paper ``A Filtering Algorithm for
Proxy Caches'' and will not be discussed here.
The main memory of all Frontends is coordinated globally by the
Managers. This is done by the use of a mechanism of popularity
counters. After a object is placed in some Frontend's cache by the
filtering algorithm, it will not be placed into another Frontend's
cache until the Manager which is responsible for that object decides
that it is sufficiently popular. Presently, we use a simple set of
popularity intervals to determine how many times a object can be
replicated in the Frontend's caches. The most popular objects will
be placed on all Frontends.
Every time a Frontend removes a object, it needs to notify the
object's Manager. These notifications are carried out every
specified time interval. In this way, the Frontend can at the same
time report to the Managers that it is still operational, and some
communication is avoided.
Here we describe how our design solves the problem of hot-spots.
Since requests for hot-spots are satisfied by Frontends, the requests
which are forwarded to Backends are for less popular objects. Because
of that, the routing of these requests by the means of a hashing
function creates a quite even distribution of load. To prove this
contention, we presents on the following table some quantitative data
about the distribution of load in 800 fields of a routing array before
and after the application of a filtering algorithm. The data were
gathered from a simulation of algorithm operation on a log file, which
lasted 2 hours. The data in the table describe the last hour.
| Percent | max-min | max-min | variance | variance | mean | mean |
| of requests | without | with | without | with | without | with |
| to Frontends | filter | filter | filter | filter | filter | filter |
| 43% | 42 | 16 | 5.2 | 2.7 | 6.6 | 3.75 |
We have said much about directing requests by the means of routing
arrays. The efficiency of this method is crucial to the performance
of the distributed cache under heavy load. The amount of load which
will be placed on the individual components of the distributed cache
(a Manager or a Backend) is controlled by the number of fields
assigned to that component in the routing array. Assigning too many
fields may lead to overloading of the component and deterioration of
distributed cache performance.
We have been able to show that a proportional division of the array is not the best method and have developed an algorithm which takes into account load information to partition the array as to minimize overall load. The algorithm has been tested in a simulator of a queuing model for a proxy cache. However, the discussion of the algorithm operation, simulator construction and performance tests is beyond the scope of this paper. It will be the subject of another paper, which is in preparation.
The described design was prototyped and its algorithms tested on real
workloads from several caches in Poland. However, none of the
workloads available to us had a sufficiently high request rate and
came from a cache with high network speed. Because of these
limitations, we cannot consider prototype testing of the design to be
sufficiently reliable. This will be one area of future work.
The prototype was implemented in Java for maximum portability and ease
of implementation. During the first set of tests and work on the
prototype we investigated the effect of a simple filtering algorithm
on Frontend hit rate. This led to a detailed investigation of
filtering algorithms, which is beyond the scope of this paper.
The functional components of a distributed cache to cooperate closely to improve performance. We have investigated techniques for better global memory management and increased the resilience of the distributed cache. All the described improvements were made possible by an increase of functionality due to the distributed design of our cache. The distributed cache is in several respects optimized to reduce communication overhead. The amount of data which is kept consistent is minimal. We believe that a production implementation of the presented design is feasible with the advent of high-speed, switched local networks.