Linking Cache Performance to User Behaviour.

Ian Marshall and Chris Roadknight

BT Research Laboratories, Martlesham Heath, Ipswich, Suffolk, UK. IP5 7RE

{marshall,roadknic}@drake.bt.co.uk

Abstract.

The performance of HTTP cache servers varies dramatically from server to server. Much of the variation is independent of cache size and network topology and thus appears to be related to differences in the user communities. Based on analysis of a range of user traces, a simple model of individual users is proposed. This model could be used as the basis of a range of user community models, and thereby enable the accurate prediction of cache performance for a wide range of communities.
Keywords. Traffic, Cache, Internet, Users

1.Introduction.

HTTP caching has been shown to deliver considerable benefits in network utilisation and user perceived performance [BAE97] [ABR95]. In order to optimise the benefits it is important to develop accurate models of cache behaviour. To our knowledge, accurate models have yet to be fully defined. The performance of HTTP cache servers varies dramatically from server to server and invariants have proved hard to identify. Many of the metrics proposed have single sided probability distributions and are thus not amenable to simple analysis based on statistical means. Much of the observed variability is independent of cache size [DUN97] and network topology [ARL97] and appears to be related to differences in the user communities of the caches. It therefore seems desirable to develop a good model of user behaviour. Such a model could be derived by estimating proportions of users of different types in a particular community. If it could be shown that user behaviours form a distribution with a well-defined mean (many human metrics, such as PSTN calling patterns, follow Poisson distributions [BOU88]) it would be possible to build simple models based on a typical user. Unfortunately studies of the behaviour of individual users are not readily available, and it is not yet possible to quantify the distribution of behaviours. This is probably because it is not possible to analyse the behaviour of anything other than large (atypical) groups of users in the anonymised statistics published by some cache operators [e.g. ftp://ircache.nlanr.net/Traces/]. In order to build reliable statistics it is necessary to encourage the publication of long-term traces for small and medium users.
In this paper we report the first results from a study of individual user behaviour that we have undertaken. In section 2 we illustrate the analysis that is possible using the published daily statistics from a large cache. In section 3 we report some longer-term results for a single user on a small cache of our own. We hope that others will find these results useful and will offer similar traces. In section 4 a simple user model is proposed. In section 5 we combine the single user results and the results derived from the published statistics examined in section 2 and propose a user community model based on the user model proposed in section 4. The model is currently forced to make a number of assumptions. If individual user traces were available from a wide range of sources these assumptions could be validated directly. We use the model to derive some quantitative predictions of cache behaviour. Comparison with observed results shows that our assumptions are reasonable, however we cannot yet be confident about the range of applicability of the model.
 

2. Analysis of Daily statistics

The clearest link to user choices and behaviour in published cache statistics is the host popularity curve (or locality plot). This curve plots the number of requests for files from a given site, against the popularity ranking of the site. Previous studies [CUN95] have shown that the curve usually approximates to Zipf’s law. Zipf's First Law [ZIP49] states that if the frequencies of occurrence of each item are ranked by frequency then the frequency of the second most popular item will be half the frequency of the most common item. The frequency of the third most popular item will be a third of the frequency, and so on. That is:

Frequency of RankingN = (Frequency of Ranking1)/N

In order to get a clear picture of user activity however, it is necessary to know the target file for each request, rather than just the target host. This is because some users could access a large range of files on a small range of hosts, and some users could repeatedly access a small range of files on the same range of hosts. This would lead to identical popularity curves but very different results for metrics such as hit rate, and number of cached files. Unfortunately the statistics published by most caches do not provide any detail of the files accessed by individual users. However, we were able to obtain a log which reported the target URL for each request, from the operators of the Funet (Finnish University and Research Network) cache in Finland [http://www.funet.fi/funet/]. The Funet cache is a primary cache serving universities, polytechnics and some research organisations in Finland. The cache has about 2500 users, 500000 requests per day, a hit rate of ~30% and a popularity curve which is approximately Zipf. The cache appears, on these metrics, to be similar to many primary caches run by other universities or group of academic institutions [e.g. http://www.swin.edu.au/proxy/usage/]. The analysis reported in this section is based on the log files obtained from Finland.
Figure 1 shows the user community of the cache sorted by number of requests generated in a single day (20/1/98). The logarithmic distribution of request rates (figure 2) is approximately Poisson with a mean of 101 requests (where the mean is defined as the inverse log of the mean of the logarithmic distribution in fig. 2). It is thus clear that one of the obvious metrics for user behaviour has a meaningful maximum likelihood, which could be used in a simple model. However, in order to characterise user behaviour with respect to caching it is also necessary to understand the locality of the files accessed by individual users. Many file requests are automatically generated by embedded objects such as .gif files, rather than by the user. As we are attempting to analyse user behaviour rather than the composition of web pages we have attempted to eliminate the auto-generated requests from the analysis. Therefore all requests which occurred within a second of the previous request and were made to the same host as the previous request have been filtered out of the data.
Figure 1. Usage variations of clients using the Finland proxy cache.
 
 
Figure 2. Users in different log based usage groups.
Figure 3 shows the host popularity data for the largest two users of the cache. A line with the gradient of Zipf’s law is also shown as an aid to the eye. It is apparent that user 1 (no.of requests) visits fewer hosts than user 2 (no.of requests), and that whilst user 1 approximately follows Zipf’s law user 2 does not. In fact user 2 shows weaker locality since his most popular sites are less popular than Zipf would predict. We chose to analyse large users since performance parameters based on locality, such as hit rate, are known to be partly stochastic in origin. In other words it seemed a good idea to use large sample sizes. In fact it is possible to analyse much smaller users but the benefits of such analysis are debatable for two reasons. Firstly common sense would suggest that data from only one day will consistently underestimate the locality shown by users because daily visits to sites (e.g. news sites) will not be represented. Secondly smaller than average users do generate samples which are too small after only one day, so the selection of users can never be fully representative.
Figure 3. Popularity curve for hosts accessed by two users of the Finland cache.
In order to illustrate the need for data on file requests rather than just the usual data on host requests, the file request popularity curve for the same two users is shown in Figure 4, again with the Zipf gradient shown as an aid. It is clear that user 1 is visiting more pages than user 2, even though he visited fewer hosts, illustrating the importance of the file data. It is also clear that page request popularity does not follow Zipf’s law, since both curves now show less locality than Zipf would predict. The reason for this is likely to be that data has only been analysed for one day and thus the popularity of the most frequently visited sites has been underestimated as discussed above. It also important to note that the request gradient for hosts will always be steeper than files because the former is usually an aggregate of the latter. This means that there is always more files requested than hosts visited and the most popular host is always more popular than the most popular file.
Figure 4. Popularity curves for files accessed by two users of the Finland cache.
Even with access to the file request data from the Funet cache we are unable to draw any strong conclusions. The reasons can be summarised as follows:
1. Only large users generate enough requests to be worthy of analysis and they are not likely to be representative of small users.
2. A user identity cannot be traced for long time periods since the data is freshly anonymised on a daily basis. This is particularly important as there is likely to be significant long range dependency in user activity based on the users memory which will be lost if only the activity of a single day is analysed.
In an attempt to overcome these issues we have assembled a long term data set on a small cache of our own. The initial analysis of some of the data from our own cache is reported in section 3 below.

3.Analysis of long term statistics

The cache which is the source of the data presented in this section has been used by 5 individuals over the course of a year. The cache is small since only users who are members of our research team, and are happy for their logs to be analysed in non-anonymised form have been encouraged to use it. Over a 26-week period (25/8/97 – 22/2/98) the cache generated 14000 requests and sustained a hit rate of around 20%. The cache is sited at the subnet router for the team’s network and is thus caching requests to sites in other buildings on the local Intranet. These local requests do not appear in the data for larger caches and should possibly be eliminated. In this paper these requests have not been eliminated as it is difficult to know where to stop eliminating sites from a global Intranet such as that operated by BT. We do not feel that the local requests significantly alter the principal findings of the work.
Table 1. Users statistics for biggest user of small scale cache.
Table 1 shows the usage statistics of the most consistent user of the cache over the 26-week period. This user was using the WWW through the cache whilst in the office, primarily as a research tool.
On days when the WWW was used the user spent between one and three hours browsing and generated up to 345 page requests in a session. The typical number of initial page requests (auto generated requests for embedded files have been filtered as above) on a working day was 100. This places the user in the most probable part of the usage distribution observed on the Finland cache. One cannot on this basis alone claim that the user is representative, but at least we can be sure we are analysing a user with significant differences to the users analysed in section 2, and can derive some preliminary inferences about the range of user behaviours which were not previously possible.
On the basis of the amount of time this user spent in WWW based activity it seems unlikely that the requests attributed to the heavy users on the Finland cache were entirely generated by human end users. 14% of the requests generated by our user were actually automated requests. The user was using Internet explorer and has experimented with the subscription mechanisms this browser provides. In the 26 week period the user had 6 subscriptions for a period of about one month. One of the subscriptions was checking the currency of bookmarked pages. Had these subscriptions been extended over the entire 26 weeks 50% of the traffic would have been "robot" generated. It is conceivable that an enthusiastic subscription user would have a much higher proportion of robot traffic. Table 2 shows the ten most popular sites for all three users.
 
Table 2. The 10 most popular hosts for 3 users.
As expected from previous reports [GWE96] the favourite sites for our moderate user include a local directory, some intranet pages, some news pages, some experimental research activity, and developer sites of large software companies. On the other hand the popular sites list for Finland user 1 is heavily biased towards the sites which are default subscriptions in internet explorer. This tends to support our supposition that the high request rates are robot assisted. It also leads us to question the benefits of the subscription mechanism which is potentially increasing web traffic by up to an order of magnitude.
Figure 5. Comparison of popularity curves over different timespans.
Figure 5 shows the host popularity data for our cache analysed over progressively increasing periods of 1 week, one month and 6 months. As the sample size and time period increase the results tend towards following Zipf’s law. It is not clear whether this is an artifact of small sample sizes or whether it indicates a very long-range dependency of more than 6 months. It is to be expected that there will be significant self-similarity at long timescales as the users' memory and interests are quite persistent, but it is also clear that for less popular sites the statistics will be very poor.
Figure 6 compares the page popularity curves for the two Finland users and for our local user (with auto generated requests filtered as before). It is clear from figure 6 that our local, moderate user has a popularity curve with a similar gradient to that of Finland user 2, but Finland user 1 has a curve with a steeper gradient. Despite the results being taken over a longer time period, our user still exhibits less locality than would be predicted by Zipf’s law. We suspect that this may be because the lifetime of pages is significantly shorter than that of sites so the usage of the most popular pages cannot build up over long periods as it does for the most popular sites. It is equally possible that it is simply because our user is different. To resolve this question we need access to further long term user traces
Figure 6. Popularity curves of three users at two sites.
Figure 7 shows the variance of the "auto hit rate" generated by our user plotted against sample size. The "auto hit rate" is the rate of request for files previously requested by the user and successfully cached. As observed elsewhere [ROA98] for the aggregate hit rate on much larger caches the variance does not decay rapidly with sample size. This is consistent with long range dependency (or Fractal behaviour). The Hurst parameter can be estimated (unreliably since the sample is quite small) from the data at ~0.7. This is consistent with estimates derived elsewhere [MAR98] from published cache statistics.
Figure 7. Decline in hit rate variance with increasing sample size.
Two observations are worthy of note
1. The sample size we have used is sufficiently large to enable the stochastic factors to be averaged out. It seems that a sample of ~2500 requests is sufficient
2. The long-range dependency in hit rate observed previously is apparent in a single user trace. This supports the conclusion that one source of the Fractal properties observed in caches is the behaviour of the users. The origin is most likely to be the users memory.
Figure 8. Change in theoretical hit rate over 10 working days.
In figure 8 we show the build up of the auto hit rate of our user with time. The rate plateaus after just over 1 working week, indicating that the characteristic time of the dominant long-range dependency should be around 1 week. The data also indicates that the ideal length of user trace is at least 1 week even for large users. For typical users with ~100 requests per day a 25 day sample would be ideal to overcome stochastic effects. Users with 20 requests per day would ideally require a 6 month long trace. Users with less than 10 requests/day will be hard to analyse accurately as the sample would need to be 250 days long, i.e. it would exceed the typical lifetime of web pages. However since these users represent the tail of the distribution and have minimal impact on cache performance they can almost certainly be safely ignored.
Based on the results reported in sections 2 and 3 we now move on to propose a simple model of users

 

4. User model

The simplest conceivable model of a web user has one assumption and one parameter. The assumption is that page popularity will follow Zipf’s law as do other popularity curves for human beings [BER92]. The parameter is usage level. The results we have reported indicate that this is not sufficient to explain the variability of user behaviour – since in practice the users do not follow Zipf’s law and users deviate in ways which are not always the same (e.g. the two Finland users are different). In addition it is clear that users are a significant source of long range dependency. This leads us to the assertion that at least 3 parameters are required to model the observed behaviour
1. Usage
2. Auto hit rate
3. Hurst parameter derived from auto hit rate
Of course for modelling situations where the burstiness is not important the third parameter can be neglected and we are left with a simple two-parameter model
Auto hit rate is chosen as the second parameter because it is the most obvious single number metric for the locality of users requests, and it has no dependencies on other users or the properties of the WWW. In addition it is plausible to argue that it will show a distribution with a well defined maximum likelihood, as does the usage parameter. This characteristic is crucial in the development of simple community models such as that proposed in the next section.
 

5. Community model

Based on a two parameter user model as proposed in section 4 and additional input from the observations we can propose a method for building user community models. This method is intended as an illustration of our objectives rather than as a definitive proposal due to the small amount of data available.
Assumptions
1. the distribution of request rates is Poisson with a mean of 101 as observed in the Finland data
2. the distribution of auto hit rates is Poisson with a mean of 0.2 (the observed auto hit rates of our user) and independent of request rate
If the community is mostly human the above assumptions are likely to be true. However if there is a substantial proportion of robot traffic there are likely to be at least two independent distributions with different means. The impact of robots will be the subject of a further study.
Since we have assumed Poisson distributions it is sufficient to have studied users enough to characterise the means of the above distributions accurately. The community model is not dependent on individual user properties. However, we also need a characterising parameter – the degree of overlap between users in a given community. We propose an overlap parameter, v, which can be derived empirically using
(1)
We anticipate that caches with similar user communities will have very similar values for v, and that v will have a useful probability distribution, which can be treated as the sum of the distributions for v for a small set of community types (e.g. researchers, home users, business users, robots, mixed). Each type represents a different mixture of user objectives, and should thus represent a defined market sector for WWW products. We are currently attempting to calculate v for a wide range of caches with published data in order to validate this hypothesis.
v can be used to make predictions, for example it can be shown that:
Using:
meanrequestrate= 101
meanautohitrate = 0.2
and taking observations from the Finland cache for observedno.of cachedpages = 161628, workingdays = 1, and no.ofusers = 2284, we derive v = 0.124
Applying this value of v to the Finland cache we get no.ofhits = 69010
The true value of no.of hits was 70999 so our model has predicted a reasonable answer.
As it stands we have used values and made assumptions which due to lack of user traces are not statistically safe. However we have shown that the assumptions lead to reasonable results, thus we are confident that further work obtaining and analysing data to strengthen our assumptions will be worthwhile. We have also used a linear model to illustrate our objective. It is probable that reality is non-linear, but there is not sufficient data to justify a non-linear model at this stage.
 

6. Conclusions

Based on a preliminary analysis of user behaviour in a daily cache log and in a long term trace it is clear that long term traces, lasting at least a week, of individual users file requests are required to develop models of WWW caches based on user behaviour. If such traces were widely available they would form the basis of a simple but powerful approach to modelling cache performance. It is possible that analysis of a large number of traces would show that user properties have a probability distribution with a maximum likelihood and consequently, there is a meaningful definition of a typical user of the WWW. This would enable simple typed community models to be built where the community types map to user objectives. The user objectives could be derived from marketing studies. A minimalistic model has been proposed to illustrate this approach to cache modelling.
 

7. Acknowledgements.

We would like to thank Pekka J¬rvel¬inen for supplying us with anonymised logs for the Funet proxy cache.
 

8. References

[ABR95] M. Abrams, C.R. Standridge, G. Abdulla, S. Williams and E.A. Fox. Caching Proxies: Limitations and Potentials. Proc. 4th Inter. World-Wide Web Conference, Boston, MA, Dec. 1995.

[ARL97] M. Arlitt and C. Williamson. Internet web servers: Workload Characterization and performance implications. IEEE/ACM Transactions on Networking. Vol. 5, No. 5. October 1997.

[BAE97] M. Baentsch, L. Baum, G. Molter, S. Rothkugel and P. Sturm. Enhancing the web’s infrastructure: From caching to replication. IEEE Internet Computing. March 1997. P. 18-27.

[BER92] J. Beran. Statistical methods for data with long-range dependance. Statistical Science. Vol. 7. No. 4. p404-427.

[BOU88] J. Boucher. Voice teletraffic systems engineering. Chapter 2. ISBN 0-89006-335-4.

[CUN95] C.A. Cunha, A. Bestavros, and M.E Crovella. Characteristics of WWW client-based traces. Technical report TR-95-010, Boston University Department of Computer Science, April 1995.

[DUS97] B.M. Duska, D. Marwood and M.J. Feeley. The measured access characteristics of WWW proxy caches. 1997. URL: http://www.cs.ubc.ca/spider/marwood/Projects/SPA/

[GWE96] J Gwertzmann and M Seltzerle, People, Places, and Things: The Next Generation Web. COMPCON 1996 65-70

[MAR98] I. Marshall and C. Roadknight. Characteristic times for long range dependency in WWW page requests – in preparation

[ROA98] C. Roadknight and I. Marshall. Variations in cache behavior. Proceedings of WWW7 (in press).

[ZIP49] G. K Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.