Topic: Experiences in implementing national caches, regional caches and Institution/site caches

Analyzing the Behavior of a Proxy Server in Light of Regional and Cultural Issues

Vírgilio F. Almeida      Márcio G. Cesário       Rodrigo C. Fonseca
Wagner Meira Jr.        Cristina D. Murta
{virgilio, magc, rfonseca, meira, cristina}@dcc.ufmg.br
Computer Science Department - Federal University of Minas Gerais, Brazil

Abstract

The Internet is not only a technological concept, but also a cultural and linguistic concept. Therefore, to understand the performance behavior of the WWW, the role of regional, cultural and social issues on traffic patterns must be analyzed. The way users access the Internet depends heavily upon the telecommunication infrastructure and social context of each country. Based on the analysis of different proxy server logs, this paper shows evidences of the influence of these issues on the performance of a caching proxy server.

1 Introduction

The Internet and the WWW are not only a technological concept, but also a cultural and linguistic concept [6]. Therefore, to understand the performance behavior of the WWW, one must also understand the role of regional, cultural and social issues on traffic patterns. The way users access the Internet depends heavily upon the telecommunication infrastructure and social context of each country. Based on the analysis of different proxy server logs, this paper shows evidences of the influence of these issues on the behavior of a caching proxy server. Our approach is to examine the performance statistics drawn from logs of a busy Brazilian proxy server and interpret the measures in light of geographical and cultural issues.

History, geography, culture, language and economics are features that shape the regional identity of a nation. In this paper, we investigate the quantitative behavior of a caching proxy server in light of these features. In reference [9], an American historian discusses the cultural patterns of the Brazilian Society. The author chose a select group of features that he feels are at the core of the Brazilian society and culture: class, color, family, sexuality, music, soccer, religion, carnival, television, film and literature. He concludes: ``These are features that define Brazil and Brazilians.'' So, our approach is to use the geographical and cultural features of Brazil to examine the meaning of the statistics drawn from logs of a busy caching proxy server.

2 The Internet infrastructure in Brazil

  The history of the Internet in Brazil dates back to 1989, with the implementation of the National Research Network's backbone (RNP), which provides Internet access throughout the country. Figure 1 shows its status as of March 1998. Points Of Presence (POPs) were created in most states of the country, to provide universities and institutions with a link to the Internet. Like other countries, Brazil has watched exponential growth of the Internet in its territory. As of January 1997, Brazil stands as the 19th country in number of hosts in the world, and the 3rd of the Americas, after US and Canada. According to [20], the number of .com hosts in the Brazilian national domain (.br) has grown 1947% from January 1996 to July 1997.


  
Figure 1: RNP's backbone (Source: http://www.rnp.br/1.3.bone.html)
\begin{figure}
\epsfxsize=430pt
\leavevmode
\begin{center}
\epsffile{new_backbone_ing_pb.eps}\end{center}\end{figure}

The POP in the state of Minas Gerais (also known as POP-MG), located in the city of Belo Horizonte, serves almost all the users in the state and has connections with the main Brazilian backbones and direct links to the United States, that make up a bandwidth of 9 Mbps. The average total traffic rate measured is close to 6 Mbps. POP-MG is the main gateway to the Internet for twenty five universities and a hundred business organizations including Internet Service Providers (ISP), and research institutes.

2.1 The POP-MG cache proxy server

The POP-MG's cache server became operational in March 1997, and the goal was to reduce the web response time and bandwidth utilization. It is configured as a two level hierarchy, as shown in Figure 2.

  
Figure 2: The Cache Proxy Server Hierarchy in the POP-MG
\begin{figure}
\epsfxsize=380pt
\leavevmode
\begin{center}
\epsffile{cache-pop-mg.eps}\end{center}\end{figure}

The first level consists of servers that answer only end user's requests. Ideally, this level would be located in ISPs, universities and other locations closer to final users. Because most of its customers do not have their own cache server, POP-MG designed its caching hierarchy with two levels of servers.

If an object requested by an end user cannot be found locally in a first level proxy, this server will try to fetch it from the second level. The second level servers are responsible for fetching the object from the source, when a miss occurs. They are all located in POP-MG, and they only answer first level server's requests. To balance the load, POP-MG has three second level servers and the requests are divided according to top-level domains. One server answers ``.br'' requests, the other one answers ``.com'' requests and the third one is responsible for the other domains (which will be denoted ``!.com !.br''). The second level cache servers are said to be ``parents'' of first level ones.

All of POP-MG's cache servers are Pentium based machines running FreeBSD [15] and Squid [11]. Table 1 shows the configuration and load, measured in number of connections per hour, that includes HTTP, FTP and ICP requests. The total average load of the caching proxy servers is around 1,800,000 requests per day.

 
Table 1: Configuration of the Cache Proxy Server
Server RAM Disk Cache Software Connections/Hour
.br 128 Mb 8 Gb Squid 1.1.20 17000 avg / 40000 peak
.com 128 Mb 8 Gb Squid 1.1.20 21000 avg / 45000 peak
!.br !.com 64 Mb 4 Gb Squid 1.novm.17 7000 avg / 15000 peak
1st level 128 Mb 4 Gb Squid 1.1.17 16000 avg / 40000 peak
1st level 192 Mb 4 Gb Squid 1.1.17 16000 avg / 40000 peak

3 Cache Proxy Statistics

  Cache proxy servers throughout the world exhibit different access patterns. Using data available at [10], we compiled statistics for cache proxy servers in several countries, as shown in Table 2. For each domain, we present the total number of accesses (i.e., count), the percentage of accesses to the number of accesses of all domains and the hit ratio. The figures for the Brazilian proxy servers were obtained from the logs of the POP-MG. All the statistics refer to data collected from November 17 to November 23, 1997. The USA data was collected from all servers that make up the NLANR hierarchy [10]. By examining the Table, we note that in five countries the majority of accesses are to the their national domain. The countries are: USA, Brazil, Japan, Italy, and Taiwan. More than 50% of the accesses in US, Brazil and Japan are directed to sites in their national domains. It can also be noted from Table 2, that the hit ratio for national domains is always greater than the ratio for other domains.
 
Table 2: Access patterns throughout the world
Country 4|c||Most accessed domains 4|c|2nd most accessed domains            
  Dom. Counts %all %hit Dom. Counts %all %hit
Austria com 181,610 38 14 at 141,273 29 22
Belgium com 1,346,956 52 51 be 385,759 15 63
Brazil br 2,146,625 50 58 com 1,402,558 33 36
Italy it 28,630 39 21 com 23,911 32 20
Japan jp 206,037 75 48 com 53,533 19 36
Netherlands com 352,883 43 44 nl 234,992 29 53
Portugal com 1,148,262 55 27 pt 448,275 21 37
Taiwan tw 6,285 49 15 com 3661 29 15
USA com 14,125,456 71 35 net 1,474,418 7 22

We also collected the hit ratio for the caching proxy hierarchy of POP-MG. It is around 65%, greater than the average hit ratios published for England and Japan [17,16], that are less than 60%. The quantitative behavior of a cache proxy server can be analyzed through the following statistics:

4 Analysis of Cache Proxy Server Logs

We analyze the influence of regional and cultural characteristics on the behavior of proxy servers, by examining 4,235,311 requests that arrived at the one of the first-level proxy server at POP-MG. The requests correspond to a seven-day operation interval and come from both commercial (40%) and educational (60%) organizations. The amount of transferred data is more than 25 gigabytes. We also analyzed data from proxy servers in two other locations, namely the National Laboratory for Advanced Network Research (NLANR) in the United States, and Esoterica, a large ISP in Portugal. We chose Portugal because Brazil and Portugal share the same language and many cultural characteristics. In order to have comparable number of requests, we gathered data corresponding to one day and 7 days of those two proxy servers, respectively.

 
Table 3: Statistics for accesses to the first level proxy of POP-MG
  All .br .com
Requests 4,235,311(100% Req) 2,146,625 (50.2% Req) 1,402,558 (33.2% Req)
Objects 1,079,044(100% Obj) 319,937 (29.7% Obj) 518,256 (48.0% Obj)
Accesses/object 3.92 6.70 2.70
1-access objects 709,759 (65.8% Obj) 180,385 (16.7% Obj) 349,737 (32.4% Obj)
Non-first accesses 74.52% 85.10% 63.05%
Hit ratio 47% 58% 36%

Table 3 displays workload statistics for POP-MG's log. Column labeled ``All'' indicates all requests handled by the proxy during the period, and the other two columns represent the two most accessed domains (comprising more than 80% of the accesses): .br and .com.

4.1 Language Influence

By analyzing the results of the Brazilian Internet User Survey [14] and considering the telecommunication infrastructure, we have the following explanations for the high percentage of accesses to the .br domain: (1) only 58% of the users speak English, and are able to access English language sites; (2) most of the Brazilian users are interested in news (80%), scientific information (67%), music (67%), and adult entertainment (61%), which are topics heavily related to regional culture; and (3) accesses to Brazilian sites are usually faster, since they do not demand traversing busy international links.

The second observation regards the average number of accesses per object. The hit ratio is much higher for Brazilian objects (6.7 and 58%, respectively) than for objects from the .com domain (2.7 and 36%, respectively). This phenomenon - the hit ratio for the national domain being higher than for other domains - has been consistently observed in all countries from which we analyzed data. For Brazil, this can be explained not only by the amount of accesses to .br sites, but also by the fact that the number of unique Brazilian objects (319,937) is significantly smaller than the number of cached objects from the .com domain (518,256).

4.2 Cultural Influence

By examining the POP-MG, NLANR and Esoterica's logs, we found a significant difference regarding the popularity of http based chat sites in Brazil. Accesses to sites with chat applications correspond to 4.9% of the total accesses recorded at POP-MG log. In the US, requests for chatter sites represent 1.2% of the accesses for NLANR's. In Portugal requests for chat sites represent 1.17% of the total number of requests.

Table  4 shows some access patterns to html-based chat sites, contrasting these between Brazil and Portugal. The row labeled National corresponds to the percentage of the accessed chat pages which are from the National domain (.br or .pt) respectively. US corresponds to the percentage of these pages that are in either one of .com, .net, .org or .edu domains, and Reciprocal represents the accesses to pages from Brazil to Portugal and from Portugal to Brazil. We can note that not only are chat sites much more popular in Brazil, but also that most of them are located in Brazil. The accesses from Brazil to Portugal chat sites are almost negligible, whereas the other way around is very significant. Our explanation for those statistics stem from the fact that Brazilian TV Soap Operas are very popular in Portugal. As a consequence sites of TV in Brazil get a lot of attention from WWW users in Portugal.

 
Table 4: Html based chat access patterns
  POP-MG-Brazil Esoterica(PT)
National 87% 7.3%
US 6.2% 79.4%
Reciprocal 0.5% 10.5%

It is also worthnoting that Web chatter sites are among the most popular sites in Brazil. This characteristic is important for caching projects, because chat pages are dynamic and cannot be cached.

4.3 Infrastructure Influence

In Brazil, telecommunication services are much more expensive than in US. Thus, most of Internet users tend to navigate through the WWW in periods of time when the telephone rates are lower. As a consequence, we note the existance of heavy load peaks, that coincide with the low rate periods. Using the logs we calculated the hourly arrival rates for the proxy server of NLANR and POP-MG. We noticed a high variability in the load, due to the different telephone charging policies in the two countries (i.e., Brazil and US). The traffic patterns observed at POP-MG follow the variations in the phone rates. Figure 3 shows the sum of all POP-MS's first level proxy server utilization (in number of access per hour) in a day where the phone rates in Brazil are less expensive. On Saturdays, the pricing scheme is the following: from 6:00AM to 2:00PM, a telephone customer pays the normal rate per minute (i.e., it is not a flat rate), from 2:00PM to 23:59PM a customer is charged for only one minute, no matter the duration of the phone call. As can be noted in figure 3, there is a load peak just after 2:00PM. During the least expensive period, the peak arrival rate is 116% higher than the day's average rate. In the NLANR servers the peak to average ratio falls to 46%. Thus, this type of infrastructure information is useful for capacity planning of proxy servers, that should be able to handle the load peaks.

4.4 Document Popularity

The highly uneven popularity of various Web documents has been noted in many references [5,7,3,18]. It was shown the applicability of Zipf's Law for document popularity versus ranking. The law states that if one ranks the popularity of words in a given text (denoted by $\rho$) by their frequency of use (denoted by P), then

\begin{displaymath}
P \sim 1/\rho\end{displaymath}

Our logs show that Zipf's law does not apply to documents requested from a proxy cache. This is demonstrated in Figure 3 for the 4,235,311 requests referenced in the POP-MG log. The figure shows a log-log plot of references to each document as a function of the document's rank in overall popularity. The straight line represents the Zipf's law function and does not fit the actual data from the logs.

  
Figure 3: POP-MG's first level proxy server's utilization
\begin{figure}
\centerline{
\psfig {file=1st-level-server-utilization.07.03.98.ps,width=5in}
}\end{figure}


  
Figure 4: Concentration of references in a Brazilian's server and a Portugal's server
\begin{figure}
\centerline{

\psfig {figure=concentration-br1.ps,width=75mm}

\psfig {figure=concentration-pt1.ps,width=75mm}

}\end{figure}

5 Related Work

  Characteristics of Web cache workloads have been studied in many references [12,5,18]. None of them analyzes the influence of geographical, social or cultural influence on the proxy cache performance. Characteristics of web cache workloads have been studied in many references. Arlitt et al. [4] analyzed six web server workloads and pointed ten invariants, that is, ten characteristics that are common across all the data sets studied. A similar work for proxy cache workloads was done by Abdulla et al. [12]. They identified ten invariants that hold true across the ten workloads analyzed. The workloads are from five different organizations. There are also client-based trace organization [5,8]. None of them analyze data from the social and cultural viewpoint. All of the logs studied except one are from North America.

Gwertzman and Seltzer [13] have proposed the geographical push-caching, a new cooperative caching strategy based on the knowledge of network topology (proximity of the hosts) and access history of the documents. Nabeshima [16] proposed the concept of domain cache, based on experiments in the Japan cache project. The Japan Cache Server is dedicated to answering accesses to .jp (Japan) domain servers. The author concludes that domain cache is an effective cache server operation method. The high cache hit ratio obtained in that cache server can be seen as a confirmation of our results. Neal [17] wrote about the Harvest Object Cache in New Zealand and pointed out how the New Zealand's geographical characteristics such as small size of the country and its considerable distance from the Occident, and regional issues as the high cost of bandwidth between New Zealand and United States' west coast have influenced the creation of a cache project. Pitkow [18] presents a summary of WWW characterizations and points out the need of characterizations of WWW international users. The presence of the Internet in Brazil is recognized in a recent article [19], that points out that Brazil is the first country included in the bimonthly directory of Internet Service Providers by Boardwatch Magazine [1], after U.S. and Canada.

6 Conclusions

In this paper, we have analyzed the logs of a busy cache proxy server in light of geographical and cultural issues, such as language, social interaction, telecommunication infrastructure, among others. We noted a correlation between national characteristics (taking Brazil as our example) and the quantitative behavior of a cache proxy server, represented by the percentage of accesses to the national domain, the hit ratio for each domain and traffic peaks. As noted by  [9], Brazilians naturally like to chat, and this fact is reflected in a high percentage of accesses to chat sites, as compared to an American server. Language and interest in regional information, according to a WWW Brazilian user survey [14], as well as limited bandwidth of international links are used to explain the high percentage of accesses from Brazilian users to pages in Brazilian sites and the high hit ratios observed in the cache of POP-MG. The charging policies adopted by the local phone companies - a strong geographical factor - is found to have a significant influence on the traffic patterns of POP-MG's cache server. These facts drew from the cache proxy server logs can be used as an useful information to design of efficient regional caching infrastructures. For example, one can use this type of information to define the architecture of a cache proxy hierarchy, (e.g. domain based caching), as well as to size cache capacity to handle load peaks.

Acknowledgments

The authors would like to thank the ISP Esoterica - Novas Tecnologias de Informacao, S.A. for allowing them to obtain statistics from its access logs.

7 Bibliography

References

1
Boardwatch magazine, July 1997.

2
V. Almeida, M. Cesário, R. Fonseca, W. Meira Jr., and C. Murta.
The influence of geographical and cultural issues on the cache proxy server workload.
http://www.dcc.ufmg.br/anades/submissions/habits/296.html .

3
Virgilio Almeida, Azer Bestravos, Mark Crovella, and Adriana de Oliveira.
Characterizing Reference Locality in the WWW.
Technical Report 96-011, Boston University, Computer Science Department, 1996.

4
Martin F. Arlitt and Carey L. Williamson.
Web Server Workload Characterization: The Search for Invariants.
In Proceedings of the 1996 ACM Sigmetrics Conference, pages 126-137, May 1996.

5
Azer Bestavros Carlos R. Cunha and Mark E. Crovella.
Characteristics of WWW Client-based Traces.
Technical Report TR-95-010, Boston University Computer Science Department, 1995.

6
Alberto Cavicchiolo.
Internet: a network, networks.
Cybersphere, 02, April 1996.

7
M. Crovella and A. Bestavros.
Self-Similarity in World-Wide Web Traffic: Evidence and Possible Causes.
In Proceedings of the 1996 ACM Sigmetrics Conference, May 1996.

8
Mark E. Crovella and Azer Bestavros.
Explaining World Wide Web Traffic Self-Similarity.
Technical Report TR-95-015, Boston University Computer Science Department, 1995.

9
Marshall Eakin.
Brazil: the once and future country.
St. Martin's Press, New York, 1997.

10
National Laboratory for Applied Network Research.
Cache Statistics pages.
http://ircache.nlanr.net/Cache/cache-stats-links.html .

11
National Laboratory for Applied Network Research.
Squid Internet Object Cache.
http://squid.nlanr.net/Squid/ .

12
Marc Abrams Ghaleb Abdulla, Edward A. Fox and Stephen Williams.
WWW Proxy Traffic Characterization with Application to Caching.
Technical Report 97-04, Virginia Tech, Computer Science Department, 1997.

13
James Gwertzman and Margo Seltzer.
The Case for Geographical Push-Caching.
In Proceedings of the Fifth Annual Workshop on Hot Operating Systems, pages 51-55, May 1995.

14
IBOPE.
2a. Pesquisa Cadê?/IBOPE.
http://www.ibope.com.br/cade97/welcome.htm .

15
FreeBSD Inc.
FreeBSD.
http://www.freebsd.org/ .

16
Masaaki Nabeshima.
The Japan Cache Project: An Experiment on Domain Cache.
In Sixth International World Wide Web Conference, 1997.

17
Donald Neal.
The Harvest Object Caching in New Zealand.
http://www.waikato.ac.nz/harvest/www5/Overview.html , 1996.

18
James Pitkow.
Summary of WWW Characterizations.
In Seventh International World Wide Web Conference, 1998.

19
Larry Press.
Tracking the Global Diffusion of the Internet.
Communications of the ACM, 40(11):11-17, November 1997.

20
Brazilian Science and Technology Ministry.
Hosts por Domínio.
http://www.gt-er.cg.org.br/estatisticas/hosts/tab-host.html .