Topic: Experiences in implementing national caches, regional caches and Institution/site caches
Analyzing the Behavior of a Proxy Server in Light of Regional and
Cultural Issues
Vírgilio F. Almeida Márcio G. Cesário Rodrigo C. Fonseca
Wagner Meira Jr. Cristina D. Murta
{virgilio, magc, rfonseca, meira, cristina}@dcc.ufmg.br
Computer Science Department - Federal University of Minas Gerais, Brazil
Abstract
The Internet is not only a technological concept, but
also a cultural and linguistic concept.
Therefore, to
understand the performance behavior of the WWW, the role
of regional, cultural and social issues on traffic patterns must
be analyzed.
The way users access the Internet depends heavily upon the telecommunication
infrastructure and social context of each country.
Based on the analysis of
different proxy server logs, this paper shows evidences of the influence
of these issues on the performance of a caching proxy server.
The Internet and the WWW are not only a technological concept, but
also a cultural and linguistic concept [6].
Therefore, to
understand the performance behavior of the WWW, one must also
understand the role
of regional, cultural and social issues on traffic patterns.
The way users access the Internet depends heavily upon the telecommunication
infrastructure and social context of each country.
Based on the analysis of
different proxy server logs, this paper shows evidences of the influence
of these issues on the behavior of a caching proxy server.
Our approach is to examine the performance statistics drawn from logs of
a busy Brazilian proxy server and interpret the
measures in light of geographical and cultural issues.
History, geography, culture, language and economics are features
that shape the regional identity of a nation. In this paper,
we investigate the quantitative behavior of a caching proxy
server in light of these features. In reference [9],
an American historian discusses the cultural patterns of the
Brazilian Society. The author chose a select group of features that
he feels are at the core of the Brazilian society and culture:
class, color, family, sexuality, music, soccer, religion,
carnival, television, film and literature. He concludes: ``These
are features that define Brazil and Brazilians.''
So, our approach is to use the geographical and cultural features
of Brazil to examine the meaning of the statistics
drawn from logs of a busy caching proxy server.
The history of the Internet in Brazil dates back to 1989, with the
implementation of the National Research Network's backbone
(RNP), which provides Internet access throughout the country.
Figure 1 shows its status as of March 1998.
Points Of Presence (POPs) were created in most states of the country, to
provide universities and institutions with a link to the Internet.
Like other countries, Brazil has watched exponential growth of
the Internet in its territory. As of January 1997, Brazil stands
as the 19th country in number of hosts in the world,
and the 3rd of the Americas, after US and Canada. According
to [20], the number of .com hosts in the Brazilian national
domain (.br)
has grown 1947% from January 1996 to July 1997.
Figure 1:
RNP's backbone (Source: http://www.rnp.br/1.3.bone.html)
 |
The POP in the state of Minas Gerais (also known as POP-MG),
located in the city of Belo Horizonte,
serves almost all the users in the state and
has connections with the main Brazilian backbones
and direct links to the United States,
that make up a bandwidth of 9 Mbps.
The average total traffic rate measured is close to 6 Mbps.
POP-MG is the main gateway to the Internet for twenty five
universities and a hundred business organizations including Internet
Service Providers (ISP), and research institutes.
The POP-MG's cache server became operational in March 1997, and the
goal was to reduce the web response time and bandwidth utilization.
It is configured as a two level hierarchy, as shown in
Figure 2.
Figure 2:
The Cache Proxy Server Hierarchy in the POP-MG
 |
The first level consists of servers that answer only end
user's requests. Ideally, this level would be located in ISPs,
universities and other locations closer to final users. Because most
of its customers do not have their own cache server,
POP-MG designed its caching hierarchy with two levels of servers.
If an object requested by an end user cannot be found locally in a
first level proxy, this server will try to
fetch it from the second level.
The second level servers are responsible for fetching the object from
the source, when a miss occurs.
They are all located in POP-MG, and they only
answer first level server's requests. To balance the load,
POP-MG has three second level servers and the requests are divided
according to
top-level domains. One server answers ``.br'' requests, the other one
answers
``.com'' requests and the third one is responsible for the other domains
(which will be denoted ``!.com !.br'').
The second level cache servers are said to be ``parents'' of
first level ones.
All of POP-MG's cache servers are
Pentium based machines running FreeBSD [15]
and Squid [11].
Table 1 shows the configuration and
load, measured in number of connections per hour,
that includes HTTP, FTP and ICP requests. The total average load of the
caching proxy servers is around 1,800,000 requests per day.
Table 1:
Configuration of the Cache Proxy Server
| Server |
RAM |
Disk |
Cache Software |
Connections/Hour |
| .br |
128 Mb |
8 Gb |
Squid 1.1.20 |
17000 avg / 40000 peak |
| .com |
128 Mb |
8 Gb |
Squid 1.1.20 |
21000 avg / 45000 peak |
| !.br !.com |
64 Mb |
4 Gb |
Squid 1.novm.17 |
7000 avg / 15000
peak |
| 1st level |
128 Mb |
4 Gb |
Squid 1.1.17 |
16000 avg / 40000 peak |
| 1st level |
192 Mb |
4 Gb |
Squid 1.1.17 |
16000 avg / 40000 peak |
Cache proxy servers throughout the world exhibit different access
patterns. Using data available at [10], we compiled
statistics for cache proxy servers in several countries, as shown
in Table 2. For each domain, we present the
total number of accesses (i.e., count),
the percentage of accesses
to the number of accesses of all domains and the hit ratio.
The figures for the Brazilian proxy
servers were obtained from the logs of the POP-MG. All the statistics
refer to data collected from November 17 to November 23, 1997.
The USA data was collected from all servers that make up the
NLANR hierarchy [10]. By examining the Table, we note that in five
countries the majority of accesses are to the their national domain.
The countries are: USA, Brazil, Japan, Italy, and Taiwan.
More than 50% of the accesses in US, Brazil and Japan are directed to sites
in their national domains. It can also be noted from Table 2,
that the hit ratio for national domains is always greater than the ratio
for other domains.
Table 2:
Access patterns throughout the world
| Country |
4|c||Most accessed domains |
4|c|2nd most accessed domains |
|
|
|
|
|
|
| |
Dom. |
Counts |
%all |
%hit |
Dom. |
Counts |
%all |
%hit |
| Austria |
com |
181,610 |
38 |
14 |
at |
141,273 |
29 |
22 |
| Belgium |
com |
1,346,956 |
52 |
51 |
be |
385,759 |
15 |
63 |
| Brazil |
br |
2,146,625 |
50 |
58 |
com |
1,402,558 |
33 |
36 |
| Italy |
it |
28,630 |
39 |
21 |
com |
23,911 |
32 |
20 |
| Japan |
jp |
206,037 |
75 |
48 |
com |
53,533 |
19 |
36 |
| Netherlands |
com |
352,883 |
43 |
44 |
nl |
234,992 |
29 |
53 |
| Portugal |
com |
1,148,262 |
55 |
27 |
pt |
448,275 |
21 |
37 |
| Taiwan |
tw |
6,285 |
49 |
15 |
com |
3661 |
29 |
15 |
| USA |
com |
14,125,456 |
71 |
35 |
net |
1,474,418 |
7 |
22 |
We also collected the hit ratio for the caching proxy hierarchy of
POP-MG. It is around 65%, greater than the average hit ratios published
for England and Japan [17,16], that are less than 60%.
The quantitative behavior of a
cache proxy server can be analyzed through
the following statistics:
- Access frequency per domain: it indicates the
proportion of accesses to a top-level domain to the total number of
accesses.
- Size of the universe of objects accessed: it is the number of unique
objects accessed by users during the period of analysis;
- Hit Ratio: percentage of requests that can be serviced with
objects found in the cache. This metric is influenced
by factors such as the user's access patterns (i.e., reference locality),
Time-To-Live information, cache size, etc;
- Object access frequency: it represents the average number of accesses per
object, i.e., total number of accesses divided by the number of accesses to unique
objects.
We analyze the influence of regional and cultural characteristics on
the behavior of proxy servers, by examining 4,235,311 requests
that arrived at the one of the first-level proxy server at POP-MG.
The requests correspond to a seven-day operation interval
and come from both commercial (40%) and educational (60%)
organizations. The amount of transferred data is more than 25 gigabytes.
We also analyzed data from proxy servers in two other locations, namely the
National Laboratory for Advanced Network Research (NLANR) in the United States,
and Esoterica, a large ISP in Portugal. We chose Portugal because Brazil
and Portugal share the same language and many cultural characteristics.
In order to have comparable number of requests,
we gathered data corresponding to
one day and 7 days of those two proxy servers, respectively.
Table 3:
Statistics for accesses to the first level proxy of POP-MG
| |
All |
.br |
.com |
| Requests |
4,235,311(100% Req) |
2,146,625 (50.2% Req) |
1,402,558 (33.2% Req) |
| Objects |
1,079,044(100% Obj) |
319,937 (29.7% Obj) |
518,256 (48.0% Obj) |
| Accesses/object |
3.92 |
6.70 |
2.70 |
| 1-access objects |
709,759 (65.8% Obj) |
180,385 (16.7% Obj) |
349,737 (32.4% Obj) |
| Non-first accesses |
74.52% |
85.10% |
63.05% |
| Hit ratio |
47% |
58% |
36% |
Table 3 displays workload statistics for POP-MG's log.
Column labeled ``All'' indicates all requests handled by the proxy during
the period, and the other two columns represent the two most
accessed domains (comprising more than 80% of the accesses):
.br and .com.
By analyzing
the results of the Brazilian Internet User Survey [14] and
considering the telecommunication infrastructure, we have the following
explanations for the high percentage of accesses to the .br domain:
(1) only 58% of the users speak English, and are able to access
English language sites; (2) most of the Brazilian users are interested in news
(80%), scientific information (67%), music (67%), and adult entertainment
(61%), which are topics heavily related to regional culture; and (3)
accesses to Brazilian sites are usually faster, since they do not demand
traversing busy international links.
The second observation regards the average number of accesses per
object. The hit ratio is much higher for
Brazilian objects (6.7 and 58%, respectively) than for objects from
the .com domain (2.7 and 36%, respectively). This phenomenon - the
hit ratio for the national domain being higher than for other domains - has
been consistently observed in all countries from which we analyzed data.
For Brazil, this can be explained not only by the amount of accesses
to .br sites,
but also by the fact that the number of unique Brazilian objects
(319,937) is significantly smaller than the number of cached objects
from the .com domain (518,256).
By examining the POP-MG, NLANR and Esoterica's logs,
we found a significant difference regarding the popularity of http
based chat
sites in Brazil.
Accesses to sites with chat applications correspond to 4.9% of the
total accesses recorded at POP-MG
log. In the US, requests for chatter sites represent
1.2% of the accesses for NLANR's. In Portugal requests for chat
sites represent
1.17% of the total number of requests.
Table 4 shows some access patterns to html-based chat sites,
contrasting these between Brazil and Portugal. The row labeled National
corresponds to the percentage of the accessed chat pages which are from the
National domain (.br or .pt) respectively. US corresponds to the
percentage of these pages that are in either one of .com, .net,
.org or .edu domains, and Reciprocal represents the
accesses to pages from
Brazil to Portugal and from Portugal to Brazil. We can note that not only are
chat sites much more popular in Brazil, but also that most of them are located
in Brazil. The accesses from Brazil to Portugal chat sites are almost
negligible, whereas the other way around is very significant.
Our explanation for those statistics stem from the fact that
Brazilian TV Soap Operas are very popular in Portugal. As
a consequence sites of TV in Brazil get a lot of attention from
WWW users in Portugal.
Table 4:
Html based chat access patterns
| |
POP-MG-Brazil |
Esoterica(PT) |
| National |
87% |
7.3% |
| US |
6.2% |
79.4% |
| Reciprocal |
0.5% |
10.5% |
It is also worthnoting
that Web chatter sites are among the most popular sites in Brazil.
This characteristic is important for caching projects,
because chat pages are dynamic and cannot be cached.
In Brazil, telecommunication services are much more expensive than in US.
Thus, most of Internet users tend to navigate through the WWW in periods
of time when the telephone rates are lower. As a consequence, we note
the existance of heavy load peaks, that coincide with the low rate periods.
Using the logs we calculated the hourly arrival rates for the
proxy server of NLANR and POP-MG.
We noticed a high variability in the load,
due to the different telephone charging policies in the two countries
(i.e., Brazil and US). The traffic patterns
observed at POP-MG follow the variations in the phone rates.
Figure 3 shows the sum of all POP-MS's first
level proxy server utilization (in number of access per hour) in a day where
the phone rates in Brazil are less expensive. On Saturdays, the pricing
scheme is the following: from 6:00AM to 2:00PM, a telephone customer
pays the normal rate per minute (i.e., it is not a flat rate),
from 2:00PM to 23:59PM a customer is charged for only one minute, no
matter the duration of the phone call.
As can be noted in figure 3,
there is a load peak just after 2:00PM.
During the least expensive period, the peak
arrival rate is 116% higher than the day's average rate.
In the NLANR servers the peak to average ratio falls to 46%.
Thus, this type of infrastructure information is useful for
capacity planning
of proxy servers, that should be able to handle the load peaks.
The highly uneven popularity of various Web documents has been
noted in many references [5,7,3,18].
It was shown the applicability
of Zipf's Law for document popularity versus ranking.
The law states that if one ranks the popularity of words in a given
text (denoted by
) by their frequency of use (denoted by P),
then

Our logs show that Zipf's law does not apply to documents requested
from a proxy cache. This is demonstrated in Figure 3
for the 4,235,311 requests referenced in the POP-MG log. The figure
shows a log-log plot of references to each document as a function of
the document's rank in overall popularity. The straight line represents
the Zipf's law function and does not fit the actual data from the logs.
Figure 3:
POP-MG's first level proxy server's utilization
 |
Figure 4:
Concentration of references in a Brazilian's server and a Portugal's server
 |
Characteristics of Web cache workloads have been studied in many
references [12,5,18].
None of them analyzes the influence of
geographical, social or cultural influence on the proxy cache performance.
Characteristics of web cache workloads have been studied in many
references. Arlitt et al. [4] analyzed six web server
workloads and pointed ten invariants, that is, ten characteristics
that are common across all the data sets studied.
A similar work for proxy cache workloads was done by Abdulla et al.
[12]. They identified ten
invariants that hold true across the ten workloads analyzed. The
workloads are from five different organizations. There are also
client-based trace organization [5,8].
None of them analyze data from
the social and cultural viewpoint. All of
the logs studied except one are from North America.
Gwertzman and Seltzer [13] have proposed the
geographical push-caching, a new cooperative caching strategy based on
the knowledge of network topology (proximity of the hosts) and access
history of the documents. Nabeshima [16] proposed the
concept of domain cache, based on experiments in the Japan cache
project. The Japan Cache Server is dedicated to
answering accesses to .jp
(Japan) domain servers. The author
concludes that domain cache is an effective
cache server operation method. The high cache hit ratio obtained in
that cache server can be seen as a confirmation of our results.
Neal [17] wrote about the Harvest
Object Cache in New Zealand and pointed out how the New Zealand's
geographical characteristics such as small size of the country and its
considerable distance from the Occident, and regional issues as the
high cost of bandwidth between New Zealand and United States' west
coast have influenced the creation of a cache project.
Pitkow [18] presents a summary of WWW characterizations and
points out the need of characterizations
of WWW international users.
The presence of the Internet in Brazil is recognized in a recent
article [19], that points out that
Brazil is the
first country included in the bimonthly directory of Internet Service
Providers by Boardwatch Magazine [1],
after U.S. and Canada.
In this paper, we have analyzed the logs of a busy cache proxy
server in light of geographical and cultural issues, such as language,
social interaction, telecommunication infrastructure, among others. We noted
a correlation between national characteristics (taking Brazil as our
example) and the quantitative behavior of a cache proxy server,
represented by the percentage of accesses to the national domain, the
hit ratio for each domain and traffic peaks.
As noted by [9], Brazilians naturally like to chat, and this fact
is reflected in a high percentage of accesses to chat sites, as compared
to an American server. Language and interest in regional information,
according to a WWW Brazilian user survey [14],
as well as limited bandwidth of international links are used to explain the
high percentage of accesses from Brazilian users to pages
in Brazilian sites and the high hit ratios observed in the cache of POP-MG.
The charging policies adopted by the local phone companies
- a strong geographical
factor - is found to have a significant influence on the traffic patterns of
POP-MG's cache server.
These facts drew from the cache proxy server logs
can be used as an useful information to design of efficient regional caching
infrastructures.
For example, one can use this type of information
to define the architecture of a
cache proxy hierarchy, (e.g. domain based caching), as well as to size
cache capacity to handle load peaks.
The authors would like to thank the ISP
Esoterica - Novas Tecnologias de Informacao, S.A. for allowing them to obtain
statistics from its access logs.
- 1
-
Boardwatch magazine, July 1997.
- 2
-
V. Almeida, M. Cesário, R. Fonseca, W. Meira Jr., and C. Murta.
The influence of geographical and cultural issues on the cache proxy
server workload.
http://www.dcc.ufmg.br/anades/submissions/habits/296.html
.
- 3
-
Virgilio Almeida, Azer Bestravos, Mark Crovella, and Adriana de Oliveira.
Characterizing Reference Locality in the WWW.
Technical Report 96-011, Boston University, Computer Science
Department, 1996.
- 4
-
Martin F. Arlitt and Carey L. Williamson.
Web Server Workload Characterization: The Search for Invariants.
In Proceedings of the 1996 ACM Sigmetrics Conference, pages
126-137, May 1996.
- 5
-
Azer Bestavros Carlos R. Cunha and Mark E. Crovella.
Characteristics of WWW Client-based Traces.
Technical Report TR-95-010, Boston University Computer Science
Department, 1995.
- 6
-
Alberto Cavicchiolo.
Internet: a network, networks.
Cybersphere, 02, April 1996.
- 7
-
M. Crovella and A. Bestavros.
Self-Similarity in World-Wide Web Traffic: Evidence and Possible
Causes.
In Proceedings of the 1996 ACM Sigmetrics Conference, May 1996.
- 8
-
Mark E. Crovella and Azer Bestavros.
Explaining World Wide Web Traffic Self-Similarity.
Technical Report TR-95-015, Boston University Computer Science
Department, 1995.
- 9
-
Marshall Eakin.
Brazil: the once and future country.
St. Martin's Press, New York, 1997.
- 10
-
National Laboratory for Applied Network Research.
Cache Statistics pages.
http://ircache.nlanr.net/Cache/cache-stats-links.html
.
- 11
-
National Laboratory for Applied Network Research.
Squid Internet Object Cache.
http://squid.nlanr.net/Squid/
.
- 12
-
Marc Abrams Ghaleb Abdulla, Edward A. Fox and Stephen Williams.
WWW Proxy Traffic Characterization with Application to Caching.
Technical Report 97-04, Virginia Tech, Computer Science Department,
1997.
- 13
-
James Gwertzman and Margo Seltzer.
The Case for Geographical Push-Caching.
In Proceedings of the Fifth Annual Workshop on Hot Operating
Systems, pages 51-55, May 1995.
- 14
-
IBOPE.
2a. Pesquisa Cadê?/IBOPE.
http://www.ibope.com.br/cade97/welcome.htm
.
- 15
-
FreeBSD Inc.
FreeBSD.
http://www.freebsd.org/
.
- 16
-
Masaaki Nabeshima.
The Japan Cache Project: An Experiment on Domain Cache.
In Sixth International World Wide Web Conference, 1997.
- 17
-
Donald Neal.
The Harvest Object Caching in New Zealand.
http://www.waikato.ac.nz/harvest/www5/Overview.html
, 1996.
- 18
-
James Pitkow.
Summary of WWW Characterizations.
In Seventh International World Wide Web Conference, 1998.
- 19
-
Larry Press.
Tracking the Global Diffusion of the Internet.
Communications of the ACM, 40(11):11-17, November 1997.
- 20
-
Brazilian Science and Technology Ministry.
Hosts por Domínio.
http://www.gt-er.cg.org.br/estatisticas/hosts/tab-host.html
.