Position Paper for NLANR Cache Workshop
Japan Cache Project was motivated by the Japan Window Project which aims to disseminate information about Japan to users outside Japan. On the other hand, the goal of the Japan Cache Project is to create more convenient access to raw Japanese information found on the web from North America.
To achieve this goal, we operate a public cache server in the USA, for access to Japanese information. Infrequently accessed information , like Japanese information, may be spread throughout other sites and stored in cache servers at other sites. If we can gather such minor information into one cache, we can get a higher hit ratio. In other words, to get a high effectiveness from the cache server, the tendency of access should be narrowed. Our final goal for this cache server is to fill it with information related to Japan. Currently, the Japan Cache server is dedicated to accesses to JP (Japan) domain servers.
In Japan, we also operate a cache server (cache.imnet.ad.jp) on the IMNET (Inter-Ministry research information NETwork) which is founded by Japanese Science and Technology Agency. Although we operate cache.imnet.ad.jp as a provider cache in IMNET, We are willingly to accept JP domain access from outside Japan. It could be a good cache server for users in Asian countries which have direct international link. But more than ten Internet providers in Japan have their own international links to the USA. The Japan Cache Server makes better use of multiple international links than cache.imnet.ad.jp. It also avoids narrow international links. Therefore, the Japan Cache Server is a better cache server for users in North America.
The Japan Cache server is a SUN Sparc Station-20 with 128 MB of memory and 8 GB of cache space. We use the squid Internet object cache. This is at NTT Multimedia Communication Laboratories in Palo Alto. The USA ends of many of the international links between USA and Japan are located on the west side of the USA, an area through which a lot of the traffic to Japan passes. Therefore it is ideal in terms of network topology.
The early morning is the best time to access web servers. This is a good time to store information that will be accessed by users later. In this experiment, target information for prefetching was classified into two types. One is the top 100 most-frequently accessed pages. The other is from current events and newly submitted information.
With regards to new submissions or current event information, around 80% of the prefetched information was fetched newly from web servers. However, only about 10% of the prefetched information stored on cache was later retrieved from the cache server by users. The target information does not satisfy users' demands. We should establish the user access prediction method.
Regarding the prefetching of frequently accessed information, 81% of the information already existed in the main cache. Therefore, prefetching of frequently accessed information was insignificant.
Using the notification of primary server updating like an Andrew File System is a good way to keep the coherence of a cache server with a long expiration time. Currently, we set up email based refresh notification mechanism on www.ntt.co.jp. The mechanism for the notification is the following.
There are many users who cannot use the Japan Cache server. At some sites, users have to access outside of the site through a firewall proxy server for security reasons. Another reason is client software limitation. The Japan Cache server only permits accesses for the JP domain servers, so client software must have a domain name base proxy selection mechanism. Currently this is supported only by Netscape Navigator.
To accept accesses from these users, we select an operating cache replication server that is not a proxy server. To the user, it appears to be a type of mirror server. However, in this case, the information is retrieved from the primary server through the cache and stored in it. As a cache replication server, we operate the Squid Internet Object Cache in HTTP accelerate mode.
Currently www-ntt.nttam.com (in California), whose primary server is www.ntt.co.jp (in Tokyo), is in operation. Hit ratio was around 75 percent. And average cache disk usage was 30M bytes while the primary server of this replication server 500M bytes.
The Japan Cache server works as a JP domain parent cache of the NLANR caching project. In other words, in the NLANR caching project, every JP domain access goes through sv.cache.nlanr.net, which is the final JP domain parent cache in the NLANR caching project. Our server acts as the JP domain parent cache of sv.cache.nlanr.net.
However, many accesses from the users seem to be time-outed on the way. Infrequently accessed information, like Japanese information, on regional cache server may not be retrieved afterwards by other users. Therefore, it would be a good way for users in North America to configure the Japan Cache Server as a direct JP domain parent cache server.
The hit ratio was around 58% higher than usual cache server operation and helped to show the effectiveness of a designated cache server for access to specific domain name. However we count not get a high effectiveness from prefetching.
With regards to keeping coherence and operation of the cache replication server, we have set up the system. At present, we are asking for cooperation from the primary servers in Japan. Especially, we want to get a cooperation from news company. We are asking them to send refresh notifications to Japan Cache Server, and give us a permission to open cache replication servers for them in public. But It's hard to get them.