Cache them if you can | High Performance Web Sites
“The fastest HTTP request is the one not made.”
I always smile when I hear a web performance speaker say this. I forget who said it first, but I’ve heard it numerous times at conferences and meetups over the past few years. It’s true! Caching is critical for making web pages faster. I’ve written extensively about caching:
- Call to improve browser caching
- (lack of) Caching for iPhone Home Screen Apps
- Redirect caching deep dive
- Mobile cache file sizes
- Improving app cache
- Storager case study: Bing, Google
- App cache & localStorage survey
- HTTP Archive: max-age
Things are getting better – but not quickly enough. The chart below from the HTTP Archive shows that the percentage of resources that are cacheable has increased 10% during the past year (from 42% to 46%). Over that same time the number of requests per page has increased 12% and total transfer size has increased 24% (chart).
Perhaps it’s hard to make progress on caching because the problem doesn’t belong to a single group – responsibility spans website owners, third party content providers, and browser developers. One thing is certain – we have to do a better job when it comes to caching.
I’ve gathered some compelling statistics over the past few weeks that illuminate problems with caching and point to some next steps. Here are the highlights:
- 55% of resources don’t specify a max-age value
- 46% of the resources without any max-age remained unchanged over a 2 week period
- some of the most popular resources on the Web are only cacheable for an hour or two
- 40-60% of daily users to your site don’t have your resources in their cache
- 30% of users have a full cache
- for users with a full cache, the median time to fill their cache is 4 hours of active browsing
Read on to understand the full story.
My kingdom for a max-age header
Many of the caching articles I’ve written address issues such as size & space limitations, bugs with less common HTTP headers, and outdated purging logic. These are critical areas to focus on. But the basic function of caching hinges on websites specifying caching headers for their resources. This is typically done using max-age in the Cache-Control response header. This example specifies that a response can be read from cache for 1 year:
Cache-Control: max-age=31536000Since you’re reading this blog post you probably already use max-age, but the following chart from the HTTP Archive shows that 55% of resources don’t specify a max-age value. This translates to 45 of the average website’s 81 resources needing a HTTP request even for repeat visits.
Missing max-age != dynamic
Why do 55% of resources have no caching information? Having looked at caching headers across thousands of websites my first guess is lack of awareness – many website owners simply don’t know about the benefits of caching. An alternative explanation might be that many resources are dynamic (JSON, ads, beacons, etc.) and shouldn’t be cached. Which is the bigger cause – lack of awareness or dynamic resources? Luckily we can quantify the dynamicness of these uncacheable resources using data from the HTTP Archive.
The HTTP Archive analyzes the world’s top ~50K web pages on the 1st and 15th of the month and records the HTTP headers for every resource. Using this history it’s possible to go back in time and quantify how many of today’s resources without any max-age value were identical in previous crawls. The data for the chart above (showing 55% of resources with no max-age) was gathered on Feb 15 2012. The chart below shows the percentage of those uncacheable resources that were identical in the previous crawl on Feb 1 2012. We can go back even further and see how many were identical in both the Feb 1 2012 and the Jan 15 2012 crawls. (The HTTP Archive doesn’t save response bodies so the determination of “identical” is based on the resource having the exact same URL, Last-Modified, ETag, and Content-Length.)
46% of the resources without any max-age remained unchanged over a 2 week period. This works out to 21 resources per page that could have been read from cache without any HTTP request but weren’t. Over a 1 month period 38% are unchanged – 17 resources per page.
This is a significant missed opportunity. Here are some popular websites and the number of resources that were unchanged for 1 month but did not specify max-age:
- http://www.toyota.jp/ – 172 resources without max-age & unchanged for 1 month
- http://www.sfgate.com/ – 133
- http://www.hasbro.com/ – 122
- http://www.rakuten.co.jp/ – 113
- http://www.ieee.org/ – 97
- http://www.elmundo.es/ – 80
- http://www.nih.gov/ – 76
- http://www.frys.com/ – 68
- http://www.foodnetwork.com/ – 66
- http://www.irs.gov/ – 58
- http://www.ca.gov/ – 53
- http://www.oracle.com/ – 52
- http://www.blackberry.com/ – 50
Recalling that “the fastest HTTP request is the one not made”, this is a lot of unnecessary HTTP traffic. I can’t prove it, but I strongly believe this is not intentional – it’s just a lack of awareness. The chart below reinforces this belief – it shows the percentage of resources (both cacheable and uncacheable) that remain unchanged starting from Feb 15 2012 and going back for one year.
The percentage of resources that are unchanged is nearly the same when looking at all resources as it is for only uncacheable resources: 44% vs. 46% going back 2 weeks and 35% vs. 38% going back 1 month. Given this similarity in “dynamicness” it’s likely that the absence of max-age has nothing to do with the resources themselves and is instead caused by website owners overlooking this best practice.
3rd party content
If a website owner doesn’t make their resources cacheable, they’re just hurting themselves (and their users). But if a 3rd party content provider doesn’t have good caching behavior it impacts all the websites that embed that content. This is both bad a good. It’s bad in that one uncacheable 3rd party resource can impact multiple sites. The good part is that shifting 3rd party content to adopt good caching practices also has a magnified effect.
So how are we doing when it comes to caching 3rd party content? Below is a list of the top 30 most-used resources according to the HTTP Archive. These are the resources that were used the most across the world’s top 50K web pages. The max-age value (in hours) is also shown.
- http://www.google-analytics.com/ga.js (2 hours)
- (8760 hours)
- http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (336 hours)
- http://pagead2.googlesyndication.com/pagead/render_ads.js (336 hours)
- http://pagead2.googlesyndication.com/pagead/show_ads.js (1 hour)
- https://apis.google.com/_/apps-static/_/js/gapi/gcm_ppb,googleapis_client,plusone/[…] (720 hours)
- http://pagead2.googlesyndication.com/pagead/osd.js (24 hours)
- http://pagead2.googlesyndication.com/pagead/expansion_embed.js (24 hours)
- https://apis.google.com/js/plusone.js (1 hour)
- http://googleads.g.doubleclick.net/pagead/drt/s?safe=on (1 hour)
- (3825 hours)
- http://connect.facebook.net/rsrc.php/v1/yQ/r/f3KaqM7xIBg.swf (164 hours)
- https://ssl.gstatic.com/s2/oz/images/stars/po/Publisher/sprite2.png (8760 hours)
- https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles[…] (720 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yv/r/ZSM9MGjuEiO.js (8742 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/qP7Pvs6bhpP.js (8699 hours)
- https://plusone.google.com/_/apps-static/_/ss/plusone/[…] (720 hours)
- http://b.scorecardresearch.com/beacon.js (336 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/lP_Rtwh3P-S.css (8710 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yA/r/TSn6F7aukNQ.js (8760 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yk/r/Wm4bpxemaRU.js (8702 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yZ/r/TtnIy6IhDUq.js (8699 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yy/r/0wf7ewMoKC2.css (8699 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yO/r/H0ip1JFN_jB.js (8760 hours)
- http://platform.twitter.com/widgets/hub.1329256447.html (87659 hours)
- (8699 hours)
- http://platform.twitter.com/widgets.js (1 hour)
- https://plusone.google.com/_/apps-static/_/js/plusone/[…] (720 hours)
- http://pagead2.googlesyndication.com/pagead/js/graphics.js (24 hours)
- http://s0.2mdn.net/879366/flashwrite_1_2.js (720 hours)
There are some interesting patterns.
- simple URLs have short cache times – Some resources have very short cache times, e.g., ga.js (1), show_ads.js (5), and twitter.com/widgets.js (27). Most of the URLs for these resources are very simple (no querystring or URL “fingerprints”) because these resource URLs are part of the snippet that website owners paste into their page. These “bootstrap” resources are given short cache times because there’s no way for the resource URL to be changed if there’s an emergency fix – instead the cached resource has to expire in order for the emergency update to be retrieved.
- long URLs have long cache times – Many 3rd party “bootstrap” scripts dynamically load other resources. These code-generated URLs are typically long and complicated because they contain some unique fingerprinting, e.g., http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (3) and http://platform.twitter.com/widgets/hub.>1329256447.html (25). If there’s an emergency change to one of these resources, the fingerprint in the bootstrap script can be modified so that a new URL is requested. Therefore, these fingerprinted resources can have long cache times because there’s no need to rev them in the case of an emergency fix.
- where’s Facebook’s like button? – Facebook’s like.php and likebox.php are also hugely popular but aren’t in this list because the URL contains a querystring that differs across every website. Those resources have an even more aggressive expiration policy compared to other bootstrap resources – they use
no-cache, no-store, must-revalidate. Once the like[box] bootstrap resource is loaded, it loads the other required resources: lP_Rtwh3P-S.css (19), TSn6F7aukNQ.js (20), etc. Those resources have long URLs and long cache times because they’re generated by code, as explained in the previous bullet.- short caching resources are often async – The fact that bootstrap scripts have short cache times is good for getting emergency updates, but is bad for performance because they generate many Conditional GET requests on subsequent requests. We all know that scripts block pages from loading, so these Conditional GET requests can have a significant impact on the user experience. Luckily, some 3rd party content providers are aware of this and offer async snippets for loading these bootstrap scripts mitigating the impact of their short cache times. This is true for ga.js (1), plusone.js (9), twitter.com/widgets.js (27), and Facebook’s like[box].php.
These extremely popular 3rd party snippets are in pretty good shape, but as we get out of the top widgets we quickly find that these good caching patterns degrade. In addition, more 3rd party providers need to support async snippets.
Cache sizes are too small
In January 2007 Tenni Theurer and I ran an experiment at Yahoo! to estimate how many users had a primed cache. The methodology was to embed a transparent 1×1 image in the page with an expiration date in the past. If users had the expired image in their cache the browser would issue a Conditional GET request and receive a 304 response (primed cache). Otherwise they’d get a 200 response (empty cache). I was surprised to see that 40-60% of daily users to the site didn’t have the site’s resources in their cache and 20% of page views were done without the site’s resources in the cache.
Numerous factors contribute to this high rate of unique users missing the site’s resources in their cache, but I believe the primary reason is small cache sizes. Browsers have increased the size of their caches since this experiment was run, but not enough. It’s hard to test browser cache size. Blaze.io’s article Understanding Mobile Cache Sizes shows results from their testing. Here are the max cache sizes I found for browsers on my MacBook Air. (Some browsers set the cache size based on available disk space, so let me mention that my drive is 250 GB and has 54 GB available.) I did some testing and searching to find max cache sizes for my mobile devices and IE.
- Chrome: 320 MB
- Internet Explorer 9: 250 MB
- Firefox 11: 830 MB (shown in about:cache)
- Opera 11: 20 MB (shown in Preferences | Advanced | History)
- iPhone 4, iOS 5.1: 30-35 MB (based on testing)
- Galaxy Nexus: 18 MB (based on testing)
I’m surprised that Firefox 11 has such a large cache size – that’s almost close to what I want. All the others are (way) too small. 18-35 MB on my mobile devices?! I have seven movies on my iPhone – I’d gladly trade Iron Man 2 (1.82 GB) for more cache space.
Caching in the real world
In order to justify increasing browser cache sizes we need some statistics on how many real users overflow their cache. This topic came up at last month’s Velocity Summit where we had representatives from Chrome, Internet Explorer, Firefox, Opera, and Silk. (Safari was invited but didn’t show up.) Will Chan from the Chrome team (working on SPDY) followed-up with this post on Chromium cache metrics from Windows Chrome. These are the most informative real user cache statistics I’ve ever seen. I strongly encourage you to read his article.
Some of the takeaways include:
- ~30% of users have a full cache (capped at 320 MB)
- for users with a full cache, the median time to fill their cache is 4 hours of active browsing (20 hours of clock time)
- 7% of users clear their cache at least once per week
- 19% of users experience “fatal cache corruption” at least once per week thus clearing their cache
The last stat about cache corruption is interesting – I appreciate the honesty. The IE 9 team experienced something similar. In IE 7&8 the cache was capped at 50 MB based on tests showing increasing the cache size didn’t improve the cache hit rate. They revisited this surprising result in IE9 and found that larger cache sizes actually did improve the cache hit rate:
In IE9, we took a much closer look at our cache behaviors to better understand our surprising finding that larger caches were rarely improving our hit rate. We found a number of functional problems related to what IE treats as cacheable and how the cache cleanup algorithm works. After fixing these issues, we found larger cache sizes were again resulting in better hit rates, and as a result, we’ve changed our default cache size algorithm to provide a larger default cache.
Will mentions that Chrome’s 320 MB cap should be revisited. 30% seems like a low percentage for full caches, but could be accounted for by users that aren’t very active and active users that only visit a small number of websites (for example, just Gmail and Facebook). If possible I’d like to see these full cache statistics correlated with activity. It’s likely that user who account for the biggest percentage of web visits are more likely to have a full cache, and thus experience slower page load times.
Next steps
First, much of the data for this post came from the HTTP Archive, so I’d like to thank our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, dynaTrace Software, and Torbit.
The data presented here suggest a few areas to focus on:
Website owners need to increase their use of a Cache-Control max-age, and the max-age times need to be longer. 38% of resources were unchanged over a 1 month period, and yet only 11% of resources have a max-age value that high. Most resources, even if they change, can be refreshed by including a fingerprint in the URL specified in the HTML document. Only bootstrap scripts from 3rd parties should have short cache times (hours). Truly dynamic responses (JSON, etc.) should specify must-revalidate. A year from now rather than seeing 55% of resources without any max-age value we should see 55% cacheable for a month or more.
3rd party content providers need wider adoption of the caching and async behavior shown by the top Google, Twitter, and Facebook snippets.
Browser developers stand to bring the biggest improvements to caching. Increasing cache sizes is a likely win, especially for mobile devices. Data correlating cache sizes and user activity is needed. More intelligence around purging algorithms, such as IE 9′s prioritization based on mime type, will help when the cache fills up. More focus on personalization (what are the sites I visit most often?) would also create a faster user experience when users go to their favorite websites.
It’s great that the number of resources with caching headers grew 10% over the last year, but that just isn’t enough progress. We should really expect to double the number of resources that can be read from cache over the coming year. Just think about all those HTTP requests that can be avoided!
How Reddit ranking algorithms work - amix.dk
This is a follow up post to How Hacker News ranking algorithm works. This time around I will examine how Reddit’s default story and comment rankings work. Reddit’s algorithms are fairly simple to understand and to implement and in this post I’ll dig deeper into them.
The first part of this post will focus on story ranking, i.e. how are Reddit stories ranked? The second part of this post will focus on comment ranking, which does not use the same ranking as stories (unlike Hacker News), Reddit’s comment ranking algorithm is quite interesting and the idea guy behind it is Randall Munroe (the author of xkcd).
Digging into the story ranking code
Reddit is open sourced and the code is freely available. Reddit is implemented in Python and their code is located here. Their sorting algorithms are implemented in Pyrex, which is a language to write Python C extensions. They have used Pyrex for speed reasons. I have rewritten their Pyrex implementation into pure Python since it’s easier to read.
The default story algorithm called the hot ranking is implemented like this:
#Rewritten code from /r2/r2/lib/db/_sorts.pyx from datetime import datetime, timedelta from math import log epoch = datetime(1970, 1, 1) def epoch_seconds(date): """Returns the number of seconds from the epoch to date.""" td = date - epoch return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000) def score(ups, downs): return ups - downs def hot(ups, downs, date): """The hot formula. Should match the equivalent function in postgres.""" s = score(ups, downs) order = log(max(abs(s), 1), 10) sign = 1 if s > 0 else -1 if s < 0 else 0 seconds = epoch_seconds(date) - 1134028003 return round(order + sign * seconds / 45000, 7)In mathematical notation the hot algorithm looks like this (I have this from SEOmoz, but I doubt they are the author of this):
Effects of submission time
Following things can be said about submission time related to story ranking:
- Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older
- The score won’t decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News’s algorithm which decreases the score as time goes by
Here is a visualization of the score for a story that has same amount of up and downvotes, but different submission time:
The logarithm scale
Reddit’s hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applies:
- The first 10 upvotes have the same weight as the next 100 upvotes which have the same weight as the next 1000 etc…
Here is a visualization:
Without using the logarithm scale the score would look like this:
Effects of downvotes
Reddit is one of the few sites that has downvotes. As you can read in the code a story’s “score” is defined to be:
- up_votes - down_votes
The meaning of this can be visualized like this:
This has a big impact for stories that get a lot of upvotes and downvotes (e.g. controversial stories) as they will get a lower ranking than stories that just get upvotes. This could explain why kittens (and other non-controversial stories) rank so high :)Conclusion of Reddit’s story ranking
- Submission time is a very important parameter, generally newer stories will rank higher than older
- The first 10 upvotes count as high as the next 100. E.g. a story that has 10 upvotes and a story that has 50 upvotes will have a similar ranking
- Controversial stories that get similar amounts of upvotes and downvotes will get a low ranking compared to stories that mainly get upvotes
How Reddit’s comment ranking works
Randall Munroe of xkcd is the idea guy behind Reddit’s best ranking. He has written a great blog post about it:
You should read his blog post as it explains the algorithm in a very understandable way. The outline of his blog post is following:
- Using the hot algorithm for comments isn’t that smart since it seems to be heavily biased toward comments posted early
- In a comment system you want to rank the best comments highest regardless of their submission time
- A solution for this has been found in 1927 by Edwin B. Wilson and it’s called “Wilson score interval”, Wilson’s score interval can be made into “the confidence sort”
- The confidence sort treats the vote count as a statistical sampling of a hypothetical full vote by everyone - like in an opinion poll
- How Not To Sort By Average Rating outlines the confidence ranking in higher detail, definitely recommended reading!
Digging into the comment ranking code
The confidence sort algorithm is implemented in _sorts.pyx, I have rewritten their Pyrex implementation into pure Python (do also note that I have removed their caching optimization):
#Rewritten code from /r2/r2/lib/db/_sorts.pyx from math import sqrt def _confidence(ups, downs): n = ups + downs if n == 0: return 0 z = 1.0 #1.0 = 85%, 1.6 = 95% phat = float(ups) / n return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n) def confidence(ups, downs): if ups + downs == 0: return 0 else: return _confidence(ups, downs)The confidence sort uses Wilson score interval and the mathematical notation looks like this:
In the above formula the parameters are defined in a following way:
- p is the observed fraction of positive ratings
- n is the total number of ratings
- zα/2 is the (1-α/2) quantile of the standard normal distribution
Let’s summarize the above in a following manner:
- The confidence sort treats the vote count as a statistical sampling of a hypothetical full vote by everyone
- The confidence sort gives a comment a provisional ranking that it is 85% sure it will get to
- The more votes, the closer the 85% confidence score gets to the actual score
- Wilson’s interval has good properties for a small number of trials and/or an extreme probability
Randall has a great example of how the confidence sort ranks comments in his blog post:
If a comment has one upvote and zero downvotes, it has a 100% upvote rate, but since there’s not very much data, the system will keep it near the bottom. But if it has 10 upvotes and only 1 downvote, the system might have enough confidence to place it above something with 40 upvotes and 20 downvotes — figuring that by the time it’s also gotten 40 upvotes, it’s almost certain it will have fewer than 20 downvotes. And the best part is that if it’s wrong (which it is 15% of the time), it will quickly get more data, since the comment with less data is near the top.
Effects of submission time: there are none!
The great thing about the confidence sort is that submission time is irrelevant (much unlike the hot sort or Hacker News’s ranking algorithm). Comments are ranked by confidence and by data sampling - - i.e. the more votes a comment gets the more accurate its score will become.
Visualization
Let’s visualize the confidence sort and see how it ranks comments. We can use Randall’s example:
As you can see the confidence sort does not care about how many votes a comment have received, but about how many upvotes it has compared to the total number of votes and to the sampling size!Application outside of ranking
Like Evan Miller notes Wilson’s score interval has applications outside of ranking. He lists 3 examples:
- Detect spam/abuse: What percentage of people who see this item will mark it as spam?
- Create a “best of” list: What percentage of people who see this item will mark it as “best of”?
- Create a “Most emailed” list: What percentage of people who see this page will click “Email”?
To use it you only need two things:
- the total number of ratings/samplings
- the positive number of ratings/samplings
Given how powerful and simple this is, it’s amazing that most sites today use the naive ways to rank their content. This includes billion dollar companies like Amazon.com, which define Average rating = (Positive ratings) / (Total ratings).
Conclusion
I hope you have found this useful and leave comments if you have any questions or remarks.
Happy hacking as always :)
Related
- Reddit’s comment ranking algorithm, discusses a bug that Reddit’s implementation has
Remarkably simple and powerful.
Mario is hard, and that’s mathematically official - 14 March 2012 - New Scientist
IF YOU have ever struggled to complete classic Nintendo games, don’t feel bad - they are officially difficult.
An analysis of the computational complexity of video games, including those in the Mario and Legend of Zelda series, proves that many of them belong to a class of mathematical problems called NP-hard. This means that for a given game level, it can be very difficult to work out whether it is possible for a player to reach the end. The results suggest that some hard problems could be solved by playing a game.
The commercial versions of games are designed to be doable, of course, but Erik Demaine of the Massachusetts Institute of Technology and colleagues studied versions without that constraint. “We are working within the same rules, but we are free to design the levels however we like,” he says.
The team transformed each game into a type of logical puzzle called the Boolean satisfiability problem. This asks whether the variables in a collection of logical statements can be chosen to make all the statements true, or whether the statements inevitably contradict each other.
For each game, the team constructed sections of a level that force players to choose one of two paths - equivalent to assigning variables in the Boolean satisfiability problem. Arrangements of enemies and power-ups are equivalent to logical statements. If they allow a level to be completed, that is equivalent to all the statements in the Boolean problem being true; if they make the level impossible, that is equivalent to a contradiction.
Many of the games in the Mario, Donkey Kong, Legend of Zelda, Metroid and Pokémon series prove to be NP-hard. That means deciding whether a player can complete them is at least as hard as the hardest problems in NP, a complexity class involved in the tantalising problem of P versus NP (see “Million-dollar proof”). Not every game in each series was included in the proof, as they follow different rules.
For Mario, the team also prove that the games are NP-complete, an additional property with important consequences. Many difficult problems can be converted into any problem in the NP-complete category. Then if you can solve the NP-complete problem - say, by completing a level of Mario - you have solved the original problem too.
That includes the travelling salesman problem - finding the shortest route between a series of points - which is of real interest in the field of logistics, and also the knapsack problem, used in deciding how to allocate resources. So theoretically you could convert an example of either problem into a Mario level, and play the game to solve it. That approach would be fun, says Demaine, although it would probably be simpler to solve the satisfiability problem directly.
The results offer mixed news for game designers, says Giovanni Viglietta, a computer scientist at the University of Pisa, Italy, who was not involved in the research. If it is an NP-hard problem to decide whether a level can be successfully navigated, there is no easy way for a designer to check this. But it does ensure that playing the game is interesting, as players can’t easily decide if they are heading the right way or straying into an impassable area of the level. “The game demands some creativity and ingenuity,” Viglietta says.
Prof Aims to Rebuild Google With Stuff In Desk Drawer | Wired Enterprise
Dave Anderson looked into a desk drawer filled with tiny computers. Each was no bigger than a hardback novel, and their chips ran no faster than 600 MHz. Built by a little-known company called Soekris Engineering, they were meant to be wireless access points or network firewalls, and that’s how Anderson — a computer science professor at Carnegie Mellon — used them in a previous research project. But that project was over, and he thought: “They’ve got to be good for something else.”
At first, he decided these tiny machines could be super-low-power DNS (domain name system) servers — servers that take site names and translate them to a numeric internet address — and he asked some Ph.D. students to make it happen. “I wondered,” he remembers, “if we could do this on a wimpy platform that consumed only about 5 watts of power rather than 500.” Those students proved they could. But they also told Anderson he was thinking too small.
After tinkering with his tiny machines, they realized that if you strung a bunch of them together, you could run a massive application each machine could never execute on its own. The trick was to split the application’s duties into tiny pieces and spread them evenly across the network. “They were right,” Anderson says of his students. “We could use these boxes to run high-performance large-scale key-value stores — the kind of [databases] you would run behind the scenes at Facebook or Twitter. And the rest is publication history.”
The year was 2008, and as it turns out, Anderson and his students were at the forefront of a movement that could reinvent the way the world uses its servers, making them significantly more efficient — and cramming them into much smaller spaces. Startups such as SeaMicro and Calxeda are now building servers using the hundreds of low-power processor cores originally designed for cell phones and other mobile devices. HP is set to resell Calxeda machines as it explores similar systems with a research effort called Project Moonshot. And the giants of the internet — including Google, Amazon, and Facebook — are seriously considering the possibility of running their operations atop the sort of “wimpy” processors Anderson found in his desk drawer.
“Wimpy” is the official term. Now into its fourth year, Anderson’s project is known as the Fast Array of Wimpy Nodes, or FAWN. He regrets the name. “No manufacturer wants to advertise their products as wimpy,” he says. But the name certainly suits his research, and despite the negative connotation, the project has attracted the interest of the largest chip maker on earth. Intel sponsors Anderson’s research, and he works closely with researchers at the Pittsburgh lab Intel runs on the Carnegie Mellon campus.
The rub is that the Fast Array of Wimpy Nodes isn’t always fast. In some cases, software must be significantly rewritten to achieve high speeds on a collection of low-power processors, and other applications aren’t suited to the setup at all.
Like so many others across the server world, Intel is approaching the wimpy-node idea with skepticism — and not just because it makes an awful lot of money selling the far-from-wimpy processors that power today’s servers. “Intel is trying to walk a difficult line,” Anderson says. “Yes, a lot of their profit is from big brawny processors — and they don’t want to undercut that. But they also don’t want their customers to get inappropriately excited about wimpy processors and then be disappointed.”
Dave Anderson says that skepticism is healthy. But only up to a point. His research shows that many applications can be far more efficient on wimpy nodes, including not only ordinary web serving but, yes, large databases. “Intel realizes this too,” he says. “And they don’t want to get blindsided.”
Google Slaps Wimps
Google is a search and advertising company. But it’s also the company the world looks to for the latest thinking on hardware and software infrastructure. Google uses custom-built software platforms to distribute enormous applications across a worldwide network of custom-built servers, and this do-it-yourself approach to parallel computing has inspired everything from Hadoop, the increasingly popular open source platform for crunching data with vast server clusters, to Facebook’s Open Compute Project, a collective effort to improve the efficiency of the world’s servers.
So when Urs Hölzle, the man who oversees Google’s infrastructure, weighed in on the wimpy node idea, the server world sat up and noticed. If anyone believes in wimpy nodes, the world assumed, it’s Hölzle. But with a paper published in chip design magazine IEEE Micro, Google’s parallel computing guru actually took the hype down a notch. “Brawny cores still beat wimpy cores, most of the time,” read the paper’s title.
The problem, Hölzle said, was something called Amdahl’s law: If you parallelize only part of a system, there’s a limit to the performance improvement. “Slower but energy efficient ‘wimpy’ cores only win for general workloads if their single-core speed is reasonably close to that of mid-range ‘brawny’ cores,” he wrote. “In many corners of the real world, [wimpy core systems] are prohibited by law — Amdahl’s law.”
In short, he argued that moving information between so many cores can bog down the entire system. But he also complained that if you install a wimpy node array, you may have to rewrite your applications. “Cost numbers used by wimpy-core evangelists always exclude software development costs,” he said. “Unfortunately, wimpy-core systems can require applications to be explicitly parallelized or otherwise optimized for acceptable performance.”
Many “wimpy-core evangelists” took issue with Hölzle’s paper. But Dave Anderson calls it “reasonably balanced,” and he urges readers to consider the source. “I think you should also realize that this is written from the perspective of a company that doesn’t want to change too much of its software,” he says.
Anderson’s research has shown that some applications do require a significant rewrite, including virus scanning and other tasks that look for patterns in large amounts of data. “We actually locked our entire cluster because the [pattern recognition] algorithms we used allocated more memory than our individual cores had,” he remembers. “If you’re using wimpy cores, they probably don’t have as much memory per processor as the brawny cores. This can be a big limiter.”
But not all applications use as much memory. And in some cases, software can run on a wimpy core system with relatively few changes. Mozilla is using SeaMircro servers — based on Intel’s ATOM mobile processor — to facilitate downloads of its Firefox browser, saying the cluster draws about one fifth the power and uses about a fourth of the space of its previous cluster. Anderson points to this as an example of a wimpy core system that can be rolled out with relatively little effort.
Anderson’s stance echos that of Intel. This summer, when we asked Jason Waxman — the general manager of high-density computing in Intel’s data center group — about the company’s stance on wimpy nodes, he said that many applications — including those run by Google — are unsuited to the setup, but that others — including basic web serving — work just fine.
In other words, Google’s needs may not be your needs. Even if your applications are similar to Google’s, you may be more willing to rewrite your code. “I’m a researcher,” Anderson says. “I’m completely happy — and actually enjoy — reinventing the software. But there are others who would never ever want to rewrite their software. The question should be: As a company, where do you fit on that spectrum?”
Wimps Get Brawny
At the same time, wimpy nodes are evolving. Although low-power processors such as the Intel Atom and the ARM chips used by Calxeda can’t handle as much memory as “brawny” servers chips from Intel and AMD, newer versions are on the way — and these will shrink the memory gap. Facebook has said it can’t move to ARM chips because of the memory limitations, but it has also indicated it could move to wimpy cores once those limitations are resolved.
As the chips evolve, the rest of the system is evolving around them. Dave Anderson’s array uses flash storage rather than hard disks, and similar research from Steve Swanson — a professor of computer science and engineering at the University of San Diego — has shown wimpy nodes and flash go hand-in-hand. If you move to flash — the same solid-state storage used by smartphones — in place of spinning hard drives, you can use chips with lower clock speeds.
An old fashioned hard drive burns about 10 watts of power even when it’s doing nothing. In order to get the most out of the drive, you need a fast processor. But flash storage doesn’t burn that much power when idle, and that means you can use slower chips. “Adding solid state drives lets you use wimpier cores without giving up as much energy efficiency as you would if you were using a hard drive,” Swanson says. “With a hard drive, you want to use a faster core because it can access the hard drive and then race ahead as quickly as possible for the next access. With a solid state drive, it’s less critical that the processor race ahead to save power while the drive is idle.”
Anderson is also looking at ways to better balance workloads across wimpy node systems — an issue Urs Hölzle alludes to in his paper. “It is a problem,” he says, “but it’s a solvable problem. It just takes research and programmer effort to solve it.” What Hölzle identifies as difficulties, Anderson prefers to think of as research opportunities.
This includes software rewrites. In the short term, many companies — including Google — will frown on the idea. But in the long term, this changes. Since Hölzle published his paper, Google has resolved to rewrite its backend software — which is now being stretched into its second decade — and the new platform may very well move closer to the wimpy end of the spectrum.
Dave Anderson isn’t just looking at how wimpy core systems can be used today. He’s looking at how they can be used tomorrow. “If you came to me and you said: ‘Hey, Dave, how should I build my data center?’, I would not tell you to go and use the wimpiest cores you could find. That’s how I built mine, but I’m trying to push the limit and understand how to make these things practical.”
Why we need even more programming languages
Whenever a new programming language is announced, a certain segment of the developer population always rolls its eyes and groans that we have quite enough to choose from already, thank you very much.
As a case in point, take Dart, the language Google hopes will replace JavaScript [1]. Was it really necessary to create a whole new client-side language for the Web? Superficially, Dart doesn’t seem like much more than JavaScript with a few Java-like features mixed in. It even compiles to JavaScript (and not very efficiently [2]). Wouldn’t it have been better to work to improve the JavaScript we already have?
To its credit, Google says it’s doing just that [8], in tandem with its Dart effort. But once a language reaches a certain tipping point of popularity, overhauling it to include support for new features, paradigms, and patterns is easier said than done. In fact, judging by the past ten years or so, it can be very, very difficult.
The PHP 6 debacle
Take PHP, for example. The next version of the popular Web applications language shipped its second release candidate this week, and the final build is expected to arrive early next year. This release won’t be the long-awaited PHP 6, however. Instead, it will be a far less ambitious revision [9] designated PHP 5.4.Doubtless that’s a disappointment to developers who have been anticipating PHP 6 since the project launched in October 2005. But at this point, even if a release dubbed PHP 6 does eventually appear, it will bear little resemblance to the version that’s borne the designation so far. PHP creator Rasmus Lerdorf officially shelved the PHP 6 project [10] in March 2010 after almost five years of fruitless labor, in favor of refocusing on a formal 5.4 release.
Some of the reasons for the PHP 6 effort’s failure were technical. The primary focus of the project was retooling PHP to include native support for Unicode [11]. That wouldn’t be limited to strings, either; in PHP 6, developers would be able to specify variable names, function names, and other identifiers using any Unicode script, including multibyte encodings such as Chinese and Hindi. As the years rolled on and hidden gotchas began to surface, however, it became clear that the PHP developers had bitten off more than they could chew.
It didn’t help that as an open source project, PHP development is largely a volunteer effort. According to PHP contributor Andrei Zmievski, relatively few developers [12] really understood the Unicode push and were committed to making it happen. It was hard to get excited about rewriting lots of working code to support Unicode, and enthusiasm for the project waned. By the time PHP 6 was abandoned in 2010, Lerdorf observed [13] that PHP development “hasn’t been fun for quite a while.”
Languages move forward, but slowly
Personally, I’m no fan of PHP. I’ve long held it’s a poster child for bad language design, so it doesn’t surprise me to learn that evolving it into something better is a Sisyphean task. But I shouldn’t single out PHP here. In fact, many of the more popular languages have struggled to move forward with major new versions in recent years.The best example is surely Perl, whose community has been working on version 6 continuously since 2000. In 2008, the Perl community chastised me [14] for suggesting that Perl 6 was “vaporware.” It exists, they insisted, and you can use it today. But while that may be technically true, whether Perl 6 will ever be mature enough for production use remains an open question. Although several implementations are available, and some are even fairly stable [15], so far none of them supports all of the features [16] of the Perl 6 specification.
The Python community has had better luck implementing its language. The most recent version, Python 3, was released in 2008, after about three years of development. Three years later, however, adoption remains slow. Python 3 took the radical step of breaking backward compatibility with earlier versions, and a number of popular Python libraries and frameworks (including Django [17]) have yet to catch up. As a result, many Python developers still cling to version 2.x [18], particularly for Web work, and widespread migration to Python 3 is expected to take several more years.
These kinds of difficulties aren’t limited to scripting languages, either. The Java community has long clamored for significant language updates. Before Java SE 7 shipped [19] earlier this year, it had been five years since the last major release. But while Java SE 7 was originally expected to include much-requested capabilities such as lambda expressions and a module system, Oracle has since delayed those features [20] until Java SE 8 at the earliest.
Darting ahead of JavaScript
All of this brings us back to Dart and why Google chose to invent its own language rather than working to improve JavaScript. JavaScript is based on the ECMAScript standard. Before the publication of the ECMAScript 5 specification in 2009, ECMAScript had remained unchanged for a full decade — it made the Perl community look like a hotbed of activity!Not that the ECMAScript working group had been idle throughout those years. Work on ECMAScript 4 began in 1999, but soon foundered and went on hiatus. The committee reconvened in 2003, but still the various stakeholders couldn’t agree. Around and around they went for another five years, until — much like PHP 6 — the ECMAScript 4 effort was officially abandoned in 2008 [21], in favor of a less-contentious specification that became ECMAScript 3.1.
The lesson from all of these examples is clear: Programming languages move slowly, and the more popular a language is, the slower it moves. It is far, far easier to create a new language from whole cloth than it is to convince the existing user base of a popular language to accept radical changes.
So don’t knock Google for the work it’s done on Dart. If the Web development community responds well to Dart, perhaps it will eventually replace JavaScript. On the other hand, it’s possible that by creating a working prototype of Dart, Google will encourage the ECMAScript working group to evolve JavaScript more quickly. Either way, efforts like Dart are an important engine of progress for the software development community and a hedge against stagnation. Remember that the next time you ask whether we really need another new programming language.
This article, “Why we need even more programming languages [22],” originally appeared at InfoWorld.com [23]. Read more of Neil McAllister’s Fatal Exception blog [24] and follow the latest news in programming [25] at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter [26].
Distinguished Lecture: Vinton Cerf
Generation Jobless: Students Pick Easier Majors Despite Less Pay
By JOE LIGHT And RACHEL EMMA SILVERMAN
Jeff Swensen for the Wall Street Journal
Biyan Zhou switched her major from electrical and computer engineering to a double major in psychology and policy management.
Biyan Zhou wanted to major in engineering. Her mother and her academic adviser also wanted her to major in it, given the apparent career opportunities for engineers in a tough job market.
For today’s college graduates, the message is clear: If you want a job, the best thing you can do is build a career in rigorous disciplines of science, technology, engineering and math, Joe Light reports on Markets Hub. Photo: AP.
But during her sophomore year at Carnegie Mellon University, Ms. Zhou switched her major from electrical and computer engineering to a double major in psychology and policy management. Workers who majored in psychology have median earnings that are $38,000 below those of computer engineering majors, according to an analysis of U.S. Census data by Georgetown University.
“My ability level was just not there,” says Ms. Zhou of her decision. She now plans to look for jobs in public relations or human resources.
Ms. Zhou’s dilemma is one that educators, politicians and companies have been trying to solve for decades amid fears that U.S. science and technology training may be trailing other countries. The weak economy is putting those fears into deeper relief.
Time will tell if the poor job market persuaded more students to push into disciplines such as engineering and science. Although the number of college graduates increased about 29% between 2001 and 2009, the number graduating with engineering degrees only increased 19%, according to the most recent statistics from the U.S. Dept. of Education. The number with computer and information-sciences degrees decreased 14%. Since students typically set their majors during their sophomore year, the first class that chose their major in the midst of the recession graduated this year.
Research has shown that graduating with these majors provides a good foundation not just for so-called STEM jobs, or those in the science, technology, engineering, and math fields, but a whole range of industries where earnings expectations are high. Business, finance and consulting firms, as well as most health-care professions, are keen to hire those who bring quantitative skills and can help them stay competitive.
For 22-year-old Ms. Zhou, from Miami, the last straw was a project for one of her second-year courses that kept her and her partner in the lab well past midnight for several days. Their task was to program a soda machine. Though she and her partner managed to make it dispense the right items, they couldn’t get it to give the correct change.
To avoid getting an “incomplete” for the course, Ms. Zhou withdrew before the lab ended. Since switching majors she has earned almost straight A’s instead of the B’s and C’s she took home in engineering.
Students who drop out of science majors and professors who study the phenomenon say that introductory courses are often difficult and abstract. Some students, like Ms. Zhou, say their high schools didn’t prepare them for the level of rigor in the introductory courses.
Overall, only 45% of 2011 U.S. high-school graduates who took the ACT test were prepared for college-level math and only 30% of ACT-tested high-school graduates were ready for college-level science, according to a 2011 report by ACT Inc.
“If you haven’t been given the proper foundation early on, you fall farther and farther behind as the material gets more difficult. It’s discouraging, demoralizing,” says Claus von Zastrow, the chief operating officer and director of research at Change the Equation, a Washington, D.C., nonprofit group that seeks to improve science and math education. It has led professors to anticipate the high levels of attrition.
“I get direct emails from a handful of students who say, ‘I struggled in this class. I realize that I’m not cut out to be a [computer science] major,’ ” says professor Adam Klivans, who teaches an introductory math and computer science class at the University of Texas, Austin.
Science classes may also require more time—something U.S. college students may not be willing to commit. In a recent study, sociologists Richard Arum of New York University and Josipa Roksa of the University of Virginia found that the average U.S. student in their sample spent only about 12 to 13 hours a week studying, about half the time spent by students in 1960. They found that math and science—though not engineering—students study on average about three hours more per week than their non-science-major counterparts.
Educators have tried to tackle the attrition problem with new programs that they say make engineering more accessible. In 2003, Georgia Institute of Technology split its introductory computer-science class into three separate courses. One was geared toward computer science majors, another to engineering majors, and a third to liberal arts, architecture and management majors. The liberal arts course cut down on computer-science theory in favor of practical tasks like using programming to manipulate photographs, says computer science professor Mark Guzdial. Since the switch, about 85% of students pass, he says.
Meanwhile, only a third of science and engineering college graduates actually take jobs in science and tech fields, according to a 2007 study by Georgetown University professor B. Lindsay Lowell and Rutgers University professor Hal Salzman.
That may partly be because the jobs don’t pay enough to attract or retain top graduates. Science, technology, engineering and math majors who stay in a related profession had average annual earnings of $78,550 in 2009, but those who decided to go into managerial and professional positions made more than $102,000, according to an analysis of U.S. Census data by the Georgetown University Center on Education and the Workforce.
“If you’re a high math student in America, from a purely economic point of view, it’s crazy to go into STEM,” says Anthony Carnevale, director of the Georgetown center.
Some science and math graduates also say they would rather channel their analytical skills into fields that pay higher and seem less tedious. Charles Mokuolu, 23, graduated from Georgia Tech in 2010 with a civil-engineering degree, and now heads the finance club at Duke University’s master of engineering management program. He recently secured a business-strategy job in the commercial leadership program of a large global manufacturing company.
After interning at an engineering firm, “I realized that although I did enjoy learning about all this cool stuff and doing math problems that no one else could solve, it’s not something I wanted long term as a career,” he says.
I’m not cut out to be a CS major but no pain no gain.
















Jeff Swensen for the Wall Street Journal