Asynchronous Data Processing
100 Websites and Lots of Stats
Shaun Haber & Ethan Kaplan from the Tech Department of Warner Brother Records
87 sites with 87 databases in a single Drupal environment and lots of data.
A single huge multisite setup. Drupal 5.
Multisite: sharing the same code base and share modules, but each site has its own database, containing a lot of info valuable for data mining, including number of users, etc.
They don’t deal with shared databases.
Use Case: I need to view the data
How do we view the data in a consolidated fashion.
One site to rule them all, capable of aggregating data from all the sites and displaying it to the user.
Pull model – "Master" pulls data from sites
Push model – "Master" is server: Sites push data to "master"
Shaun: pull model
Ethan: push model
Data Output < Data aggregatoin < Query API < Site Node
- Site node: new content type (URL, DB Info, Document Root…)
- Query API
- Queries all site nodes
- DB Connection management
- Store results as PHP Array
- Don’t do anything stupid!
- Data aggregation
- Select count(*) from users
- Cache management
- Semaphore management
- Storage management
- Cron hook
- Data Output
- Sortable tables
Go to API and execute SELECT COUNT (*) FROM users WHERE mail like ‘%gmail.com’
Returns an array of sites with count.
They have a view showing newly registered users, users on line, etc. by site, in a sortable table.
Check if cron is running on all sites: queries last cron from each database (good site up method).
They have a flex application on their desktops so that they can monitor sites in real time.
Zero Latency! by using queuing and other messaging services.
Python multi-threaded processes very efficient.
Spread is a more efficient kind of communications alternative to using XML-RPC which is used by Flcr, etc. (see http://www.schlossnagle.org/~george/spread/)
Not a node protocol, used in large scale applications, transport protocol and not a queue, C++ and not Ruby. No session setup. Connect/send .002 ms There is spinup with XMPP (waits for answer).
They do provide XML-RPC alternative for slower sites so as not to flood spread.
Spread is a transport protocol, first to get, last to get.
Why python: python is just easy, more fun than Java. Python has exceedingly good multithreading without threading like in Java.
On Drupal side, Actions send a spread. Geocoding of data is much more scalable than Location module. (out of band, so scales)
- Mailing list
- Discussion board
- SMS list
- Any other external systms/APIs
The ultimate goal: User of record centralized as a master on a single site and connected to all services.
Site of record
API abstraction layer
All of this operates in a Grand Central Station methodology.
They would like to Open Source it in order to move towards a more generic and less specific approach.
Very interesting presentation
Question on scaling: they have gotten good at scaling Drupal: 86,000 on 3 servers with 3 DB servers.
The key to scaling Drupal is caching, using memcache, and being smart in optimizing heavy queries in views, for example. Diligence in debugging to detect and eliminate locking.
Question: what was it about Drupal that led them to use it.
Ethan has used Drupal since version 2… they evaluated CMSs, but Drupal has a strong, unified community, and a generic architecture. Three other record labels also use Drupal. They started using it three years ago, when it wasn’t as popular as now.
Question: Where is Drupal getting in the way, what pitfalls does it have?
Ethan: We are dependent on many different people (Ubercart, services module, etc., and they are at different levels of maturity, given the need to migrate to Drupal 6). The challenge has to do with dealing with Open Source in general. But the act of thinking a lot lowers our costs a hundred-fold. It gives us more than it costs. Required: more education in the company. But it enables us to respond better to requests.
Question: How do you update so many sites?
Shaun: A great question. They have four people doing this, wearing many hats. They have an infrastructure for developers to work on. Based on SVN. When there is a small change, they commit to repository and update to live site. For major updates: snapshot of site (code + db dump), replicate that, develop on that, while public site continues. Sync of data is required to merge in all the new data accumulated in the interim. This is a complicated process, but broken down into two phases. Sequence id: I shift up the delta to allow space for the data coming into the live site. The script shifts all nodes up. In Drupal 6 this will be unnecessary since the sequence table is no more. Then merge.
Question: Drupal core updates and module modules?
Shaun: With core updates we wait a few days to make sure new release is stable. Since it is a multi-site, all share the same code base.
Apache: mod-vhosts-alias allows for virtual document root
You can launch a new site without restarting servers with this form of dynamic hosting.
This allows for symbolic links, so the population of the new site is done by creating/changing symlinks.
You can update the sites one at a time at the rate it is comfortable… no need to update all 87 at once.
New sites are based on a skeleton folder of symlinks which can be copied.