Scalability is always a challenging issue, especially when your limiting factor is never physical space, but instead the physical limitations of the servers hosting the data. Distributed systems have several challenging elements in their construction. It doesn't matter if it's a giant forum, client records, or even weblog posts. There are some basic strategies though that can help simplify data management issues. If nothing else, these can be great starting points for more complex solutions.
Define the Content
Before content can be streamlined, it is important to define what the content itself is. In doing so, it's also possible to find where the content's weight resides. Four our purposes, we will define weight in two contexts. First, the content's weight is where the bulk of the content (in physical space is). Often times, this weight coincides with where the content areas are located. The second definition of weight is in terms of access. Content that is either written heavily or is accessed frequently carries a lot of weight. For Gaia Online, this weight was the posts inside of the forums. With more than 191,000,000 forum posts, and "viewtopic.php" being the most frequently accessed page, the Gaia Forums were the heaviest both in size and in access.
Group the Content
Defined content can then be grouped into manageable chunks. Chunks can exist per user, per topic, per blog category, or per any other content meta information that defines the content. In selecting a group, the goal is to determine a grouping that will not shift, even if the majority of content's meta information changes.
Minimize Load
With the data grouped into pieces, it can be broken up across servers or whatever means are necessary to solve one half of the weight issue. To solve the second half of the weight issue-- content access, the goal is to reduce the amount of times that the content is accessed. Movable Type, a piece of software very popular in the blogging world, pre-renders HTML pages (in the default mode), shifting the access to once during a page build, and once during a comment build. Template caching and query caching are also common solutions to minimize data access load. Reducing the data load by this point is also made easier by the distribution of the content's weight.
Thinking Practical: Gaia
Much of this talk about scalability is nice, but it needs a practical example. The Gaia Online forum rewrite is a really good example of rewriting and recoding with scalability in mind. The original architecture made use of one large server for writing post information, and three slave servers that handled client reads. Because all four servers must maintain the same data set, and only one server is capable of writing, the demand to post exceeded the single server's limitation. When tackling the scalability problem of the forums, both physical space and the distribution of load were the two primary concerns. By developing a method where 1/4 of the data was on each of the four servers, it would be possible to spread out not only the data, but the reading and writing load on a per server basis. With this in mind, it was time to group the content, and minimize the load.
Grouping for a Forum, Handling Load
Breaking things up in a forum environment is very simple. A post in a forum is categorized by three sets of meta information. The poster's information, the information for the topic that the post belongs to, and the forum information to which the topic belongs to. Since a topic can be moved from one forum to another (and therefore the posts be moved as well), the most static meta information associated with a post is the topic to which it is bound. The four servers being used became buckets for the posts, divided up based on their topic ID. This also ensures that all posts for a topic will be on the same server. To help minimize load, topics are spread out onto the servers until they are no longer active. At that point, the topics are moved to a new server for the purpose of archival. The master topic index makes use of MySQL cache in order to lessen the strain for repeat queries. Finally, to reduce load on the servers, another bucket can be added at any time, and will spread out the load by 1/n. The downside to this method is that the growth, while scalable, is logarithmic in theory, meaning that with 9 forum servers, the 10th will only lessen the load by 1/10 of the total strain.
Scalability of data and the access to that data can be challenging, no matter the task. When done correctly, the end result can be faster, more streamlined, and allow for rapid growth and expansion.
In response to "Spreading Out the Data":
Ok. Cool