Partitioning in Large Data Systems Alberto Lerner Motivation + achieving scalability through data partitioning is not new + but new desiderata: + not all systems are born large so expand as need be, just by adding nodes (disk+memory) + access patterns not known or not fixed always: so balance load as it goes + no intervention other than adding machines + Two systems have done it nicely: Bigtable (google) and Dynano (amazon) + In investigating the partitioning aspect of those, we'll study Paxos (to solve distributed consensus) and DHT's (to do away with central authorities in distributed systems). === Problems to solve To achieve... + ...incremental scalability + create new partitions as data grows + accept a new node in the system + ...adaptive load balancing + move partitions from over- to underloaded nodes + for the above to work + how to detect who is over- or underloaded in a distributed system? + how to even know who is *in* the system, if new nodes can get added? (or failures can happen) + if things get moved around how do clients find it? === Successful systems... + ... had at least two distinct ways of looking at partitioning and positioning data + partition above info (no pun intended) consequently, central decisions such as balancing and group membership may need a central authority + replicate this info (roughly) everywhere so, we might get away without a central authority + First, that's Bigtable's approach. Latter, Dynamo's. === Bigtable system brief + system is a collection of servers that export a key-value data model + roughly, server translates put's-get's keys into disk read-writes + each server handles a portion of the partitions partitioning scheme + partitioning data is maintained in a table and is itself partitioned + partitining table has two levels (fixed height tree) in that the first partition points to all others + everybody can read, only system can update locating keys + locating a key means descending a tree (visit fist partition, visit table data partition, visit table partition) + trick is, how to find first partition? partition split + partition split would be mostly local operation + but partitioning data gets updated + can partition due to size, access pattern, ... load balancing + load balancing decision is a centralized operation + but state can be volatile (collected just for that) + load balancer reads servers loads at coarse granularity + tells some servers to drop load, again coarse + then load balancer finds homes for dropped tablets on underloaded servers node addition + a new node joins the group by creating a file in a special directory and grabbing a lock to it + anybody needing to know all servers just scans that directory + wouldn't that directory be a single point of failure? Paxos and Chubby + So that "magic" file server... not a SPOF because it is fault tolerant + Paxos solves distributed consensus + consensus drives a set of replicated state machines that manipulate a database + there's layer of file services (locks and watches) atop the database + so a file server like this can be used in maintaing some global state about the system, such as its members, or who has a particular piece of metadata. === Dynamo system brief + like bigtable, system is a collection of servers that expors a key-value data model + big difference: no central authority, so we'll see a lot more partitioning and positioning information replicated around + system agrees on a hash function whose domain is seen as circular (max's key successor is first key) + each server takes care of a portion of the ring, defined by where in the ring that server's "token" falls partitioning scheme + no need to maintain explicit partitioning data, since anyone can compute where in the ring each key would fall + but all need to know which server is responsible for which portion of the ring + tradeoff of size of the state and time to route. Route in O(1) means state is O(n) # of servers + but gets away saying the O(1) routing is active research topic (Cornel's Beehive) locating keys + servers know all servers, locating is O(1) + client points to any server + server redirects if need be node addition + server inserts itself in the ring + servers contact one another periodically using a gossip scheme + eventually all learn about newcomers + new server asks for part of the successor's ring portion partition split + not really possible + can increase number of tokens per server to increase the number of partitions + assumes that access is somewhat uniform Load balancing w/o central authority + argument is that in high traffic, there are many "hot" keys + hashing function distributes those randomly + but having one "token" per server doesn't always guarantee partitions with roughly the same number of keys + so use "virtual nodes", that is, pick T tokens per server + but what if we want to add server and maintain partitioning? + divide in Q partitions, Q >> #servers * T + each partition still belong to "next server" in the ring + now, did we loose the ability to issue efficient scans? + argh... so pick an order preserving hash-function then! + so bad choice of keys by the application writer may not distribute "hot" ones conveniently. + Was it really worth the trouble of doing load balancing without a central authority? === Wrap-up + decision on how to go about the partitioning has deep implications on the system + Paxos and Chubby add yet another set of failure modes to the system whereas having all nodes agree on a hash function seem simpler (and has not faults associated) + but freedom to do balancing is a bit compromised + better, worse? no such thing because the partitioning strategy works in tandem with other aspects of the system. Dynamo use replication and balancing is done across replicas as well. + both approaches were successful in the companies they were deployed.