273 - Distributed System

Consensus

It allow a collection of machines to work as a coherent group that can survive when its members fail. It is a main building block in building reliable large-scale distributed software systems.

Quorum Consensus algorithm

A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Quorum-based voting for replica control

Suppose you have two nodes (A & B) storing the same set of data.

N = # of replicas (nodes) R = # of READ nodes in quorum W = # of WRITE nodes in quorum

If R = 1, W = 1, then R + W = N

Client can write into A and read from B (might not see its latest write): inconsistent view.

If R = 2, W = 1, then then R+W > N

Writes go to either A or B, but client must read from both A and B: consistent view.

If a given data item has a total of V votes, the quorums have to obey the following rules:

ensures that a data item is not read and written by two transactions concurrently, Additionally, it ensures that a read quorum contains at least one site with the newest version of the data item.
ensures that two write operations from two transactions cannot occur concurrently on the same data item. The two rules ensure that one-copy serializability is maintained.

Raft consensus algorithm

Raft is a consensus algorithm designed as an alternative to Paxos. Raft offers a generic way to distribute a state machine across a cluster of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions

Decomposition into sub components:

Leader election : Centralized control simplifies data and task synchronization. Raft uses randomized timers to elect leaders.
Log replication : the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own
Safety

Raft achieves consensus via an elected leader. A server in a raft cluster is either a leader or a follower, and can be a candidate in the precise case of an election (leader unavailable). The leader is responsible for log replication to the followers. It regularly informs the followers of its existence by sending a heartbeat message. Each follower has a timeout (typically between 150 and 300 ms) in which it expects the heartbeat from the leader. The timeout is reset on receiving the heartbeat. If no heartbeat is received, the follower changes its status to candidate and starts a leader election.

Non-Consensus Approach

Consistency As Logical Monotonicity(CALM)

In general terms, the CALM principle says that:

monotonic -> without coordination protocols

non-monotonic -> with coordination protocols.

logically monotonic distributed code is eventually consistent without any need for coordination protocols (distributed locks, two-phase commit, paxos, etc.)
eventual consistency can be guaranteed in any program by protecting non-monotonic statements (“points of order”) with coordination protocols.

( a block of code is logically monotonic if it satisfies a simple property: adding things to the input can only increase the output. By contrast, non-monotonic code may need to “retract” a previous output if more is added to its input.)

Conflict-Free Replicated Data Types (CRDT)

consistency without consensus

CRDTs are objects that can be updated without expensive synchronization/consensus and they are guaranteed to converge eventually if all concurrent updates are commutative and if all updates are executed by each replica eventually.

state-based and operation-based approach, and based on the model of replication they define two types of CRDTs, CvRDT (convergent replicated data type) and CmRDT ( commutative replicated data type).

ACID 2.0
- Associative
- Commutative
- Idempotent
- Distributed

Bloom filter

A Bloom filter is a space-efficient probabilistic data structure, that is used to test whether an element is a member of a set. It tells us whether a search key is definitely not in the set or may be in the set.

Two Bloom Filters operations:

The ability to add an extra object ‘x’ to the set ‘S’.
To determine whether a given object ‘x’ is a member of set ‘S’.

空間效率和查詢時間都遠遠超過一般的演算法，缺點是有一定的誤識別率和刪除困難。

While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets.

Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, , completely independent of the number of items already in the set

Gossip Protocol

3 Stages in Gossip:

Susceptible: Node “n” has not yet received an update.
Infective: Node “n” has received an update and is now ready to forward the update.
Removed: Node “n” has the update but is no longer willing to share it.

2-Style of Epidemic Protocols:

Anti-entropy

Each Node “n” periodically contacts a random peer “p” selected from the pool.
Then, n-p engages in an information exchange protocol: n to p (push), n from p (pull), or both push-pull.

Rumor mongering (or Dissemination protocols)

Nodes are initially ignorant.
When an update is received by a node, it becomes a hot rumor.
While a node holds a hot rumor, it periodically chooses a random peer from the pool and sends (push) the rumor to it.
Eventually, a node will lose interest in spreading the rumor.

Each node maintains a small local membership table that provides a partial view on the complete set of nodes, and periodically refreshes the table using a gossiping procedure.

Performance

Capacity: Amount of resource

Utilization: % of capacity using

Processing Time: Time for a request to process.

Latency: Any wait time to start processing the request including network transfer.

Throughput: request/second

Threads

⋅⋅ General-purpose solution for managing concurrency
⋅⋅ Multiple independent execution streams.
⋅⋅ Shared state.
⋅⋅ Pre-emptive scheduling
⋅⋅ Synchronization via Locks, Semaphore, Conditions.

Usage

OS: one kernel thread for each user process •CPU-bound Apps: one kernel thread per CPU (E.g. Scientific applications) •Distributed Systems: process requests concurrently (overlap I/Os) •GUIs: threads correspond to user actions

Problem

Synchronization : Must coordinate access to shared resources with locks. Forget to hold a lock? Corrupted data.
Deadlock: Circular dependencies among locks. Each process waits for some other process: system hangs!

Callbacks don’t work with locks.

Events

⋅⋅ One execution stream: no CPU concurrency
⋅⋅ Register interest in event callbacks.
⋅⋅ Event loop (scheduler) waits for events, invokes handlers.
⋅⋅ NO preemption of event handlers.
⋅⋅ Handlers generally short-lived.

Usage

GUIs or Browser’s HTML DOM: One handler for each event: press button, right-click on mouse) Handler implements behavior (delete, undo files, etc.) •Distributed Systems: One handler for each source of input (socket) Handler processes incoming request, sends response. Event-driven I/O for I/O even loop.

Problem

Long-running handlers make application non-responsive.
Cannot maintain local state across events because handler must return.
No CPU concurrency
Event-driven I/O not always well supported (e.g. poor write buffering)

Threads vs. Events

Events avoid concurrency as mush as possible, thread embrace

Debugging easier with events.

Application logic flow are typically sequential, so it’s harder to code in event-driven style.

Events are faster than threads on single CPU : no locks, not context-switching

Threads provide true concurrency

Use events for GUIs, distributed systems.

Only use threads where true CPU concurrency is needed.

scale up: make faster for use experience, re-factor code
scale out: optimizing for more hundreds users

application caching, replication cache, update database and local data at the same time

application cache above is for content, down is for read only / hot data

Parallel Programming

Programming model :

shared memory

Multiple threads communicating through memory, so it need locking and synchronization to keep atomicity. Many scalability issues -- deadlock, starvation, race condition

distributed memory

Data distribution and communication orchestration is essential for performance.

Messaging passing

Process communicate by explicit send-receive pairs

Advantage of Share memory	Advantage of Message Passing
Implicit communication	Explicit communication (send/receive)
Low overhead when cached	Easier to control data placement

Disadvantages of Shared Memory	Disadvantages of Message Passing
Complex to build scalable system	Message passing overhead can be quite high.
Requires synchronization	More complex to program
Hard to control data placement within caching system	Introduces question of reception technique

Single Program Multiple Data pattern (SPMD):

the same program runs against multiple data on each processor or node.

= initialize -> split data correctly and evenly -> run the same program -> combine the results -> finalize

X dynamic load balancing

Master-worker pattern:

Particularly relevant for problems using task parallelism pattern where task have no dependencies

Main challenge is determining when the entire problem is complete, because you cannot know the finishing time of each tasks

Future/Defer

FutureTask can be used to load lengthy computation that can be started before the results are needed, like loading info.

Fork/Join pattern:

Supports divide-and-conquer pattern. using the work-stealing

Parallel Array - Fine-grained Algorithm

Parallelize large data set by recursively break it into smaller data sets.

How to choose a design pattern that facilitates the mapping of tasks to units of execution?

Organize by tasks 2. Organize by data decomposition 3. Organize by flow of data

4 steps to crate a parallel program

decomposition -> assignment -> orchestration -> mapping

Data Parallelism

Array data structure – by rows, columns or blocks

Recursive data structure – trees into sub-trees

Task Parallelism

Each execution process (thread) performs different functions on same or different data.

Pipeline Parallelism

Each task will do the job while the previous one finishing. Bandwidth up! But still the same latency

Amdahl's law

A program can be sped up by adding computing resources, based on proportion of serial and parallelizable components.

p = fraction of work that can be parallelized 1-p = the remaining work that cannot be parallelized

n = the number of processor

1/ ((1-p) + (p/n))

speed up = old running time / new running time

3 Keys to Parallel performance

Coverage of parallelism in algorithm : Amadahl's law
Granularity of partitioning among CPUs : load balance, communication cost
Locality of computation and communication : Communication between CPUs or between CPUs and their memories.

Single Thread Performance

Load balance the work among processors. 2. Make execution on each processor faster

Performance Consideration
For computation-intensive (CPU-bound) tasks that do no I/O and access no shared data,
◦ N cpu or N cpu + 1 threads -> optimal throughput, because More threads can degrade performance as the threads compete for CPU and memory resources.

Load balancing factors

Tasks costs(load balance) 2. Tasks dependency 3. locality

MERKLE TREES

Implemented based on Binary Trees.
Hash the data and then stores the data hash in the leaf nodes only.
Each node has up to 2 children.
Space efficient as it stores hashes instead of full content of the files.
Uses for data verification in distributed systems: P2P networks, Bitcoin, Git.
Bitcoin uses it as a digital fingerprint of the entire set of transactions.

If there is an odd number of transactions, the last transaction hash will be duplicated to create an even number of leaf nodes.

Mining

Mining is the mechanism that allows the blockchain to be a decencentralized security. The main job is verifing the legitimacy of a transaction

Proof-Of-Work: Miners compete to solve a difficult mathematical problem based on a cryptographic hash algorithm. The solution found is called the Proof-Of-Work

Proof of Stake: a person can mine or validate block transactions according to how many coins he or she holds. This means that the more Bitcoin or altcoin owned by a miner, the more mining power he or she has.

Zero Knowledge Proof : One party can approve to another party without conveying any other information.

Compleness : If the statement is true, the honest verifier will be convinced of the fact by an honest prover.
Soundness : If the statement is false, no cheating prover can convince the honest verifier that it is true, except with some small probability.
Zero-knowledge : Just knowing the statement, not the secret, is sufficient to show that the prover knows the secret.

BlockChain

Digital currency and online payment system that uses encryption techniques to

Regulate the generation of units of currency
Verify the transfer of funds
Decentralized virtual currency
Mining, anyone can use their machines to verify and record payments into the public ledger.

Ownership of bitcoin is established through private and public keys, wallet with a bitcoin address.

Mining:

It's an operation of inverse hashing to determine a nonce.

miners complete to solve a difficult mathematical problem based on cryptographic hash algorithm (known as Proof-of-Work).

The competition to solve the PoW algorithm is to earn the reward and the right to record transactions on the blockchain.

Block chain smart contract: piece of code that lives on a blockchain, not a server. When a pre-programmed condition is triggered, the smart contract executes the corresponding contractual clause.

Replication

To keep data geographically close to users.
To allow the system to continue working even if some parts of the system have failed.
To scale out the number of nodes that can serve read queries.

Leader-based Replication

client --> read/write --> Leader ----[write]----> Followers ---> Leader -----[response] --> client

Hashing

Hashing is a process that maps data of a variable length to data of a fixed length.

A common way of load balancing across n cache nodes is to put object o in cache node number hash(o) mod n.

=> disadvantage: add or remove from cache nodes, every object needs to hash to a new location

Consistent Hashing (CH)

Hash both objects and caches using the same hash function to map the cache to an interval.

When the cache is removed, its interval is taken over by a cache with an adjacent interval while the other caches remain unchanged, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.

Rendezvous(HRW) Hashing

Rendezvous hashing uses Highest Random Weight (HRW) to distribute objects uniformly over all nodes.

When a node is removed, only the keys mapped to that node need to be remapped .

273 - Distributed System

273 - Distributed System

Consensus

Raft consensus algorithm

Non-Consensus Approach

Bloom filter

Gossip Protocol

Performance

Parallel Programming

MERKLE TREES

Mining

BlockChain

Replication

Hashing

Rendezvous(HRW) Hashing

results matching ""

No results matching ""