226 - Database Final 2

Lec12: Distributed database

  • no share physical component
  • independent of each other
  • Transaction may access data at one or more sites

Replication – multiple copies of data, stored in different sites, for faster retrieval and fault tolerance

Fragmentation – Relation is partitioned into several fragments stored in distinct sites

Data replication

  • Full replication of a relation: relation stored at all sites
  • Fully redundant databases: entire DB in every site
  • Pros
    • Availability: site failure is OK if replicas exist
    • Parallelism: queries processed by several nodes in parallel
    • Reduced data transfer: relation available locally at site w/ a replica
  • Cons
    • Increased cost of updates: update each replica
    • Increased complexity of concurrency control: concurrent updates to distinct replicas
      • One solution: choose one copy as primary copy

Horizontal fragmentation

–Partition a table horizontally
–A fragment == a subset of rows == shard
–A table is divided to one or more horizontal fragments

Vertical fragmentation

– Partition a table vertically
–A fragment == a subset of columns
–Ex: 2 vertical fragments of employee:
•(ssn, name, bdate, addr, sex), (ssn, salary, super_ssn, Dno)
–All fragments must contain a common candidate key (or superkey, or tuple-id)

Data Fragmentation

  • Advantages:
    • parallel processing on fragments of a relation
    • Locality
  • Data transparency

    • User is unaware of details of how and where data is stored

      • Fragmentation transparency

      • Replication transparency

      • Location transparency

Transaction coordinator:

  • Start execution of txns that originate at the site
  • Distribute sub-txns at sites for execution
  • Coordinate termination of txn that originates at the site (commit or abort at all sites)

Transaction Manager

  • Maintain log for recovery purposes
  • Coordinate concurrent txn executing at this site

System Failure Modes

Network partition : A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them
-- Network partitioning and site failures are generally indistinguishable

Commit Protocol

Ensure atomicity across sites: a txn executing at multiple sites either be committed at all the sites, or aborted at all the sites

Two-phase commit (2PC) protocol: widely used

Phase 1 : Get a decision

Ci asks all participants to prepare to commit txn T
–Ci adds <prepare T> to (local) log and forces log to stable storage
–sends prepare T msg to all sites at which T executed

Upon receiving msg, txn manager at each site determines if it can commit T
–if not, add <no T> to log and send abort T msgto Ci
–if yes
•add <ready T> to log : A promise by a site to commit T or abort T
•force all logs for T to stable storage
•send ready T msg to Ci

Phase 2 : Record a decision

Ci waits for receiving responses from all participating sites, or timeout
• If Ci received a ready T msg from all participating sites, commit T else abort T
• Receiving just one abort T msg: good enough for Ci to abort T
Ci adds <commit T> or <abort T> to (local) log and forces log to stable storage
–Once on stable storage it is irrevocable(even if failures occur)
Ci sends the decision to each participant
–Each participant records <commit T> or <abort T> to log

Failure Handing

2PC Cons - coordinator failure may result in blocking, where a decision either to commit or to abort T may have to be postponed until Ci recovers

Site Failure

  • Ci detects a site
    • fails before the site responding w/ ready T msg: Ci assumes abort T msg
    • fails after Ci receiving ready T msg: Ci ignores it ➜ have a promise
  • When site Sk recovers, it examines its log to determine fate of txns active at the time of the failure

    –Log contains <commit T> record: redo (T)

    –Log contains <abort T> record: undo (T)

    –Log contains <ready T> record: consult Ci for the fate of T
    –Log contains no control records (commit, abort, ready) concerning T:
    ➜ Sk failed before responding to prepare T msg from Ci : the failure of Sk precludes the sending of such a response, Ci must abort T

    ➜ Sk must execute undo (T)

Coordinator Failure

If coordinator Cifails while the commit protocol for Tis executing then participating sites must decide on T’s fate:

  • If an active site contains <commit T> in log ➜ commit T
  • If an active site contains <abort T> in log ➜ abort T
  • If some active participating site does not contain <ready T> in log, then the failed coordinator Ci cannot have decided to commit T : Can abort T; such a site must reject any subsequent <prepare T> message from Ci
  • else, then all active sites must have <ready T> in their logs, but no additional control records (such as <abort T> of <commit T>) ➜ active sites must wait for Ci to recover, to find decision ➜ Blocking problem

Single lock manager

– a single lock manager resides in a single chosen site, say Si
– txn sends a lock request to Si and lock manager determines

  • Txn can read data from any site w/ a replica of the data
  • Writes must be performed on all replicas of a data item
  • Pros: single implementation, simple deadlock handling
  • Cons: bottleneck, vulnerability

Distributed lock manager

Local lock manager at each site; control access to local data items

  • Pros: Simple implementation, reduce bottleneck, reduce vulnerability
  • Cons: deadlock detection
  • Primary copy: choose one replica of data item to be the primary copy

    –Site containing the replica is called the primary site for that data item=

    • When a txn needs to lock a data item Q, it requests a lock at the primary site of Q
    • Pros: simple implementation
    • Cons: If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible
  • Quorum Consensus Protocol :

    • Each site is assigned a weight; S = the total of all site weights A more reliable site can get higher weight
    • Choose two values: read quorum QR and write quorum QW

      –Rules: QR +QW > S and 2 *QW > S

    • Read: must lock enough replicas that sum(site weight) >= QR

    • Write: must lock enough replicas that sum(site weight) >= QW
    • Pros: the cost of either read or write locking can be selectively reduced

      •Optimize read: smaller QR, larger QW; optimize write: larger QR, smaller QW

Timestamp

Each transaction must be given a unique timestamp (TS)

– Each site generates a unique local TS using either a logical counter or the local clock
Global unique TS: {unique local TS, unique site identifier}

Replication with Weak Consistency

  • Replication of data with weak degrees of consistency (w/o guarantee of serializabiliy)
    • i.e. QR +QW ≤S or 2*QW≤S
    • Usually only when not enough sites are available to ensure quorum
    • Tradeoffs: consistency, availability, latency/performance
  • Key issues:
    • Reads may get old versions
    • Writes may occur in parallel, leading to inconsistent versions
  • Master-slave Replication

    • txn updates: at a single “master” site, and propagated to “slave” sites
    • Update propagation -not part of txn

      •immediately (after txncommits), or async(lazy, periodic)

      –txnmay read from slave sites, but no update

      –Replicas sees a transaction-consistent snapshot of DB at primary site

  • Multi-master replication

    –txn updates done at any replica: update local copy, and system propagates to replicas transparently - txn unaware data replication

  • Lazy replication

    –Txn updates: at one site (primary or any)

    –Update propagation: lazy (not on critical path, after txn commits)

Deadlock detection: cannot be detected locally at either site
➜ Centralized detection: A global wait-for graph is constructed and maintained in a single site

CAP

Availability : To be robust (ability of system to function even during failures)
➜ Detect failures, Reconfigure system, Recovery/Reintegration

Consistency:
● Strong consistency: serializable schedule
● Weak consistency: master-slave, multi-master, lazy replication, etc.

Tade-off between A and C
–Most of distributed consensus algorithms will block during partitions to ensure consistency
• Paxosdoes not block as long as a majority of participants are alive
–Many apps require continued operation even during network partition
• Even at the cost of consistency

Consistency: All copies have the same value

Availability: System can run even if parts have failed

Partition-tolerant: Survive network partitioning

★ It is impossible to achieve all three ➜ Any two can be achieved
★ CAP theorem only matters when there is a partition

Eventual Consistency

when no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent
– You may not know how long it may take

• Known as BASE (Basically Available, Soft state, Eventual consistency)
Soft state: copies of a data item may be inconsistent

• Used by most NoSQL

Lec13 : No SQL

  • Schema-less data stored as some form of key-value pair DB
  • Scalability: scale out
  • (usually) availability instead of consistency: CAP
  • Eventual consistency (BASE)

<b></b> RDBMS NoSQL
Type of data structured data Semi-and un-structured data
Organization of data schema: tables/rows/columns schema-less: key-value pairs
Design steps

requirements➜

data model (ERD, UML) ➜

tables ➜…

store data first➜

(transform data) ➜

get/put

data association referential integrity, foreign keys You are on your own
Functionality complex simple
Properties ACID BASE
Pros

(strong) consistency

Transactions

Joins

Schema-less

Scalability, Availability

Performance

Cons

Scalability

Availability

Eventual consistency

Tricky join

No transactions

NoSQL–Key-value Pair

  • Simple architecture
    • a unique identifier (key) maps to an obj(value)
    • DB itself does not care about obj
  • Good for
    • Simple data model
    • Scalability
  • Bad for : “Complex” datasets
  • Examples: AWS DynamoDB, Azure Tables, Google Cloud Datastore, Riak, Redis, Azure CosmosDB

NoSQL–Column-family/BigTable

  • Key-value pair + grouping
    • Key –{a set of values}
    • E.g., time series data from sensors, languages of web page
  • Good for
    • More complex data set (than simple key-value pair)
    • Scalability
  • Examples : Cassandra, HBase, Hypertable, Azure CosmosDB

NoSQL –Document

  • DB knows the nature of the data
    • Document –JSON
    • Need index to improve performance
  • Scalability –clustering, replication, some w/ partition-tolerant
  • Good for
    • Systems already using document-style data
    • Scalability
  • Examples: MongoDB, CouchDB, AWS DynamoDB, Azure CosmosDB

NoSQL–Graph

  • How data is related and what calculation is to be performed
  • Usually also has “transaction”
  • Good for
    • Interconnected Data and non-tabular
      • Geospatial problems, recommendation engine, network analysis, bioinformatics
    • “Closeness” of relationship
      • How vulnerable a company is to “bad news” for another company
  • Bad for: Simple tabular data
  • Examples: Neo4j, Azure CosmosDB

Cassandra

  • Scalable High Availability NoSQL, key-value pair + column family (≈ table in RDBMS)
  • Data model: a partitioned row store
    • Rows are organized into tables
    • Partition: consistent hashing - decides storage node (≈ AWS DynamoDB)
      • adding/removing nodes in cluster only reshuffles keys among neighboring nodes
  • Tunable consistency: decided by client apps, per operation
    • Consistency level: 1, 2, 3, quorum, all, any, etc. (quorum ≈ AWS DynamoDB )
  • W: sent to all replica, R: decided by consistency level
  • Replication and multi-DC replication
    • asynchronous masterless(peer-to-peer) replication
  • Cassandra Query Language (CQL): SQL like
    • No joins and subqueries
    • materialized views (i.e. pre-computed results of queries)

HBase

  • Column-oriented DB based on Google BigTable
  • Distributed, (timestamp) versioned, NoSQL DB on top of Hadoop and HDFS
  • Strong consistent reads and writes: not eventual consistency
    • –All operations are atomic at the row level, regardless of # of columns
    • Write ahead logging (WAL): protect from server failure
  • Region-based storage architecture: a set of rows, [starting-row-key, ending-row-key)
    • Auto sharding of tables
  • HBaseMaster: masternode
    • Assign regions to region servers

MongoDB

  • Scalable distributed document DB
    • Nested JSON document (binary form of JSON –BSON)
    • A document: a JSON obj w/ key-value pairs ≈ row in RDBMS
  • CRUD: JavaScript, ACID at document level
  • No joins: still allow to retrieve data via relationship ➜ $lookup
  • Replica Sets
    • One server as primary (master), others as secondary (w/ replicated data)
    • Client only reads from or writes to master
  • Auto sharding – range-based, hash-based, user-defined

CouchDB

  • JSON-and REST-based document-oriented DB]
  • Each document
    • Unique immutable _id: auto-gen or explicit assigned per doc
    • _rev: revision per doc, starting from 1, auto-modified after doc changes
    • Each update/delete must specify both_id and _rev
  • CRUD: Futon web interface
  • No transaction, No locking
  • View: Content generated by mapreduce functions incrementally
  • Changes API
  • Multi-master replication: HA

NoSQL Pros and Cons

Pros:
–schema-less
–scalability (scale out), availability
–performance
–simpler change management (“agile”, faster dev)

Cons:
–eventual consistency
–tricky (server-side) join
–no transaction
–schema-less

Trade-offs (w/ RDBMS): functionality, availability, consistency, durability, scalability, performance

results matching ""

    No results matching ""