Core Principles

This lab dives into the low-level "magic" that makes cloud storage reliable and consistent: data integrity via hashing, data protection via erasure coding, and cluster coordination via consensus protocols.

Learning Objectives

Manually verify data integrity using hashing.
Understand how erasure coding protects against data loss with less overhead than replication.
Observe distributed consensus and leader election in action.

Prerequisites

Linux environment with Python installed.
Docker installed.
Tools: sha256sum, pip install zfec.

Step 1: Bit-Rot and Hashing

Cloud storage providers use checksums to detect "bit-rot" (silent data corruption).

Create a file and hash it:

echo "This is important data that must not be corrupted." > data.txt
sha256sum data.txt > data.txt.sha256
cat data.txt.sha256

Simulate corruption: Use sed to change a single character without changing the file size.

sed -i 's/important/importamt/' data.txt

Verify the hash:

sha256sum -c data.txt.sha256

Output should show: data.txt: FAILED

Analysis: Note how changing just one bit results in a completely different hash (the "avalanche effect").

Step 2: Erasure Coding in Action

Replication (keeping 3 copies) is expensive (3x overhead). Erasure Coding (EC) provides similar protection with much less overhead.

Setup zfec:

pip install zfec

Encode a file: We will split a file into 4 data blocks ($k=4$) and 2 parity blocks ($m=2$). This setup can survive the loss of ANY 2 blocks.

# Create a 40KB file
dd if=/dev/urandom of=original_file.dat bs=1k count=40

# Encode
zfec -p encoded_ -k 4 -m 6 original_file.dat

You will see files like encoded_0.fec, encoded_1.fec ... up to encoded_5.fec.

Simulate Disk Failure: Delete two of the "data" blocks.

rm encoded_0.fec encoded_3.fec

Reconstruct:

zunfec -o reconstructed.dat encoded_*.fec

Verify:

diff original_file.dat reconstructed.dat && echo "SUCCESS: Reconstruction perfect!"

Step 3: Consensus Dynamics (etcd & Raft)

Distributed systems need to agree on who is the "leader" and what the current state is. We will use etcd, which uses the Raft protocol.

Run a 3-node etcd cluster: (For simplicity, we will run one node and observe its self-election, but in a real lab, you would use a compose file for 3 nodes).

docker run -d --name etcd-node \
  -p 2379:2379 \
  quay.io/coreos/etcd:v3.5.0 \
  /usr/local/bin/etcd \
  --name s1 \
  --advertise-client-urls http://0.0.0.0:2379 \
  --listen-client-urls http://0.0.0.0:2379 \
  --initial-advertise-peer-urls http://0.0.0.0:2380 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-cluster s1=http://0.0.0.0:2380 \
  --initial-cluster-token tkn \
  --initial-cluster-state new

Write a value:

docker exec etcd-node etcdctl put mykey "myvalue"

Kill the Leader (Simulation): In a multi-node setup, you would docker stop the leader node.
Run docker logs -f etcd-node
Look for terms like became leader, term, and vote.

Analysis: Observe how the nodes communicate to ensure only one node handles writes at a time, maintaining a consistent global state.

Lab Reflection

If you have $k=10, m=4$ in erasure coding, how many disk failures can you survive? What is the storage overhead?
Why is Raft preferred over simple "master-slave" replication for cloud metadata?
How does hashing help prevent "man-in-the-middle" attacks in cloud storage?