This lab dives into the low-level "magic" that makes cloud storage reliable and consistent: data integrity via hashing, data protection via erasure coding, and cluster coordination via consensus protocols.

Learning Objectives

  • Manually verify data integrity using hashing.
  • Understand how erasure coding protects against data loss with less overhead than replication.
  • Observe distributed consensus and leader election in action.

Prerequisites

  • Linux environment with Python installed.
  • Docker installed.
  • Tools: sha256sum, pip install zfec.

Step 1: Bit-Rot and Hashing

Cloud storage providers use checksums to detect "bit-rot" (silent data corruption).

  1. Create a file and hash it:
echo "This is important data that must not be corrupted." > data.txt
sha256sum data.txt > data.txt.sha256
cat data.txt.sha256
  1. Simulate corruption: Use sed to change a single character without changing the file size.
sed -i 's/important/importamt/' data.txt
  1. Verify the hash:
sha256sum -c data.txt.sha256

Output should show: data.txt: FAILED

Analysis: Note how changing just one bit results in a completely different hash (the "avalanche effect").


Step 2: Erasure Coding in Action

Replication (keeping 3 copies) is expensive (3x overhead). Erasure Coding (EC) provides similar protection with much less overhead.

  1. Setup zfec:
pip install zfec
  1. Encode a file: We will split a file into 4 data blocks ($k=4$) and 2 parity blocks ($m=2$). This setup can survive the loss of ANY 2 blocks.
# Create a 40KB file
dd if=/dev/urandom of=original_file.dat bs=1k count=40

# Encode
zfec -p encoded_ -k 4 -m 6 original_file.dat

You will see files like encoded_0.fec, encoded_1.fec ... up to encoded_5.fec.

  1. Simulate Disk Failure: Delete two of the "data" blocks.
rm encoded_0.fec encoded_3.fec
  1. Reconstruct:
zunfec -o reconstructed.dat encoded_*.fec
  1. Verify:
diff original_file.dat reconstructed.dat && echo "SUCCESS: Reconstruction perfect!"

Step 3: Consensus Dynamics (etcd & Raft)

Distributed systems need to agree on who is the "leader" and what the current state is. We will use etcd, which uses the Raft protocol.

  1. Run a 3-node etcd cluster: (For simplicity, we will run one node and observe its self-election, but in a real lab, you would use a compose file for 3 nodes).
docker run -d --name etcd-node \
  -p 2379:2379 \
  quay.io/coreos/etcd:v3.5.0 \
  /usr/local/bin/etcd \
  --name s1 \
  --advertise-client-urls http://0.0.0.0:2379 \
  --listen-client-urls http://0.0.0.0:2379 \
  --initial-advertise-peer-urls http://0.0.0.0:2380 \
  --listen-peer-urls http://0.0.0.0:2380 \
  --initial-cluster s1=http://0.0.0.0:2380 \
  --initial-cluster-token tkn \
  --initial-cluster-state new
  1. Write a value:
docker exec etcd-node etcdctl put mykey "myvalue"
  1. Kill the Leader (Simulation): In a multi-node setup, you would docker stop the leader node.
  2. Run docker logs -f etcd-node
  3. Look for terms like became leader, term, and vote.

Analysis: Observe how the nodes communicate to ensure only one node handles writes at a time, maintaining a consistent global state.


Lab Reflection

  1. If you have $k=10, m=4$ in erasure coding, how many disk failures can you survive? What is the storage overhead?
  2. Why is Raft preferred over simple "master-slave" replication for cloud metadata?
  3. How does hashing help prevent "man-in-the-middle" attacks in cloud storage?