This lab dives into the low-level "magic" that makes cloud storage reliable and consistent: data integrity via hashing, data protection via erasure coding, and cluster coordination via consensus protocols.
Learning Objectives
- Manually verify data integrity using hashing.
- Understand how erasure coding protects against data loss with less overhead than replication.
- Observe distributed consensus and leader election in action.
Prerequisites
- Linux environment with Python installed.
- Docker installed.
- Tools:
sha256sum,pip install zfec.
Step 1: Bit-Rot and Hashing
Cloud storage providers use checksums to detect "bit-rot" (silent data corruption).
- Create a file and hash it:
echo "This is important data that must not be corrupted." > data.txt
sha256sum data.txt > data.txt.sha256
cat data.txt.sha256
- Simulate corruption: Use
sedto change a single character without changing the file size.
sed -i 's/important/importamt/' data.txt
- Verify the hash:
sha256sum -c data.txt.sha256
Output should show: data.txt: FAILED
Analysis: Note how changing just one bit results in a completely different hash (the "avalanche effect").
Step 2: Erasure Coding in Action
Replication (keeping 3 copies) is expensive (3x overhead). Erasure Coding (EC) provides similar protection with much less overhead.
- Setup zfec:
pip install zfec
- Encode a file: We will split a file into 4 data blocks ($k=4$) and 2 parity blocks ($m=2$). This setup can survive the loss of ANY 2 blocks.
# Create a 40KB file
dd if=/dev/urandom of=original_file.dat bs=1k count=40
# Encode
zfec -p encoded_ -k 4 -m 6 original_file.dat
You will see files like encoded_0.fec, encoded_1.fec ... up to encoded_5.fec.
- Simulate Disk Failure: Delete two of the "data" blocks.
rm encoded_0.fec encoded_3.fec
- Reconstruct:
zunfec -o reconstructed.dat encoded_*.fec
- Verify:
diff original_file.dat reconstructed.dat && echo "SUCCESS: Reconstruction perfect!"
Step 3: Consensus Dynamics (etcd & Raft)
Distributed systems need to agree on who is the "leader" and what the current state is. We will use etcd, which uses the Raft protocol.
- Run a 3-node etcd cluster: (For simplicity, we will run one node and observe its self-election, but in a real lab, you would use a compose file for 3 nodes).
docker run -d --name etcd-node \
-p 2379:2379 \
quay.io/coreos/etcd:v3.5.0 \
/usr/local/bin/etcd \
--name s1 \
--advertise-client-urls http://0.0.0.0:2379 \
--listen-client-urls http://0.0.0.0:2379 \
--initial-advertise-peer-urls http://0.0.0.0:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster s1=http://0.0.0.0:2380 \
--initial-cluster-token tkn \
--initial-cluster-state new
- Write a value:
docker exec etcd-node etcdctl put mykey "myvalue"
- Kill the Leader (Simulation):
In a multi-node setup, you would
docker stopthe leader node. - Run
docker logs -f etcd-node - Look for terms like
became leader,term, andvote.
Analysis: Observe how the nodes communicate to ensure only one node handles writes at a time, maintaining a consistent global state.
Lab Reflection
- If you have $k=10, m=4$ in erasure coding, how many disk failures can you survive? What is the storage overhead?
- Why is Raft preferred over simple "master-slave" replication for cloud metadata?
- How does hashing help prevent "man-in-the-middle" attacks in cloud storage?