Exercise 1: Shell Script for Detecting and Removing Duplicate Files

Objective:

Write a shell script that compares the content of files in a specified directory to identify duplicates and optionally removes them.


Requirements

  1. Input:
  2. The script should accept a directory path as an argument.
  3. Example:
./check-duplicates.sh /path/to/directory
  1. Output:
  2. The script should list all files in the directory with a column indicating whether they are duplicates. For duplicates, it should specify the file they match.
  3. Example:
    File                | Duplicate
    --------------------------------
    file1.txt           | No
    file2.txt           | Yes
    file3.txt           | Yes (file2.txt)
  1. Optional Flag:
  2. Add a -fix flag to remove duplicate files. Only one copy of each duplicate file should be kept.
  3. Example:
./check-duplicates.sh /path/to/directory --fix
# Deleting duplicate: file3.txt
  1. Implementation Details:
  2. Compare files based on their content. Use tools like md5sum or sha256sum to generate file hashes for comparison.
  3. Ensure the script handles different file sizes efficiently.
  4. Edge Cases:
  5. Handle empty directories gracefully.
  6. Display error messages for invalid directories or insufficient permissions.

Evaluation Criteria

  1. Correctness:
  2. Does the script accurately identify duplicates?
  3. Does it correctly delete duplicates when -fix is provided?
  4. Code Quality:
  5. Is the script modular and easy to read?
  6. Does it make effective use of shell utilities (e.g., find, awk, sort)?
  7. Error Handling:
  8. Are invalid inputs handled gracefully?
  9. Are meaningful error messages displayed?
  10. Efficiency:
  11. Does the script process large directories effectively without excessive resource usage?

Hints and Tips

  • Use find to list files recursively in the directory.
  • Use md5sum or sha256sum to calculate file hashes for comparison.
  • Store hashes and file paths in a temporary file or an associative array for processing.
  • Use awk or sed for formatting the output.
  • Test your script with different types of files and directory structures.