Find Duplicate File in System [Medium] — Content HashMap
Advertisement
Problem Statement
Given a list of paths with files and content in the format "root/d1/d2 f1.txt(content1) f2.txt(content2)", find all groups of duplicate files (same content).
Input: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)"]
Output: [["root/a/1.txt","root/c/3.txt"],["root/a/2.txt"]] (only dupes)
Solutions
Python
from collections import defaultdict
def findDuplicate(paths: list[str]) -> list[list[str]]:
content_map = defaultdict(list)
for path in paths:
parts = path.split()
root = parts[0]
for f in parts[1:]:
name, content = f.split('(')
content = content[:-1] # remove ')'
content_map[content].append(f"{root}/{name}")
return [v for v in content_map.values() if len(v) > 1]
C++
vector<vector<string>> findDuplicate(vector<string>& paths) {
unordered_map<string,vector<string>> mp;
for(auto& p:paths){
stringstream ss(p); string root; ss>>root;
string f;
while(ss>>f){
int l=f.find('('), r=f.find(')');
string content=f.substr(l+1,r-l-1), name=root+"/"+f.substr(0,l);
mp[content].push_back(name);
}
}
vector<vector<string>> res;
for(auto& [k,v]:mp) if(v.size()>1) res.push_back(v);
return res;
}
Complexity
- Time: O(n×k) where k = avg file content length
- Space: O(n×k)
Advertisement