187. Repeated DNA Sequences
problem description
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
algorithm thought
简单暴力的办法就是用一个map保存所有的切割的字符串,遍历字符串,每次substr 长度为10的字符串,然后与之前记录的字符串比较。如果有重复就加入res。但是对字符串操作是麻烦的,比较也需要一位位比。这里限制的条件是字符串中只有4个字符,如果用bit表示的话,4个字符只需要两位,一共10个字符,所以20位bit就可以表示所有substr的情况了。int类型是32位的,所以使用int就可以表示每个substr,用int来比较和保存也比字符串快多了。
tips:这里能用int表示,一是因为只有4个字符,二是长度只有10.如果有5个字符就需要3个bit,9个字符就需要4个bit。长度为10的话,int类型就不能表示所有的substr了,需要64位的longlong来辅助解题了。如果再长,就还是用字符串解题吧
code
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<char,int> ma{{'A',0},{'C',1},{'G',2},{'T',3}};
unordered_map<int,int> cache;
vector<string>res;
int hash=0,mask=(1<<20)-1;
for(int i=0;i<s.size();++i){
hash=(hash<<2)+ma[s[i]];
if(i>=9){
hash=hash&mask;
cache[hash]++;
if(cache[hash]==2){
res.push_back(s.substr(i-9,10));
}
}
}
return res;
}
};
algorithm analysis
hashmap插入和取值操作是O(1),循环中时间复杂度是O(1),最后时间复杂度是O(n)
Last updated