187. Repeated DNA Sequences

problem description

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

algorithm thought

简单暴力的办法就是用一个map保存所有的切割的字符串，遍历字符串，每次substr 长度为10的字符串，然后与之前记录的字符串比较。如果有重复就加入res。但是对字符串操作是麻烦的，比较也需要一位位比。这里限制的条件是字符串中只有4个字符，如果用bit表示的话，4个字符只需要两位，一共10个字符，所以20位bit就可以表示所有substr的情况了。int类型是32位的，所以使用int就可以表示每个substr，用int来比较和保存也比字符串快多了。

tips：这里能用int表示，一是因为只有4个字符，二是长度只有10.如果有5个字符就需要3个bit，9个字符就需要4个bit。长度为10的话，int类型就不能表示所有的substr了，需要64位的longlong来辅助解题了。如果再长，就还是用字符串解题吧

code

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        unordered_map<char,int> ma{{'A',0},{'C',1},{'G',2},{'T',3}};
        unordered_map<int,int> cache;
        vector<string>res;
        int hash=0,mask=(1<<20)-1;
        for(int i=0;i<s.size();++i){
            hash=(hash<<2)+ma[s[i]];
            if(i>=9){
                hash=hash&mask;
                cache[hash]++;
                if(cache[hash]==2){
                    res.push_back(s.substr(i-9,10));
                }
            }
        }
        return res;
    }
};

algorithm analysis

hashmap插入和取值操作是O(1),循环中时间复杂度是O(1)，最后时间复杂度是O(n)

Previous179. Largest Number Next188. Best Time to Buy and Sell Stock IV

Last updated 5 years ago