Finding the Longest Palindromic Substring in Linear Time-程序员宅基地

技术标签: python  

Finding the Longest Palindromic Substring in Linear Time

Finding the Longest Palindromic Substring in Linear Time
Fred Akalin
November 28, 2007

Another interesting problem I stumbled across on reddit is finding the longest substring of a given string that is a palindrome. I found the explanation on Johan Jeuring's blog somewhat confusing and I had to spend some time poring over the Haskell code (eventually rewriting it in Python) and walking through examples before it "clicked." I haven't found any other explanations of the same approach so hopefully my explanation below will help the next person who is curious about this problem.

Of course, the most naive solution would be to exhaustively examine all (n2)substrings of the given n-length string, test each one if it's a palindrome, and keep track of the longest one seen so far. This has complexity O(n3), but we can easily do better by realizing that a palindrome is centered on either a letter (for odd-length palindromes) or a space between letters (for even-length palindromes). Therefore we can examine all 2n+1possible centers and find the longest palindrome for that center, keeping track of the overall longest palindrome. This has complexity O(n2).

It is not immediately clear that we can do better but if we're told that an Θ(n)algorithm exists we can infer that the algorithm is most likely structured as an iteration through all possible centers. As an off-the-cuff first attempt, we can adapt the above algorithm by keeping track of the current center and expanding until we find the longest palindrome around that center, in which case we then consider the last letter (or space) of that palindrome as the new center. The algorithm (which isn't correct) looks like this informally:

    Set the current center to the first letter.
    Loop while the current center is valid:
        Expand to the left and right simultaneously until we find the largest palindrome around this center.
        If the current palindrome is bigger than the stored maximum one, store the current one as the maximum one.
        Set the space following the current palindrome as the current center unless the two letters immediately surrounding it are different, in which case set the last letter of the current palindrome as the current center.
    Return the stored maximum palindrome.

This seems to work but it doesn't handle all cases: consider the string "abababa". The first non-trivial palindrome we see is "a|bababa", followed by "aba|baba". Considering the current space as the center doesn't get us anywhere but considering the preceding letter (the second 'a') as the center, we can expand to get "ababa|ba". From this state, considering the current space again doesn't get us anywhere but considering the preceding letter as the center, we can expand to get "abababa|". However, this is incorrect as the longest palindrome is actually the entire string! We can remedy this case by changing the algorithm to try and set the new center to be one before the end of the last palindrome, but it is clear that having a fixed "lookbehind" doesn't solve the general case and anything more than that will probably bump us back up to quadratic time.

The key question is this: given the state from the example above, "ababa|ba", what makes the second 'b' so special that it should be the new center? To use another example, in "abcbabcba|bcba", what makes the second 'c' so special that it should be the new center? Hopefully, the answer to this question will lead to the answer to the more important question: once we stop expanding the palindrome around the current center, how do we pick the next center? To answer the first question, first notice that the current palindromes in the above examples themselves contain smaller non-trivial palindromes: "ababa" contains "aba" and "abcbabcba" contains "abcba" which also contains "bcb". Then, notice that if we expand around the "special" letters, we get a palindrome which shares a right edge with the current palindrome; that is, the longest palindrome around the special letters are proper suffixes of the current palindrome. With a little thought, we can then answer the second question: to pick the next center, take the center of the longest palindromic proper suffix of the current palindrome. Our algorithm then looks like this:

    Set the current center to the first letter.
    Loop while the current center is valid:
        Expand to the left and right simultaneously until we find the largest palindrome around this center.
        If the current palindrome is bigger than the stored maximum one, store the current one as the maximum one.
        Find the maximal palindromic proper suffix of the current palindrome.
        Set the center of the suffix from c as the current center and start expanding from the suffix as it is palindromic.
    Return the stored maximum palindrome.

However, unless step 2c can be done efficiently, it will cause the algorithm to be superlinear. Doing step 2c efficiently seems impossible since we have to examine the entire current palindrome to find the longest palindromic suffix unless we somehow keep track of extra state as we progress through the input string. Notice that the longest palindromic suffix would by definition also be a palindrome of the input string so it might suffice to keep track of every palindrome that we see as we move through the string and hopefully, by the time we finish expanding around a given center, we would know where all the palindromes with centers lying to the left of the current one are. However, if the longest palindromic suffix has a center to the right of the current center, we would not know about it. But we also have at our disposal the very useful fact that a palindromic proper suffix of a palindrome has a corresponding dual palindromic proper prefix. For example, in one of our examples above, "abcbabcba", notice that "abcba" appears twice: once as a prefix and once as a suffix. Therefore, while we wouldn't know about all the palindromic suffixes of our current palindrome, we would know about either it or its dual.

Another crucial realization is the fact that we don't have to keep track of all the palindromes we've seen. To use the example "abcbabcba" again, we don't really care about "bcb" that much, since it's already contained in the palindrome "abcba". In fact, we only really care about keeping track of the longest palindromes for a given center or equivalently, the length of the longest palindrome for a given center. But this is simply a more general version of our original problem, which is to find the longest palindrome around any center! Thus, if we can keep track of this state efficiently, maybe by taking advantage of the properties of palindromes, we don't have to keep track of the maximal palindrome and can instead figure it out at the very end.

Unfortunately, we seem to be back where we started; the second naive algorithm that we have is simply to loop through all possible centers and for each one find the longest palindrome around that center. But our discussion has led us to a different incremental formulation: given a current center, the longest palindrome around that center, and a list of the lengths of the longest palindromes around the centers to the left of the current center, can we figure out the new center to consider and extend the list of longest palindrome lengths up to that center efficiently? For example, if we have the state:

<"ababa|??", [0, 1, 0, 3, 0, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]>

where the highlighted letter is the current center, the vertical line is our current position, the question marks represent unread characters or unknown quantities, and the array represents the list of longest palindrome lengths by center, can we get to the state:

<"ababa|??", [0, 1, 0, 3, 0, 5, 0, ?, ?, ?, ?, ?, ?, ?, ?]>

and then to:

<"abababa|", [0, 1, 0, 3, 0, 5, 0, 7, 0, 5, 0, 3, 0, 1, 0]>

efficiently? The crucial thing to notice is that the longest palindrome lengths array (we'll call it simply the lengths array) in the final state is palindromic since the original string is palindromic. In fact, the lengths array obeys a more general property: the longest palindrome d places to the right of the current center (the d-right palindrome) is at least as long as the longest palindrome d places to the left of the current center (the d-left palindrome) if the d-left palindrome is completely contained in the longest palindrome around the current center (the center palindrome), and it is of equal length if the d-left palindrome is not a prefix of the center palindrome or if the center palindrome is a suffix of the entire string. This then implies that we can more or less fill in the values to the right of the current center from the values to the left of the current center. For example, from [0, 1, 0, 3, 0, 5, ?, ?, ?, ?, ?, ?, ?, ?, ?] we can get to [0, 1, 0, 3, 0, 5, 0, ≥3?, 0, ≥1?, 0, ?, ?, ?, ?]. This also implies that the first unknown entry (in this case, ≥3?) should be the new center because it means that the center palindrome is not a suffix of the input string (i.e., we're not done) and that the d-left palindrome is a prefix of the center palindrome.

From these observations we can construct our final algorithm which returns the lengths array, and from which it is easy to find the longest palindromic substring:

    Initialize the lengths array to the number of possible centers.
    Set the current center to the first center.
    Loop while the current center is valid:
        Expand to the left and right simultaneously until we find the largest palindrome around this center.
        Fill in the appropriate entry in the longest palindrome lengths array.
        Iterate through the longest palindrome lengths array backwards and fill in the corresponding values to the right of the entry for the current center until an unknown value (as described above) is encountered.
        set the new center to the index of this unknown value.
    Return the lengths array.

Note that at each step of the algorithm we're either incrementing our current position in the input string or filling in an entry in the lengths array. Since the lengths array has size linear in the size of the input array, the algorithm has worst-case linear running time. Since given the lengths array we can find and return the longest palindromic substring in linear time, a linear-time algorithm to find the longest palindromic substring is the composition of these two operations.

Here is Python code that implements the above algorithm (although it is closer to Johan Jeuring's Haskell implementation than to the above description):
* An exercise for the reader: in this place in the code you might think that you can replace the == with >= to improve performance. This does not change the correctness of the algorithm but it does hurt performance, contrary to expectations. Why?

def fastLongestPalindromes(seq):
    """
    Behaves identically to naiveLongestPalindrome (see below), but
    runs in linear time.
    """
    seqLen = len(seq)
    l = []
    i = 0
    palLen = 0
    # Loop invariant: seq[(i - palLen):i] is a palindrome.
    # Loop invariant: len(l) >= 2 * i - palLen. The code path that
    # increments palLen skips the l-filling inner-loop.
    # Loop invariant: len(l) < 2 * i + 1. Any code path that
    # increments i past seqLen - 1 exits the loop early and so skips
    # the l-filling inner loop.
    while i < seqLen:
        # First, see if we can extend the current palindrome.  Note
        # that the center of the palindrome remains fixed.
        if i > palLen and seq[i - palLen - 1] == seq[i]:
            palLen += 2
            i += 1
            continue

        # The current palindrome is as large as it gets, so we append
        # it.
        l.append(palLen)

        # Now to make further progress, we look for a smaller
        # palindrome sharing the right edge with the current
        # palindrome.  If we find one, we can try to expand it and see
        # where that takes us.  At the same time, we can fill the
        # values for l that we neglected during the loop above. We
        # make use of our knowledge of the length of the previous
        # palindrome (palLen) and the fact that the values of l for
        # positions on the right half of the palindrome are closely
        # related to the values of the corresponding positions on the
        # left half of the palindrome.

        # Traverse backwards starting from the second-to-last index up
        # to the edge of the last palindrome.
        s = len(l) - 2
        e = s - palLen
        for j in xrange(s, e, -1):
            # d is the value l[j] must have in order for the
            # palindrome centered there to share the left edge with
            # the last palindrome.  (Drawing it out is helpful to
            # understanding why the - 1 is there.)
            d = j - e - 1

            # We check to see if the palindrome at l[j] shares a left
            # edge with the last palindrome.  If so, the corresponding
            # palindrome on the right half must share the right edge
            # with the last palindrome, and so we have a new value for
            # palLen.
            if l[j] == d: # *
                palLen = d
                # We actually want to go to the beginning of the outer
                # loop, but Python doesn't have loop labels.  Instead,
                # we use an else block corresponding to the inner
                # loop, which gets executed only when the for loop
                # exits normally (i.e., not via break).
                break

            # Otherwise, we just copy the value over to the right
            # side.  We have to bound l[i] because palindromes on the
            # left side could extend past the left edge of the last
            # palindrome, whereas their counterparts won't extend past
            # the right edge.
            l.append(min(d, l[j]))
        else:
            # This code is executed in two cases: when the for loop
            # isn't taken at all (palLen == 0) or the inner loop was
            # unable to find a palindrome sharing the left edge with
            # the last palindrome.  In either case, we're free to
            # consider the palindrome centered at seq[i].
            palLen = 1
            i += 1

    # We know from the loop invariant that len(l) < 2 * seqLen + 1, so
    # we must fill in the remaining values of l.

    # Obviously, the last palindrome we're looking at can't grow any
    # more.
    l.append(palLen)

    # Traverse backwards starting from the second-to-last index up
    # until we get l to size 2 * seqLen + 1. We can deduce from the
    # loop invariants we have enough elements.
    lLen = len(l)
    s = lLen - 2
    e = s - (2 * seqLen + 1 - lLen)
    for i in xrange(s, e, -1):
        # The d here uses the same formula as the d in the inner loop
        # above.  (Computes distance to left edge of the last
        # palindrome.)
        d = i - e - 1
        # We bound l[i] with min for the same reason as in the inner
        # loop above.
        l.append(min(d, l[i]))

    return l

And here is a naive quadratic version for comparison:

def naiveLongestPalindromes(seq):
    """
    Given a sequence seq, returns a list l such that l[2 * i + 1]
    holds the length of the longest palindrome centered at seq[i]
    (which must be odd), l[2 * i] holds the length of the longest
    palindrome centered between seq[i - 1] and seq[i] (which must be
    even), and l[2 * len(seq)] holds the length of the longest
    palindrome centered past the last element of seq (which must be 0,
    as is l[0]).

    The actual palindrome for l[i] is seq[s:(s + l[i])] where s is i
    // 2 - l[i] // 2. (// is integer division.)

    Example:
    naiveLongestPalindrome('ababa') -> [0, 1, 0, 3, 0, 5, 0, 3, 0, 1]
   
    Runs in quadratic time.
    """
    seqLen = len(seq)
    lLen = 2 * seqLen + 1
    l = []

    for i in xrange(lLen):
        # If i is even (i.e., we're on a space), this will produce e
        # == s.  Otherwise, we're on an element and e == s + 1, as a
        # single letter is trivially a palindrome.
        s = i / 2
        e = s + i % 2

        # Loop invariant: seq[s:e] is a palindrome.
        while s > 0 and e < seqLen and seq[s - 1] == seq[e]:
            s -= 1
            e += 1

        l.append(e - s)

    return l

Note that this is not the only efficient solution to this problem; building a suffix tree is linear in the length of the input string and you can use one to solve this problem but as Johan also mentions, that is a much less direct and efficient solution compared to this one.
posted on 2014-03-14 07:29  lexus 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/p/3599720.html

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/a13393665983/article/details/102182881

智能推荐

signature=ee7dc7a24a5c0fb0d87cb9f853211a1a,Enhanced biological production off Chennai triggered by O...-程序员宅基地

Enhanced biological production off Chennai triggered by October 1999 super cyclone (Orissa)2002Madhu, N.V.; Maheswaran, P.A.; Jyothibabu, R.; Sunil, V.; Revichandran, C.; Balasubramanian, T.; Gopalakr...

总结:磁盘分区_linux vdb-程序员宅基地

linux磁盘相关知识梳理_linux vdb

Labview机器视觉(6)-图像识别-程序员宅基地

测试源码:http://download.csdn.net/detail/fzxy002763/3820684labview的图像识别的设置还是较为简单的,基本思想是先选定一段图像(可以是照片也可以实时的进行抓取)做为比对的模板,然后再实时判断需要比对的图像,在其中找到目标并且标识出来。主要应用了如下几个控件:1.这个控件我们是当截图来使用的,image src接入原始图像,o..._optional rectangle

java基础--多线程模拟买票案例---(同步)-程序员宅基地

Thread类中的构造方法和常用方法: 构造方法:public Thread() 和 public Thread(Runnable target) 常用方法:start() 开启线程 和 getName() 获取线程名字,线程名字的默认编号是Thread-编号,编号是默认从0开始 currentThread()是Thread的静态方法,获取当前线程对象多线程安全问题: ...

Android WIFI模块测试-程序员宅基地

什么是WIFIWIFI是一种无线连接技术,可用于PC,PDA,手机等终端。WIFI的目的是改善基于IEEE802.11标准的无线网络产品之间的互通性,也就是说WIFI是基于802.11标准的,但WIFI不等同无线网络。android平台下的WIFI模块简单介绍一下,WIFI模块的基本功能:开关WIFI除了在WIFI设置界面

随便推点

ArcEngine 许可初始化问题-程序员宅基地

ArcEngine 许可初始化问题 今天用AE做拓扑,出现了这个错误提示“The application is not licensed to create or modify schema for this type of data”关于初始化Engine许可的,其实原理都很简单,大家一般都没有问题,但又往往会因为不够细心加上Engine的“小脾气”,让不少程序员都要在这里犯错。

mysql 如何监控innodb的阻塞-程序员宅基地

转载于:https://www.cnblogs.com/angdh/p/10667209.html

Linux 复制文件-程序员宅基地

复制文件,只有源文件较目的文件的修改时间新时,才复制文件 cp -u -v file1 file2 .将文件file1复制成文件file2 cp file1 file2 .采用交互方式将文件file1复制成文件file2 cp -i file1 file2 .将文件file1复制成file2,因为目的文件已经存在,所以指定使用强制复制_linux 复制文件

CASS11.0功能与生俱来:南方地理信息数据成图软件SouthMap(超越经典,绘算俱佳)_southmap和cass一样吗-程序员宅基地

南方地理信息数据成图软件SouthMap是通过南方测绘20余年软件研发经验,基于AutoCAD和国产CAD平台,集数据采集、编辑、成图、质检等功能于一体的成图软件,主要用于大比例尺地形图绘制、三维测图、点云绘图、日常地籍测绘、工程土石方计算、职业教育等领域。_southmap和cass一样吗

EmguCV-第12讲-图像的几何变换(缩放、平移、旋转、转置、镜像和重映射)_emgu 图像-程序员宅基地

1 图像缩放补充:超分辨率缩放可实现图像不失真2 图像平移3 图像旋转4 转置和镜像5 重映射6 代码using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading.Tasks;using Emgu.CV;using Emgu.CV.Structure;using Emgu.CV.CvEnum;using Emgu._emgu 图像

计算机考研机试指南价格,计算机考研:机试指南(第2版)-程序员宅基地

目前越来越多的高校开始采用上机考试的形式来考查学生的动手编程能力,而对于以应试为主的大学教学模式,上机往往是学生的薄弱环节。本书由浅入深、从简到难地讲解了机试的相关考点,并精选名校的复试上机真题作为例题和习题,以便给读者提供很可靠的练习指导。书中的所有机试题在九度OJ(ac.jobdu.com)上均有收录,建议读者在阅读本书时,结合上机练习,自己动手测试。本书不仅可以作为研究生入学考试的复试复习用...