Friday, August 24, 2012

Finding all locations of substring in a string.


There are a lot of methods for fining all locations of a substring substr in a string s, and if we ever find ourselves teaching again Python, I would first emphasize the simplest algorithm to do so.

We repeatedly use the basic string object find function. If s is a string and substr is a string pattern (no wild cards!) then s.find(substr) will -1 if substr is not found in s, otherwise it will return the starting index of substr in s.

Here is our function for findind all occurence of substr in a string:


"""
def findall(s, substr):
   """
   Find all locations of substr in s.
   """
   out = []
   oldpos = 0
   n = len(substr)
   while (1):
     pos = s[oldpos:].find(substr)
     if pos >=0:
        out.append(oldpos + pos)
        oldpos += (pos+n+1) #remove +n if you want to find occurences starting within substr also!
     else:
        break
   return out
 

if __name__ == "__main__":
   s   = 'the quick brown fox jumps over the lazy dog near the bank of the river.'
   sub = 'the'
   x=  findall(s, sub)
   print x
   for v in x:
      print s[v:v+len(sub)]

The string find function can actually perform a search using a sprcified starting and ending indices. Thus s[oldpos:].find(substr) may be rewritten as s.find(substr, oldpos). We ask the reader to experiment by modifying the above code.

Help on built-in function find:

find(...)
    S.find(sub [,start [,end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.


Other methods may be searched in Google. Here are some links to get you started in learning alternative and complicated approaches: