Saturday, February 26, 2011

Python: a high level word count program

Introduction


Python allows the beginning programmer to use powerful built-in powerful routines, freeing the programmer from imlementing on his own (usually error prone versions)! Here, we use Python A dictionarie, convenient string splitting methods to do the non-trivial word counting problem. Non-trivial in the sense it does more than printing a message "Hello World" to the screen.

the Python code


"""
"""
file     wordcount.py
author   Ernesto P. Adorio
         U.P.Clarkfield
version  0.0.1
desc     counts words, lines
         lines are setoff by new line character, "\n"
         quoted strings are stripped off of quotes.   
"""

def wc(textfilename):
    D = {}
    nc = 0 # number of characters.
    nw = 0 # number of words.
    nl = 0 # number of lines.
    for line in open(textfilename, "r").readlines():
        # print line #decomment for debugging.
        nc += len(line)
        words = line.split()
        for word in words:
            if word in D:
               D[word] += 1
            else:
               D[word] = 1
            nw += 1
        nl += 1
    return nc, nw, nl


if __name__ == "__main__":
   print wc("wordcount.py")

Notice the code's compactness. In a lower level language such as C, the code will be far longer and complicated. The essence of the code is:



read in all lines and store in an array.

process each line.
add length of line to number of characters.
split the line into words, which are tokens separated by whitespaces.
add number of words in line the number of total words.
update the line number.



When the program is fed the code for itself, it printed the number of c
haracters, words and lines as


$ python wordcount.py
(762, 104, 36)


As a check the wc utility produces ( wc wordcount.py) in reverse order


36 104 762



Issues?


There may be instances when the number of lines is off by one. This happens if the last line does not contains a terminating line feed character.
Another thing is that the code may be feed an extremely large file. If the working memory is small, it may run out of space! This problem is easy to fix and we will revise the above initial working code later.

No comments:

Post a Comment