Friday, February 18, 2011

Python: Converting a matrix of raw scores to ranks.

There are instances where we have to convert a matrix of raw scores to ranking scores, for example in Statistics where we have to compute the Kruskal-Wallis statistics of a block of values. Although we have written a Kruskal Wallis routine previously, we shall write an independent function to return ranks based on raw scores.

A complication in computing ranks is that of processing ties. An approach is to replace any tied scores with the mean of all ranks. For example if 4 values v1,v2,v3,v4 have ranks 5,6,7,8, then each value should have an associated rank of (5+6+7+8)/4.

Here is an outline for converting raw scores to ranks:
1. Create a list L containing tuples of the form (v, r,c) for value, row, column for each element in the input raw scores matrix.
2. Sort the temporaty list L in ascending order.
3. Process the list L to fill in the output Ranks matrix,

Here is our python code for outputting ranks values, complete with an example from BerensonÅ› Basic Business Statistics, page 493.(We do have a local copy of the book) for testing the routine.

"""
File    matrix_scores_to_ranks.py
Author  Dr. Ernesto P. Adorio
        U.P. Clarkfield
        ernesto.adorio@gmail.com
"""


def matrix_scores_to_ranks(M, break_ties = 1, start_rank = 1, ZTOL = 1.0e-5):
    """
    Args
       M          - a list of lists of values.
       break_ties - 1 if ties need special processing, otherwise zero
       start_rank - 0 or 1
       ZTOL       - equality comparison of two real numbers.
    Return value
       A list of lists of ranks of the values. 
    """
    R=[]
    L=[]
    for i, row in enumerate(M):
        R.append([0] * len(row))
        for j, v in enumerate(row):
    L.append((v, i, j))
    L.sort()

    if break_ties != 1:
       for k, tup in enumerate(L):
          v, i, j = tup
   if break_ties != 1:   
     R[i][j] = k + start_rank
    else:
        prevv = L[0][0]    
        starti = 0 
        endi   = 0 
        for k, tup in enumerate(L):
            v, i, j = tup
            if abs(v- prevv) < ZTOL: # a tie?
               endi = k
            else:
               # print previous tied scores!
               rank = sum(range(starti, endi+1))/(endi-starti + 1.0) + start_rank

               for i in range(starti, endi+1):
                   ii = L[i][1]
                   jj = L[i][2]
                   R[ii][jj] = rank
               prevv = v
               starti = endi = k   
        # Final processing of any remaining ties.
        rank = sum(range(starti, endi+1))/(endi-starti + 1.0) + start_rank
        for i in range(starti, endi+1):
            ii = L[i][1]
            jj = L[i][2]
            R[ii][jj] = rank
    return R

#Berenson p. 493.

if __name__ == "__main__":
  M = [[ 18.5, 24.0, 17.2, 19.9, 18.0],
     [ 26.3, 25.3, 24.0, 21.2, 24.5],
     [20.6,  25.2, 20.8, 24.7, 22.9],
     [25.4,  19.9, 22.6, 17.5, 20.4]]

  R = scores_to_ranks(M, break_ties = 1, start_rank = 1)

  for row in R:
     print row


When the above program runs, it outputs and matches with the results in the textbook.
[4.0, 13.5, 1.0, 5.5, 3.0] [20.0, 18.0, 13.5, 10.0, 15.0] [8.0, 17.0, 9.0, 16.0, 12.0] [19.0, 5.5, 11.0, 2.0, 7.0]
The above is for a matrix. We will post array_scores_to_ranks() with the next article posting. Of course, the output must be transposed to read properly. start_rank may be set to any starting value, usually a 1, but there may be occasion when a starting zero rank may be appropriate. The above code is written for clarity. We may speed it up later, but there is no incentive to do so. We think most of the running time is spent on the sort, and Python sort routine is already fast enough! Feb. 18, 2011: I need to install the Syntax Highlighter addin immediately! Done. I refer a blogger user to Coding Freak There are other strategies in breaking ties. See Python, statistics: Ranking a single array of raw scores and we will update the above code someday.

No comments:

Post a Comment