2011-12-23

SICP Exercise 2.72: Efficiency of Huffman Encoding

Consider the encoding procedure that you designed in exercise 2.68. What is the order of growth in the number of steps needed to encode a symbol? Be sure to include the number of steps needed to search the symbol list at each node encountered. To answer this question in general is difficult. Consider the special case where the relative frequencies of the n symbols are as described in exercise 2.71, and give the order of growth (as a function of n) of the number of steps needed to encode the most frequent and least frequent symbols in the alphabet.

Here's the encoding procedure, encode-symbol, we designed in exercise 2.68:
(define (encode-symbol symbol tree)
  (cond ((not (memq symbol (symbols tree)))
         (error "bad symbol -- ENCODE-SYMBOL" symbol))
        ((leaf? tree) '())
        ((memq symbol (symbols (left-branch tree)))
         (cons 0 (encode-symbol symbol (left-branch tree))))
        ((memq symbol (symbols (right-branch tree)))
         (cons 1 (encode-symbol symbol (right-branch tree))))))
And to recap, the summary of exercise 2.71 was that, for an alphabet of n symbols where relative frequencies of the symbols are 1, 2, 4, ..., 2n-1, a Huffman tree will encode the most frequent symbol using a 1-bit encoding, while the least frequent symbol will be encoded using n-1 bits. The reasoning for these bit lengths is that for such an alphabet generate-huffman-tree will produce a tree that is n levels deep in which each non-leaf node has a right child node that is a leaf. The most frequent symbol will be in the right child of the root node (and so only one edge needs to be traversed to get to the symbol) while the least frequent symbol will be in the nth level of the tree, in the left-most branch of the tree (and so n-1 edges need to be traversed to get to the symbol).

As with previous exercises we will assume that list operations (such as list, cons, car, cadr and caddr) are all Θ(1) operations. Since left-branch and right-branch simply delegate to car and cadr respectively this means we can assume they're Θ(1) too. If we also assume that eq? is Θ(1) then this allows us to assume that leaf? is Θ(1) too... and if leaf? is Θ(1) then that means symbols will also be Θ(1), as it performs a fixed number of operations, all of which are Θ(1).

Now, how about memq? Well, memq iterates through a list and compares each element to the specified symbol. If it matches then it returns the tail of the list starting at the first matching element, otherwise it returns #f. Now iterating through a list is Θ(n), and we've already stated that we're assuming list operations are Θ(1), so this will make memq an Θ(n) operation.

Okay, now we've stated all our assumptions about the orders of growth of the operations used by encode-symbol. Now we can go on to calculate the best case (when the symbol is the most frequent in the alphabet) and worst case (when the symbol is the least frequent in the alphabet) orders of growth.

Let's start with the best case: encoding the most frequent symbol. In this case the symbol's corresponding leaf node is the right child node of the root, and so the order of operations is:
  1. Start at the root of the tree.
  2. Ensure that symbol is in the list of symbols for the root - otherwise we'd throw an error. The check uses symbols and memq, and so this is Θ(n+1)=Θ(n).
  3. Check whether the root is a leaf node (which it's not) using leaf?, an Θ(1) operation.
  4. Check whether the symbol is under the left branch of the tree (which it's not). This uses left-branch, symbols and memq, and so this is Θ(1+n+1)=Θ(n).
  5. Check whether the symbol is under the right branch of the tree which it is). This uses right-branch, symbols and memq, and so is also Θ(1+n+1)=Θ(n).
  6. cons the value 1 onto the result of invoking encode-symbol on the right branch of the tree. cons is an Θ(1) operation. The recursive call works out at Θ(n+1+1)=Θ(n), using the following two steps:
    1. Ensure that symbol is in the list of symbols for the node (Θ(n+1)).
    2. Check whether the node is a leaf node, which it is, so return the empty list (Θ(1)).
Put that all together and you get an order of growth when the symbol is the most frequent in the alphabet of Θ(4n+2)=Θ(n).

Now let's move onto the worst case: encoding the least frequent symbol. In this case the symbol's corresponding leaf node is the left most child node at the lowest level of the tree. That means that at each non-leaf node we'll need to perform the following steps:
  1. Ensure that symbol is in the list of symbols for the root (Θ(n)).
  2. Check whether the root is a leaf node, which it's not (Θ(1)).
  3. Check whether the symbol is under the left branch of the tree, which it is (Θ(n)).
  4. cons the value 0 onto the result of invoking encode-symbol on the left branch of the tree (Θ(1), plus the cost of the recursive call).
So, if we ignore the the recursive call, for any non-leaf node we process the order of growth is Θ(2n+2)=Θ(n). Starting at the root there'll be n-1 non-leaf nodes we'll need to process before we reach the leaf node corresponding to the symbol we're looking for, so the order of growth of processing all of the non-leaf nodes will be Θ(n(n-1))=Θ(n2-n)=Θ(n2).

We just need to incorporate the order of growth of processing a leaf node into this. We already calculated this as Θ(n) in step 6 for the best case, so the total order of growth for encoding the least frequent symbol is Θ(n2+n)=Θ(n2).

So, to summarize, for an alphabet of n symbols where relative frequencies of the symbols are 1, 2, 4, ..., 2n-1, a Huffman tree will encode:
  • The most frequent symbol with an order of growth of Θ(n)
  • The least frequent symbol with an order of growth of Θ(n2)

SICP Exercise 2.71: Huffman With Power-of-2 Frequencies

Suppose we have a Huffman tree for an alphabet of n symbols, and that the relative frequencies of the symbols are 1, 2, 4, ..., 2n-1. Sketch the tree for n=5; for n=10. In such a tree (for general n) how many bits are required to encode the most frequent symbol? the least frequent symbol?

Before we sketch the trees out, let's check what the interpreter gives us:
> (generate-huffman-tree '((A 1) (B 2) (C 4) (D 8) (E 16)))
'(((((leaf A 1) (leaf B 2) (A B) 3)
    (leaf C 4)
    (A B C)
    7)
   (leaf D 8)
   (A B C D)
   15)
  (leaf E 16)
  (A B C D E)
  31)
> (generate-huffman-tree '((A 1) (B 2) (C 4) (D 8) (E 16)
                           (F 32) (G 64) (H 128) (I 256) (J 512)))
'((((((((((leaf A 1) (leaf B 2) (A B) 3)
         (leaf C 4)
         (A B C)
         7)
        (leaf D 8)
        (A B C D)
        15)
       (leaf E 16)
       (A B C D E)
       31)
      (leaf F 32)
      (A B C D E F)
      63)
     (leaf G 64)
     (A B C D E F G)
     127)
    (leaf H 128)
    (A B C D E F G H)
    255)
   (leaf I 256)
   (A B C D E F G H I)
   511)
  (leaf J 512)
  (A B C D E F G H I J)
  1023)
You'll note that in the tree formed, each non-leaf node has a leaf node as its right-child. When you sketch them out you get the following for n=5:
...and the following for n=10:
You'll note that:
  • When n=5, the tree is 5 levels deep. When we use a Huffman tree for encoding we generate a bit for each edge we traverse in the tree (not for each node we visit), so the longest path from the root to a leaf node, which is for the least frequent symbol, will generate a 4-bit encoding.
  • When n=10, the tree is 10 levels deep, so the longest path from the root to a leaf node will generate a 9-bit encoding.
  • In both trees the shortest path from the root to a leaf node, which is for the most frequent symbol, traverses a single edge and so will generate a 1-bit encoding
In general we can therefore state that, for an alphabet of n symbols where relative frequencies of the symbols are 1, 2, 4, ..., 2n-1, a Huffman tree will encode the most frequent symbol using a 1-bit encoding, while the least frequent symbol will be encoded using n-1 bits.

SICP Exercise 2.70: Yip a Yip Wah Boom!

The following eight-symbol alphabet with associated relative frequencies was designed to efficiently encode the lyrics of 1950s rock songs. (Note that the ``symbols'' of an ``alphabet'' need not be individual letters.)

A2NA16
BOOM1SHA3
GET2YIP9
JOB2WAH1

Use generate-huffman-tree (exercise 2.69) to generate a corresponding Huffman tree, and use encode (exercise 2.68) to encode the following message:

Get a job
Sha na na na na na na na na
Get a job
Sha na na na na na na na na
Wah yip yip yip yip yip yip yip yip yip
Sha boom

How many bits are required for the encoding? What is the smallest number of bits that would be needed to encode this song if we used a fixed-length code for the eight-symbol alphabet?

Okay, first let's build our Huffman tree:
> (define fifties-song-tree
    (generate-huffman-tree
      '((A 2) (BOOM 1) (GET 2) (JOB 2) (NA 16) (SHA 3) (YIP 9) (WAH 1))))
> fifties-song-tree
'((leaf NA 16)
  ((leaf YIP 9)
   (((leaf A 2) ((leaf WAH 1) (leaf BOOM 1) (WAH BOOM) 2) (A WAH BOOM) 4)
   ((leaf SHA 3) ((leaf JOB 2) (leaf GET 2) (JOB GET) 4) (SHA JOB GET) 7)
   (A WAH BOOM SHA JOB GET)
   11)
  (YIP A WAH BOOM SHA JOB GET)
  20)
 (NA YIP A WAH BOOM SHA JOB GET)
 36)
And now let's encode our song:
> (define encoded-song (encode '(GET A JOB
                                 SHA NA NA NA NA NA NA NA NA
                                 GET A JOB
                                 SHA NA NA NA NA NA NA NA NA
                                 WAH YIP YIP YIP YIP YIP YIP YIP YIP YIP 
                                 SHA BOOM)
                               fifties-song-tree))
> encoded-song
'(1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
  1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1
  0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1)
> (length encoded-song)
84
So we need 84 bits to encode the song using our Huffman tree.

How about if we use a fixed-length code? Well, we have eight symbols to encode so, using a fixed-length code, we could encode the message using 3 bits per symbol (as 23=8). As there are a total of 36 symbols to encode, that means we could encode this using a fixed-length code using 36×3=108 bits.

So using Huffman encoding has saved us 24 bits - a 22% reduction in the size of the message.