2011-12-23

SICP Exercise 2.70: Yip a Yip Wah Boom!

The following eight-symbol alphabet with associated relative frequencies was designed to efficiently encode the lyrics of 1950s rock songs. (Note that the ``symbols'' of an ``alphabet'' need not be individual letters.)

A2NA16
BOOM1SHA3
GET2YIP9
JOB2WAH1

Use generate-huffman-tree (exercise 2.69) to generate a corresponding Huffman tree, and use encode (exercise 2.68) to encode the following message:

Get a job
Sha na na na na na na na na
Get a job
Sha na na na na na na na na
Wah yip yip yip yip yip yip yip yip yip
Sha boom

How many bits are required for the encoding? What is the smallest number of bits that would be needed to encode this song if we used a fixed-length code for the eight-symbol alphabet?

Okay, first let's build our Huffman tree:
> (define fifties-song-tree
    (generate-huffman-tree
      '((A 2) (BOOM 1) (GET 2) (JOB 2) (NA 16) (SHA 3) (YIP 9) (WAH 1))))
> fifties-song-tree
'((leaf NA 16)
  ((leaf YIP 9)
   (((leaf A 2) ((leaf WAH 1) (leaf BOOM 1) (WAH BOOM) 2) (A WAH BOOM) 4)
   ((leaf SHA 3) ((leaf JOB 2) (leaf GET 2) (JOB GET) 4) (SHA JOB GET) 7)
   (A WAH BOOM SHA JOB GET)
   11)
  (YIP A WAH BOOM SHA JOB GET)
  20)
 (NA YIP A WAH BOOM SHA JOB GET)
 36)
And now let's encode our song:
> (define encoded-song (encode '(GET A JOB
                                 SHA NA NA NA NA NA NA NA NA
                                 GET A JOB
                                 SHA NA NA NA NA NA NA NA NA
                                 WAH YIP YIP YIP YIP YIP YIP YIP YIP YIP 
                                 SHA BOOM)
                               fifties-song-tree))
> encoded-song
'(1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
  1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1
  0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1)
> (length encoded-song)
84
So we need 84 bits to encode the song using our Huffman tree.

How about if we use a fixed-length code? Well, we have eight symbols to encode so, using a fixed-length code, we could encode the message using 3 bits per symbol (as 23=8). As there are a total of 36 symbols to encode, that means we could encode this using a fixed-length code using 36×3=108 bits.

So using Huffman encoding has saved us 24 bits - a 22% reduction in the size of the message.

2011-12-22

SICP Exercise 2.69: Generating Huffman Trees

The following procedure takes as its argument a list of symbol-frequency pairs (where no symbol appears in more than one pair) and generates a Huffman encoding tree according to the Huffman algorithm.
(define (generate-huffman-tree pairs)
  (successive-merge (make-leaf-set pairs)))
Make-leaf-set is the procedure given above that transforms the list of pairs into an ordered set of leaves. Successive-merge is the procedure you must write, using make-code-tree to successively merge the smallest-weight elements of the set until there is only one element left, which is the desired Huffman tree. (This procedure is slightly tricky, but not really complicated. If you find yourself designing a complex procedure, then you are almost certainly doing something wrong. You can take significant advantage of the fact that we are using an ordered set representation.)

The procedure generate-huffman-tree calls successive-merge with the results of (make-leaf-set pairs). As noted above, this is an ordered set of leaves. To be more precise, it is a list of leaves, ordered by increasing frequency of occurrence. This means that, initially at least, we can find the smallest-weight elements of the set at the start of the set. If this were an invariant of the procedure then this would mean that we could always find the smallest-weight elements of the set at the start of the list. We can do this by defining successive-merge as a recursive procedure that, at each call, picks the first two nodes off the list, merges them, inserts the result of the merge into the appropriate place in the remainder of the list (to maintain the invariant) and then makes the recursive call with this result.

So provided we can perform two actions (merge a pair of elements and insert a node into the appropriate place in the list) then we will be able to do this... and thankfully the authors provide us with a couple of procedures that will let us do this. make-code-tree takes two nodes and produces a new node with an appropriate symbols list and weight and with those nodes as its children. adjoin-set takes a node and inserts it into the appropriate position in a list of nodes that are ordered by increasing weight.

All we need to do now is to provide a base case and we're done. Or, as we'll actually do, a couple of base cases. The first base case is when we have completed merging all of the nodes in a non-empty list of nodes. In this situation we have a list with one node in it, and we can simply return that node as the result. The second base case is an edge case - we need to deal with what happens when pairs is an empty list. In that case make-leaf-set also returns an empty list. To keep it consistent we'll do the same.

Okay, so we've got all the components we need, we know what we need them to do, so let's put them together:
(define (successive-merge set)
  (cond ((null? set) '())
        ((null? (cdr set)) (car set))
        (else (successive-merge
               (adjoin-set (make-code-tree (car set) (cadr set))
                           (cddr set))))))
To test this, let's use the frequencies that were used to generate sample-tree for exercise 2.67, and compare the result against sample-tree:
> (generate-huffman-tree '((A 4) (B 2) (C 1) (D 1)))
'((leaf A 4) ((leaf B 2) ((leaf D 1) (leaf C 1) (D C) 2) (B D C) 4) (A B D C) 8)
> sample-tree
'((leaf A 4) ((leaf B 2) ((leaf D 1) (leaf C 1) (D C) 2) (B D C) 4) (A B D C) 8)

SICP Exercise 2.68: Encoding Symbols

The encode procedure takes as arguments a message and a tree and produces the list of bits that gives the encoded message.
(define (encode message tree)
  (if (null? message)
      '()
      (append (encode-symbol (car message) tree)
              (encode (cdr message) tree))))
Encode-symbol is a procedure, which you must write, that returns the list of bits that encodes a given symbol according to a given tree. You should design encode-symbol so that it signals an error if the symbol is not in the tree at all. Test your procedure by encoding the result you obtained in exercise 2.67 with the sample tree and seeing whether it is the same as the original sample message.

First things first, let's deal with the error case. We're given a procedure, symbols, that will give us a list of all the symbols in a particular node of the tree. We can use this to test whether a particular node contains a symbol or not by using it in conjunction with memq which was introduced way back in section 2.3.1. memq will return false if the symbol isn't in the list, and a non-empty list otherwise, so we can use this as a predicate (since non-false values are treated as true).

With the error case out of the way, let's consider how we can generate our list of bits. We have a binary tree, so we can recurse down this to find the leaf node corresponding to the symbol we're trying to encode. We know (from the error case) that we can test whether or not our symbol is under the left or right child nodes via symbols and memq, and we know that we want to put a 0 onto the list for each time we move to a left child and a 1 onto the list for each time we move to a right child. So all we need to do is determine which branch to take at each (non-leaf) node and then cons the appropriate binary digit onto the list returned by recursively calling encode-symbol for that branch. Oh, and we'll also need a base case... Once we reach a leaf node we have no further children to navigate, so we can simply return the empty list.

This gives us the following implementation:
(define (encode-symbol symbol tree)
  (cond ((not (memq symbol (symbols tree)))
         (error "bad symbol -- ENCODE-SYMBOL" symbol))
        ((leaf? tree) '())
        ((memq symbol (symbols (left-branch tree)))
         (cons 0 (encode-symbol symbol (left-branch tree))))
        ((memq symbol (symbols (right-branch tree)))
         (cons 1 (encode-symbol symbol (right-branch tree))))))
Let's see it in action:
> (encode '(A D A B B C A) sample-tree)
'(0 1 1 0 0 1 0 1 0 1 1 1 0)