Consider the encoding procedure that you designed in exercise 2.68. What is the order of growth in the number of steps needed to encode a symbol? Be sure to include the number of steps needed to search the symbol list at each node encountered. To answer this question in general is difficult. Consider the special case where the relative frequencies of the n symbols are as described in exercise 2.71, and give the order of growth (as a function of n) of the number of steps needed to encode the most frequent and least frequent symbols in the alphabet.
Here's the encoding procedure,
encode-symbol
, we designed in exercise 2.68:
(define (encode-symbol symbol tree) (cond ((not (memq symbol (symbols tree))) (error "bad symbol -- ENCODE-SYMBOL" symbol)) ((leaf? tree) '()) ((memq symbol (symbols (left-branch tree))) (cons 0 (encode-symbol symbol (left-branch tree)))) ((memq symbol (symbols (right-branch tree))) (cons 1 (encode-symbol symbol (right-branch tree))))))And to recap, the summary of exercise 2.71 was that, for an alphabet of n symbols where relative frequencies of the symbols are 1, 2, 4, ..., 2n-1, a Huffman tree will encode the most frequent symbol using a 1-bit encoding, while the least frequent symbol will be encoded using n-1 bits. The reasoning for these bit lengths is that for such an alphabet
generate-huffman-tree
will produce a tree that is n levels deep in which each non-leaf node has a right child node that is a leaf. The most frequent symbol will be in the right child of the root node (and so only one edge needs to be traversed to get to the symbol) while the least frequent symbol will be in the nth level of the tree, in the left-most branch of the tree (and so n-1 edges need to be traversed to get to the symbol).As with previous exercises we will assume that list operations (such as
list
, cons
, car
, cadr
and caddr
) are all Θ(1) operations. Since left-branch
and right-branch
simply delegate to car
and cadr
respectively this means we can assume they're Θ(1) too. If we also assume that eq?
is Θ(1) then this allows us to assume that leaf?
is Θ(1) too... and if leaf?
is Θ(1) then that means symbols
will also be Θ(1), as it performs a fixed number of operations, all of which are Θ(1).Now, how about
memq
? Well, memq
iterates through a list and compares each element to the specified symbol. If it matches then it returns the tail of the list starting at the first matching element, otherwise it returns #f
. Now iterating through a list is Θ(n), and we've already stated that we're assuming list operations are Θ(1), so this will make memq
an Θ(n) operation.Okay, now we've stated all our assumptions about the orders of growth of the operations used by
encode-symbol
. Now we can go on to calculate the best case (when the symbol is the most frequent in the alphabet) and worst case (when the symbol is the least frequent in the alphabet) orders of growth.Let's start with the best case: encoding the most frequent symbol. In this case the symbol's corresponding leaf node is the right child node of the root, and so the order of operations is:
- Start at the root of the tree.
- Ensure that
symbol
is in the list of symbols for the root - otherwise we'd throw an error. The check usessymbols
andmemq
, and so this is Θ(n+1)=Θ(n). - Check whether the root is a leaf node (which it's not) using
leaf?
, an Θ(1) operation. - Check whether the symbol is under the left branch of the tree (which it's not). This uses
left-branch
,symbols
andmemq
, and so this is Θ(1+n+1)=Θ(n). - Check whether the symbol is under the right branch of the tree which it is). This uses
right-branch
,symbols
andmemq
, and so is also Θ(1+n+1)=Θ(n). cons
the value1
onto the result of invokingencode-symbol
on the right branch of the tree.cons
is an Θ(1) operation. The recursive call works out at Θ(n+1+1)=Θ(n), using the following two steps:- Ensure that
symbol
is in the list of symbols for the node (Θ(n+1)). - Check whether the node is a leaf node, which it is, so return the empty list (Θ(1)).
- Ensure that
Now let's move onto the worst case: encoding the least frequent symbol. In this case the symbol's corresponding leaf node is the left most child node at the lowest level of the tree. That means that at each non-leaf node we'll need to perform the following steps:
- Ensure that
symbol
is in the list of symbols for the root (Θ(n)). - Check whether the root is a leaf node, which it's not (Θ(1)).
- Check whether the symbol is under the left branch of the tree, which it is (Θ(n)).
cons
the value0
onto the result of invokingencode-symbol
on the left branch of the tree (Θ(1), plus the cost of the recursive call).
We just need to incorporate the order of growth of processing a leaf node into this. We already calculated this as Θ(n) in step 6 for the best case, so the total order of growth for encoding the least frequent symbol is Θ(n2+n)=Θ(n2).
So, to summarize, for an alphabet of n symbols where relative frequencies of the symbols are 1, 2, 4, ..., 2n-1, a Huffman tree will encode:
- The most frequent symbol with an order of growth of Θ(n)
- The least frequent symbol with an order of growth of Θ(n2)
I don't understand why the order of growth of processing each node is \Theta(n). In my understanding, the length of the list of symbols decreases by 1 each time we go to the next node. Hence, for the least frequent symbol, the order of growth should be \Theta(1*2*...*n) = \Theta(n!)...
ReplyDeleteWe are not doing a \Theta(n-1) process \Theta(n) times every time we go to the next node. We are doing a \Theta(n-1) process when we go to the next node after doing a \Theta(n) process before. The order of growth is therefore /Theta(1 + 2 + 3 + ... + (n-1) + n) = /Theta(n(n-1)/2) = /Theta(n^2)
DeleteI think encode-symbol could be a bit faster if you put the error message as an else statement instead of the first part of the conditional list. Since encode-symbol checks if the symbol is in the left/right branches of the tree, we know it is not in the tree if it skips those left/right predicates. Saves us from checking if the symbol is in a subtree every single time we enter one, since entering a subtree already guarantees the symbol is there.
ReplyDeleteWith an encode-symbol like this finding the most frequent symbol is really quick, since we just look at the left-branch of the tree and find it there immediately.