2011-12-18

SICP Exercise 2.62: Unioning Sets with Ordered Representations

Give a Θ(n) implementation of union-set for sets represented as ordered lists.

Just like in exercise 2.59, the implementation of union-set is similar in many respects to intersection-set. We'll iterate through the two list representations in parallel, comparing the head items of each list at each iteration. However, unlike intersection-set, which only keeps an item if it is present in the heads of both lists, we want to keep all of the items from both lists - but we need to remove duplicates and ensure the resulting set is ordered.

So what are the cases we're going to deal with? Well:
  • If set1 is null then the union is just set2.
  • Conversely, if set2 is null then the union is just set1.
  • If the head items of both lists are the same then we can produce the union by consing that item onto the result of generating the union of the cdrs of both lists. As with intersection-set it doesn't really matter whether we use the head if set1 or set2 for the cons, so we'll arbitrarily use the head of set1.
  • If the head item of set1 precedes the head item of set2 then we can produce the union by consing the head item of set1 onto the result of generating the union of the cdr of set1 and set2.
  • Conversely, if the head item of set2 precedes the head item of set1 then we can produce the union by consing the head item of set2 onto the result of generating the union of set1 and the cdr of set2.
Here's the procedure:
(define (union-set set1 set2)
  (cond ((null? set1) set2)
        ((null? set2) set1)
        (else (let ((x1 (car set1))
                    (x2 (car set2)))
                (cond ((= x1 x2)
                       (cons x1
                             (union-set (cdr set1)
                                        (cdr set2))))
                      ((< x1 x2)
                       (cons x1
                             (union-set (cdr set1)
                                        set2)))
                      ((< x2 x1)
                       (cons x2
                             (union-set set1
                                        (cdr set2)))))))))
Let's see it in action:
> (union-set '() '())
'()
> (union-set '(1 2 3) '())
'(1 2 3)
> (union-set '() '(4 5 6))
'(4 5 6)
> (union-set '(1 2 3) '(4 5 6))
'(1 2 3 4 5 6)
> (union-set '(4 5 6) '(1 2 3))
'(1 2 3 4 5 6)
> (union-set '(1 3 5) '(2 4 6))
'(1 2 3 4 5 6)
> (union-set '(2 4 6) '(1 3 5))
'(1 2 3 4 5 6)
> (union-set '(1 2 3) '(2 3 4))
'(1 2 3 4)

SICP Exercise 2.61: Adjoining to Orderered Representations

Give an implementation of adjoin-set using the ordered representation. By analogy with element-of-set? show how to take advantage of the ordering to produce a procedure that requires on the average about half as many steps as with the unordered representation.

The implementation of element-of-set? given for using the ordered representation iterates through the list until it either finds the item being checked or until it finds an element greater than the item being checked. Our implementation of adjoin-set has to produce a representation that maintains the ordering and can do so using a similar approach. We're going to iterate through the list until we find the appropriate location for the new item to be inserted, and we still have the same four cases to consider. However, as we need to build up a result list as we go along, the actions taken for each case differ:
  • If we're adjoining to a null list, then the result is simply a list containing only the item being adjoined.
  • If the head of the list is the same as the item being adjoined then the item is already present in the list and so we don't need to modify the list and can just use the current list as the result.
  • If the head of the list is greater than the item being adjoined then the item needs to go at the head of the list, and so we need to cons the item onto the list.
  • Otherwise (i.e. if the head of the list is less then the item being adjoined) then the item needs to be inserted later in the list. We can do this by consing the head of the list onto the result of adjoining the item to the cdr of the list.
An appropriate implementation of the procedure is therefore:
(define (adjoin-set x set)
  (cond ((null? set) (list x))
        ((= x (car set)) set)
        (( < x (car set)) (cons x set))
        (else (cons (car set)
                    (adjoin-set x (cdr set))))))
Here it is in action:
> (adjoin-set 4 '())
'(4)
> (adjoin-set 3 '(4))
'(3 4)
> (adjoin-set 10 '(3 4))
'(3 4 10)
> (adjoin-set 6 '(3 4 10))
'(3 4 6 10)
> (adjoin-set 7 '(3 4 6 10))
'(3 4 6 7 10)
> (adjoin-set 5 '(3 4 6 7 10))
'(3 4 5 6 7 10)
Similar to the element-of-set? implementation, assuming that the items we adjoin are of many different sizes and are randomly ordered, then we'll sometimes be able to insert items near (and so stop iterating near) the start of the list, while for other items we'll have to insert them later in the list, and so on average we'll need Θ(n/2) steps to adjoin items.

It's worth noting that the ordering of the items processed is important for adjoin-set in making this assertion, whereas it wasn't for element-of-set?. For example, if you take any set and then check each of it's members is present in the set using element-of-set? then this will require, on average, n/2 recursive calls to element-of-set? to perform the check, regardless of the ordering we use. However, if we start with an empty list and try to build a set by inserting items in numerically ascending order (i.e. (adjoin-set 3 (adjoin-set 2 (adjoin-set 1 '())))) then each adjoin will require n recursive calls to adjoin-set - the first will require 1 call, the second will require 2 calls, and so on, as for each new item it will need to iterate to the end of the current list to find the appropriate location for the item. On the other hand, if we build the set by inserting items in numerically descending order (i.e. (adjoin-set 1 (adjoin-set 2 (adjoin-set 3 '())))) then each adjoin will only require 1 call to adjoin-set, as each item will go at the head of the current list.

2011-12-16

SICP Exercise 2.60: Duplicate Sets

We specified that a set would be represented as a list with no duplicates. Now suppose we allow duplicates. For instance, the set {1,2,3} could be represented as the list (2 3 2 1 3 2 2). Design procedures element-of-set?, adjoin-set, union-set, and intersection-set that operate on this representation. How does the efficiency of each compare with the corresponding procedure for the non-duplicate representation? Are there applications for which you would use this representation in preference to the non-duplicate one?

We're still using a list as our representation here. We've just removed the restriction on duplicates. This means that those set operations which create a new set by taking an existing set and potentially adding new items to it no longer need to check if those items are present before adding them. This means adjoin-set and union-set no longer need to perform those checks. We'll look at those set operations in detail in a minute, but let's first look at the other two set operations: element-of-set and intersection-set.

The implementation of element-of-set given in the book simply iterates through the list representation of the set and returns true as soon as it encounters a matching element, or false if there is no such matching element. Allowing duplicates in the list representation doesn't change this approach. We still need to scan through the list to see if the element is present. It's just that, where duplicates exist, we may compare against the same value multiple times. So element-of-set can be used as is - it may run slower though, as the presence of duplicates means that the lists we're scanning may be longer than their equivalents when duplicates are not allowed.

We can see that the same holds true for intersection-set. The implementation in the book iterates through the list representation of set1 and, for each element, adds it to the result set iff it's also present in set2. If there are duplicates in either set1 or set2 this doesn't really change this approach. We still need to examine each item in set1 and only include it in the result set if it's also in set2, so we can reuse the existing implementation of intersection-set. Note that the result set could contain duplicates: any element that appears in both sets will appear as many times as it is present in set1. As with element-of-set this means that it may also run slower and may generate larger sets in comparison with using equivalent sets where duplicates are not allowed.

Now onto adjoin-set... In the "no duplicates" implementation provided in the book, adjoin-set checks to see whether the element is already in the representation before appending it. As we're allowed duplicates in our representation we don't need this test - we can just cons it onto the head. This gives us a very simple implementation:
(define (adjoin-set x set)
  (cons x set))
...or, to put it more succinctly...
(define adjoin-set cons)
As this simply puts the item on the head of the list this makes it an Θ(1) operation, and so is much more efficient than the "no duplicates" implementation (which is Θ(n)). Note that if we wanted to be slightly smart about it, still allow duplicates generally, and still retain the Θ(1) efficiency, we could simply check the head of the list to ensure that we're not putting another identical element onto the head of the list:
(define (adjoin-set x set)
  (if (or (null? set) (not (equal? x (car set))))
      (cons x set)
      set))
However, given that the example representation of the set {1,2,3} ((2 3 2 1 3 2 2)) includes two '2' values adjacent to each other, this isn't a necessary optimization.

The last operation, union-set needs to produce the set of all values in both set1 and set2. As we no longer need to worry about duplicates we can simply append the two sets together to produce the result we need:
(define (union-set set1 set2)
  (append set1 set2))
...or, more succinctly...
(define union-set append)
Okay, so let's build some sets:
> (define evens
    (adjoin-set 0 (adjoin-set 2 (adjoin-set 4 (adjoin-set 6 (adjoin-set 8 '()))))))
> (define odds
    (adjoin-set 1 (adjoin-set 3 (adjoin-set 5 (adjoin-set 7 (adjoin-set 9 '()))))))
> evens
'(0 2 4 6 8)
> odds
'(1 3 5 7 9)
> (adjoin-set 2 evens)
'(2 0 2 4 6 8)
> (adjoin-set 2 odds)
'(2 1 3 5 7 9)
> (intersection-set evens odds)
'()
> (intersection-set evens evens)
'(0 2 4 6 8)
> (union-set evens odds)
'(0 2 4 6 8 1 3 5 7 9)
> (union-set evens evens)
'(0 2 4 6 8 0 2 4 6 8)
Let's compare efficiencies of implementations:

OperationNo DuplicatesAllow Duplicates
element-of-set?Θ(n)Θ(n)
adjoin-setΘ(n)Θ(1)
intersection-setΘ(n2)Θ(n2)
union-setΘ(n2)Θ(n)

Of course we need to be slightly careful in the comparison here... While element-of-set? is Θ(n) regardless of whether or not we allow duplicates, the n here is the number of elements in the list representation of the set, not the number of distinct elements in the set. As a result, n could be much larger in the "allow duplicates" case than in the "no duplicates" case, and so the operation could be much slower. A similar issue arises for intersection-set too (except we're dealing with Θ(n2) here, so the effect can be much more exacerbated).

We also need to be aware of this issue when comparing the two union-set efficiencies. While the "allow duplicates" case is definitely more efficient (at Θ(n) as opposed to the no duplicates efficiency of Θ(n2)), the value of n in the "allow duplicates" case could potentially be much higher. For example, consider the set {1, 2, 3}. In the "no duplicates" case the size of the underlying list we have to process (and so the n we are dealing with) will always be 3. However, in the "allow duplicates" case all we know for sure is that it will be at least 3 - but (theoretically) there's no upper bound on the size of the underlying list we have to process, so with certain list representations we may find that the Θ(n) operation in the "allow duplicates" case may still perform worse than the Θ(n2) operation in the "no duplicates case.

With adjoin-set there is no such issue. The "allow duplicates" case will take constant time regardless of the size of the underlying representation (or at least we're assuming that this is how cons works). As a result it doesn't matter that there may be duplicates in the set - this has no effect on the efficiency of the operation, and so the "allow duplicates" case will generally be quicker than the "no duplicates" case.

So when would we use this representation in preference to the non-duplicate one? Well, in applications where we're going to use adjoin-set much more frequently than any of the other operations (and where memory is not a concern) it may be preferable to use the implementation that allows duplicates.