anagram dictionary

neil epstein and justin hall
december 18, 1997
computer science 21
professor charles kelemen

'gram dict

assigned to write an anagram dictionary, neil and justin began our problem solving with an exchange of e-mails, roughly, abstractly outlining our thoughts and initial, conceptual approaches to the problem.
justin proposed an initial task delegation:

Date: Fri, 21 Nov 1997 11:56:45 -0500 (EST) From: justin To: Neil Epstein Subject: assignment notes i talked to kelemen today, got this much on the assignment - we build the program to read a database file loosely based on the one in the e-mail. he might eventually give us a much larger database to read from. we do not, however, have to worry about user input. or making sensical anagrams. so the program, neil, would seem to me to be: (read in database into an array of arrays? or an array of lists? he sez there are four ways, and for our paper we need to compare two, but we only need to use one) filter each entry for its ASCII string, rank that string from lowest to highest ASCII. compare that ASCII string against all the entries to place entries in an array corresponding to that ASCII string. (unless they will be on the right line in the file?) then print appropriately. i'm talking about this now neil because i won't be back from thanksgiving break until thursday december 4. that's just a week before it's due, and i'd rather not cram. can we start passing code back and forth now? i can take up the basic program structure, struct types, input from file, maybe printing, and give that to you, you do the ASCII sorting. whaddya think? we should probably agree on a basic structure first - maybe an array of arrays? i don't know about this "list" stuff - maybe if you did, you could tell me. okay, justin

the idea of passing initial code structure back and forth was efficient-seeming but impractical - we needed to meet to talk, especially granted neil's high-falutin' concept(s) for the program:

Date: Mon, 24 Nov 1997 14:30:18 -0500 (EST) From: Neil Epstein To: justin Subject: Re: assignment notes Dear Justin Okay. Seems like there should be a header file, call it "anadict.h" or something, which will contain #defines, of course, as well as the definitions of the structs and the prototypes for the functions that manipulate them. Those functions should include adding an entry to the dictionary, adding an anagram to an entry [remember, an entry is some sort of bag of words all anagrammatic to each other], comparing keys to see which one should come first, and a spilling-of-guts function for dictionary entries (i.e. shamelessly bare all to a string or stdout!). I feel better when the honest-to-goodness print functions are prototyped in a different .h file. (like "anaprint.h", I guess) Anyway, let's implement all the functions prototyped in each .h file in a .c file with the same prefix. Sound good? Just organizational principles, but if you object, do so outwardly. Um, as far as arrays, lists, and so on (which we'll have to deal with in "anadict.c" if we go along with my naming scheme) here's how I see things: As Kelemen showed in class, when you've got some sort of dictionary structure (meaning a string for a key and something else in the belly), you can either actually order the elements of the dictionary, moving them out of the way when new ones come along so that it's always alphabetized as is. Or, alternatively, you can just make an auxillary pointer array to deal with it. In the latter case, the actual order of the dictionary elements in memory doesn't change. Instead, the elements of the pointer array are ordered so that each successive pointer points to the _alphabetically_ next element of the dictionary. I'm assuming this is what Kelemen means by a list. I'm not sure that's a good explanation of the '"list" stuff' that you wanted explained. Here's an analogy: A bunch of burly manual laborers come by and plunk down heavy boxes in a line against a wall, and all of the boxes are uniquely labeled. It would be nice if they were in alphabetical order. Unfortunately, the laborers refuse to move any box already plunked. So the boxes just end up in the order that the laborers found them. Now, I want to know important things like "how many boxes have labels orthographically between 'alimony' and 'wife'?", and "Where's the box labeled 'aardvark'?" and I want them answered fast when I ask them, which may be sometime in the future. But nobody except these recalcitrant laborers are willing (or able) to move these boxes. Anyway, so I hire a bunch of people called "pointers". They each have exactly one, very long, finger, which they can point. So I stand them in line, and each of them is to point at a box, such that the first pointer points at the orthographically first label, the second pointer at the one with the second orth. label, etc. The way this works is, when a laborer plunks down a new box, the pointers wrangle among themselves to figure out once again who should point at what box. The boxes don't move, but it's almost as if they do, because if I want the box with the orth. fourth label, I can just follow the finger of the fourth person in my line of pointers. This takes more time than just pointing at the fourth box on the belt, but if moving the actual boxes to put them in label-order so that I can do that sort of thing is prohibitive in this setup, so it's better if I just point to someone and follow their pointing finger. I hope my analogy is clear, even if a bit outlandish. Anyway, we should meet sometime tomorrow afternoon for an hour or so, if that's good for you. I've got most of the afternoon free (except for work, of course). I'll have lots more time after Thanksgiving, ironically, because I've got a major rough draft of a thesis due this Wednesday. However, I'll be back at Swarthmore before November is out.. The division of labor you suggested sounds okay, though we should work truly-together on some of it. Anyway, if you're gonna give me basic structure, I think I prefer lists to arrays on this one. Let me know when's good. Your room's fine if you want. -Neil
neil's analogy, while illustrative, evoked computer science beyond justin's immediate expertise. justin understood the value of neil's abstraction, but it put the creation of the data types beyond his abilities, in the short term.
the dialog continued after the thanksgiving break, when the two of us united in the sun lab and set about encoding the data types and the functions together.
we initially decided a header file with function interfaces would be appropriate. but after some time splitting function declarations and the code that made them into separate files, we elected to build a single file containing both the functions and split them later if we had the time, once we had the program working.

we sketched out a program strategy:
first check inputs to see if they match an existing key
user input becomes anadict->element[0]->anagram[0]
ascii value ordered version of that becomes

anadict->element[0]->key
put everything in, then merge sort -
"i don't want to move boxes, i want to move fingers." - neil
we define types, neil wants to use a triple star on dictelement->anagram, but justin talks him down one layer of complexity for the initial coding process. that additional pointer structure is later added within the anadictT struct, to make **order.
we initially define types (with one less layer of abstraction within the dictelement form):
typedef struct {
  char *key;
  char **anagram;
  int size;
} *dictelementT;

typedef struct {
  dictelementT *element;
  dictelementT **order;
  int size;
} *anadictT;
neil defines "new" types as means of abbreviating clearing out.
ie NEW_DICT -

we divided up initial tasks - justin printf stuff, neil file input stuff.
the latter because neil decides that we should use getc and not scanf, defining each individual whitespace exception instead of using the whitespace exception built into scanf ( read up to %s) which justin had used for his movie club database.
neil likes the getc approach because scanf sucks. he understands how it works better. scanf better suited to line by line input from files.
we agree to instertion sorting letters into keys because the sorting string won't be longer than 12 characters, and short sort strings process insertion sort quickly.
the getc interface debugging takes a good chunk of time.

seg faults and bus errors

which we solved with neil's hand job;
he turns out to have a knack, or a penchance for hand simulation, and moreso for pedagogy, illustrating the twists and turns of three level pointers.
which is how we sort out a problem we have with this line of code:

dict->element[size]->anagram[0] = NEW_ARRAY(strlen(ana)+1, char);

is the trouble line.
we've chosen something hard, neil sez "if it works, it'll be elegant as hell"
hand simulation reveals we might need to allocate space for anagram separately. neil built a function for it but didn't use it. the whole file prints out and it's victory - another computer science high.
merge sort for the dictionary sorting. but how to do it on a single array? split it? hand simulating. neil reintroduces his triple star notion. adjusting the dictionary typedef to include a pointer to an array of pointers to the ordered dictelements.
neil adds many functions that we later remove. we struggle with another error until neil discovers he's neglecting to call his dictionary initialization fuction.
besides hand simulation, a lot of the work (debugging) was scaffolding.

static void merge(dictelementT **array, dictelementT **arr1, int size1,
		               dictelementT **arr2, int size2)
{
  int place, place1, place2;
  void *tmp;

  printf("merge: sizes are %d and %d.\n", size1, size2);
  place = place1 = place2 = 0;
  while (place1 < size1 && place2 < size2) {
    printf("(*arr1[%d]->key is %s.\n", place1, (*arr1[place1]->key);
    printf("comparing (*arr1[%d]->key (%s) with (*arr2[%d]->key (%s)\n",
	   place1, (*arr1[place1]->key, place2, (*arr2[place2]->key);
    if (strcmp((*arr1[place1]->key, (*arr2[place2]->key) < 0) {
      printf("(*arr1[%d]->key, %s, comes first.\n",
	     place1, (*arr1[place1]->key);
      *array[place] = *arr1[place1];
      printf("Accordingly, we set (*array[%d]->key equal to %s.\n",
	     place, (*array[place]->key);
      place++;
      place1++;
    } else {
      printf("Either (*arr2[%d]->key, %s, comes first, or they are equal.\n",
	     place2, (*arr2[place2]->key);
      *array[place] = *arr2[place2];
      printf("Accordingly, we set (*array[%d]->key equal to %s.\n",
	     place, (*array[place]->key);
      place++;
      place2++;
    }
  }
  while(place1 < size1) {
    printf("Now set (*array[%d]->key equal to (*arr1[%d]->key: %s.\n",
	   place, place1, (*arr1[place1]->key);
    *array[place] = *arr1[place1];
    printf("(*array[%d]->key is now %s.\n",
	   place, (*array[place]->key);
    place++;
    place1++;
  }
  while (place2 < size2) {
    printf("Now set (*array[%d]->key equal to (*arr2[%d]->key: %s.\n",
	   place, place2, (*arr2[place2]->key);
    *array[place] = *arr2[place2];
    printf("(*array[%d]->key is now %s.\n",
	   place, (*array[place]->key);
    place++;
    place2++;
  }
}

The reason for all this scaffolding was that Neil had come up against a conundrum. The program worked, (and by that we mean that it _ran_), but the results were screwy:

       full anagram dictionary listing, 7 keyed entries. 
 
         KEY :  'GRAMS 
 
ABINRY       : CRAPES       ESCARP       PACERS       CAPERS       SCRAPE      
               SPACER       RECAPS      
ABINRY       : INTEGRAL     RELATING     ALTERING     TRIANGLE     ALERTING    
ABINRY       : SYZYGY      
ABINRY       : ALGORITHM    LOGARITHM   
ABINRY       : MARTLESSNESS TRAMLESSNESS
ABINRY       : BINARY       BRAINY      
STY          : STY

After all that scaffolding, Neil figured out that what Neil needed to revise in the code was, instead of doing a strcpy of (*array[place])->key and so forth, Neil needed to just set the pointers themselves equal. Otherwise, Neil was changing dictelementT's internally, rather than switching pointers to them, which is why it wasn't working.
we calculated the amount of space our dictionary and elements would take with one layer less abstraction, and compared the two:

byte count on one layer less abstraction:
granted 20 anagrams read, resulting in 7 keys.

	anadict: 290,004 bytes size 4 dictelementT order (4 bytes x 500 max elements)* dictelementT element (500 max elements x 576 size of elements)
		dictelement: 576 int size 4 bytes anagram 520* (10 max anagrams x (4 bytes x 13 characters)) key 52 (4 bytes x 13 characters)

byte count of anadict.c::
granted 20 anagrams read, resulting in 7 keys.

	anadict: <1068 bytes size 4 bytes dictelementT order 112 bytes (4 bytes + (4 bytes x 7 keys))** dictelementT element <952 bytes* *(4 bytes + (136 bytes 7 keys))**
		element: 136 bytes int size 4 bytes anagram: approx. 100 bytes (4 bytes + (4 bytes * ((avg number of anagrams: 20/7, approx 3) * (avg word length + 1 + 1: 8))))** key 32 bytes* *(4 bytes + (4 bytes (avg word length + 1, approx 7)))**

we win!

In general, if l is the average word length, n the number of anagrams, and k the number of keys, then the total amount of space involved, with our implementation, is 4(3+k(5+l)+(n(2+l))) bytes. With the other implementaion, the amount of space involved is fixed at 290,004 bytes. Assuming an average of 2 anagrams per key, and an average word length of 6, our figure boils down to 12(1+9k) bytes. So as long as we have 2685 keys or less with these assumptions, our implementation takes less space than the given one. In fact, since in both, we can't take more than 500 keys, ours is always better. For instance, even if we use 500 keys in our implementation, assuming an average of two anagrams per key and an average word length of 6, only 54012 bytes are used. And even if we assume an average word length of, say, 9, and 4 anagrams per key, with 500 keys (all of which is highly unlikely, but let's take a gander at it), the formula churns out, for this worst-case scenario: 116,012 bytes used: only 40% of the space used by the other approach.

According to some people in the Physics and Astronomy departments, time should be seen as a dimension of space. Therefore, since our implementation takes less space than a naive, non-pointeristic approach, it must take less time on some plane.

coursework | swat | justin's links