PennNames Generate Algorithm
Following is a discussion of how PennNames are generated when using
the PennNames Generate command.
In this discussion of the Generate command algorithm the following terminology will be used:
||This is the full name of the user as stored in PennCommunity,
in the format, LAST_NAME FIRST_NAME MIDDLE_NAME.
||Smith, John H
||This is any name information supplied as an optional argument
to the Generate command in the <FULLNAME> variable.
This may be the full name of the user, or it may
be something completely arbitrary.
||This is a string which the user prefers for his/her username.
This information is supplied as an optional
argument to the Generate command in the <SEED> variable, the string that
is prepended by a colon when issuing the Generate command. Please note that the colon is not factored into the algorithm. It has the highest weight.
||This is the combination of data from PennCommunity_Supplied_Fullname plus the Seed or the User-Supplied-Name plus the Seed and is used to generate a list of potential usernames. Any punctuation is stripped out so that "John H. Smith, Jr." becomes "John H Smith Jr".
||bozo Jack Smith
A good-looking username consists of a large portion of one part of a
user's name, either followed by or preceeded by a small portion of another
part of his/her name.
For example, let's consider the user whose PennCommunity FIRST_NAME is "Ziggy", whose PennCommunity LAST_NAME is "Bozo", and who doesn't have a PennCommunity MIDDLE_NAME. It is likely that Ziggy would prefer the following names:
These should be considered the best possible names for this user.
In practice, some of those names will be unavailable since they might be
assigned to some other person on campus or reserved by another sponsor.
Likely secondary choices for Ziggy might be:
If the names postpended with "2" are taken, we might
suggest the same names ending with 3, and so on.
Of course there are a lot of variables. There's no guarantee that we'll
have a middle initial. Even if we do, it's possible that the user prefers
to go by their second name, the user could have two middle names, the
PennCommunity information might be inaccurate, and so on. So a wide
variety of names should be generated to try and cover as many of these
possibilities as is practical.
Previous versions of the PennNames name generation algorithm were considered to
have too high a weight on the middle name since the first name, last name and middle name had been weighted equally. The current version of the algorithm tries to deduce which name is the middle name, and give it a lower priority than other seed material.
Sources of the Names
The implemented algorithm has three sources of potential Name-Material. They are
- the PennCommunity-Fullname
- an optional User-Supplied-Name
- an optional Seed
Basic generation of the names
The basic premise is that all seed material falls in to one of three categories:
high, medium or low weight. The PennCommunity-Fullname is examined, and middle
names are given a low weight. The optional Seed, if provided, is given a high weight. Titles and suffixes (e.g. Mrs., Jr., Sr.) are discarded. Everything else is given a medium weight.
The namestream is a series of "lazy enumeration" functions which generate more results on demand. These functions are stacked on top of each other such that the
generate results from the high priority material first; then the high and medium
priority material; then the high, medium and low material in conjunction. The lazy enumeration will only do as much work as it needs to in order to return the
target number of suggested names.
There is also a time limit placed on the namestream generator. If the time expires then the result list will be cut short. This prevents a runaway server or malfunctioning service from denying access to the PennNames service.
The mixing of names is done in stages:
- Each piece of material by itself
- Mixed pairs of material
- Mixed triplets of material
No more than three pieces of material are ever considered for a generated name.
In our simplest example where the PennCommunity FIRST_NAME is "Ziggy", the PennCommunity LAST_NAME is "Bozo", the PennComunity MIDDLE_NAME is blank, and no User-Supplied-Name or Seed has been supplied, the Name-Material will be "Bozo Ziggy"
(note: last name first) and we will generate these names:
Those are the only 44 possible names generated from the given source material.
Extended generation of the names
As a final resort, the lazy enumerator will begin to postpend numbers if the material is completely used up. The names generated will look like this:
You'll notice that this starts with the first results from the initial generate,
above. The order in which the seeds themselves are consulted is guaranteed, but
the point at which they're used is not. In this case there are two uses of 'boz o', then one of 'ziggy'. This may not be true of all cases. In general, this number-postpending mechanism is considered a mechanism of last resort so it is hoped that one need not rely on the efficacy of the extension; if you get here, nothing that you want is avaiable anyway.
As mentioned above, not every piece of Name-Material carries the same weight. For example, if we add a middle name ("Quartermaine") for Ziggy, we then have the
Name-Material "Bozo Ziggy Quartermaine" and generate these names:
Note that the middle name is not consulted until the other Name-Material is nearly exhausted (at position 41). This means that typical generate request which is usually for 15-25 names
will not generally use the middle name at all.
Name validation and results
Since this algorithm uses a lazy iterator, the results are vetted immediately upon generation. Names which are not available are immediately discarded. This results in a significantly faster algorithm than the previous iteration; generating
1000 names takes approximately 8 seconds with the new algorithm, whereas we saw
generates of 100 take over 30 seconds with the previous algorithm. This should
help avoid server deadlock under high load.
The list of results is returned to the PennNames client, up to but no more than
the maximum number of names that were requested. This iterator should never run
out of material (there are an endless supply of numbers), and so should always return the exact number of requested names. Even so, do not assume that asking for 25 names will return 25 names. It will return up to 25 names. Some cases wind
up returning only one name, or only the set of names that are already held for a given PennID. It is only guaranteed that the server will not return more results than what you requested.