GenWord - Tutorial

GenWord - A Tutorial

This is a tutorial for GenWord, a Javascript-based vocabulary generator for conlangs based on Mark Rosenfelder's Gen. It's designed to randomly create words based on the settings you give it. There are several defaults built in to give you an idea of how it can perform: try choosing one from the drop-down menu towards the bottom of the screen and click the Load Predef button, if you haven't done so already.

Before you start, you ought to come up with a general idea of what sounds your language will have, and how you will choose to write them out. This will give you a foundation to work from. For the sake of this tutorial, we will work with a fictional language that sounds a lot like English.

We'll take some of the sections out of order.

Categories are where you define what letters are in your words by assigning them, well, categories. The most basic categories would be consonants and vowels. Open GenWord in another tab and hit the Clear Input button, then put the following in the categories box:

Categories:

From there, you can define what syllables look like. By default, the box marked Use only one type of syllables should be checked. If it's not, go ahead and check it. It keeps things simple. Now, put this in the Syllables box:

Syllables:

Now go ahead and click the Generate button and see what happens. You'll get something much like this:

Woaw rowot ette retay yorray piwo. Re orray o tipwoaw we? Hutre aar erkayrar layew tarirew. Totyitra woylu la wayot. Ra weruwar row worote. Wowawxat walraugow tawatjajet lewitror arwo yiw ertar. Topi powawotaa wojtetdew arhalriyye ow to. Jatru ewtawa rorlowaplepku ora yortuwpa otwar. Kekap we oaygorul gihirrara. Awwew ra iwertor ow wowwo wuwtuytit? Awfet yatew er wootwetewey taw roy. Pujrow wawlalpoyawro owlarawtijtuw parurookar ekol. Wiwot upoad row yarat urir irrowfoywaw? Watyowway ge yaw pasowwewwiw wu.

Wow, that's a lot of Ws, isn't it? That's because the generator looks at your categories and uses the first letter in a category more often than the next letter, and so on down the line. We can affect this by choosing a new Dropoff setting (Equiprobable eliminates it entirely!), but let's try to avoid that, because natural languages use some sounds more than others. Try changing your categories to this:

Categories:

That's a more normal distribution of letters in English. Hit Generate again and you'll get something like this:

Ehhat et ehoh itsodo undehtamna tennelis nudi sottottidot een tantot. Datfun son tomdohto tutno at tine. Tot hetas notaef utetserfussam essa salut. Rotetosa gen sutres rora eno netteb. Nenansuhlate teen neto nusen ettit leshaaen. Das tatone nesesir nani natut solot? Natewus tit nateiit etew mene usnered onah. Tusos tetse he dosnet rotwate? Tudoto ialat woh ho te het. Otet natri e tiol seso. Seiese nistiutitten atana ritutesna er soddutommetton. Tema turentedos otheta marnihtete tandathen nele. Tunto ettaso saryiwnat tesana wanmoon et orahet.

That still looks funky, especially those Hs at the beginning. Turns out, the H doesn't appear by itself in English all that often. Usually, it gets paired with a T, S or C to form th, sh and ch. How could we produce those combinations with the generator when it only outputs one letter at a time?

The answer lies in the Rewrite Rules box. It's used to reconfigure the output in a variety of ways. Each line is a separate rule in the form X||Y, which tells the generator to replace all instances of X with Y.

First off, let's change up the Categories a bit. We can use capital letters to represent the th, sh and ch pairs.

Categories:

Now, try putting these rules in:

Rewrite rules:

Didja notice I added an extra rule to change "q" to "qu"? English never has a bare Q, except with a few words borrowed from other languages. Go ahead and hit Generate and see what happens:

Tot e lemedaernet nuoth rushe iretvet en nem dashsontasre. Tossas ite rensen et let tethdaes? Oroa ruriteter neror tin. Sa tototeshesh erone tannetron addat rasisninruro. Dechoson noon ath ni neo ani. Unnetlen rosautos wetetoen atela nosut rutsha. Setus na shetersin otnodtid os. Tenere ostotte utdattotnat tod retna tete? Nando? Tosishshas enotlatu atiqulen laut nan. Sottonronse osen. Sanninusepit rutut tetas tedrin usut te. Tudetenu ewi ded rutanton teushtir lathmiutnetut ta. Terran midte detatte afshe su oshsonmarit? Nin sasnesh eniryit susi. Ro sattedi larsha san tettas nusat?

I underlined a couple of examples where our rules have affected the output!

There are other facets of English we can incorporate into our generator if we start playing with the other syllable boxes. Uncheck that Use only one type of syllables box and a few things change. The Syllables box is now named Word-initial syllables (and will now be used to generate the first syllable of multisyllabic words), and three new boxes have appeared. Let's add some syllable types to them.

Single-word syllables:

This one's pretty straightforward. It's only used when the generator wants to make a single-syllable word, like "I", "you", or "or". I changed the orders around from the Word-initial syllables because the syllable types get chosen in a way similar to the letters in a category: the ones on top are picked more than the others. You can smooth this out by checking the Slow syllable dropoff box.

Mid-word syllables:

This one is only used when putting syllables in the middle of a word, like the "la" in "syllables". I've dropped the CVC because it tends not to happen, and because it makes ugly syllables in the middle of words.

Word-final syllables:

This box is only used for the final syllable in words, like "nal" in "final" or "ble" in "syllable".

Now click Generate and see what happens.

Tenut sedthena sinnee terartad nisetmenem. Moergano tunenis res metoet ni tati. Ta thulsalonean ses inetore otdi po quit? Onar o rattinutotel utro rar lidtaer. Eshshe noner lem nefsos nota to. Nadta urratan soshuta ash ton tanonas. Terra ne nennat tolnion setsu dutomat? Nonane testusoni re radoan tetlase. Im shaetagem ini quit setef shedni. Itreedsu af eutsoensa nurate netquo ete?

Looking better, but I've underlined one glaring non-English cluster there. We can fix that with a new rewrite rule. But be careful: the rules are used in order from top to bottom. Look at this:

Rewrite rules:

In that one, the second rule would never get used, because the first rule would change any "SS" to "shsh" beforehand.

Rewrite rules:

That will work correctly. But we can do even better!

Rewrite rules:

Either of those rules by themselves will do the job of the two rules we used before. They make use of regular expressions, which are special. I can't do a full tutorial on them (Google "javascript regular expression patterns" if you're interested), but I'll briefly explain the two I just used.

SS? means S, optionally followed by another S. The "?" makes the item before it optional. It will match "S" and "SS", but it will only match the first two Ss in "SSS". That's ok, because our syllable rules never allow three consonants in a row.

S+ means any number of Ss in a row. The "+" means "match the item before me as many times as possible". So it can match "S", "SS", "SSS", all the way up to "SSSSSSSSSSSSSSSS" and beyond!

But let's go back to our last generated output. It looks a bit English-y, but not really. It's more like Latin or Italian or something. That's because we're not using the full inventory of English. In particular, English is pretty odd in that it allows tons of consonant clusters. Look at this word: strengths. That's three consonant sounds, a vowel, and three more consonant sounds! That's crazy!

We can solve this problem by adding more categories and syllable types. S and TH are voiceless fricatives, T is a voiceless stop, R is an approximant or liquid, and NG is a nasal stop. (Those are technical terms, and you'll have to start getting used to them if you're going to make languages, so try not to be afraid and just go with me here, all right?)

I went ahead and changed up the following boxes to try and make use of these distinctions. I tried to make the category names have some relation to their contents. See if you can follow along:

Categories:

Rewrite rules:

Word-initial syllables:

Word-final syllables:

Noticed I added a new wrinkle in the word-initial syllables box: sSLV. Everything in GenWord is case-sensitive, so s and S aren't treated the same. S will be replaced with a letter from the S category, but s won't match any category, so the generator will just output it without changing it at all. Check out the first underlined word below for an example:

Etas eseon anwolu strideot wo tednitedasi? Tat liote elthiro an soenatusinsh ser. Tetungth setilaon otush stleadsionsh ad. Tae plaonth nedaeati nassi tosasti. Sena tlaotethata elen sethe ultoeo methea atonie. Teus streat sanat ul on lano? Itoet nette ni nidme taat doso stlemiti rebesen. Nutetshem tes rortaer le pretatet ninontoet se. Erithtanasa stlonener shotoot etudtash.

I underlined some of the new syllables we're getting from this change-up. Plaonth and strideot seem particularly English-like! We could probably do better by changing the other syllable boxes, too.

I'm going to leave off here, but there's so much more you can do. Here are a few ideas you can try on your own to test your skills:

Make word-final QU into QUE.
Try making some words start with WH, but not all, since which and witch are both possible. (For that matter, try making CH into TCH on occassion! Hint: TCH only happens after a vowel in natural English words.)
Insert the letter C where you would normally generate K or S. This is tricky, since the sound of a C greatly depends on the sounds of the letters around it!
Throw PH into the mix.
Add more vowel clusters, like OI, OO and OU.
Figure out how to make Y sometimes a vowel.
Make more English-like patterns, like how a bare H almost always occurs at the start of a word and not anywhere else.
Add in silent letters like GH, or awkward clusters like the BLE in "syllable".
Go beyond English by clicking the Extra Characters button and finding new letters and symbols to cut-n-paste. Or select a pre-defined default (Predef) and click Load Predef to loook at some other categories, rules and syllables.

And as a final note: always remember to have fun!

The following is a quick summary of the rewrite rules of the Kartaran Predef. It's based on Ancient Kartara, my first conlang. Many of the rules use Javascript regular expressions.

The line above looks for a string of the same monopthong vowel. If it contains three or more in a row, it replaces it with a string of only two.

[aeiou] looks for a single match of any of the following: a, e, i, o, or u.
() surrounds a phrase and marks it for later use.
\1 looks at the first () and tries to match the exact same string that was matched within the ().
- \2 would refer to the second (), \3 to the third, and so on.
- ()s are numbered starting with the leftmost ( and counting each ( from there. So in ((a)b(c)), \1 is (abc), \2 is (a), and \3 is (c).
{2,} means "match the previous thing at least two times, but as many times as possible".
$1 only works on the right-hand side. It is replaced with whatever matched the first () in the left-hand side.
- $s are numbered exactly the same as \s.

The above looks for a string of two or more of the same dipthong vowels in a row, and replaces it with only one.

+ means "match the previous thing as many times as possible, but at least once".
- * does the same thing, but can match zero times!

The above looks for a string of three or more vowels in a row and reduces it to the first two.

%V is a special expression used by GenWord. It gets replaced with [abc], where abc is the run of letters in the category specified (V, in this case).
- Note: If the category you select has special characters in it (+, *, \, etc.), this could create unexpected results.
{2} means "match the previous thing exactly two times, no more, no less".

The above is made up of several rules about the letter h. They are designed to preserve h only when it's at the start or end of a word, or when it's a part of a penultimate syllable when the word ends with a vowel.

If more than one h hapens in a row, reduce them to a single h.
If an h occurs before a vowel and a final consonant at the end of the word, change it to H.
- \b matches the beginning or end of a word.
If an h occurs before a vowel, followed by an ending consonant, or else followed by 0-2 consonants and a vowel at the end of the word, change it to H.
- (?= ... ) is a lookahead. It means "only match what came before if we match what comes next, but don't save or change what comes next."
- {0,2} means "match the previous thing between zero and two times".
If an h occurs between two vowels at the end of a word, change it to H.
If an h occurs at the beginning of a word, change it to H.
If an h occurs at the end of a word, change it to H.
If any h is left, delete it.
Change every H back into h.

The rules above change the dipthongs to their two-letter symbols.

A ĭ before an i gets reduced to just i.

A ĭ before a retroflex consonant makes it into a non-retroflex consonant.

If a monopthong is followed by an i, and they aren't at the start of a word, they get turned into a dipthong.

. matches any single character.
\B gives you a match in the middle of a word, never at the start or end. It doesn't actually match any characters when it does it.

Any doubled consonants are reduced to one.

If a stop is followed by r, remove the stop.

If a nasal and a fricative occur together, change the second to match the first's place of articulation. If a k is followed by a nasal, remove the nasal.

If a stop and a fricative occur together, change the second to match the first's place of articulation.

Change certain difficult fricative/stop pairs into easier-to-pronounce pairs.

If retroflex and dental/alveolar occur together, keep the first one.

"Fix" the doubles we may have introduced with the nasal-or-fricative/stop changes. It's easier to just put an accent on the last retroflex in a series.

Finally, change retroflex letters into their correct symbols.

t́ is made with a Unicode combining character. See more at unicode-table.com.

GenWord

Conlangs

Index