Got some ideas? Feel free to contribute!
Add your speculations, theories, current lines of investigation, or suggestions for areas which should be looked into here!
A couple of notes about this section: This is a place to collect all kinds of theories from all kinds of sources. It is quite possible (and likely) that there will be different theories which disagree with or contradict each other, and that's fine (great scientific progress often starts with a bunch of competing theories!). Updating and building off other people's theories is OK (and encouraged!) but please do not make substantial changes to other people's texts (particularly those written in the first person) without consulting them first. Making additions (comments, etc) is OK, but please try to do it in a way that it does not disrupt the original text too much (such as placing them at the end), and make sure your comments are annotated with your own username so they don't sound like they're being said by somebody else (you can use four tildes (~~~~) in your text to automatically insert your username and a time/date).
- 1 Phlosioneer
- 2 Foogod
- 3 Guy0320
- 4 StarGateTABC
- 5 Ratfox
- 6 straycat
- 7 nneonneo
- 8 Shain
- 9 curiousgeorgie
- 10 Jones
- 11 Jmorlock
- 12 Hiperon
- 13 Narga
- 14 Jonadab
- 15 Qwertystop
I have discovered two sub-alphabets (using Foogod's transliteration):
And two other letters: a and w. The alphabets are used to transcribe a word like so:
- Begin at the ending of a word, with alphabet #1.
- Alternate alphabets until the beginning of the word is reached.
The exceptions to this rule are the a and w characters. These characters seem to be nulls, randomly thrown in. The only hard and fast rule is that they reset the alternating alphabets back to alphabet #1. I suspect there are rules governing when to use these characters, but I can't figure them out.
Some random stats about these alphabets:
• They are unequally used: #1 appears 926 times, while #2 appears 775 times.
• a and w appear roughly the same number of times: 181 and 204, respectively.
• a comes after a #1 alphabet 60 times, after a #2 alphabet 38 times, after a w 27 times, and at the beginning of a word 56 times.
• w comes after a #1 alphabet 12 times, after a #2 alphabet 102 times, after an a 10 times, and at the beginning of a word 80 times.
Translation of "and"
I have come to believe that the word prefix yw ("yw" in the foogod transliteration) is translated into the word "and". I came to this conclusion by identifying phrases and determining where clause breaks must happen, and then looking at what was common between them. This result is confirmed through careful study of the patterns of words with w's in them, and by the one appearance of y immediately before &. Whenever there is the word "and" in the plaintext, it is first translated into y, and then mashed into the following word with w. This can't be done before &, so it becomes its own stand-alone word.
I'll add more concrete evidence here when my Finals are over.
Working from the folded text, it becomes fairly clear that there are three, largely distinct, classes of glyphs at work (ignoring & for the time being):
- Class 1: p s l f x g k o
- Class 2: j d u y q
- Class 3: a w
Based on this, the text can be divided up into individual "tokens" according to the following rules:
A token consists of:
- A class-1 glyph followed by a class-2 glyph
- A class-3 glyph by itself
- (at the beginning of a word or following a class-3 glyph) A class-2 glyph by itself
(Note that unlike w, the a glyph occurs quite frequently followed by class-2 glyphs, the same way a class-1 glyph would. However, it also appears quite frequently on its own, unlike any other class-1 glyph. Its classification in this model is therefore somewhat uncertain at this point.)
It is unclear whether the solo class-2 glyphs represent unique tokens, or whether they are simply "shorthand" for some of the other class-1 + class-2 combinations. For the time being, they are being considered distinct. If we therefore map out all of the tokens found in the text according to these rules, we come up with the following "41-token alphabet":
If we take tokens as letters, obviously, this alphabet is significantly larger than that of English or German. It also results in substantially shorter average word lengths in the text than would be expected. My current theory is that some tokens probably map directly to letters, but that many additional tokens may map to common multiple-letter sequences (such as "ing" or "ed" word-endings in English), and that this may account for both the atypical word lengths and the larger number of tokens involved.
w as Space/Suffix/Shift Marker
There are numerous words in the text which appear both as their own word and also as the prefix for other words (this may correspond to, for example, the tendency in English to change the tense or part of speech by adding a suffix to an existing word (i.e. "-ed" for past tense, or "-ly" to make an adverb)).
One particularly noteworthy observation that a couple of people have made (directly or indirectly) is that whenever this occurs, the suffix portion of the longer word always begins with the w character. In some cases there appear to be multiple suffixes tacked onto the same word in this way (for example, gdod suffixed by wsd suffixed by wxj).
However, it is important to note also that w shows up in various other locations which do not appear to be suffixes in the same way, making things more fuzzy.
There are a couple of current theories regarding this:
- The w character is actually some sort of "space" character, and possibly these suffixes are not actually suffixes but instead other words (the meaning of using w instead of space in this context is unclear).
- w is being used as some sort of "suffix marker" or "suffix glue" to indicate when a suffix is being tacked onto another word (whether this is just redundant or is required for some semantic purpose is unclear).
- w is being used as some sort of "shift" character to change the meaning of the following characters/tokens (possibly, the "alternate token set" contains a bunch of common word endings which are not present in the normal "token alphabet").
W as a dash/hyphen/space
W appears to be used as a dash connecting smaller words to make larger ones. Several four and three letter standalone words in the text appear as a part of larger words within the text. Smaller words isolated and found used as a prefix before W on the first page of The Book of Woo include UAD, LYOJ, LDSD, and GDOD.
W may also be a space character introduced at some point during encryption to throw off word length analysis by creating a large word which is actually two short words in the plain text.
These are just theories and the proofs are just the steps I've taken to get to the conclusion. Please edit my theory freely if you find any writing mistakes. I'm not sure how often I'll be visiting the wiki, so I just dumped my ideas here. And I'm not sure if someone mentioned them earlier.
Information density and word length
I'm not a cryptography expert, but I know something about Information Theory.
Let's say my "Not English nor German" theory is wrong and text really is written in English and/or German. English has 26 letters, while German has 30 letters (or also 26 if not counting ü, ä, ö and ß). Folded Woo alphabet has 15 characters plus &. So, because the alphabet is two times less dense (has less characters), the message must be approximately log15(30) = 1.3 times longer. In other words, the original text is 0.7 times shorter. If we assume spaces are word breaks, this also means that words are shorter as well. In Foogods transliteration, average word length is around 6.2 (2186 letters / 353 words), which means that original has average word length of 4.3. I'm pretty sure both German and English have average words longer than that, especially in a meaningful text worth deciphering.
But wait, there's more: If we assume W/w are spaces/sufix markers/prefix markers/dashes things get even fuzzier. We get a lot of shorter words lowering the average to 5.1 letters per word (3.6 in original), and alphabet to 14 (+ &)). But surprisingly, there are no one letter words except Y/y which appears 26 times (but only once without being pre-/suffixed to another word), and a single u, which is also prefixed to another word. (Also interesting that solo Y and prefix u are both on the first page).
This could mean one or more of these things:
- Folded Woo alphabet is wrong
- Maybe U/u do change letter meaning? This would increase the alphabet density, and shorten the Book (bring plain text average word lengths to normal)
- Do note that this will bring total number of letters to exactly 30 (+ &) or 26 + UuWw& (if we assume U/u is just to make it harder, and W/w are some kind of spaces), just like German alphabet. (Assuming letter frequencies are not correct, so it's not plain letter substitution, PENDING TEST)
- (I meant that U/u glyphs do something else apart from switching the alphabet, because we need more letters to fill up to the english/german alphabets. In other words, observed glyph pairs A/a, D/d,... are not completely identical and while having some connection (proven by word repetition) are distinct characters)Stargatetabc (talk) 11:50, 25 July 2014 (UTC)
- Thank you for your respond.
- OK, I think i understand now. But word repetition was almost first I checked when you showed me there are two alphabets. And EVERY single character (not &) is in two different versions - big and small, in same words - I mean with same combination of glyphs. I can't possible imagine this, if big letter and small letter mean something else (maybe because of the w/W characters) it shouldn't give same words in the end. Can you?
- BUT! You made me think about something else. The & glyph can do what you are talking about. u and U in my opinion can't work this way, but & can! It appear in several places and have no upper/lower case character, so it's special. It can do lots of things, it can change alphabets into some other alphabets or just simple change language from English to German... I was believing & means seeoahtlahmakaskay, and comment about it you see a little lower... But it's just a guess, maybe wrong, what do you think? Hiperon (talk) 21:00, 29 July 2014 (UTC)
- Ah! And IMO It's not that hard to check if & is making some "things" to our text... With the same way I checked if upper case letters are same as lower case letters. We can check if there are some "same" words between two & glyphs. If there are many of them, there is low probability that & is making too much things with our alphabets or something Hiperon (talk) 21:24, 29 July 2014 (UTC)
- Spaces are not boundaries of words
- Maybe they mark syllables, or other phonetic units? This is placed as an assumption on What We Know page, but might not be true.
- This would increase actual word length in the Book, and so in the original as well.
- The language is not plain German/English
- This can mean that acronyms are used to shorten the words.
- Maybe just some common German words (pronouns, especially possessive pronouns because they might be used a lot and they are longer (meinen, deinen, seinen, etc.))
- Information Theory failed me
- Or I'm just missing something, please correct me if so.
(Natural language isn't very informationally dense. So, for example, one could find combinations of letters that are common in the plaintext language, then turn that into a smaller number of combinations in the cyphertext. Jones (talk) 00:17, 26 July 2014 (UTC))
- (I haven't fully understood what you are trying to say. If you were thinking about substituting common letter groups with a single glyph, that would work, but will INCREASE the needed alphabet, and further increase the 1.3 ratio. If you are thinking about equalising some pairs (e.g. common "ei" and "ie" pairs, which sometimes alternate in verbs), that still won't solve the problem. Unless we combine the two methods and substitute a big number of letter pairs with a single character. But then we'll need to guess which pair we need to use for those glyphs (That would actually work, but then you're basically changing the language which takes us to point number three).)
The interesting &
Note: I'm using Foogod's transliteration, because it's easier to see patterns than with WooText. Also alphabet is folded, i.e. lowercase and uppercase letters are equal.
Interesting thing I just found out is that & has quite interesting pattern surrounding it:
- A known fact is that & appears 8 times, only on it's own.
- It is always followed by word xjsykd, sometimes in variations (xjsykdwpj and xjsykdwpjwlja). The word xjsykd itself is mentioned only twice more, and only in first paragraph.
- 7 out of 8 times it is preceded by the word ending with wxj (i.e. if W/w theories are correct, with a xj suffix).
Those words are luoyojwxj, ljaxjawxj, dpjwxj, dpywxj and ywluoyojwxj. Also note that 1st, 2nd and 4th word appear before & quite a few times in third paragraph (it looks like some kind of enumeration to me), with or without wxj suffix, and in different arrangements.
[...]luoypj dpywxj ljaxjawxj & xjsykd.[...] [...]luoypjwxj ljaxja dpywxj & xjsykdwpj[...] [...]luoypjwxj ljaxjawxj &[...]
- luoypjwxj appears only 5 times a few fords away from &. It does not appear out of &'s context. Three times it's directly preceeding & (once as ywluoypjwxj).
- ljaxjawxj appears 3 times (once without wxj suffix) in &'s context. It does appear twice out of &'s context (as ljaxja-ya ljaxja-pua-sd). Two times it's directly preceeding &.
- dpjwxj appears once and only preceding &. It does not appear out of &'s context. (xdpj and xdpjwxj do appear, but are diferent words)
- dpywxj appears twice and only preceding &. It does not appear out of &'s context, but other variations of dpy do.
(I was sure, & was just "seeoahtlahmakaskay". Does it make sense with this, what you noticed? What you think, you proved that I was thinking correct, or I was going in wrong way, or we can't tell? Maybe There is same word, because it's always name of god and her qualities, like "Zeus the thunderer? Can someone else comment? Guys, we need to talk to get something ;-) Hiperon (talk) 20:56, 24 July 2014 (UTC))
Just pointing out a hint (RFC)
So, while browsing the comment pages for the original comic, I found something quite interesting.
@Viktor: I have got two questions which I think do not reveal to much and do not make it any easier to solve it: 1. Is the cipher deterministic? I mean that for example using random anagrams of words makes the cipher nondeterministic since you can not decide whether “amry” is “army” or “mary” and then even if you know the password/key there would exist several possible plaintext. On the other hand deterministic cipher is for example columnar transposition or monoalphabetic substitution, when if you know password then there is just one correct plaintext.
The answer was simple:
@Novil: Contrary to your assumption, this piece of information would reveal *very much*.
The other question was poorly worded, and not worthy of neither the useful answer (which it didn't get) nor the bytes of this wiki.
I was thinking for a bit, but came to no conclusion. Anagraming the words might make decoding harder, but only if it was somewhere in the middle of the whole algorithm.
If it was among the first algorithms preformed on the original text, it will be as simple as solving regular anagrams. If it was somewhere at the end, searching for anagrams on the transliteration should do the trick.
Note that some algorithms that don't take letter position into account can be preformed both before and after the anagramming for same results such as letter substitution, rotation, polyalphabet substitution(i.e. letters->glyphs)
Also note that anagraming can be both deterministic (such as sort all letters in a word alphabeticaly, stargate->aaegrstt) and nondeterministic (stargate->regattas or treat-gas or any other combination of those 8 letters)
Not English nor German
This theory has been pretty much disproved by Novil's comment a few hours later, even Novil himself corrected me. But I'll just leave this here anyway
My decryption technique is more focused on the hints. I tried to reach for hidden meanings in them, and this is the theory I came up with:
Book of Woo is written in another language
Novil wrote it in either DE or EN but the letters and words are not 1-to-1 with the final text, even when other ciphering algorithms are applied. In other words, the text was translated to another language. This could be some existing language (let's call it "Woo's secret language"), or maybe some nonexistent "raccoonish". The latter would be a hard thing to decode, so I presume the former is true. Since Novil translated it, the original language is lost in translation, so it doesn't matter. Even if it's written in DE it can be decoded to EN without knowledge of former language, and viceversa.
(nopenopenope... Hmmm... It would make decryption book of woo impossible. Because in this situation of secret Woo language any "word" in book of woo can mean anything Novi want. Novi said decryption is hard but possible if I remember it well. So this theory is wrong, or we don't believe her... And even if this theory is right, following it will not help us, because we can't do anything with text if it's right. So believe it and surrender the decryption, or forget this theory and still work. Hiperon (talk) 21:03, 24 July 2014 (UTC))
Proof number one:
"I wrote the story for “The Book of Woo” in English or German or both [...]" - Novil, somewhen
What was possibly meant was:
"I wrote the story for “The Book of Woo” in English or German or both, it doesn't matter, [...]" - Novil, somewhen
Proof number two:
The following word appears in the plain text:
ENGLISH: Potbelly Hill | GERMAN: Bauchigen Hügel
How can both words appear in plain text? My conclusion is that the hint is written in DE and EN because writing it in "Woo's secret language" might reveal the language and give a bigger hint than Novil wanted.
Novil added the following clarification a few hours after the blog post was posted:
That means if the plain text is in English, the word is “Potbelly Hill”, if the plain text is in German, the word is “Bauchigen Hügel”, if the plain text is a mixture of English and German, the word is “Potbelly Hill” or “Bauchigen Hügel”.
First, Potbelly Hill is an old archaeological site in Turkey: Entry in Wikipedia It has many reliefs of animals. I'm not sure it's "quite easy" to determine on which page it is, but my bet is for the second one. I suppose Potbelly Hill is a sanctuary created by Seeoahtlahmakaskay where she teaches to raccoons. On the other hand, considering the destruction on the last page, it might be where is is mentioned that the remains of the sanctuary is now called Potbelly Hill.
I have followed the supposition that w is a separator of some kind, and the theory has legs. To begin with, it is neither at the end or at the begin of a word, and most importantly, the subwords found between w characters can most of the time be found as standalone words. It seems that w can be used to add suffixes or prefixes to a word:
- Example of prefix: y + w + luoypj = ywluoypj
- Examples of suffixes: luoypj + w + xj = luoypjwxj, ljaxja + w + pua + w + sd = ljaxjawpuawsd
- Example of both: y + w + luoypj + w + xj = ywluoypjwxj
As noted by StarGateTABC, the & character is always followed by xjsykd, with possibly one or two w-suffixes. It is also most often preceded by a word ending with the wxj suffix.
As noted by Foogod, there seems to be three types of characters:
- Group 1: fgklopsx
- Group 2: djquy
- Special characters: wa
One thing of importance is that you never have two letters of the groups in a row. A letter of the group 1 is always followed by a letter of the group 2. One letter of the group 2 can be followed by any letter, except another one of the group 2. Words (or subwords after separation by w) can start by any letter except w, but cannot end by a letter of group 1, or w.
I want to mention that Foogod's table resulting from combinations of one letter of the group 1 and one letter of the group 2, or one letter of the group 2 alone, looks very familiar to people who studied Japanese. Japanese has two syllabic alphabets where each syllable is either one vowel, or one consonant plus one vowel. This would imply the group 1 is the group of consonants, and the group 2 is the group of vowels. This works rather well considering group 2 contains 5 letters. There is also an additional character in the Japanese alphabets for ending a syllable with an n or m sound. The correspondence of that special character with a is not perfect since there are words starting with a, instead of always having a following a vowel.
Now, it is a bit ungainly to write German or English with a syllabic alphabet, but possible. So maybe the original text was rewritten to work with a syllabic alphabet, and then rewritten with the group 1-group 2 letters.
Is the plainttext a lipogram?
Considering the reduced alphabet, it might be, that the plaintext is a lipogram, i.e. a text written without one or more characters.
On a similar note, the nearly linear letter frequency may be caused by writing the plaintext in some restricted way. Maybe Novil even tuned the plaintext to create a letter frequency that matches neither German nor English. This would reduce the usefulness of analytical methods a bit.
A change of alphabet
I think it's about time we moved away from a glyph-based alphabet and towards a more logically oriented one, to make the next stage(s) easier to reason about. Based on observations made by Foogod and Phlosioneer, I propose the following breakdown of the glyph alphabet:
- Call jdyqu "vowels"
- Call olpkgfxs "hard consonants"
- Call a and w "soft consonants"
Along with this, I propose a new mapping that makes the relationship between the groups very obvious, namely a transliteration that converts "vowels" to English vowels, "hard consonants" to English consonants, and the other characters to something else (here, I will use "'" for 'w' based on observations that it looks like a word boundary marker, and "n" for 'a' simply based on the fact that it can go next to either vowels or consonants relatively easily).
Here's the mapping table from the folded glyph-based transliteration to this more abstracted alphabet:
adfgjklopqsuwxy => negwacrltudi'so
and a text sample:
rilota dota'ta nada o'na: rilota dota'ta wiga sige'gin rola. len'te dota ruta'te ine'ta'ci o & sadoce. wonsi nense letu rago sana'te ine'ta ison'tin'ta gin rola'ta guwo o'gin rola. gin rola'ta wigo o'na: i'etere o'de dede'ran. ine dota'ta sene o'nada "da wele" wele'cen rede eto rede'ta wele'wen'de sa runi'sa sana ete. wonsi wele wonsi wele'de towo'te ine dota'ta tugan o'gon'sa dita'on segete. wele sine'on rine
It reads a lot more clearly, and it's a whole lot easier to remember English-like words than to have to carry around a word like "sypjwpj". Plus it has the benefit of making it look like it's written in an actual foreign language (though don't forget that it's encrypted text!).
Do note that the mapping is not meant to symbolize any kind of actual connection with vowels, consonants, sounds, or the like. It is a notational convenience that makes reasoning easier, and makes the properties of the letters stand out more clearly to simplify higher-order analysis.
The full transliteration can be found here: Transliterations#nneonneo
(Very good, just one addition: I think that instead of apostrophe, the symbol for w should be a dash "-" because its clearer to see in normal text (dota'ta vs dota-ta). Apart from that, this transliteration will ease pronunciation of the text)Stargatetabc (talk) 06:47, 25 July 2014 (UTC)
By analyzing the character sequences, I've derived a tokenization of the text using the regex pattern ([cdglrstwn]?[aeiou]n?) (a consonant or "n", followed by a vowel, followed by another optional "n"). This tokenization perfectly breaks each word into constituent parts.
Here's what the tokenized text looks like:
ri_lo_ta do_ta'ta na_da o'na: ri_lo_ta do_ta'ta wi_ga si_ge'gin ro_la. len'te do_ta ru_ta'te i_ne'ta'ci o & sa_do_ce. won_si nen_se le_tu ra_go sa_na'te i_ne'ta i_son'tin'ta gin ro_la'ta gu_wo o'gin ro_la. gin ro_la'ta wi_go o'na: i'e_te_re o'de de_de'ran. i_ne do_ta'ta se_ne o'na_da "da we_le" we_le'cen re_de e_to re_de'ta we_le'wen'de sa ru_ni'sa sa_na e_te. won_si we_le won_si we_le'de to_wo'te i_ne do_ta'ta tu_gan o'gon'sa di_ta'on se_ge_te. we_le si_ne'on ri_ne
Notably, the tokenization results in just 61 unique tokens, a number of which only occur at the start of words. These tokens may be the primitive units for the next stage of decryption, which may be some kind of transposition or code replacement.
Breaking the text into smaller words
"Words" (divided by spaces) can be further divided into smaller words based on the position of the "'" characters ('w' glyphs). For this section I will refer to the words-divided-by-spaces as word groups.
Initially I assigned these prefix/suffix status based on their position in the word. Later, I realized that this is an irrelevant distinction.
There are two kinds of words, simply put: one-token words ("short" words) and two- and three-token words ("long" words). Words are agglomerated into groups according to the following rules:
- A short word with a single glyph becomes the start of the next word group. The only one-token words are "i" and "o" ("u" and "y" glyphs).
- All other short words are appended to the previous word group, up to a maximum of three words in a single group. After that, further short words (if there are several consecutively) are simply placed into a new word group.
- A word group can contain at most one long word.
That's it! All "suffix/prefix" behaviour stems from these two rules.
Illustrative example: this phrase in the second line of the first page:
i_ne'ta'ci o & sa_do_ce
contains the long word i_ne, followed by two short words "ta" and "ci". Another short word, "o" follows, but since it cannot be placed in a word group with the next word (since it's a & symbol), it stands alone. This is the only instance where the "o" word stands alone.
Illustrative example: this phrase near the end of the third page:
ge_de'ci o'si_ne'gan tin'de'sa
contains several short words interspersed with long words. The short word "ci" is appended to the "ge_de" group. The short word "o" that follows is placed at the start of a new word group with "si_ne". "gan" is suffixed to si_ne, which makes a group of three words. Therefore, the next short word, tin, is placed in a new word group, and two more short words are added to it.
There are four exceptions to these rules in the text, out of 247 short words. These may be errors in transcription, modifications made for artistic purposes, or they may represent a problem with the rules I've outlined above.
The exceptions are (line numbers refer to the whole transliterated text):
page 1 lines 14~15: o'ne_ran we_le i_ne'tin / wen'de sa_do_ce'tin page 1 line 22: ru_ta tin'de'sa page 4 line 84: won_si na'gan'te page 4 line 85: tu_ge len'ta'tin
Rendered according to the rules, these would be
page 1 lines 14~15: o'ne_ran we_le i_ne'tin'wen / de sa_do_ce'tin (might be broken due to a line length limitation) page 1 line 22: ru_ta'tin'de sa page 4 line 84: won_si'na'gan te page 4 line 85: tu_ge'len'ta tin
Division into smaller words
Here are the short words and their counts:
52 ta 45 sa 26 o 22 de 20 te 15 tin 14 da 10 cen 9 on 7 na 6 gin 4 ran 5 len 4 wen 3 ci 2 gan 1 wu 1 i 1 gon
and the long words and their counts:
25 ri_lo_ta 22 we_le 15 ru_ta 14 won_si 13 do_ta 11 ro_la 11 i_ne 10 ri_ne 10 sa_do_ce 9 du_wo 9 e_to 9 e_te 8 ge_de 6 tu_ge 6 en_wo 6 ge_ra 6 di_ta 5 re_de 5 sa_na 5 ran_san 5 ga_lon 5 nen_se 5 si_ne 5 sa_tan 5 ta_ta 5 ru_ni 4 de_de 4 ne_ran 4 a_go 4 en_se 3 se_ge_te 3 go_so_gon 3 gu_to 3 na_da 3 we_ri 3 an_re 3 ra_go 3 si_ge 2 u_we 2 du_ra 2 se_ne 2 wi_di 2 gu_wo 2 le_le 2 te_le 2 se_ta 2 i_son 2 e_te_re 2 wi_ga 2 te_ri 2 le_tu 2 le_ri 1 da_co 1 ra_wo_ton 1 la_to 1 tu_gan 1 e_ta 1 i_tan 1 no_ne 1 ro_ta 1 wi_go 1 a_ti 1 ni_ge 1 te_so 1 to_wo 1 din_ru_we
There are a total of 85 words (19 short and 66 long). The long words use 52 unique tokens, which makes for over 140,000 possible three-token words. Therefore, it is very possible that these 85 words are, by themselves, complete words. Furthermore, these word counts are very consistent with a sample English text I looked at, once the most common words were removed. I don't know much about German, unfortunately, so I can't be much use there.
Book Cipher Hypothesis
One observation that is commonly made about the "words" above is that they are all very short - 1, 2 or 3 tokens in length. No direct mapping from tokens to syllables or letters would seem to make sense here.
But what if the words are actually indices into a list of words? This would enable the words to be compactly represented (essentially just as encoded numbers), as they seem to be here.
The big open questions are (1) what wordlist/book was used, and (2) how do letters map to numbers. I can offer a *possible* answer to the first question: the hint "10000" on the third page may indicate that the list in question is a list of the top 10,000 words in either English or German. If this is the case, there are only a few public wordlists that would qualify.
The second question is rather thornier, given that there doesn't seem to be any "natural" way to order the characters.
This seems like a far-fetched theory, and I'm perfectly willing to admit it as such. However, there is some compelling evidence for it.
First, observe that each token consists of an optional consonant, vowel, and optional trailing "n". There are ten possible consonants: c,d,g,l,n,r,s,t,w and none. Similarly, there are ten possible combinations for vowel+n (namely a,e,i,o,u,an,en,in,on,un). So, there are precisely 100 possible tokens, and so there are precisely 10000 possible two-token words. The three-token words, in this case, are likely those words which can't be coded in the wordlist, and for which we'd have to find some alternate coding (e.g. an extended wordlist, or a syllabic breakdown). However, there are only 7 three-token words, so maybe they can just be guessed from context.
Second, note that so far we haven't used the "10000" 'clue' on the third page. This would be one way to explain that clue.
Third, if one looks at the distribution of words in a short story (like the Book of Genesis), a significant fraction of the words are from the top 100 words, though some of the most frequent words are also words from further down in the list (usually things like the main subject of the story). This correlates well with the observed word frequency in the BoW text.
Now, here's some cons of this particular hypothesis:
- It requires an ordering of the characters, which we don't have (and which would be irritatingly hard to guess)
- It requires knowing or guessing the wordlist, which is not easy
- I haven't been able to decode the text using this theory yet, though I haven't spent much time on it. This is mainly due to being stymied by the first two points.
Anyway, there's the hypothesis. Please comment below; I want to hear what you think of this idea.
- I think that guessing the list and ordering the characters is a little bit too much to do. ordering both sets would give 10!*10!=13168189440000 combinations (which in turn would multiply with every possible top 10.000 words list). Even if we were to assume "Benford's Law" (number distribution) the possibilities are still far too many. --Shain (talk) 05:49, 28 July 2014 (UTC)
- Yes, indeed guessing the list with brute-force generates far, far too many possibilities. Benford's law only really holds for large samples; for our tiny sample the statistical variance would generate too many possible alternatives.
- I'm thinking more along the lines of Stargatetabc's idea: find some sort of hint or clue that points us toward some kind of ordering. Of course such a thing could be a total wild goose chase if this theory doesn't pan out, but the statistical evidence is rather compelling. nneonneo talk 02:08, 30 July 2014 (UTC)
- In one book I read, hero founds two part code: one part is the message it self, and the other is the key. But they are merged by some algorithm I can't remember. Maybe it's the case here as well? Maybe a paragraph that's different from the others? Or maybe just some sentences? And that key is the ordering of letters?
I think it might be possible that there is some kind of "terminator" character which simply ignores the last vowel. in nneonneo's transliteration the ' might be the vowel killer? but i don't know what to make of it if it is "n'". "dota'ta" could be "dotta" which atleast looks nice. or it is the other way around terminating the following consonant?
Also it might be that there is some kind of gloryfication character (the &) to mark the different gods of the story?
These are just basic ideas and i am in no way experienced in deciphering texts.
I would like to hear what you think about these two ideas, does this make sense?
Alternative: w as a doubler
Building off Ratfox's idea (and also nneonneo's apostrophe), if one of the encipherment methods was a syllabic re-writing of German or English, it could be that w is like the っ (small tsu) character in Japanese, which normally indicates a glottal stop, or if you write it in English, a double consonant. Like w, it normally appears between syllables and makes no sense at the beginning of a word (or the end, most of the time).
Here, w is followed mainly by class 1 glyphs, and occasionally by class 2 glyphs. If the class 1 glyphs represent consonants or consonant clusters (though this would drastically decrease the number of possible consonant sounds represented by this alphabet unless a somehow fixes that), and the class 2 glyphs represent vowels, then it kind of makes sense that w doubles the character following it. w is followed by basically every glyph except two of the class 2 "vowels" (i.e., there is no wj or wq). As far as I and the German orthography wikipedia page can tell, English and German are likely to have "ee", "oo", and even "aa", but "ii" and "uu" (except in vacuum) are pretty much nonexistent, so 3/5 "vowels" seems to fit.
Shain: if the vowels are doubled it should be clear that if the text is german there would have to be some steps to clean it up beforehand. normally longer vowels in german are made by adding and "h" after a vowel (exception is "i" where it is most of the time "ie"). so this would have to be fixed beforehand. but considering word lenght double fits better than terminate.
w is not a word separator
This is just based on word frequency analysis. Assuming space is a word separator, but w is not, there are 184 unique words out of 368. Of these 114 appear only once. This is very similar to a number of short fairy tales I did the same analysis on. For example, The Princess and The Pea by Hans Christian Anderson has 385 words, 180 unique words, and 121 of them only appear once. However, if you assume that both space and w are word separators, then you get 572 words, but only 86 unique words, and a measly 17 words that only appear once.
I also did this analysis on a slightly longer fairy tale, to see if the fraction of unrepeated words would go down. It did, but it was still above 50% of the unique words.
- I have done a similar analysis, but with the book of Genesis (seemed appropriate since the BoW story seems to be a creation myth). However, my results are very different.
- The Book of Woo with ' as a word separator (as in my transcription) has 564 words, 85 unique words, and 17 that appear once.
- The section of the Book of Genesis that I used has 573 words, 120 unique words, and 46 words that appear only once.
- The Bible results are rather closer to the BoW results. Some of this may be explained as the Bible using far fewer synonyms, preferring to use repetition to reinforce the story (e.g. God always "commands" that something be so, not "wishes", not "wills").
- It's entirely possible that the true answer lies in between. For example, the ' character may be a pseudo-word separator like "-", enabling the construction of conjugated forms and compound words from individual parts. nneonneo talk 04:07, 27 July 2014 (UTC)
In looking at foogod's transliteration the U modifies the case, upper to lower and vice-versa. If one were break the text into two different bodies of text on the U separator does that help? I'm thinking Baconian Cipher where the upper and lower are the English and German portions of the message or something of the sort.
So here's the summary. I broke the text into two pieces, the uppercase half and lowercase half, then combined them with the uppercase portion first then the lowercase portion. I removed the line breaks and the character U. This leaves us with 278 words with 197 unique words with 156 appearing once. Jmorlock (talk) 15:01, 29 July 2014 (UTC)
- I think this is highly possible. the problem now is which one is the german and which one the english part? having the same words appear in both parts (upper/lower) would be a great hint at their meaning. even if there are some words in german and english that are the same, the possibility of those being names is higher. i am also thinking the Potbelly Hill Hint could help further here. --Shain (talk) 07:17, 1 August 2014 (UTC)
Here's a thought how do words from the two languages compare across the U boundary? 
- While the idea is good to think about, the fact that identical glyph patterns appear on either side of the U boundary (and even half-and-half across it) is how Satsuoni discovered the folding function of U/u in the first place. The possibility that there could still be some difference between the sets was considered but deemed unlikely. - Curiousgeorgie (talk) 19:21, 3 August 2014 (UTC)
Hi, I'm happy to join to this great group of book of woo fans or workers or decryptioners or what you want to talk about yourself.
My theories are usually repetition on what you said all. I mostly base on your work, but I will try to join to the discussion, not with something new, but with my voice for or against some of your theories.
I will write it all but not right now, it will take me a while. This, and Work_Organization page will take a few days to end, so if you are interested, wait a while, and probably you will find something new here.
I highly recommend discussion in "working theories" page. I believe it always help, and its much more organized than in comments page under Sandra and Woo comics. Especially i ASK you for comments, discussion, etc under my theories paragraph. It makes me thing much better if I can throw my theories into anyone any he somehow respond. Or if someone throw theories into me, dump or good, no matter. Have you seen "House M.D." serial? If you have, you know what I'm talking about. So please, talk, discuss, oppose or agree with me, maybe it will make me somehow useful here.
Ah, and one more thing. If you got too much time, please, translate my scribble from this "I'm not native English speaker" into true, easy to understand English if possible ;-)
Upper/Lower case characters are the same one
When I saw Foogod transliteration, I was wondering, if this is just for esier word processing or upper/lower case letters are really the same. Right now I believe upper/lower letters are the same.
The reason I'm thinking this way right now, is because every letter can be found in at least one pair of word and mirror word like lUOYPJ and Luoypj (lUOYPJ and Luoypj). I think, it's impossible, or really hard to get (and i don't believe Novi was working so hard just too fool us with this) this result with two different "alphabets", one with lower case letters and second with upper case letters.
If you see how it could be done by chance or with some easy Novi work, please tell here, it may be important.
Unless someone will show how could it be possible I will work with text as with 15 "letters" (of course combination of two or more glyphs can give us one letter in English/German) and highly recommend this for everyone working on the text. Unless you are looking for answer "when" woo is using upper case letters, when Lower case letters (that's how you found u and U, GREAT job, wow!) you can use "all to low" commend on the Foogod transliteration.
& is of course something else, special, there will be something about this in another theory.
Upper/Lower marking something
I can't think of an easy way to get mirror words if the "upper" and "lower" alphabets are unrelated, but that doesn't mean the distinction in meaningless. For example, maybe "upper" is used when the deity is speaking, and mixed when the deity is being quoted. Or maybe switching alphabets indicates emphasis, or a change in the part of speech (e.g., using a word as a verb instead of as a noun).
I found the following information in "what we know:"
'First Decryption Step Confirmed
On August 2, 2012, Satsuoni posted the following in the comment thread:
By looking at the distribution of character pairs I have discovered that the alphabet can be cleanly separated into two parts, with “u” and “e” serving as separators (incidentally, they look like on and off switch) between two sets. Sets seem identical in function. For example, “z” in one subset is equivalent to “w” in other. The distribution of separators doesn’t suggest that they are spaces of any kind [...]
..and then a bit later posted a new transliteration with the appropriate characters replaced to reduce the 31-character alphabet to a 15-character one.
On August 7, 2012, Novil posted a comment in which he confirmed that Satsuoni's analysis was, in fact, correct. Potbelly Hill Hint
On July 21, 2014, Novil posted the following in an article in the news section of the website:
I have decided to give you a little hint. The following word appears in the plain text:
ENGLISH: Potbelly Hill | GERMAN: Bauchigen Hügel It should be quite easy to determine on which page. I hope this will generate some debate about the supposed content of that page.'
These two facts -- that there are alternating parallel alphabets, and that the plaintext is apparently English AND German -- suggest to me that the alternations are shifts from one plaintext language to another.
For example, "Then the raccoon goddess ' sprach zu den Waschbären, hier lassen Sie Ihren Sprösslingen & find this gift: the use of prehensile paws ..." (I made that up, based on prior speculation in the pictures section, just to demonstrate the concept).
This says nothing about the cipher proper, but it does suggest why the segments seem to alternate.
It would seem to me also that the "Potbelly hill" clue might lead to a cryptographic "window."
Another such "window" would be the name of the Raccoon Goddess -- "Seeoahlahmakaskay" or "Coalmask" -- if that sequence could be found in the text.
Finally, a last point that seems suggestive: Knowing that Novil has consistently used "Seeoahlahmakaskay" in place of "Coalmask" suggests that the Woo Text may have been phonetically spelled out prior to encryption, as with "Seeoahlahmakaskay / Coalmask."
Woo would perhaps become Wahoouoou or some such.
Just some thoughts. Hope they help. Og the Odious
the properties of the currently discussed transliterations reminded me of the properties of the polynesian language family (like hawaiian, samoan or RapaNui).
- all words seem to end on one of five characters which people here seem to identify with vowels
- 14-15 different characters in total
- consonants and vowels are alternating
- the w/W separator could be identified with the ' (glottal stop character) that most of these languages have
But so far no luck with a translation :-)
Potbelly Hill is an archaeological site in Turkey (called Göbekli Tepe there). As the user Rilota Ta wrote here somewhere, it is most likely referred to on page 3 of the Book of Woo, as 10.000 BC is the approximate age of the site and the scorpion figure on top of the box is very similar to the one on stone pillar #43 there. http://aintnohothouseflower.files.wordpress.com/2013/12/254952-a.jpg
Satsuoni's discovery of the doubled character set
I must say that I did not understand how exactly Satsuoni discovered the separation in the character set. I can create a distribution of character pairs. But I fail to see "a clear separation" of two sets. Can anybody explain this in a more comprehensive way at the respective location in this wiki?
Interpreting the use of Book of Woo glyphs in a recent comic (0973 Speedrun) as a hint from Novil, translating to "lingerie", the final decryption step could be a syllable-based nomenclator (i.e. lookup table) in the style of the Great Cipher. It would fit that a french word was used in the hint. So each "word" in the Book of Woo would actually be a syllable (as proposed by some contributors above):
|Word||nneonneo's transliteration||decoded syllable|
Now there remains only the rest of the syllables to identify :-) I'd guess that there should be some kind of order to it, maybe connected to the vowel/consonant structure of the words. Otherwise one could identify some of the syllables based on the frequencies with which they appear in the Book of Woo.
Cordolf (talk) 18:39, 11 May 2018 (CEST) Just an additional thought - why assume that the plaintext of the glyphs is English? In the German version of the original page that the picture with the "Sexy ____" language is taken from, I believe the German is "dessous". Assuming that this is a hint, we might even be going for other words with essentially the same meaning. "Underwear" or "Unterwasche" (for some reason I can't get the umlaut on that "a"), maybe.
Linear Frequency Distribution
Looking at the frequency distribution of the tokens, I couldn't help but wonder, what kind of cipher can produce that sort of distribution?
So far, I've found one possible answer...
Polybius Square plus Multi-Option Substitution
My current working theory is that something resembling a Polybius square (with probably about nine distinct coordinate possibilities) was used to considerably reduce the number of distinct characters; and then it was subsequently (re)expanded by using a substitution cipher that has multiple possibilities for each coordinate -- more possibilities for some than for others.
Here is a Perl script that implements such a cipher, and while I still need to tweak some of the details, I'm getting a frequency distribution that largely resembles the one we see from the Book of Woo cipher. Direct Download | Wiki Source
The newest strip gives a clue, Woo saying "mi tawa". This is in toki pona (http://tokipona.net/tp/Default.aspx), a conlang with an extremely limited wordlist. The language doesn't have capital letters, uses only 15 letters, and all syllables are a consonant (optional if word-initial, required otherwise) followed by a vowel followed by an optional "m" or "n". This seems to closely mirror theories and discoveries about the number of letters used, the consonant-vowel alternation, and other such discoveries.
If the specific meaning of the clue matters, "mi tawa" could be translated in a variety of ways – the limited vocabulary makes context important. "mi" is "me", "my", "we", etc. – first-person, both to indicate oneself and as an adjective indicating possession. "tawa" is mostly about motion – "go to", "walk", "move" – but also can be transportation ("tomo tawa" is a kenning for "car", literally a "moving room/house"). It can also indicate purpose, the official wordlist giving "in order to" as one meaning, but that would leave what Woo said as an incomplete sentence. I think the most likely translations would be "I'm leaving"/"Goodbye" or "My journey". Qwertystop (talk) 02:17, 25 June 2018 (CEST)
Toki Pona and Potbelly Hill
Names represented in toki pona are generally converted to the restricted soundset used by it. "Potbelly Hill" would translate as "nena poteli" or "nena potipeli", "nena" being "hill"(or "nose", or generally "protuberance"). If the meaning/intent of the word "potbelly" were translated rather than transliterated, it would be "nena pi sinpin suli", word-for-word literally "hill of chest of bigness" ("sinpin" is "chest", "body", "front", "wall").
Transliterating "Bauchigen", I'm less sure because I don't know German. If that's pronounced roughly the way it looks to an English-speaker, it would transliterate to something like "p[a/u]si[s/k]en".
Qwertystop (talk) 03:00, 26 June 2018 (CEST) Ah. Yeah, I suppose "insa" does work better than "sinpin" here, good catch. Though the single-quote isn't a part of toki pona as such; I don't know if it has meaning in other sections of the text, but it looks from that like it might just be an alternate space?