Download version 1.0
What's this one about?
I wrote a text librarian....well, I've been writing text librarian for nearly 30 years. It was intended to save space on floppy disks way back when, but I've kept it up since I think it's kinda neat. I've used it recently for Machine Essays , but I like how it works and looked for other uses. This is one.
Words Masked is an iteration on a code pad. It takes input text and produces a textual list of tokens and two files that are keys to those tokens. The tokens can be placed anywhere with no hope ( I believe) of the reader being able to understand the text without the supporting files.
Why is this interesting?
Since this is not algorithmic, there would appear to be no way for a reader to ever interprete the masked text without the supporting files. The masked text is a block of random characters that have no relation to the content of the message, no patterns to observe, no sense to be gained from the masked text.
At least that's what I believe, I could be dead wrong and this isn't useful at all. But I don't think so, if you find a flaw let me know.
As an example here is a section of masked text. I'll prominently display the names of anyone who can determine what it says, I think it's not possible.Sample
How does it work?
Type or paste text into the window. Large text is OK, I've tested this with a 200k text block and it works dandy.
Then click Mask. Words Masked will crunch away masking the text. Might take a bit, that 200k file takes 10 seconds to process, for example.
When it's done you'll get a screen full of masked text, like this....
stwUqWMJltmLeGpbOLjtjlwIGIKUOn 29 eDlLPWDo 0000000000 HtVLTcIc 0000000001 qAffgHfc 0000000001 LZgdlNeO 0000000001 LSHtJspt 0000000001 hFZuGYRr 0000000001 ryvgVdjo 0000000001
Select Save and you'll be prompted for a location and name.
Words Masked will create 3 files. One is the txt file containing the masked text you see. Two other files are created, a RWWMTOK file and a RWWMPAD file. These contain the intelligence to reconstruct the masked text. Now you can put the masked text anywhere, like a web site, in an email, or whatever. No-one can determine what it says.
Just keep the RWWMTOK and RWWMPAD files in a separate place.
BTW, it also masks the punctuation & special characters.
Masked text isn't useful unless you can unmask it. To do so, you need to provide the person who wants to unmask the text the RWWMTOK and RWWMPAD files.
You can email them (separately from the masked text of course), put them on a server, or whatever.
The person on the other end then gets those files, puts them in the same directory, and selects Open. Navigate to any of the three files and select it. The files must all be in the same directory.
Note that the pad and token files are mandatory, the txt file is not, the user can paste the masked text into the text field if they want to.
Then press Unmask and the original text will be exposed.
Meaning.... You mask the text , get a block of random characters and numbers. These are meaningless, there is no contextual information to be gained from analyzing or attempting to use algorithmic methods on that block.
The block definition is pretty simple;
- a 20 character serial number
-a number indicating the length of the unmasked text
-a series of numbers and 8 character tokens. The number is the gap between the last word and this word, and the token is a reference into the words in the PAD file.
Note that the number was originally the place in the text where the word should be put (like position 10, position 230, etc) but I realized I was leaking intelligence that way, one could determine (usually) the length of each word, and one could at least try and place correct length words to determine meaning. Made the change to gaps between words, usually 1-3, and that removed the intelligence from the masked text. Interestingly, it only took one new method and 6 lines of other change to modify this.
The only way the sense can be regained is by acquiring the RWMWPAD & RWMWTOK files and using Words Masked to re-built the text. So you can send or post the masked text with no concern about it being unmasked. And nothing can be discerned from the files themseles. Then simply provide the files to the person you want to unmask the text.
Obviously, protecting the files is important and your responsibility. Individually they are meaningless. Together they can unmask.
First, the masked text must not be altered. Every byte of it is important. If it is changed or damaged Words Masked will not be able to unmask it, and may crash.
Also, the 3 components, the text and the two other files, must match. You can't use a RWMWPAD or file from another masking. The 3 files have serial numbers that must match.
Might be other bugs, let me know if you find any.
Fidelity is very high in masking and reproduction, but I have encountered edge cases where the unmasked text did not 100% match the original. This is extremely rare and comes from how I pre-process the text before masking, but may happen, pre-flight your text before distributing. I can't give you specific things to look for, or I'da fixed them.
And this only works for UTF8 English text, sorry. Actually it might work for other single-byte languages, never tried it and don't plan to.
And that's it
I think it's interesting. If you do too let me know.
Performance: A lot of tweaking has gone into making this as fast as I feel competent to do so. One thing I am not yet doing is stemming and the associated trees that grow from that. I know what I have to do to implement that, but I don't feel the need to put the work in to increase speed that fractionally more.
What files does Words Masked create? Words Masked creates one file, ReallyBigList.RWTLIB in the Application Support/Words Masked folder. Of course, each mask you make and save creates the 3 files necessary for transmittal and unmasking in the location you designate. Words Masked does not generate any network traffic or collect any information about users or useage.
What's going to change
Right now, these are single-use masked text and pad files, i.e., each masking and resulting files is unique, can only be used as a set. I have a plan for a multi-use RWWMPAD file that would allow someone to distribute a single RWWMPAD file and use it for multiple maskings.
I haven't moved forward on it though, because I'm concerned about security. A compromise of the RWMWPAD file could happen and someone using a multi-use RWWMPAD file may continue to use it not knowing that someone was easily un-masking their text.
However, if anyone really thinks this is an idea they would use please let me know.
The dictionary of words I'm using to do this should contain everything you need, it's composed of 3 standard English word lists, plus 20 years of my own email correspondence, plus input from lots of random documents I had lying around on my hard drive.