Ishaan Mehta
Writing HTML has always been my personal nightmare. Remembering to close all the elements and having to constantly open and close angled brackets is one of my pet peeves, and so when I decided to make a portfolio/project website I knew I had to do something different.
Because my primary goal was to make basic websites as easy to create as possible, I started out my work with designing how the formatting file was going to look. In the search of simplicity, I started out with a syntax similar to markdown:
element here ## increasing hashtags to correspond to smaller headers something to represent normal, paragraph style text
The system worked out well at first, and it was easy to throw together since the only real check I had to make to determine what kind of element a line represented was by checking if it began with a '#'. But I realised that if i wanted to make multiple pages or add images I would have to refactor the entire codebase, so I designed a modular approach using a c union.
A union is a user-defined data type, which mostly just means a developer has to create them, but its special because it stores multiple members in the same block of memory. Say we created the following union
var 1, 2, and 3 are all stored at the same place in memory, and because of that they all have the exact same value. It's a little strange, thinking about how a string reacts to being in the same place as an int, but for my purpose it meant I could create arrays of generic element structs and use a union to store all the types of elements (text, images, etc). The whole struct is pretty basic, looking something like this:
Note that the union doesn't have a name, this is whats known as an anonymous union and it was introduced in the C11 standard. These anonymous unions are the same as normal unions, but they have no name and thus cannot be used anywhere except where they're defined, which is incredibly useful for one-off unions that don't really need to be anywhere else.
Now that we have an easily extensible element system, the sky's basically the limit. I've attempted to create programming languages in the past, so I applied the same logic to parsing my formatting syntax. I created a tokenizer that would split the loaded files into individual tokens with specific types (like TEXT or HEADER), and then shipped the tokens off to a parser that turned the tokens into a large array of elements. Those elements were then taken and shaped into HTML files. A few notable issues I encountered were about splitting one formatting file into multiple HTML files, including other files to keep things readable and organized, and deciding how to determine where a token started and stopped.
Originally the elements that were generated by the code were just turned into HTML and written to a generic index.html file, but the glaring issue with this was that not everything was meant to go into the same page on the website. I could just create multiple files and run the program on each of them, or better yet, use code to iterate across all the formatting files and generate the HTML per file. I got started, using the opendir functions from dirent.h to open the current working directory and then find every file that had the right extension. It would then open a file and process its contents into an HTML file of the same name, then move on to the next file, over and over until all the files in the directory had been checked and processed, looking something like this:
As I built this system I created a small handful of files to test it, but as I was creating each one I realised that this just wasn't worth it. It was a simple system and it meant I didn't have to devise some way to determine which file to put elements in, but it also meant the metadata of each file needed to be set seperately, not to mention I still needed to edit numerous files which was just as tedious as writing HTML by hand. So I needed a new system, and just as luck would have it, I also wanted to try out a new operating system on my laptop. While that sounds unrelated, and it kind of is, it forced me to edit a bunch of config files, config files that used the ini file format. And the way the ini file format seperates sections of data? By enclosing the name of the section with square brackets, which sounded super easy for me to parse. So I finished switching operating systems (went from Arch Linux to NixOS then back to Arch Linux), and then added section tags that defined which file things were meant to go into when generating the html. My simple formatting language had evolved from a poor mans markdown to something that could actually be used to make a half decent website:
inside of index.html $how_neat.png | An image with alt text [page1] ## Wow a smaller header This one's even inside of a seperate file Let me link index.html real quick > (index) And another link for my github > {https://github.com/MeanBeanie}
I also added the ability to create links (they just generate an a href for those who care) both to pages on the website or to external links. But because I had gotten rid of the ability to put things into multiple files, the formatting files suddenly ended up long and practically unreadable. Which segues to my next big issue
I had solved the issue with splitting one formatting file into multiple HTML files, but because of readability I needed to pivot to something else that would let me use mutliple files. The most obvious choice was to be able to include other files into some main formatting file, so that everything could stay in multiple files but the program only ever had to actually handle one. I saw the way most programming languages do it, using some sort of import or include token to make it clear when someone wanted another file to be included into the program, and considered making some kind of keyword system so I could do the same. Unfortunately it was also one in the morning, and I really didn't want to refactor the entire parser just so I could add other files, so while a video on Brainf*ck was playing in the background, I decided to make a period the include operator. This sounds nonsensical but it really isn't, because in Brainf*ck (an esoteric programming language thats real simple) the period operator is used to output data to stdout, which is basically the same thing as what I was doing to include files in the first place. I would load the file I wanted to include into a secondary buffer, then expand the main buffer so it would fit the new buffer, then had the main buffer consume the secondary buffer. This meant I didn't have to inject tokens into my token array or anything else super complicated, since the tokenizer was just processing the main buffer normally the whole time. So a dot became my new include operator and I set up code that did roughly the following:
This was going great, but I encountered the final of my big issues right after I did this. Because I had no real way of sizing up the tokens, and I was too lazy to implement one, I would just add a new token's length to the iterator that went through the main buffer so it would skip ahead to after the token. This worked out great until I need to process the included file data since it was appended to the end and sometimes would leave me in the middle of a token.
Tokens are weird little things. They aren't consistent in size and they aren't consistent in content, so theres no real way to differentiate them. At least seperating them is simple enough because you can just cut off a token if there's any whitespace (e.g. a space, a tab, a newline, etc) right? Mais non mon ami, because if you want to support sentences that aren't exactly one word in length you need whitespaces to be okay sometimes. The issue isn't determining a token seperator, its to determine when the seperator is needed and when it should be ignored. Originially the text wasn't put into quotes the way it is now, it was just left sitting in the file. This worked because I turned all unknown tokens into a text token, then merged adjacent text tokens together to turn them into the multiple sentences I needed. This worked wonders until I started including files, since the data was being smushed out the end of the current file, or in many cases smushed into the middle a file, I could no longer just assume loose text was signifigant and not just random gibberish. Patches were easy enough at first, add checks to see whether or not quotes were used to determine when loose text is actually important, then maybe a check or two for code blocks and headers, but the issue came back in one form or another until I eventually bit the bullet and refactored the tokenizer and parser like I said I wouldn't.
The actual refactoring was the easy part, it was deciding how I should be checking and seperating tokens that took the longest. Everytime I thought I came up with some system that maintained the ease of use this project had been started for, while also being easy to parse, I realised the idea was either too complicated or just plain stupid. Eventually I settled on bounding all signifigant text with quotes, which removed a little bit of the ease of use I had when quotes weren't required, but also made figuring out when text should be considered important miles easier.
Finally we get to the end, the project in working state and a website being generated by a bunch of vaguely easy-to-write plaintext. And for proof of its function, you can scroll back up to the start of the page and marvel that this silly website was generated froma couple text files, a little bit of elbow grease, and a tad bit of duct tape. Through the project, I learned a lot more about how parsing programming languages worked because of my frankly stupid decision to parse the formatting files like a language and not just use some existing framework or library. I also improved a decent portion of my memory management skills in C, since all the tokens and buffers needed to be dynamically allocated so I could manipulate them without having like 4 headaches simulataneously.