EPUB files, Markup Languages, and briefly Unix

T. Kurt Bond

2020-12-01 15:56

What follows is a lightly edited version (for clarity and relevance) of the postscripts from an email that I recently wrote, transferred here for posterity and the general good.

Danger! Danger Will Robinson! Danger! The postscripts and footnotes are much longer than the main body of the reply! And the footnotes are longer than the text of the postscripts!

P.S. H., P. (and H. M., if you are interested, though I admit this combines some of my more geeky interests and thus may be of less interest to all of you, or for Howard and Paul, for that matter):

I actually figured out how to make ebooks (to a limited degree) because I wanted to try an ebook I made of an RPG adventure I wrote for a currently on hiatus[0] fantasy Savage Worlds roleplaying game campaign for my daughter Lily and my niece and nephews (N1). I originally wrote the adventure[1] in three typesetting systems which use markup languages, LaTeX, ConTeXt, and troff [2] (which I usually use in its guise as GNU groff, but this time I used Heirloom troff, part of the Heirloom Documentation Tools, for its easy access to modern fonts) to compare the markup languages and their PDF output to decide which one I prefered to use. Later I converted it to ReStructuredText, a lightweight markup language[3] that I use, to compare it to the other markup languages.

I have used ReStructuredText on and off for many years, but the main drawbacks to it was that (1) the output produced by its original docutils implementation was excessively stark and difficult to customize to have a nicer appearance, and (2) its workflow was somewhat difficult,[4] Some time ago I discovered Pandoc, a “universal document converter” which can read many input sources, including ReStructuredText, and produce output in many output formats, including PDF (via LaTeX, ConTeXt, or troff; in ways easier to customize the appearance of) and HTML, and, as it turns and importantly to this story, EPUB, the most common format for ebooks! I started using Pandoc because it made it easier to generate PDF from ReStructuredText with one command (since Pandoc runs all the intermediate steps and cleans up any temporary files needed). It turned out that the abilities to read multiple input formats and to more easily customize the output was important to me as well.

So, having converted the adventure over to ReStructuredTexT for comparison[5] and at first using PDF through Pandoc's troff -ms output, I soon decided to take a look at Pandoc's other output formats. I started with LaTeX and ConTeXt, and decided that the PDF output via LaTeX was not of much interest to me, but the PDF output via ConTeXt offered greater control over the appearance of the final PDF output and the opportunity of adding via writing Lua filters some features to the resulting documents that lightweight markup languages normally don't offer, such as indexes and cross references that are both hyperlinks and include page numbers and section names in the PDF output, which are features that I didn't need in the adventure document, but which I expect to need in future documents.

But back to the important point, Pandoc can produce EBUB output for ebooks! Since I already had the adventure in ReStructuredText, and Pandoc produces EPUB, and I have an ebook reader, a Kindle, it just makes sense to figure out how to get it onto my Kindle! First I used Pandoc to generate the EPUB. That required figuring out how to generate a reasonably attractive cover. Then wrote a small config file for Pandoc. Then I generated the EPUB output. Then I figured out how to convert that over to MOBI, one of the formats that the Kindle can use.[6] Then I mailed it to my Kindle's email, and it looked reasonably good![7]

I hope you've enjoyed this twisty maze of passages, all different!

And with a Zork reference I really must end this email!

[1]

As it turns out, I actually wrote seven Savage Worlds adventures in troff, and then converted them all to LaTeX and ConTeXt for comparison later. I actually wrote the first three adventures using LibreOffice, a conventional office suite with a word processor, something I normally dislike but was giving another chance. I decided after three adventures that I wasn't going to do another in LibreOffice, and started looking for alternatives, hence comparing markup languages. I tend to like markup languages better than WYSIWYG editors; this may just be the programmer in me liking the idea of languages over WYSIWYG, but there did turn out to be significant advantages to switching to a markup language in the end. The primary one was that I could put character and creature descriptions in external files and reference them in the main file, rather than cut and paste them from one document to another, which meant I could just change the external file and it would automatically be included in the updated document next time I generated output from it. With a WYSIWYG tool I'd have had to go back and cut and paste the changed material in every document every time I changed it, which would be immensely tedious and horribly error prone and all too common.

[2]

This footnote is about LaTeX, ConTeXt, and troff, and peripherally about TeX, the progenitor of LaTeX and ConTeXt. Troff was one of the earliest computer typesetting systems, invented in 1973 as a scheme at the computer science portion of Bell Labs to get a PDP-11 so they could have an time-sharing operating system, like the earlier Multics that ran on much more expensive hardware and that the researchers had worked on previously and looked back fondly after Bell Labs pulled out of that research. Bell Labs wouldn't just pay for a computer for the researchers to play with, so they proposed developing a computer typesetting system for the secretaries to use, largely for patent submission, something Bell Labs did a lot of. Their scheme succeeded and as a result they invented Unix and troff.

So, Unix was invented explicitly to run troff!

TeX, by contrast, was not invented until 1978, LaTeX in 1985, and ConTeXt not until 1990! (I wish I'd found out about the latter earlier!) TeX was invented because of Donald Knuth's desire to produce gloriously typeset books with mathematics for his multi-volume work The Art of Computer Programming. He finished TeX long ago, but is still working on those books.

All of these typesetting systems have what are called markup languages, where the text of the document is interspersed with commands distinguished in some way from the regular text. For instance, the command \begin{document} from LaTeX is typical of TeX, LaTeX, and ConTeXt, all of which are related. Troff uses backslash commands in the middle of text and and commands on separate lines starting with periods but historically those commands have been limited to names of two characters, though this was relaxed in the later troff implementation groff, and in the Heirloom troff implementation which extended the second troff implementation, ditroff, with similar features as gnu groff, but easier access to modern fonts.

I am particularly impressed by troff's ability to correctly typeset documents that I wrote 30 years ago and that others have written even earlier. It has never failed me in this task.† This has often been a problem for me with documents from WYSIWYG systems on the contrary, even when those documents were more recently created, including one significant one from 2004. (Star Office, I'm looking at you!‡). LaTeX is reasonably backward compatible, though it did go through some big changes earlier it is now mostly stable. I did experience some compatibility problems, minor with my documents and major with complicated documents written by others. ConTeXt is generally stable, but it is developing rapidly so has more changes, though the developers are good about backward compatibility. The increasing sophistication of ConTeXt, which along its development has subsumed both TeX and MetaPost and combined and extended them with the Lua scripting language (mentioned again below), producing something that is even more flexible and impressive than TeX and LaTeX.

Another thing I like about markup languages is the fact that they are plain text‖, and can be manipulated with any program you want. Before the emergence of XML-based WYSIWYG document formats in Microsoft Word§ and Star Office this was practically impossible. Even now the complexity of the ZIP file and XML markup renders this much much more unpleasant to deal with. Kicking dead whales down the beach indeed! Being able to use any tool at all on a document is considerably more useful than being limited to the poor extensions languages of Microsoft Word and LibreOffice, and usually much simpler.

† I have had to change a few external programs I've written to help in the build process. Perl was a problem here. (I tried to resist the footnote within the footnote, but again, need must when the devil drives.)

‡ Sure, the current LibreOffice will open the file, but the formatting is significantly messed up. Earlier versions, if I remember correctly, did not open the file correctly.

§ I have never written a document in Microsoft Word for my personal use, though unfortunately I have used it often at work.

‖ I have delightedly taken to using Unicode characters in my plain text documents, as the ReStructuredText source of this document shows.

[3]

Lightweight markup languages, in contrast with TeX, LaTeX, ConTeXt, and troff, are usually things that start with the conventions like indicating *italics* by surrounding phrases in plain text email messages and USENET posts around them in the olden days. Most of them avoid the use of lots of keywords and backslashes, of the sort TeX, LaTeX, ConTeXt, and to a partial extent troff use. Instead, they largely try to use the non-alphanumeric characters on a standard keyboard to indicate how the text should be typeset, and without using long command names. The lack of these long command names (or short ones in troff's case) and the relatively unobtrusive nature of the non-alphanumeric characters makes documents easier to read. This is why they are called “lightweight” markup languages. Wikipedia has a good article that explains and compares them. Another advantage of most lightweight markup languages is that since they don't generally use command names, native speakers of languages other than English don't have to learn English command names, a significant matter.

I happen to prefer ReStructuredText, but Markdown is another very popular lightweight markup language that I sometimes use.

Another advantage to lightweight markup languages such as ReStructuredText and Markdown is that they often have programs allowing multiple kinds of output from them (PDF and HTML are typical) and since lightweight markup languages make no pretensions to being programming languages, which the markup languages of the original typesetting systems do (since that was how they allowed customization and extension), writing the programs to output multiple output types for lightweight markup languages is simpler than than writing programs to parse the heavy markup languages, which is the common approach that people take to get HTML from LaTeX, for instance. The fact that heavy markup languages are usually Turing complete and so can be extensively (and definitely are in practice) extended and often have programmable syntax makes processing them with other tools difficult and usually require much hand conversion. It is my impression that while LaTeX to HTML translators like TeXht and HEVEA are very good for documents that only use the standard features of LaTeX they can't deal easily with heavily programmed documents, since that would require more semantic understanding of the original LaTeX source.

One interesting attempt in this direction for troff was the unroff program, written in Elk Scheme. It took the approach of implementing a complete troff parser and proving Scheme as an extension language so you could completely customize the output. It provided a complete implementation for the troff -ms macros, and I was easily able to extend those to handle cross references and indexes that I had extended that troff document's build process to provide, in 170 lines of Scheme.

[5]

As a result of the comparison, I decided that I greatly prefered ReStructuredText and pandoc for the tool to process it. Pandoc's ability to customize its output using filters written in the programming language Lua was particularly appealing, as was the ability to customize its default templates for generating output using the troff -ms macros and ConTeXt. I see a use for both of those, since the -ms output is easier to customize for things that the base -ms provides, but the ConTeXt output offers greater control over the final appearance, though often at the cost of greater effort. For instance, I have a moderately long document† that is currently in DocBook 5.0 XML format, and I now find it tedious to edit and the open source tool for generating PDF from it has serious flaws. (I'm resisting another footnote in a footnote. Be impressed that I succeeded!) I can see how I can convert it to ReStructuredText (or Markdown, for that matter) and use Pandoc's ConTeXt output to produce a nicer, more attractive PDF. Now I just need the time to write the lua filter and do the conversion. (Pandoc will convert it from DocBook, but will lose the indexing information, which I would have to do all over again, a task with more work than I want to contemplate at the moment.)

I still find uses for troff and ConTeXt. In particular, if I have to use complicated tables in a document I find that the either troff or ConTeXt works better. (Simple tables for either are OK from ReStructuredText output, but complicated ones…!)

† The DocBook version of the document was derived from the troff -ms source mentioned previously, though by the time the conversion happened I vaguely recall I no longer had access to a working unroff, I think because of bitrot. NetBSD has an unroff package in its pkgsrc collection of program, and I could install it now on my NetBSD machine, but when I tried to process the document unroff exited complaining about a syntax error in one of its Scheme files. So bitrot seems to prevail.

P.P.S. Omitted for irrelevance.

P.P.P.S. Sorry, no deeply nested parenthetical expressions this time!

Here's an addendum with two Apple Messenger messages to P., reflecting on converting this from an HTML email into a blog post:

The HTML dialect Google uses in its MIME emails is very odd. It doesn’t use <p> elements, using instead <div> elements. Unfortunately, pandoc converts those into containers, and nests them according to the nesting of the <div> elements. To fix this I hand edited the HTML to remove the outer <div> elements and convert the remaining ones into <p>s. Also, for some reason when I ran the documents through HTML tidy it converted the unicode characters into incorrect HTML character entities. I see now that it has a -utf8 switch, which I’ll have to remember for the next time I do this. (There will inevitably be a next time.)

OMG, now I have have to put that in the blog post! How many saving throws am I going to fail today anyway?

Last edited: 2021-07-17 00:53:29 EDT

Lacking Natural Simplicity

EPUB files, Markup Languages, and briefly Unix

About

Links

Lacking Natural Simplicity

Comments

About

Links