Introduction
I write notes using Markdown and occasionally I need to share notes that explain concepts or progress with other people. While Markdown is easy to write, it is not suitable to be shared in such situations. The solution then is to convert the Markdown note to a more socially acceptable format which for my use cases is PDF or HTML. Often I prefer HTML for its superior presentation quality, wide portability through web browser support on a large array of devices—including mobile phones—and better support for embedding multimedia and other web based content. (It would not be the case if you had written a book of course.) Furthermore, many tools such as email clients or team messaging platforms may be able to render HTML natively in the application.
The translation of Markdown to the target format is accomplished by the Pandoc utility. The tool further satisfies the following requirements important for my use cases:
- Generate a standalone document by embedding any required resources into the final document. The receiving client must be able to render the document offline without running a webserver or the need to resolve other dependencies.
- Support for Mathematics using markup syntax.
- Ability to modify elements of a document by access to its internal abstract syntax tree representation.
Pandoc allows to generate a PDF document in a simple straightforward manner. The math markup will require a working distribution installed on the system which Pandoc will use to render the PDF. External resources such as images will be embedded in the PDF naturally.
Generating a static HTML document is a bit more involved due to the math requirement above. Math rendering on the web is typically done by offloading the render task to the client-side which will depend on some JavaScript math library. Mathematical content is static by nature and due to the offline requirement above, dynamic client-side rendering with JavaScript is not an option. To end up with static HTML math markup, the following approach will make use of the library at compile time of the document (server-side math rendering).
I use a Bash shell script called rendernote to convert Markdown to the target
formats mentioned above. The Git repository including that script and CSS style
sheets is located here.
It uses some Bash only features and was tested on a Linux system. Some commands
in the script may not be available in this form on a BSD or MacOS system. The
following sections explain some details for my Pandoc document conversion
approach. All of the code is located in the
rendernote
script.
Markdown to PDF Translation
As mentioned above, translation to PDF is
straightforward but requires a working
distribution. Pandoc will
then make use of it when rendering Markdown to PDF. Alternative PDF engines are
described in the
documentation
which may work but have not been tested. The
render_pdf
function in the rendernote script performs this task by calling
1pandoc \
2 --from markdown --to pdf \
3 --highlight-style pygments \
4 --output "${1%.*}.pdf" "${input}"
The --highlight-style option defines the pygments style for code
highlighting. Pandoc further supports metadata
headers for
specific document settings which will be interpreted by the corresponding
translation engine. The render_pdf function checks for the presence of such a
(YAML) header and adds a default header in case none is found. This default
header sets the page geometry, default heading font family and possibly other
settings using standard
commands typically found in the preamble
(specified by the header-include sequence) or at the beginning of a document
(values in the include-before sequence). See the Pandoc man-page for
further documentation. The default header in the script specifies the values:
1fontsize: 12pt
2papersize: a4
3linkcolor: blue
4header-includes:
5 - \usepackage[top=60pt,bottom=60pt,left=80pt,right=80pt]{geometry}
6 - \usepackage{bm}
7 - \usepackage{sectsty}
8include-before:
9 - \allsectionsfont{\sffamily}
This example
turbulence.md
Markdown document contains some common elements such as math, hyperlinks, block
quotes, code blocks as well as images and can be converted to a
turbulence.pdf
document with the command
1rendernote -pdf turbulence.md
Markdown to Static HTML Translation
Pandoc supports options --standalone and --embed-resources. For PDF
translation the former is implied and resources such as images are embedded into
the PDF by default. For HTML translation the former ensures that the generated
HTML file includes proper HTML markup (html, head and body tags) such that
it could be viewed in a web browser by simply opening it. Rendering content or
fetching remote CSS style sheets would still require a running webserver (for
example python -m http.server). While the generated HTML file is standalone,
it is not self-contained and may depend on external files such as images
specified by a file system path or a network connection to fetch remote content.
The --embed-resources option will attempt to fetch such external resources
(local or remote) and encode them within the HTML file. Specifying both of these
options will then result in a true self-contained HTML file for offline
rendering at the cost of a larger file size due to embedded payload that ensures
the file is self-contained.
Pandoc has several options for rendering math in
HTML
and supports
natively with the --katex
option. Specifying this option in addition to the two options discussed above
would be sufficient to produce a self-contained HTML file with support for math.
However, the resulting file has several defects which make this solution
approach undesirable:
- Math rendering is deferred to the client which introduces an unnecessary JavaScript dependency.
- The
--embed-resourcesoption will resolve this dependency by embedding a large amount of JavaScript code in the HTML file, which in turn blows up the file size unnecessarily. - The client-side rendering introduces an overhead which may result in slow page loading performance for notes with significant amount of math.
For example, a tiny.md Markdown file with the content $a^2 + b^2 = c^2$ is
18 bytes in size. The tiny.html file generated with
1pandoc --to html --standalone --embed-resources --katex -o tiny.html tiny.md
is 1731663 bytes in size, almost 100'000 times the size of the Markdown file.
The large file size is due to the encoded payload for JavaScript and font files
required for rendering the math. The JavaScript dependency can be fully
eliminated using server-side math rendering at the time the HTML document is
created. Some further compression can be achieved by selecting only desired
formats for font files (may not support all browsers however). To achieve this,
the steps implemented in the rendernote script are:
- Tag math in the input Markdown file with special labels by modifying the abstract syntax tree (AST) representation in Pandoc.
- Generate rendered HTML for the labeled math using the library in a Node.js application.
- Convert the intermediate document to a self-contained HTML document.
These steps are implemented in the
render_html
function and are described in more detail in the sections below.
Math Pre-Processing using Lua Filters
Pandoc supports JSON and Lua filters
which can be used for AST transformations of a given input. The idea for this
pre-processing step is to wrap inline and display math in between HTML tags that
can later be used to identify math nodes by parsing the HTML DOM. Fortunately,
adding these tags can be achieved easily in Pandoc using a Lua filter. Such a
filter is simply a Lua function with the same name as the node of the target
object in the AST. Every node in AST with the same name as the filter will then
be replaced with the return value of the filter call. The argument passed to
the filter is the value of the currently existing node. Math objects in
Pandoc are simply given the
name Math. The filter used in the rendernote script to transform math nodes
is given by the following Lua code:
1function Math(elem)
2 assert(FORMAT:match('html'))
3 local wrap
4 if elem.mathtype == 'InlineMath' then
5 wrap = '<latexinline>' .. elem.text .. '</latexinline>'
6 else
7 wrap = '<latexdisplay>' .. elem.text .. '</latexdisplay>'
8 end
9 return pandoc.RawInline('html', wrap)
10end
The function must be named Math and takes one argument which will take the
value of the current node the filter is applied to. All that this filter does
is to wrap the
code contained in elem.text within HTML tags which
are either <latexinline> for inline math or <latexdisplay> for display math.
The pre-processed math is then returned in a new Pandoc node for raw HTML code
(therefore, this filter only works for HTML targets). The rendernote script
then generates intermediate HTML code for the Markdown input using Pandoc:
1pandoc \
2 --from markdown --to html \
3 --metadata title="$(basename --suffix=.md -- "${1}")" \
4 --highlight-style pygments \
5 --lua-filter "${lua_filter}" \
6 --template "${html_template}" \
7 "${1}" >"${raw_html}"
The Lua filter is defined in the file pointed to by the variable lua_filter.
The Pandoc call further uses a HTML template stored in the file pointed to by
html_template (see this
command for
the details) and writes the intermediate HTML to the file pointed to by
raw_html. The example Markdown input
1Some inline math $a^2 + b^2 = c^2$ in a sentence.
is then filtered to HTML that looks like
1<body>
2<p>Some inline math <latexinline>a^2 + b^2 = c^2</latexinline> in a
3sentence.</p>
4</body>
For comparison, the default HTML generated without the Lua filter looks like the following:
1<body>
2<p>Some inline math <span class="math inline"><em>a</em><sup>2</sup> +
3<em>b</em><sup>2</sup> = <em>c</em><sup>2</sup></span> in a sentence.</p>
4</body>
The next step is to render the math in the intermediate HTML code using .
Server-Side Math Rendering with KaTeX
If there were filtered math nodes in the AST, they will now be rendered to valid
HTML using the renderToString function from the
API.
This is done with a small JavaScript code in the
render_latex
function that is executed with Node.js. A here
document is used for this which
is fed into the node command using a pipe. The paths for the input and output
HTML files are substituted in the here document with variable expansions. The
JavaScript used for the server-side math rendering looks as follows:
1const katex = require('katex');
2const {parseHTML} = require('linkedom');
3const fs = require('node:fs');
4fs.readFile('${1}', 'utf8', (err, content) => {
5 const {document} = parseHTML(content.toString());
6 const inline_items = document.querySelectorAll('latexinline');
7 const display_items = document.querySelectorAll('latexdisplay');
8 inline_items.forEach((item) => {
9 const katex_code = katex.renderToString(item.innerHTML, {
10 output: 'html',
11 displayMode: false,
12 });
13 item.outerHTML = katex_code;
14 });
15 display_items.forEach((item) => {
16 const katex_code = katex.renderToString(item.innerHTML, {
17 output: 'html',
18 displayMode: true,
19 });
20 item.outerHTML = katex_code;
21 });
22 fs.writeFile('${1}', document.toString(), err => {});
23});
This reads the intermediate HTML file from the previous step and parses it into
HTML DOM using the linkedom package (line 5). The custom tags for the math
nodes can then be queried and processed with forEach loops where the
katex.renderToString function is used to replace the tags with valid
rendered HTML math. Finally, the processed HTML is written back to the
same file as specified for the input, which is OK since fs.readFile reads the
full file into memory before the callback is executed. The function further
downloads the latest katex and linkedom modules using npm. These modules
will be stored in the directory where the rendernote script is located and are
only downloaded if they do not exist.
Encoding Math Fonts using Data URIs
After the rendering pass, the
static_katex
function is executed to generate static CSS and font data for
. The
function fetches the data from a CDN server and stores it in a static
directory relative to the location of the rendernote script. The function
creates static font files by converting the CDN fonts to base64 encoded data
URIs. It only fetches woff fonts by default. Alternatively, other font formats
could be defined
here.
related CSS is stored in the directory
static/css/katex/${katex_version}, where ${katex_version} is determined by
npm.
Creating the Final Standalone HTML File
The final step is to run Pandoc in a second pass with the intermediate HTML code
as input. This time the --standalone and --embed-resources options are
passed to Pandoc as well. The same HTML template is used for this second pass as
was already used during the first pass. In addition, this call adds the
--include-in-header option to pass the CSS style sheets prepared for the final
document. These style sheets include a default style
sheet
and possibly
related style sheets discussed in the previous
section. A Markdown document without
math will only include the default style sheet.
Following up with the same tiny.md example used at the beginning of the
Markdown to Static HTML Translation section, the
command
1rendernote tiny.md
generates a tiny.html file with a size of 431666 bytes that is 4 times smaller
than the standalone HTML file generated by the native Pandoc approach. The file
is further free of JavaScript. While almost half a megabyte is still quite large
for such tiny content, 94% of the total file size is attributed to the base64
encoded math fonts included in the payload of the standalone HTML file. Removing
the inline math $ markers from the tiny.md file results in a file that is
only 4282 bytes in size.
Finally, the example
turbulence.pdf file
generated in the previous section Markdown to PDF Translation
amounts to 3.7 MB. The same Markdown input converted to HTML amounts to
2.7 MB (versus 3.9 MB using the native Pandoc approach). The payload
for the two
images and encoded math fonts corresponds to 74% (2.0 MB).
For comparison, the standalone
turbulence.html
version can be viewed by following the link. The file is created with the
command
1rendernote turbulence.md