Converting Markdown to Static HTML/PDF using Pandoc

(12 minutes | 2378 words)

Summary

Markdown is a simple and lightweight document markup syntax that is perfectly suited for lightweight note taking, writing documentation and README files or even for content creation of websites. Pandoc is a very powerful document converter that can convert a Markdown document to various other target formats. Together, the combination of the two creates a very versatile tool set for writing (often quick) notes, followed by a conversion to a static offline target format such as HTML or PDF that can easily be shared with collaborators or clients. The post below will describe how to convert Markdown input to PDF as well as static standalone HTML files with support for math rendered server-side using Pandoc.

Table of Contents


Introduction

I write notes using Markdown and occasionally I need to share notes that explain concepts or progress with other people. While Markdown is easy to write, it is not suitable to be shared in such situations. The solution then is to convert the Markdown note to a more socially acceptable format which for my use cases is PDF or HTML. Often I prefer HTML for its superior presentation quality, wide portability through web browser support on a large array of devices—including mobile phones—and better support for embedding multimedia and other web based content. (It would not be the case if you had written a book of course.) Furthermore, many tools such as email clients or team messaging platforms may be able to render HTML natively in the application.

The translation of Markdown to the target format is accomplished by the Pandoc utility. The tool further satisfies the following requirements important for my use cases:

  • Generate a standalone document by embedding any required resources into the final document. The receiving client must be able to render the document offline without running a webserver or the need to resolve other dependencies.
  • Support for Mathematics using markup syntax.
  • Ability to modify elements of a document by access to its internal abstract syntax tree representation.

Pandoc allows to generate a PDF document in a simple straightforward manner. The math markup will require a working distribution installed on the system which Pandoc will use to render the PDF. External resources such as images will be embedded in the PDF naturally.

Generating a static HTML document is a bit more involved due to the math requirement above. Math rendering on the web is typically done by offloading the render task to the client-side which will depend on some JavaScript math library. Mathematical content is static by nature and due to the offline requirement above, dynamic client-side rendering with JavaScript is not an option. To end up with static HTML math markup, the following approach will make use of the library at compile time of the document (server-side math rendering).

I use a Bash shell script called rendernote to convert Markdown to the target formats mentioned above. The Git repository including that script and CSS style sheets is located here. It uses some Bash only features and was tested on a Linux system. Some commands in the script may not be available in this form on a BSD or MacOS system. The following sections explain some details for my Pandoc document conversion approach. All of the code is located in the rendernote script.

Markdown to PDF Translation

As mentioned above, translation to PDF is straightforward but requires a working distribution. Pandoc will then make use of it when rendering Markdown to PDF. Alternative PDF engines are described in the documentation which may work but have not been tested. The render_pdf function in the rendernote script performs this task by calling

1pandoc \
2    --from markdown --to pdf \
3    --highlight-style pygments \
4    --output "${1%.*}.pdf" "${input}"

The --highlight-style option defines the pygments style for code highlighting. Pandoc further supports metadata headers for specific document settings which will be interpreted by the corresponding translation engine. The render_pdf function checks for the presence of such a (YAML) header and adds a default header in case none is found. This default header sets the page geometry, default heading font family and possibly other settings using standard commands typically found in the preamble (specified by the header-include sequence) or at the beginning of a document (values in the include-before sequence). See the Pandoc man-page for further documentation. The default header in the script specifies the values:

1fontsize: 12pt
2papersize: a4
3linkcolor: blue
4header-includes:
5  - \usepackage[top=60pt,bottom=60pt,left=80pt,right=80pt]{geometry}
6  - \usepackage{bm}
7  - \usepackage{sectsty}
8include-before:
9  - \allsectionsfont{\sffamily}

This example turbulence.md Markdown document contains some common elements such as math, hyperlinks, block quotes, code blocks as well as images and can be converted to a turbulence.pdf document with the command

1rendernote -pdf turbulence.md

Markdown to Static HTML Translation

Pandoc supports options --standalone and --embed-resources. For PDF translation the former is implied and resources such as images are embedded into the PDF by default. For HTML translation the former ensures that the generated HTML file includes proper HTML markup (html, head and body tags) such that it could be viewed in a web browser by simply opening it. Rendering content or fetching remote CSS style sheets would still require a running webserver (for example python -m http.server). While the generated HTML file is standalone, it is not self-contained and may depend on external files such as images specified by a file system path or a network connection to fetch remote content. The --embed-resources option will attempt to fetch such external resources (local or remote) and encode them within the HTML file. Specifying both of these options will then result in a true self-contained HTML file for offline rendering at the cost of a larger file size due to embedded payload that ensures the file is self-contained.

Pandoc has several options for rendering math in HTML and supports natively with the --katex option. Specifying this option in addition to the two options discussed above would be sufficient to produce a self-contained HTML file with support for math. However, the resulting file has several defects which make this solution approach undesirable:

  • Math rendering is deferred to the client which introduces an unnecessary JavaScript dependency.
  • The --embed-resources option will resolve this dependency by embedding a large amount of JavaScript code in the HTML file, which in turn blows up the file size unnecessarily.
  • The client-side rendering introduces an overhead which may result in slow page loading performance for notes with significant amount of math.

For example, a tiny.md Markdown file with the content $a^2 + b^2 = c^2$ is 18 bytes in size. The tiny.html file generated with

1pandoc --to html --standalone --embed-resources --katex -o tiny.html tiny.md

is 1731663 bytes in size, almost 100'000 times the size of the Markdown file. The large file size is due to the encoded payload for JavaScript and font files required for rendering the math. The JavaScript dependency can be fully eliminated using server-side math rendering at the time the HTML document is created. Some further compression can be achieved by selecting only desired formats for font files (may not support all browsers however). To achieve this, the steps implemented in the rendernote script are:

  1. Tag math in the input Markdown file with special labels by modifying the abstract syntax tree (AST) representation in Pandoc.
  2. Generate rendered HTML for the labeled math using the library in a Node.js application.
  3. Convert the intermediate document to a self-contained HTML document.

These steps are implemented in the render_html function and are described in more detail in the sections below.

Math Pre-Processing using Lua Filters

Pandoc supports JSON and Lua filters which can be used for AST transformations of a given input. The idea for this pre-processing step is to wrap inline and display math in between HTML tags that can later be used to identify math nodes by parsing the HTML DOM. Fortunately, adding these tags can be achieved easily in Pandoc using a Lua filter. Such a filter is simply a Lua function with the same name as the node of the target object in the AST. Every node in AST with the same name as the filter will then be replaced with the return value of the filter call. The argument passed to the filter is the value of the currently existing node. Math objects in Pandoc are simply given the name Math. The filter used in the rendernote script to transform math nodes is given by the following Lua code:

 1function Math(elem)
 2    assert(FORMAT:match('html'))
 3    local wrap
 4    if elem.mathtype == 'InlineMath' then
 5        wrap = '<latexinline>' .. elem.text .. '</latexinline>'
 6    else
 7        wrap = '<latexdisplay>' .. elem.text .. '</latexdisplay>'
 8    end
 9    return pandoc.RawInline('html', wrap)
10end

The function must be named Math and takes one argument which will take the value of the current node the filter is applied to. All that this filter does is to wrap the code contained in elem.text within HTML tags which are either <latexinline> for inline math or <latexdisplay> for display math. The pre-processed math is then returned in a new Pandoc node for raw HTML code (therefore, this filter only works for HTML targets). The rendernote script then generates intermediate HTML code for the Markdown input using Pandoc:

1pandoc \
2    --from markdown --to html \
3    --metadata title="$(basename --suffix=.md -- "${1}")" \
4    --highlight-style pygments \
5    --lua-filter "${lua_filter}" \
6    --template "${html_template}" \
7    "${1}" >"${raw_html}"

The Lua filter is defined in the file pointed to by the variable lua_filter. The Pandoc call further uses a HTML template stored in the file pointed to by html_template (see this command for the details) and writes the intermediate HTML to the file pointed to by raw_html. The example Markdown input

1Some inline math $a^2 + b^2 = c^2$ in a sentence.

is then filtered to HTML that looks like

1<body>
2<p>Some inline math <latexinline>a^2 + b^2 = c^2</latexinline> in a
3sentence.</p>
4</body>

For comparison, the default HTML generated without the Lua filter looks like the following:

1<body>
2<p>Some inline math <span class="math inline"><em>a</em><sup>2</sup> +
3<em>b</em><sup>2</sup> = <em>c</em><sup>2</sup></span> in a sentence.</p>
4</body>

The next step is to render the math in the intermediate HTML code using .

Server-Side Math Rendering with KaTeX

If there were filtered math nodes in the AST, they will now be rendered to valid HTML using the renderToString function from the API. This is done with a small JavaScript code in the render_latex function that is executed with Node.js. A here document is used for this which is fed into the node command using a pipe. The paths for the input and output HTML files are substituted in the here document with variable expansions. The JavaScript used for the server-side math rendering looks as follows:

 1const katex = require('katex');
 2const {parseHTML} = require('linkedom');
 3const fs = require('node:fs');
 4fs.readFile('${1}', 'utf8', (err, content) => {
 5    const {document} = parseHTML(content.toString());
 6    const inline_items = document.querySelectorAll('latexinline');
 7    const display_items = document.querySelectorAll('latexdisplay');
 8    inline_items.forEach((item) => {
 9        const katex_code = katex.renderToString(item.innerHTML, {
10            output: 'html',
11            displayMode: false,
12        });
13        item.outerHTML = katex_code;
14    });
15    display_items.forEach((item) => {
16        const katex_code = katex.renderToString(item.innerHTML, {
17            output: 'html',
18            displayMode: true,
19        });
20        item.outerHTML = katex_code;
21    });
22    fs.writeFile('${1}', document.toString(), err => {});
23});

This reads the intermediate HTML file from the previous step and parses it into HTML DOM using the linkedom package (line 5). The custom tags for the math nodes can then be queried and processed with forEach loops where the katex.renderToString function is used to replace the tags with valid rendered HTML math. Finally, the processed HTML is written back to the same file as specified for the input, which is OK since fs.readFile reads the full file into memory before the callback is executed. The function further downloads the latest katex and linkedom modules using npm. These modules will be stored in the directory where the rendernote script is located and are only downloaded if they do not exist.

Encoding Math Fonts using Data URIs

After the rendering pass, the static_katex function is executed to generate static CSS and font data for . The function fetches the data from a CDN server and stores it in a static directory relative to the location of the rendernote script. The function creates static font files by converting the CDN fonts to base64 encoded data URIs. It only fetches woff fonts by default. Alternatively, other font formats could be defined here. related CSS is stored in the directory static/css/katex/${katex_version}, where ${katex_version} is determined by npm.

Creating the Final Standalone HTML File

The final step is to run Pandoc in a second pass with the intermediate HTML code as input. This time the --standalone and --embed-resources options are passed to Pandoc as well. The same HTML template is used for this second pass as was already used during the first pass. In addition, this call adds the --include-in-header option to pass the CSS style sheets prepared for the final document. These style sheets include a default style sheet and possibly related style sheets discussed in the previous section. A Markdown document without math will only include the default style sheet.

Following up with the same tiny.md example used at the beginning of the Markdown to Static HTML Translation section, the command

1rendernote tiny.md

generates a tiny.html file with a size of 431666 bytes that is 4 times smaller than the standalone HTML file generated by the native Pandoc approach. The file is further free of JavaScript. While almost half a megabyte is still quite large for such tiny content, 94% of the total file size is attributed to the base64 encoded math fonts included in the payload of the standalone HTML file. Removing the inline math $ markers from the tiny.md file results in a file that is only 4282 bytes in size.

Finally, the example turbulence.pdf file generated in the previous section Markdown to PDF Translation amounts to 3.7 MB. The same Markdown input converted to HTML amounts to 2.7 MB (versus 3.9 MB using the native Pandoc approach). The payload for the two images and encoded math fonts corresponds to 74% (2.0 MB). For comparison, the standalone turbulence.html version can be viewed by following the link. The file is created with the command

1rendernote turbulence.md

→ please send an email to the mailing list for comments (using post title as subject).
→ click here to see all comments.
→ mailing list etiquette.
tags: markdownhtmlnotespandoc;