<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>fabs()</title><language>en-us</language><link rel="alternate" type="text/html" href="https://0xfab.ch/"/><link rel="alternate" type="application/rss+xml" href="https://0xfab.ch/feed.xml"/><lastBuildDate>Sat, 05 Jul 2025 12:05:07 +0200</lastBuildDate><item><title>Converting Markdown to Static HTML/PDF using Pandoc</title><link rel="alternate" type="text/html" href="https://0xfab.ch/2025/01/markdown-to-static-html/"/><pubDate>Mon, 03 Feb 2025 22:16:05 +0100</pubDate><author>info@0xfab.ch (Fabian Wermelinger)</author><guid>https://0xfab.ch/2025/01/markdown-to-static-html/</guid><category term="markdown"/><category term="html"/><category term="notes"/><category term="pandoc"/><description><![CDATA[ <p><a href="https://www.markdownguide.org/" target="_blank" rel="noreferrer noopener">Markdown</a> is a simple and lightweight document
markup syntax that is perfectly suited for lightweight note taking, writing
documentation and <code>README</code> files or even for content creation of websites.
<a href="https://pandoc.org/" target="_blank" rel="noreferrer noopener">Pandoc</a> is a very powerful document converter that can
convert a Markdown document to various other target formats.  Together, the
combination of the two creates a very versatile tool set for writing (often
quick) notes, followed by a conversion to a static offline target format such as
HTML or PDF that can easily be shared with collaborators or clients.  The post
below will describe how to convert Markdown input to PDF as well as static
standalone HTML files with support for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 math rendered server-side
using Pandoc.</p>
<h2 id="introduction">Introduction</h2>
<p>I write notes using Markdown and occasionally I need to share notes that explain
concepts or progress with other people.  While Markdown is easy to write, it is
not suitable to be shared in such situations.  The solution then is to convert
the Markdown note to a more socially acceptable format which for my use cases is
PDF or HTML.  Often I prefer HTML for its superior presentation quality, wide
portability through web browser support on a large array of devices&mdash;including
mobile phones&mdash;and better support for embedding multimedia and other web based
content. (It would not be the case if you had written a book of course.)
Furthermore, many tools such as email clients or team messaging platforms may be
able to render HTML natively in the application.</p>
<p>The translation of Markdown to the target format is accomplished by the
<a href="https://pandoc.org/" target="_blank" rel="noreferrer noopener">Pandoc</a> utility.  The tool further satisfies the following
requirements important for my use cases:</p>
<ul>
<li>Generate a standalone document by embedding any required resources into the
final document.  <em>The receiving client must be able to render the document
offline without running a webserver or the need to resolve other
dependencies.</em></li>
<li>Support for Mathematics using 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 markup syntax.</li>
<li>Ability to modify elements of a document by access to its internal
<a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree" target="_blank" rel="noreferrer noopener">abstract syntax tree</a>
representation.</li>
</ul>
<p>Pandoc allows to generate a PDF document in a simple straightforward manner.
The math markup will require a working <a href="https://ctan.org/starter" target="_blank" rel="noreferrer noopener">
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>


distribution</a> installed on the system which Pandoc
will use to render the PDF.  External resources such as images will be embedded
in the PDF naturally.</p>
<p>Generating a static HTML document is a bit more involved due to the math
requirement above.  Math rendering on the web is typically done by offloading
the render task to the client-side which will depend on some JavaScript math
library.  Mathematical content is <em>static by nature</em> and due to the offline
requirement above, dynamic client-side rendering with JavaScript is not an
option. To end up with static HTML math markup, the following approach will make
use of the <a href="https://katex.org/" target="_blank" rel="noreferrer noopener">
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

</a> library at compile time of the
document (server-side math rendering).</p>
<p>I use a Bash shell script called <code>rendernote</code> to convert Markdown to the target
formats mentioned above.  The Git repository including that script and CSS style
sheets is located <a href="https://git.0xfab.ch/markdown-note-render/log.html" target="_blank" rel="noreferrer noopener">here</a>.
It uses some Bash only features and was tested on a Linux system.  Some commands
in the script may not be available in this form on a BSD or MacOS system.  The
following sections explain some details for my Pandoc document conversion
approach.  All of the code is located in the
<a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html" target="_blank" rel="noreferrer noopener"><code>rendernote</code></a>
script.</p>
<h2 id="md-to-pdf">Markdown to PDF Translation</h2>
<p>As mentioned <a href="#introduction">above</a>, translation to PDF is
straightforward but requires a working 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 distribution.  Pandoc will
then make use of it when rendering Markdown to PDF.  Alternative PDF engines are
described in the
<a href="https://pandoc.org/chunkedhtml-demo/2.4-creating-a-pdf.html" target="_blank" rel="noreferrer noopener">documentation</a>
which may work but have not been tested.  The
<a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l20" target="_blank" rel="noreferrer noopener"><code>render_pdf</code></a>
function in the <code>rendernote</code> script performs this task by calling</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>pandoc <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#8045ff"></span>    --from markdown --to pdf <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span><span style="color:#8045ff"></span>    --highlight-style pygments <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span><span style="color:#8045ff"></span>    --output <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">1</span><span style="color:#111">%.*</span><span style="color:#d88200">}</span><span style="color:#d88200">.pdf&#34;</span> <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">input</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span>
</span></span></code></pre></div><p>The <code>--highlight-style</code> option defines the <code>pygments</code> style for code
highlighting.  Pandoc further supports <a href="https://pandoc.org/chunkedhtml-demo/8.10-metadata-blocks.html" target="_blank" rel="noreferrer noopener">metadata
headers</a> for
specific document settings which will be interpreted by the corresponding
translation engine.  The <code>render_pdf</code> function checks for the presence of such a
(YAML) header and adds a default header in case none is found.  This default
header sets the page geometry, default heading font family and possibly other
settings using standard 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 commands typically found in the preamble
(specified by the <code>header-include</code> sequence) or at the beginning of a document
(values in the <code>include-before</code> sequence).  See the Pandoc <code>man</code>-page for
further documentation.   The default header in the script specifies the values:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#f92672">fontsize</span><span style="color:#111">:</span> <span style="color:#ae81ff">12pt</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#f92672">papersize</span><span style="color:#111">:</span> <span style="color:#ae81ff">a4</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span><span style="color:#f92672">linkcolor</span><span style="color:#111">:</span> <span style="color:#ae81ff">blue</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span><span style="color:#f92672">header-includes</span><span style="color:#111">:</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>  - <span style="color:#ae81ff">\usepackage[top=60pt,bottom=60pt,left=80pt,right=80pt]{geometry}</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span>  - <span style="color:#ae81ff">\usepackage{bm}</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7</span><span>  - <span style="color:#ae81ff">\usepackage{sectsty}</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8</span><span><span style="color:#f92672">include-before</span><span style="color:#111">:</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9</span><span>  - <span style="color:#ae81ff">\allsectionsfont{\sffamily}</span>
</span></span></code></pre></div><p>This example
<a href="/data/post/2025/markdown-to-static-html/turbulence.md"><code>turbulence.md</code></a>
Markdown document contains some common elements such as math, hyperlinks, block
quotes, code blocks as well as images and can be converted to a
<a href="/data/post/2025/markdown-to-static-html/turbulence.pdf"><code>turbulence.pdf</code></a>
document with the command</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>rendernote -pdf turbulence.md
</span></span></code></pre></div><h2 id="md-to-html">Markdown to Static HTML Translation</h2>
<p>Pandoc supports options <code>--standalone</code> and <code>--embed-resources</code>.  For PDF
translation the former is implied and resources such as images are embedded into
the PDF by default.  For HTML translation the former ensures that the generated
HTML file includes proper HTML markup (<code>html</code>, <code>head</code> and <code>body</code> tags) such that
it could be viewed in a web browser by simply opening it.  Rendering content or
fetching remote CSS style sheets would still require a running webserver (for
example <code>python -m http.server</code>).  While the generated HTML file is standalone,
it is not self-contained and may depend on external files such as images
specified by a file system path or a network connection to fetch remote content.
The <code>--embed-resources</code> option will attempt to fetch such external resources
(local or remote) and encode them within the HTML file. Specifying both of these
options will then result in a true self-contained HTML file for offline
rendering at the cost of a larger file size due to embedded payload that ensures
the file is self-contained.</p>
<p>Pandoc has <a href="https://pandoc.org/chunkedhtml-demo/3.6-math-rendering-in-html.html#math-rendering-in-html" target="_blank" rel="noreferrer noopener">several options for rendering math in
HTML</a>
and supports <a href="https://katex.org/" target="_blank" rel="noreferrer noopener">
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

</a> natively with the <code>--katex</code>
option.  Specifying this option in addition to the two options discussed above
would be sufficient to produce a self-contained HTML file with support for math.
However, the resulting file has several defects which make this solution
approach undesirable:</p>
<ul>
<li>Math rendering is deferred to the client which introduces an unnecessary
JavaScript dependency.</li>
<li>The <code>--embed-resources</code> option will resolve this dependency by embedding a
large amount of JavaScript code in the HTML file,  which in turn blows up the
file size unnecessarily.</li>
<li>The client-side rendering introduces an overhead which may result in slow
page loading performance for notes with significant amount of math.</li>
</ul>
<p>For example, a <code>tiny.md</code> Markdown file with the content <code>$a^2 + b^2 = c^2$</code> is
18 bytes in size.  The <code>tiny.html</code> file generated with</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>pandoc --to html --standalone --embed-resources --katex -o tiny.html tiny.md
</span></span></code></pre></div><p>is 1731663 bytes in size, almost 100'000 times the size of the Markdown file.
The large file size is due to the encoded payload for JavaScript and font files
required for rendering the math.  The JavaScript dependency can be fully
eliminated using server-side math rendering at the time the HTML document is
created. Some further compression can be achieved by selecting only desired
formats for font files (may not support all browsers however).  To achieve this,
the steps implemented in the <code>rendernote</code> script are:</p>
<ol>
<li>Tag math in the input Markdown file with special labels by modifying the
abstract syntax tree (AST) representation in Pandoc.</li>
<li>Generate rendered HTML for the labeled math using the 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 library in a
<a href="https://nodejs.org/en" target="_blank" rel="noreferrer noopener">Node.js</a> application.</li>
<li>Convert the intermediate document to a self-contained HTML document.</li>
</ol>
<p>These steps are implemented in the
<a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l45" target="_blank" rel="noreferrer noopener"><code>render_html</code></a>
function and are described in more detail in the sections below.</p>
<h3 id="math-pandoc-lua-filter">Math Pre-Processing using Lua Filters</h3>
<p>Pandoc supports JSON and <a href="https://pandoc.org/lua-filters.html" target="_blank" rel="noreferrer noopener">Lua filters</a>
which can be used for AST transformations of a given input. The idea for this
pre-processing step is to wrap inline and display math in between HTML tags that
can later be used to identify math nodes by parsing the HTML DOM.  Fortunately,
adding these tags can be achieved easily in Pandoc using a Lua filter.  Such a
filter is simply a Lua function with the same name as the node of the target
object in the AST. Every node in AST with the same name as the filter will then
be replaced with the return value of the filter call.  The argument passed to
the filter is the value of the currently existing node. <a href="https://pandoc.org/lua-filters.html#pandoc.Math" target="_blank" rel="noreferrer noopener">Math objects in
Pandoc</a> are simply given the
name <code>Math</code>. The filter used in the <code>rendernote</code> script to transform math nodes
is given by the following Lua code:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-lua" data-lang="lua"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#00a8c8">function</span> <span style="color:#75af00">Math</span><span style="color:#111">(</span><span style="color:#111">elem</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>    <span style="color:#111">assert</span><span style="color:#111">(</span><span style="color:#111">FORMAT</span><span style="color:#111">:</span><span style="color:#111">match</span><span style="color:#111">(</span><span style="color:#d88200">&#39;html&#39;</span><span style="color:#111">))</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>    <span style="color:#00a8c8">local</span> <span style="color:#111">wrap</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>    <span style="color:#00a8c8">if</span> <span style="color:#111">elem.mathtype</span> <span style="color:#f92672">==</span> <span style="color:#d88200">&#39;InlineMath&#39;</span> <span style="color:#00a8c8">then</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>        <span style="color:#111">wrap</span> <span style="color:#f92672">=</span> <span style="color:#d88200">&#39;&lt;latexinline&gt;&#39;</span> <span style="color:#f92672">..</span> <span style="color:#111">elem.text</span> <span style="color:#f92672">..</span> <span style="color:#d88200">&#39;&lt;/latexinline&gt;&#39;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>    <span style="color:#00a8c8">else</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>        <span style="color:#111">wrap</span> <span style="color:#f92672">=</span> <span style="color:#d88200">&#39;&lt;latexdisplay&gt;&#39;</span> <span style="color:#f92672">..</span> <span style="color:#111">elem.text</span> <span style="color:#f92672">..</span> <span style="color:#d88200">&#39;&lt;/latexdisplay&gt;&#39;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>    <span style="color:#00a8c8">end</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>    <span style="color:#00a8c8">return</span> <span style="color:#111">pandoc.RawInline</span><span style="color:#111">(</span><span style="color:#d88200">&#39;html&#39;</span><span style="color:#111">,</span> <span style="color:#111">wrap</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span><span style="color:#00a8c8">end</span>
</span></span></code></pre></div><p>The function must be named <code>Math</code> and takes one argument which will take the
value of the current node the filter is applied to.  All that this filter does
is to wrap the 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">L</span><span class="mspace" style="margin-right:-0.36em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 code contained in <code>elem.text</code> within HTML tags which
are either <code>&lt;latexinline&gt;</code> for inline math or <code>&lt;latexdisplay&gt;</code> for display math.
The pre-processed math is then returned in a new Pandoc node for raw HTML code
(therefore, this filter only works for HTML targets).  The <code>rendernote</code> script
then generates intermediate HTML code for the Markdown input using Pandoc:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>pandoc <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#8045ff"></span>    --from markdown --to html <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span><span style="color:#8045ff"></span>    --metadata <span style="color:#111">title</span><span style="color:#f92672">=</span><span style="color:#d88200">&#34;</span><span style="color:#00a8c8">$(</span>basename --suffix<span style="color:#f92672">=</span>.md -- <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">1</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span><span style="color:#00a8c8">)</span><span style="color:#d88200">&#34;</span> <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span><span style="color:#8045ff"></span>    --highlight-style pygments <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span><span style="color:#8045ff"></span>    --lua-filter <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">lua_filter</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span> <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span><span style="color:#8045ff"></span>    --template <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">html_template</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span> <span style="color:#8045ff">\
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7</span><span><span style="color:#8045ff"></span>    <span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">1</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span> &gt;<span style="color:#d88200">&#34;</span><span style="color:#d88200">${</span><span style="color:#111">raw_html</span><span style="color:#d88200">}</span><span style="color:#d88200">&#34;</span>
</span></span></code></pre></div><p>The Lua filter is defined in the file pointed to by the variable <code>lua_filter</code>.
The Pandoc call further uses a HTML template stored in the file pointed to by
<code>html_template</code> (see <a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l63" target="_blank" rel="noreferrer noopener">this
command</a> for
the details) and writes the intermediate HTML to the file pointed to by
<code>raw_html</code>.  The example Markdown input</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-markdown" data-lang="markdown"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>Some inline math $a^2 + b^2 = c^2$ in a sentence.
</span></span></code></pre></div><p>is then filtered to HTML that looks like</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-html" data-lang="html"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#111">&lt;</span><span style="color:#f92672">body</span><span style="color:#111">&gt;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">&lt;</span><span style="color:#f92672">p</span><span style="color:#111">&gt;</span>Some inline math <span style="color:#111">&lt;</span><span style="color:#f92672">latexinline</span><span style="color:#111">&gt;</span>a^2 + b^2 = c^2<span style="color:#111">&lt;/</span><span style="color:#f92672">latexinline</span><span style="color:#111">&gt;</span> in a
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>sentence.<span style="color:#111">&lt;/</span><span style="color:#f92672">p</span><span style="color:#111">&gt;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span><span style="color:#111">&lt;/</span><span style="color:#f92672">body</span><span style="color:#111">&gt;</span>
</span></span></code></pre></div><p>For comparison, the default HTML generated without the Lua filter looks like
the following:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-html" data-lang="html"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#111">&lt;</span><span style="color:#f92672">body</span><span style="color:#111">&gt;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">&lt;</span><span style="color:#f92672">p</span><span style="color:#111">&gt;</span>Some inline math <span style="color:#111">&lt;</span><span style="color:#f92672">span</span> <span style="color:#75af00">class</span><span style="color:#f92672">=</span><span style="color:#d88200">&#34;math inline&#34;</span><span style="color:#111">&gt;&lt;</span><span style="color:#f92672">em</span><span style="color:#111">&gt;</span>a<span style="color:#111">&lt;/</span><span style="color:#f92672">em</span><span style="color:#111">&gt;&lt;</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;</span>2<span style="color:#111">&lt;/</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;</span> +
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span><span style="color:#111">&lt;</span><span style="color:#f92672">em</span><span style="color:#111">&gt;</span>b<span style="color:#111">&lt;/</span><span style="color:#f92672">em</span><span style="color:#111">&gt;&lt;</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;</span>2<span style="color:#111">&lt;/</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;</span> = <span style="color:#111">&lt;</span><span style="color:#f92672">em</span><span style="color:#111">&gt;</span>c<span style="color:#111">&lt;/</span><span style="color:#f92672">em</span><span style="color:#111">&gt;&lt;</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;</span>2<span style="color:#111">&lt;/</span><span style="color:#f92672">sup</span><span style="color:#111">&gt;&lt;/</span><span style="color:#f92672">span</span><span style="color:#111">&gt;</span> in a sentence.<span style="color:#111">&lt;/</span><span style="color:#f92672">p</span><span style="color:#111">&gt;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span><span style="color:#111">&lt;/</span><span style="color:#f92672">body</span><span style="color:#111">&gt;</span>
</span></span></code></pre></div><p>The next step is to render the math in the intermediate HTML code using

  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

.</p>
<h3 id="server-side-katex">Server-Side Math Rendering with KaTeX</h3>
<p>If there were filtered math nodes in the AST, they will now be rendered to valid
HTML using the <code>renderToString</code> function from the <a href="https://katex.org/docs/api#server-side-rendering-or-rendering-to-a-string" target="_blank" rel="noreferrer noopener">
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>


API</a>.
This is done with a small JavaScript code in the
<a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l136" target="_blank" rel="noreferrer noopener"><code>render_latex</code></a>
function that is executed with Node.js.  A <a href="https://tldp.org/LDP/abs/html/here-docs.html" target="_blank" rel="noreferrer noopener">here
document</a> is used for this which
is fed into the <code>node</code> command using a pipe.  The paths for the input and output
HTML files are substituted in the here document with variable expansions.  The
JavaScript used for the server-side math rendering looks as follows:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-javascript" data-lang="javascript"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#00a8c8">const</span> <span style="color:#75af00">katex</span> <span style="color:#f92672">=</span> <span style="color:#75af00">require</span><span style="color:#111">(</span><span style="color:#d88200">&#39;katex&#39;</span><span style="color:#111">);</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span><span style="color:#00a8c8">const</span> <span style="color:#111">{</span><span style="color:#75af00">parseHTML</span><span style="color:#111">}</span> <span style="color:#f92672">=</span> <span style="color:#75af00">require</span><span style="color:#111">(</span><span style="color:#d88200">&#39;linkedom&#39;</span><span style="color:#111">);</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span><span style="color:#00a8c8">const</span> <span style="color:#75af00">fs</span> <span style="color:#f92672">=</span> <span style="color:#75af00">require</span><span style="color:#111">(</span><span style="color:#d88200">&#39;node:fs&#39;</span><span style="color:#111">);</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span><span style="color:#75af00">fs</span><span style="color:#111">.</span><span style="color:#75af00">readFile</span><span style="color:#111">(</span><span style="color:#d88200">&#39;${1}&#39;</span><span style="color:#111">,</span> <span style="color:#d88200">&#39;utf8&#39;</span><span style="color:#111">,</span> <span style="color:#111">(</span><span style="color:#75af00">err</span><span style="color:#111">,</span> <span style="color:#75af00">content</span><span style="color:#111">)</span> <span style="color:#111">=&gt;</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>    <span style="color:#00a8c8">const</span> <span style="color:#111">{</span><span style="color:#111">document</span><span style="color:#111">}</span> <span style="color:#f92672">=</span> <span style="color:#75af00">parseHTML</span><span style="color:#111">(</span><span style="color:#75af00">content</span><span style="color:#111">.</span><span style="color:#75af00">toString</span><span style="color:#111">());</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>    <span style="color:#00a8c8">const</span> <span style="color:#75af00">inline_items</span> <span style="color:#f92672">=</span> <span style="color:#111">document</span><span style="color:#111">.</span><span style="color:#75af00">querySelectorAll</span><span style="color:#111">(</span><span style="color:#d88200">&#39;latexinline&#39;</span><span style="color:#111">);</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>    <span style="color:#00a8c8">const</span> <span style="color:#75af00">display_items</span> <span style="color:#f92672">=</span> <span style="color:#111">document</span><span style="color:#111">.</span><span style="color:#75af00">querySelectorAll</span><span style="color:#111">(</span><span style="color:#d88200">&#39;latexdisplay&#39;</span><span style="color:#111">);</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>    <span style="color:#75af00">inline_items</span><span style="color:#111">.</span><span style="color:#75af00">forEach</span><span style="color:#111">((</span><span style="color:#75af00">item</span><span style="color:#111">)</span> <span style="color:#111">=&gt;</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>        <span style="color:#00a8c8">const</span> <span style="color:#75af00">katex_code</span> <span style="color:#f92672">=</span> <span style="color:#75af00">katex</span><span style="color:#111">.</span><span style="color:#75af00">renderToString</span><span style="color:#111">(</span><span style="color:#75af00">item</span><span style="color:#111">.</span><span style="color:#75af00">innerHTML</span><span style="color:#111">,</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>            <span style="color:#75af00">output</span><span style="color:#f92672">:</span> <span style="color:#d88200">&#39;html&#39;</span><span style="color:#111">,</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>            <span style="color:#75af00">displayMode</span><span style="color:#f92672">:</span> <span style="color:#00a8c8">false</span><span style="color:#111">,</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>        <span style="color:#111">});</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13</span><span>        <span style="color:#75af00">item</span><span style="color:#111">.</span><span style="color:#75af00">outerHTML</span> <span style="color:#f92672">=</span> <span style="color:#75af00">katex_code</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14</span><span>    <span style="color:#111">});</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15</span><span>    <span style="color:#75af00">display_items</span><span style="color:#111">.</span><span style="color:#75af00">forEach</span><span style="color:#111">((</span><span style="color:#75af00">item</span><span style="color:#111">)</span> <span style="color:#111">=&gt;</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16</span><span>        <span style="color:#00a8c8">const</span> <span style="color:#75af00">katex_code</span> <span style="color:#f92672">=</span> <span style="color:#75af00">katex</span><span style="color:#111">.</span><span style="color:#75af00">renderToString</span><span style="color:#111">(</span><span style="color:#75af00">item</span><span style="color:#111">.</span><span style="color:#75af00">innerHTML</span><span style="color:#111">,</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17</span><span>            <span style="color:#75af00">output</span><span style="color:#f92672">:</span> <span style="color:#d88200">&#39;html&#39;</span><span style="color:#111">,</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18</span><span>            <span style="color:#75af00">displayMode</span><span style="color:#f92672">:</span> <span style="color:#00a8c8">true</span><span style="color:#111">,</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19</span><span>        <span style="color:#111">});</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20</span><span>        <span style="color:#75af00">item</span><span style="color:#111">.</span><span style="color:#75af00">outerHTML</span> <span style="color:#f92672">=</span> <span style="color:#75af00">katex_code</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21</span><span>    <span style="color:#111">});</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22</span><span>    <span style="color:#75af00">fs</span><span style="color:#111">.</span><span style="color:#75af00">writeFile</span><span style="color:#111">(</span><span style="color:#d88200">&#39;${1}&#39;</span><span style="color:#111">,</span> <span style="color:#111">document</span><span style="color:#111">.</span><span style="color:#75af00">toString</span><span style="color:#111">(),</span> <span style="color:#75af00">err</span> <span style="color:#111">=&gt;</span> <span style="color:#111">{});</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23</span><span><span style="color:#111">});</span>
</span></span></code></pre></div><p>This reads the intermediate HTML file from the previous step and parses it into
HTML DOM using the <code>linkedom</code> package (line 5).  The custom tags for the math
nodes can then be queried and processed with <code>forEach</code> loops where the
<code>katex.renderToString</code> function is used to replace the tags with valid
rendered HTML math.  Finally, the processed HTML is written back to the
same file as specified for the input, which is OK since <code>fs.readFile</code> reads the
full file into memory before the callback is executed.  The function further
downloads the latest <code>katex</code> and <code>linkedom</code> modules using <code>npm</code>.  These modules
will be stored in the directory where the <code>rendernote</code> script is located and are
only downloaded if they do not exist.</p>
<h3 id="encoding-math-fonts">Encoding Math Fonts using Data URIs</h3>
<p>After the rendering pass, the
<a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l177" target="_blank" rel="noreferrer noopener"><code>static_katex</code></a>
function is executed to generate static CSS and font data for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

.  The
function fetches the data from a CDN server and stores it in a <code>static</code>
directory relative to the location of the <code>rendernote</code> script.  The function
creates static font files by converting the CDN fonts to <code>base64</code> encoded data
URIs. It only fetches <code>woff</code> fonts by default. Alternatively, other font formats
could be <a href="https://git.0xfab.ch/markdown-note-render/file/rendernote.html#l187" target="_blank" rel="noreferrer noopener">defined
here</a>.

  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 related CSS is stored in the directory
<code>static/css/katex/${katex_version}</code>, where <code>${katex_version}</code> is determined by
<code>npm</code>.</p>
<h3 id="final-html-file">Creating the Final Standalone HTML File</h3>
<p>The final step is to run Pandoc in a second pass with the intermediate HTML code
as input. This time the <code>--standalone</code> and <code>--embed-resources</code> options are
passed to Pandoc as well. The same HTML template is used for this second pass as
was already used during the first pass.  In addition, this call adds the
<code>--include-in-header</code> option to pass the CSS style sheets prepared for the final
document.  These style sheets include a <a href="https://git.0xfab.ch/markdown-note-render/file/static/css/style.css.html" target="_blank" rel="noreferrer noopener">default style
sheet</a>
and possibly 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8988em;vertical-align:-0.2155em;"></span><span class="mord text"><span class="mord textrm">K</span><span class="mspace" style="margin-right:-0.17em;"></span><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6833em;"><span style="top:-2.905em;"><span class="pstrut" style="height:2.7em;"></span><span class="mord"><span class="mord textrm mtight sizing reset-size6 size3">A</span></span></span></span></span></span><span class="mspace" style="margin-right:-0.15em;"></span><span class="mord text"><span class="mord textrm">T</span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.4678em;"><span style="top:-2.7845em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord textrm">E</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2155em;"><span></span></span></span></span><span class="mspace" style="margin-right:-0.125em;"></span><span class="mord textrm">X</span></span></span></span></span></span>

 related style sheets discussed in the <a href="#encoding-math-fonts">previous
section</a>.  A Markdown document without
math will only include the default style sheet.</p>
<p>Following up with the same <code>tiny.md</code> example used at the beginning of the
<a href="#md-to-html">Markdown to Static HTML Translation</a> section, the
command</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>rendernote tiny.md
</span></span></code></pre></div><p>generates a <code>tiny.html</code> file with a size of 431666 bytes that is 4 times smaller
than the standalone HTML file generated by the native Pandoc approach.  The file
is further free of JavaScript. While almost half a megabyte is still quite large
for such tiny content, 94% of the total file size is attributed to the <code>base64</code>
encoded math fonts included in the payload of the standalone HTML file. Removing
the inline math <code>$</code> markers from the <code>tiny.md</code> file results in a file that is
only 4282 bytes in size.</p>
<p>Finally, the example
<a href="/data/post/2025/markdown-to-static-html/turbulence.pdf"><code>turbulence.pdf</code></a> file
generated in the previous section <a href="#md-to-pdf">Markdown to PDF Translation</a>
amounts to 3.7 MB. The same Markdown input converted to HTML amounts to
2.7 MB (versus 3.9 MB using the native Pandoc approach).  The payload
for the two
images and encoded math fonts corresponds to 74% (2.0 MB).
For comparison, the standalone
<a href="/data/post/2025/markdown-to-static-html/turbulence.html"><code>turbulence.html</code></a>
version can be viewed by following the link.  The file is created with the
command</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>rendernote turbulence.md
</span></span></code></pre></div> ]]></description></item><item><title>Email with Microsoft Exchange Accounts on Linux</title><link rel="alternate" type="text/html" href="https://0xfab.ch/2024/12/microsoft-accounts-on-linux/"/><pubDate>Sat, 04 Jan 2025 23:13:52 +0100</pubDate><author>info@0xfab.ch (Fabian Wermelinger)</author><guid>https://0xfab.ch/2024/12/microsoft-accounts-on-linux/</guid><category term="davmail"/><category term="linux"/><category term="microsoft"/><category term="email"/><description><![CDATA[ <p>Email is one of the most efficient ways of communication if <a href="https://man.sr.ht/lists.sr.ht/etiquette.md" target="_blank" rel="noreferrer noopener">done
right</a>. Most people appreciative to
email value a specific set of tools that help streamline the email process. Most
often, email is fetched on a device using the
<a href="https://en.wikipedia.org/wiki/Post_Office_Protocol" target="_blank" rel="noreferrer noopener">POP</a> or
<a href="https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol" target="_blank" rel="noreferrer noopener">IMAP</a> protocol.
Some institutions may use different protocols that are not always supported by
said tools and/or operating system. This post details the steps necessary for me
to fetch email via the POP protocol from a Microsoft Exchange account using the
<a href="https://davmail.sourceforge.net/" target="_blank" rel="noreferrer noopener">Davmail</a> gateway on a Linux system.</p>
<h2 id="introduction">Introduction</h2>
<p>My workflow is command-line driven and the tools I use to process email are
either local daemons, scripts or some
<a href="https://en.wikipedia.org/wiki/Ncurses" target="_blank" rel="noreferrer noopener"><code>ncurses</code></a> based application.
Conceptually, my email data flow looks like this:</p>



<svg width="683.51" height="493.96" version="1.1" viewBox="0 0 180.85 130.69" xmlns="http://www.w3.org/2000/svg">
 <defs>
  <marker id="a" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="b" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="d" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="p" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="j" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="i" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="h" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="f" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="o" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="n" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="m" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="k" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="l" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="#a9a9a9" fill-rule="evenodd" stroke="#a9a9a9"/>
  </marker>
  <marker id="g" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="c" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
  <marker id="e" overflow="visible" markerHeight="1" markerWidth="1" orient="auto-start-reverse" preserveAspectRatio="xMidYMid" viewBox="0 0 1 1">
   <path transform="scale(.7)" d="m-0.21115-4.1056 6.4223 3.2111a1 1 90 0 1 0 1.7889l-6.4223 3.2111a1.2361 1.2361 31.717 0 1-1.7889-1.1056v-6a1.2361 1.2361 148.28 0 1 1.7889-1.1056z" fill="context-stroke" fill-rule="evenodd"/>
  </marker>
 </defs>
 <g transform="translate(-2.4343 -5.4991)">
  <rect x="2.6458" y="5.7106" width="179.92" height="15.875" rx="1.0583" ry="1.0583" fill="#87ceeb" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="76.779427" y="15.378468" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="76.779427" y="15.378468" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">Mail Servers</tspan></text>
  <rect x="3.1426" y="36.473" width="179.93" height="72.76" rx="1.0584" ry=".97288" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".40558"/>
  <text x="159.76332" y="113.4841" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="159.76332" y="113.4841" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">localhost</tspan></text>
  <rect x="12.243" y="44.955" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="22.876295" y="74.936302" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="22.876295" y="74.936302" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">procmail</tspan></text>
  <text x="24.25742" y="50.22863" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="24.25742" y="50.22863" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">getmail</tspan></text>
  <text x="17.566105" y="98.724991" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="17.566105" y="98.724991" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">spamassassin</tspan></text>
  <path transform="translate(-1.1335 -70.592)" d="m34.543 126.41v8.6357" fill="none" marker-end="url(#p)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="matrix(1 0 0 -1 90.552 165.84)" d="m34.543 125.31v14.332" fill="#a9a9a9" marker-end="url(#m)" marker-start="url(#b)" stroke="#a9a9a9" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="translate(-1.1335 -97.308)" d="m34.543 121.27v16.536" fill="#a9a9a9" marker-end="url(#k)" stroke="#a9a9a9" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="matrix(1 0 0 -1 58.567 240.75)" d="m34.543 124.76v10.289" fill="#a9a9a9" marker-end="url(#l)" marker-start="url(#a)" stroke="#a9a9a9" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="rotate(-90 49.221 58.94)" d="m34.543 128.43v8.452" fill="none" marker-end="url(#j)" marker-start="url(#d)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="rotate(-90 54.255 65.812)" d="m34.543 128.43v10.289" fill="none" marker-end="url(#i)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="rotate(-90 60.209 71.765)" d="m34.543 128.43v10.289" fill="none" marker-end="url(#h)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="rotate(-90 18.439 89.722)" d="m34.543 128.43v10.289" fill="none" marker-end="url(#f)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <rect x="71.943" y="69.65" width="42.333" height="31.75" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="83.851875" y="87.255295" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="83.851875" y="87.255295" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">Maildir</tspan></text>
  <rect x="131.64" y="69.65" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="147.56589" y="75.134735" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="147.56589" y="75.134735" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">mutt</tspan></text>
  <rect x="131.64" y="81.556" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="150.21701" y="86.73407" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="150.21701" y="86.73407" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">mu</tspan></text>
  <rect x="131.64" y="93.462" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="136.92439" y="98.724991" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="136.92439" y="98.724991" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">user scripts</tspan></text>
  <rect x="103.93" y="44.955" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="#ae81ff" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="115.92436" y="50.65461" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="115.92436" y="50.65461" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">davmail</tspan></text>
  <path transform="matrix(0 -1 -1 0 198.64 83.467)" d="m34.543 97.381v41.341" fill="#ae81ff" marker-end="url(#g)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <text x="36.878185" y="62.932842" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="36.878185" y="62.932842" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">MDA</tspan></text>
  <text x="35.724602" y="32.459084" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="35.724602" y="32.459084" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">POP3</tspan></text>
  <text x="74.215485" y="47.296913" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="74.215485" y="47.296913" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">POP3</tspan></text>
  <path transform="matrix(1 0 0 -1 107.68 192.85)" d="m34.543 126.41v8.6357" fill="#ae81ff" marker-end="url(#o)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
  <path transform="matrix(1 0 0 -1 128.85 192.85)" d="m34.543 126.41v40.239" fill="#a9a9a9" marker-end="url(#n)" stroke="#a9a9a9" stroke-linecap="round" stroke-linejoin="round" stroke-width=".529"/>
  <text x="147.4971" y="62.932842" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="147.4971" y="62.932842" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">SMTP</tspan></text>
  <text x="75.133392" y="32.459084" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="75.133392" y="32.459084" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">Microsoft Exchange</tspan></text>
  <rect x="50.776" y="120.11" width="84.667" height="15.875" rx="1.0583" ry="1.0583" fill="#9acd32" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <text x="65.357491" y="129.77502" font-family="Inconsolata" font-size="5.2917px" letter-spacing="0px" stroke-width=".26458" word-spacing="0px" style="line-height:1.25" xml:space="preserve"><tspan x="65.357491" y="129.77502" font-family="Inconsolata" font-size="5.2917px" stroke-width=".26458">Trusted Remote Server</tspan></text>
  <rect x="12.243" y="69.65" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <rect x="12.243" y="93.462" width="42.333" height="7.9375" rx="1.0583" ry="1.0583" fill="none" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".423"/>
  <path transform="translate(-1.1335 -46.491)" d="m34.543 128.43v7.1658" fill="none" marker-end="url(#e)" marker-start="url(#c)" stroke="#000" stroke-linecap="round" stroke-linejoin="round" stroke-width=".52917"/>
 </g>
</svg>


<p>A new account required communication with a Microsoft Exchange server
on which the IT department disabled POP and IMAP protocols.  Since neither
<code>getmail</code> nor <code>mutt</code> know how to do that, the additional
<a href="https://davmail.sourceforge.net/" target="_blank" rel="noreferrer noopener"><code>davmail</code></a> node in the diagram above was
required to be added.  Davmail runs as a daemon that enables POP and SMTP
protocols on the <code>localhost</code> which it then translates to the Microsoft
Exchange protocol.  Apart from POP and SMTP, Davmail also supports IMAP, LDAP,
CalDav and CardDav protocols.  With the Davmail proxy in place, any application
that is using these protocols can communicate with a server that is using the
Microsoft Exchange protocol.</p>
<h2 id="davmail-config">Davmail Configuration</h2>
<p>The following configuration is intended to use Davmail as a server daemon mainly
using POP, SMTP and CalDav protocols. Currently
<a href="https://gitlab.gnome.org/GNOME/evolution-ews/" target="_blank" rel="noreferrer noopener"><code>evolution-ews</code></a> is one of the
few email client implementations on Linux that work with Microsoft Exchange
services.  The IT department which manages the Microsoft Exchange server I am
connecting to supports this application and typically there is a high level of
resistance when asked to support another one.</p>
<p>Authentication on the remote is done with
<a href="https://www.rfc-editor.org/rfc/rfc6749.html" target="_blank" rel="noreferrer noopener">OAuth2</a>.  Following the path of
least resistance, I am using the <code>clientId</code> and <code>tenantId</code> intended for
<code>evolution-ews</code>. This information can typically be found on the help pages
provided by the IT department. The <code>persistToken</code> flag (its default value is
<code>true</code>) will append a refresh token at the end of the properties file after
connecting to the remote server for the first time.  The token will be used for
automatic authentication for subsequent connections to the server.  The steps to
obtain a refresh token are the following:</p>
<ol start="0">
<li>If <code>davmail</code> is already running as a daemon, use <code>systemctl</code> to stop the
service.</li>
<li>Set <code>davmail.mode=O365Manual</code> in your <code>davmail</code> properties file.</li>
<li>Run <code>davmail</code> manually with the properties file as argument.</li>
<li>Force a connection on the port <code>davmail</code> is listening, for example by
fetching mail or sending an email with <code>msmtp</code>.</li>
<li>Follow the instructions in the <code>stdout</code> found in the terminal where <code>davmail</code>
was started.</li>
<li>Set <code>davmail.mode=O365Modern</code> in your <code>davmail</code> properties file and restart
the <code>systemd</code> service.  From now on authentication happens via the token that
has been appended in the properties file used in item 2.</li>
</ol>
<p>By default, communication with the local Davmail daemon is not encrypted which
is fine in most cases (it can be changed if desired, <a href="https://davmail.sourceforge.net/sslsetup.html" target="_blank" rel="noreferrer noopener">see the
documentation</a>).</p>
<blockquote>
<p>Since local communication is not encrypted, you would technically not need a
password when authenticating with Davmail.  Nevertheless, you should still use
a password when connecting to Davmail with your local applications.  Davmail
will use this password to encrypt the refresh token with a symmetric AES
cipher before it is written to the properties file (you have to use the same
password with all your local applications for this to work).  Additionally,
you may want to change the file access permissions of the properties file for
further security measures.</p></blockquote>
<p>The listener ports can be chosen freely from the available user ports.  For
initial setup and testing, the logging level in line 51 may need to be set to
<code>DEBUG</code>.  In order to get a refresh token, you will need to paste an access code
that is obtained from the Microsoft Exchange server back into Davmail.  To
achieve this, execute Davmail directly from the command-line with a properties
file argument where the <code>mode</code> is set to <code>O356Manual</code>.  Once the initial
connection was successful the refresh token will be appended to the file and the
<code>mode</code> can be set back to <code>O356Modern</code> for subsequent automatic authentication
running Davmail as a <a href="https://en.wikipedia.org/wiki/Systemd" target="_blank" rel="noreferrer noopener"><code>systemd</code></a> user
service.</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-properties" data-lang="properties"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#75715e"># server and mode settings</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span><span style="color:#75af00">davmail.server</span><span style="color:#f92672">=</span><span style="color:#d88200">true</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span><span style="color:#75af00">davmail.enableKeepAlive</span><span style="color:#f92672">=</span><span style="color:#d88200">true</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span><span style="color:#75715e"># STEPS:</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span><span style="color:#75715e"># 0. Adjust davmail.logFilePath below.</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span><span style="color:#75715e"># 1. Use O365Manual for initial connection (this should append a refresh token</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span><span style="color:#75715e">#    to this file, davmail must be executed from the command line).</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span><span style="color:#75715e"># 2. Use O365Modern for headless server (e.g. systemd service) operation.</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span><span style="color:#75715e">#    Obtaining an access token should now be automated until refresh token</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span><span style="color:#75715e">#    expires.</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span><span style="color:#75af00">davmail.mode</span><span style="color:#f92672">=</span><span style="color:#d88200">O365Manual</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span><span style="color:#75715e"># davmail.mode=O365Modern</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13</span><span><span style="color:#75af00">davmail.url</span><span style="color:#f92672">=</span><span style="color:#d88200">https://outlook.office365.com/EWS/Exchange.asmx</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15</span><span><span style="color:#75715e"># oauth evolution mock settings</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16</span><span><span style="color:#75af00">davmail.oauth.clientId</span><span style="color:#f92672">=</span><span style="color:#d88200">&lt;client id&gt;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17</span><span><span style="color:#75af00">davmail.oauth.tenantId</span><span style="color:#f92672">=</span><span style="color:#d88200">common</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18</span><span><span style="color:#75af00">davmail.oauth.redirectUri</span><span style="color:#f92672">=</span><span style="color:#d88200">https://login.microsoftonline.com/common/oauth2/nativeclient</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19</span><span><span style="color:#75af00">davmail.oauth.persistToken</span><span style="color:#f92672">=</span><span style="color:#d88200">true</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21</span><span><span style="color:#75715e"># listener ports</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22</span><span><span style="color:#75af00">davmail.caldavPort</span><span style="color:#f92672">=</span><span style="color:#d88200">5000</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23</span><span><span style="color:#75af00">davmail.popPort</span><span style="color:#f92672">=</span><span style="color:#d88200">5001</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24</span><span><span style="color:#75af00">davmail.imapPort</span><span style="color:#f92672">=</span><span style="color:#d88200">5002</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25</span><span><span style="color:#75af00">davmail.ldapPort</span><span style="color:#f92672">=</span><span style="color:#d88200">5003</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26</span><span><span style="color:#75af00">davmail.smtpPort</span><span style="color:#f92672">=</span><span style="color:#d88200">5004</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28</span><span><span style="color:#75715e"># network proxy settings</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29</span><span><span style="color:#75af00">davmail.enableProxy</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30</span><span><span style="color:#75af00">davmail.useSystemProxies</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31</span><span><span style="color:#75af00">davmail.allowRemote</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33</span><span><span style="color:#75715e"># disable SSL for specified listeners</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34</span><span><span style="color:#75af00">davmail.ssl.nosecurecaldav</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35</span><span><span style="color:#75af00">davmail.ssl.nosecureimap</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36</span><span><span style="color:#75af00">davmail.ssl.nosecureldap</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37</span><span><span style="color:#75af00">davmail.ssl.nosecurepop</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38</span><span><span style="color:#75af00">davmail.ssl.nosecuresmtp</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40</span><span><span style="color:#75715e"># POP settings</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41</span><span><span style="color:#75af00">davmail.keepDelay</span><span style="color:#f92672">=</span><span style="color:#d88200">0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42</span><span><span style="color:#75af00">davmail.sentKeepDelay</span><span style="color:#f92672">=</span><span style="color:#d88200">0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43</span><span><span style="color:#75af00">davmail.popMarkReadOnRetr</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45</span><span><span style="color:#75715e"># SMTP settings</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">46</span><span><span style="color:#75af00">davmail.smtpSaveInSent</span><span style="color:#f92672">=</span><span style="color:#d88200">false</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">47</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">48</span><span><span style="color:#75715e"># logging</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">49</span><span><span style="color:#75af00">davmail.logFilePath</span><span style="color:#f92672">=</span><span style="color:#d88200">/path/to/logfile</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">50</span><span><span style="color:#75af00">log4j.rootLogger</span><span style="color:#f92672">=</span><span style="color:#d88200">WARN</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">51</span><span><span style="color:#75af00">log4j.logger.davmail</span><span style="color:#f92672">=</span><span style="color:#d88200">INFO</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">52</span><span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">53</span><span><span style="color:#75715e"># davmail appended items</span>
</span></span></code></pre></div><h2 id="davmail-patch">Patching Davmail for Forced Message Deletion on Remote</h2>
<p>I prefer POP to permanently delete messages when successfully fetched such that
none of my email resides on external mail servers (on mobile devices I typically
use IMAP to receive messages while on the go).  Also note that <code>smtpSaveInSent</code>
is set to <code>false</code> in the Davmail configuration above which will prevent saving a
copy of sent mail on the server.  Default Davmail code is conservative and moves
messages to trash when it receives the <code>DELE</code> command via POP (which is probably
the sensible thing to do in most cases).  <em><strong>Permanent message deletion</strong></em> on
the server can be enforced by applying this
<a href="https://git.0xfab.ch/davmail-git/file/pop-force-delete.patch.html" target="_blank" rel="noreferrer noopener">patch</a> to
the source code.  For Arch Linux users, an AUR package that applies this patch
during build can be obtained with</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>git clone https://git.0xfab.ch/davmail-git.git
</span></span></code></pre></div><h2 id="smtp-mutt">SMTP Configuration in Mutt</h2>
<p>Local communication with Davmail is not encrypted and SMTP configuration in
clients such as <code>mutt</code> must be setup accordingly.  In particular, SSL settings
must be disabled for the account that connects to Davmail:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-ini" data-lang="ini"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#75af00">set ssl_starttls</span>  <span style="color:#f92672">=</span> <span style="color:#d88200">&#34;no&#34;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#75af00">set ssl_force_tls</span> <span style="color:#f92672">=</span> <span style="color:#d88200">&#34;no&#34;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span><span style="color:#75af00">set smtp_url</span>      <span style="color:#f92672">=</span> <span style="color:#d88200">&#34;smtp://&lt;username&gt;@localhost:&lt;davmail smtp port&gt;/&#34;</span>
</span></span></code></pre></div> ]]></description></item><item><title>Fortran is not faster than C</title><link rel="alternate" type="text/html" href="https://0xfab.ch/2024/08/fortran-is-not-faster-than-c/"/><pubDate>Sat, 28 Dec 2024 13:17:46 +0100</pubDate><author>info@0xfab.ch (Fabian Wermelinger)</author><guid>https://0xfab.ch/2024/08/fortran-is-not-faster-than-c/</guid><category term="fortran"/><category term="c"/><category term="performance"/><category term="strict-aliasing"/><description><![CDATA[ <p>This post elaborates on some technical details about performance differences
that may be observed between Fortran and C/C++ implementations.  The motivation
for this post is based on arguments I had in the past where it was claimed that
a Fortran implementation of some algorithm yields faster executable code.  For a
well designed programming language, the performance limiting factor of a particular
implementation is the hardware, not the language or compiler.</p>
<h2 id="kernel">Test Kernel</h2>
<p>The kernel I am going to benchmark is a variation of the general matrix-vector
multiplication (GEMV) found in <a href="https://www.netlib.org/blas/" target="_blank" rel="noreferrer noopener">BLAS</a> level 2.
The simplified kernel takes the form</p>

  
  <span class="katex-display"><span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mpunct">,</span></span></span></span></span>

<p>where summation over index 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span></span></span></span>

 is assumed.  The values 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 and 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 are
elements of vectors 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7335em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord"><span class="mord boldsymbol">x</span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord"><span class="mord boldsymbol" style="margin-right:0.03704em;">y</span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∈</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6889em;"></span><span class="mord"><span class="mord mathbb">R</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span></span></span></span></span></span></span></span>

,
respectively, and 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 are elements of matrix 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7224em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">A</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">∈</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.7713em;"></span><span class="mord"><span class="mord mathbb">R</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7713em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span><span class="mbin mtight">×</span><span class="mord mathnormal mtight">n</span></span></span></span></span></span></span></span></span></span></span></span>

 for some integer number 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span>

.  For typical problems 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 is large.</p>
<h3 id="OI">Operational Intensity</h3>
<p>The <em>operational intensity</em><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> of a compute kernel is the ratio between
arithmetic operations and total number of bytes transferred due to memory
accesses.  To simplify the discussion, I am assuming the destination of all
memory transactions is DRAM.  For the simplified GEMV kernel above, there are

  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>

 multiplications and 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">n</span><span class="mopen">(</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 additions for a total of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord">2</span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>


flop.  The memory accesses to DRAM are trickier to estimate because of cache
memory hierarchies found in all modern CPU architectures.  An upper bound can be
estimated by assuming no cache, where every memory access results in a capacity
miss.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>  In that case there are 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 reads for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

, 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>

 reads for the
matrix elements 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

, 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>

 reads for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 due to the 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 inner
products and another 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 writes for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

, resulting in a total of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8974em;vertical-align:-0.0833em;"></span><span class="mord">2</span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">2</span><span class="mord mathnormal">n</span></span></span></span>

 memory accesses.  For a lower bound estimate we assume an infinite cache
where only compulsory misses are relevant.<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>  Here every required read and
write is performed exactly once, resulting in a total of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8974em;vertical-align:-0.0833em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">3</span><span class="mord mathnormal">n</span></span></span></span>

 memory
accesses.  For both estimates the leading order term of memory accesses is

  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>

, indicating that the GEMV kernel is <em>memory bound</em>.  Indeed, all BLAS
level 1 and level 2 kernels are memory bound.</p>
<p>Since 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 is assumed large, it is reasonable to pick the upper bound memory
access estimate, where the cache capacity is assumed small compared to the
problem size.  Assuming 32-bit single precision floating point data, the
operational intensity for this test kernel computes as</p>

  
  <span class="katex-display"><span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.4271em;vertical-align:-0.936em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.4911em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">8</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7401em;"><span style="top:-2.989em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord mathnormal">n</span><span class="mclose">)</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.936em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.25.</span></span></span></span></span>

<p>The approximate value indicated corresponds to the expected value in the limit
when 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span></span></span></span>

 is large.</p>
<h2 id="naive-implementation">Naive Implementation</h2>
<p>A straightforward implementation of the GEMV kernel discussed in the <a href="#kernel">previous
section</a> could look like:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#00a8c8">void</span> <span style="color:#75af00">gemv</span><span style="color:#111">(</span><span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">A</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">x</span><span style="color:#111">,</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">y</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">size_t</span> <span style="color:#111">n</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>    <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">i</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">i</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">i</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>        <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">j</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">j</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">j</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>            <span style="color:#111">y</span><span style="color:#111">[</span><span style="color:#111">i</span><span style="color:#111">]</span> <span style="color:#f92672">+=</span> <span style="color:#111">A</span><span style="color:#111">[</span><span style="color:#111">i</span> <span style="color:#f92672">*</span> <span style="color:#111">n</span> <span style="color:#f92672">+</span> <span style="color:#111">j</span><span style="color:#111">]</span> <span style="color:#f92672">*</span> <span style="color:#111">x</span><span style="color:#111">[</span><span style="color:#111">j</span><span style="color:#111">];</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span><span style="color:#111">}</span>
</span></span></code></pre></div><p>Assuming the code is built for the same target architecture, a Fortran compiler
is generally able to emit more efficient code than a C/C++ compiler for this
particular implementation.  This is because a Fortran compiler assumes that
array <code>y</code> must not <em>alias</em> <code>x</code> or <code>A</code> .  A C/C++ compiler does not make this
assumption which will prevent it from performing certain optimizations during
code generation.  It is often the reason for the claim that Fortran is faster
than C/C++.</p>
<h3 id="strict-aliasing">Strict Aliasing Rule in C/C++</h3>
<p>The concept of pointers in certain programming languages allows to declare
variables that &ldquo;point to&rdquo; a specific memory location by assigning to it a
corresponding memory address.  <em>This concept does generally not prevent any two
pointers from pointing to the same location in memory.</em>  It may also be possible
that these pointers point to ranges in memory that overlap.  Pointers do not
exist in earlier releases of Fortran and so aliasing of memory regions is not a
concern.  This is not true for programs written in C/C++.  However, to limit the
scope of possible aliasing, some rules apply broadly known as <em>strict aliasing</em>:</p>
<blockquote class="emphasize"><p>Pointers with incompatible types (e.g. <code>int</code> and <code>float</code>) <em>do not</em> alias in
memory.  Dereferencing a pointer that points to an incompatible type is
<em>undefined behavior</em>.</p></blockquote>
<p>Consider the following function for which the strict aliasing rule applies to
the pointer parameter <code>a</code> and <code>b</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#00a8c8">int</span> <span style="color:#75af00">foo</span><span style="color:#111">(</span><span style="color:#00a8c8">int</span> <span style="color:#f92672">*</span><span style="color:#111">a</span><span style="color:#111">,</span> <span style="color:#00a8c8">int</span> <span style="color:#f92672">*</span><span style="color:#111">b</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>    <span style="color:#f92672">*</span><span style="color:#111">a</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>    <span style="color:#f92672">*</span><span style="color:#111">b</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>    <span style="color:#00a8c8">return</span> <span style="color:#f92672">*</span><span style="color:#111">a</span><span style="color:#111">;</span> <span style="color:#75715e">/* what is the value of *a? */</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span><span style="color:#111">}</span>
</span></span></code></pre></div><p>The compiler does not know a priori where <code>a</code> and <code>b</code> will point to when <code>foo</code>
is called.  The return value may be <code>1</code> <em>or</em> <code>2</code> depending on what arguments are
passed into <code>foo</code>.  If pointers alias, the compiler cannot reorder read and
write operations which inflicts a performance penalty in cases where strict
aliasing is not desired.  In the example above, the compiler must issue a read
in line 5 because the write in line 4 may change its value assigned in line 3.
If the pointers do not alias, the additional read instruction is not necessary.
The code below shows the optimized assembly code generated by GCC 14.2.1 (note
that the <code>-fstrict-aliasing</code> flag is implied for <code>-O2</code>, <code>-O3</code> and <code>-Os</code>
optimizations).  The values of pointers <code>a</code> and <code>b</code> are stored in registers
<code>rdi</code> and <code>rsi</code>, respectively.  If the values in these two registers are the
same, line 2 and 3 below write to <em>the same</em> memory location and consequently
the compiler is required to perform a fresh read from address <code>rdi</code> when moving
the result into return register <code>eax</code> in line 4.  A total of <em>three</em> memory
operations are required to ensure correctness of this optimized code.</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-nasm" data-lang="nasm"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#111">foo:</span> <span style="color:#75715e">; gcc 14.2 with flags -O3 -fstrict-aliasing</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdi</span><span style="color:#111">],</span> <span style="color:#ae81ff">1</span>   <span style="color:#75715e">; *a = 1</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rsi</span><span style="color:#111">],</span> <span style="color:#ae81ff">2</span>   <span style="color:#75715e">; *b = 2</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#111">eax</span><span style="color:#111">,</span> <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdi</span><span style="color:#111">]</span> <span style="color:#75715e">; store *a in register eax</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>        <span style="color:#75af00">ret</span>
</span></span></code></pre></div><p>The compiler must be conservative with respect to optimizations when aliasing is
possible.  A performance trade-off for the dynamic flexibility that pointers
offer.</p>
<p>Sometimes strict aliasing may not be desired, especially when performance is a
concern (the compiler must have the freedom to reorder memory operations).  In
that case a pointer can be qualified as <code>restrict</code> which then establishes a
contract between the compiler and the programmer that allows to transfer
responsibility to the programmer.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>  More freedom (for the person that writes
the code) typically means more responsibility.  The <code>restrict</code> type qualifier
tells the compiler that a pointer <strong>will not</strong> alias other memory in the block
scope it is defined (from the compilers&rsquo; perspective it is more of a liberation
than restriction).  If we agree (assuming understood) with this contract, we can
rewrite the code using the <code>restrict</code> type qualifier in the public API of the
function</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#00a8c8">int</span> <span style="color:#75af00">foo</span><span style="color:#111">(</span><span style="color:#00a8c8">int</span> <span style="color:#f92672">*</span><span style="color:#00a8c8">restrict</span> <span style="color:#111">a</span><span style="color:#111">,</span> <span style="color:#00a8c8">int</span> <span style="color:#f92672">*</span><span style="color:#00a8c8">restrict</span> <span style="color:#111">b</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>    <span style="color:#f92672">*</span><span style="color:#111">a</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>    <span style="color:#f92672">*</span><span style="color:#111">b</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>    <span style="color:#00a8c8">return</span> <span style="color:#f92672">*</span><span style="color:#111">a</span><span style="color:#111">;</span> <span style="color:#75715e">/* the value of *a is definitely 1! */</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span><span style="color:#111">}</span>
</span></span></code></pre></div><p>Because of our contract, the compiler can be absolutely certain that the return
value is <code>1</code>.  <em>If something else would be expected, then it is not the
compiler&rsquo;s fault.</em>  The optimized assembly now looks as follows:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-nasm" data-lang="nasm"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#111">foo:</span> <span style="color:#75715e">; gcc 14.2 with flags -O3 -fstrict-aliasing</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdi</span><span style="color:#111">],</span> <span style="color:#ae81ff">1</span>   <span style="color:#75715e">; *a = 1</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#111">eax</span><span style="color:#111">,</span> <span style="color:#ae81ff">1</span>               <span style="color:#75715e">; return value is 1, no question!</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>        <span style="color:#75af00">mov</span>     <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rsi</span><span style="color:#111">],</span> <span style="color:#ae81ff">2</span>   <span style="color:#75715e">; *b = 2</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>        <span style="color:#75af00">ret</span>
</span></span></code></pre></div><p>This code requires only two memory operations compared to the three required
when the strict aliasing rule must be enforced.  Aliasing requires additional
memory transactions (write instructions) to guarantee correct code.  These
additional memory operations are one of the causes for performance degradation
in C/C++. Fortran does not have to deal with this.</p>
<h2 id="memory-aware-implementation">Memory Aware Implementation</h2>
<p>The naive implementation of the GEMV kernel shown at the beginning of the
<a href="#naive-implementation">previous section</a> is subject to the
strict aliasing rule.  The writes to memory locations pointed to by <code>y</code> may
overwrite memory that is subsequently read by <code>A</code> and/or <code>x</code>.  Note that the
<code>const</code> qualifier applied to these two pointers in the function signature <em>is
not sufficient</em> to prevent from aliasing.  The optimized assembly code for this
naive implementation looks like:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;display:grid;"><code class="language-nasm" data-lang="nasm"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#111">gemv:</span> <span style="color:#75715e">; gcc 14.2 with flags -O3 -ftree-vectorize</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>        <span style="color:#75af00">test</span>    <span style="color:#111">rcx</span><span style="color:#111">,</span> <span style="color:#111">rcx</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>        <span style="color:#75af00">je</span>      <span style="color:#111">.L1</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>        <span style="color:#75af00">lea</span>     <span style="color:#111">r8</span><span style="color:#111">,</span> <span style="color:#111">[</span><span style="color:#ae81ff">0</span><span style="color:#f92672">+</span><span style="color:#111">rcx</span><span style="color:#f92672">*</span><span style="color:#ae81ff">4</span><span style="color:#111">]</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>        <span style="color:#75af00">lea</span>     <span style="color:#111">r9</span><span style="color:#111">,</span> <span style="color:#111">[</span><span style="color:#111">rdx</span><span style="color:#f92672">+</span><span style="color:#111">r8</span><span style="color:#111">]</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span><span style="color:#111">.L3:</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>        <span style="color:#75af00">movss</span>   <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdx</span><span style="color:#111">]</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>        <span style="color:#75af00">xor</span>     <span style="color:#111">eax</span><span style="color:#111">,</span> <span style="color:#111">eax</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span><span style="color:#111">.L4:</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>        <span style="color:#75af00">movss</span>   <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdi</span><span style="color:#f92672">+</span><span style="color:#111">rax</span><span style="color:#f92672">*</span><span style="color:#ae81ff">4</span><span style="color:#111">]</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>        <span style="color:#75af00">mulss</span>   <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rsi</span><span style="color:#f92672">+</span><span style="color:#111">rax</span><span style="color:#f92672">*</span><span style="color:#ae81ff">4</span><span style="color:#111">]</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>        <span style="color:#75af00">add</span>     <span style="color:#111">rax</span><span style="color:#111">,</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13</span><span>        <span style="color:#75af00">addss</span>   <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14</span><span>        <span style="color:#75af00">movss</span>   <span style="color:#00a8c8">DWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdx</span><span style="color:#111">],</span> <span style="color:#111">xmm1</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15</span><span>        <span style="color:#75af00">cmp</span>     <span style="color:#111">rcx</span><span style="color:#111">,</span> <span style="color:#111">rax</span>
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16</span><span>        <span style="color:#75af00">jne</span>     <span style="color:#111">.L4</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17</span><span>        <span style="color:#75af00">add</span>     <span style="color:#111">rdx</span><span style="color:#111">,</span> <span style="color:#ae81ff">4</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18</span><span>        <span style="color:#75af00">add</span>     <span style="color:#111">rdi</span><span style="color:#111">,</span> <span style="color:#111">r8</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19</span><span>        <span style="color:#75af00">cmp</span>     <span style="color:#111">r9</span><span style="color:#111">,</span> <span style="color:#111">rdx</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20</span><span>        <span style="color:#75af00">jne</span>     <span style="color:#111">.L3</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21</span><span><span style="color:#111">.L1:</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22</span><span>        <span style="color:#75af00">ret</span>
</span></span></code></pre></div><p>The highlighted section of lines 9&ndash;16 corresponds to the inner-most loop.  This
code iterates over index 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span></span></span></span>

 to compute the inner-product 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 and
add the result to 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

.  For every 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span></span></span></span>

, the value 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 is loaded into
register <code>xmm0</code> (line 10), then multiplied with 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 (line 11, the second
operand is loaded from memory) and finally added to 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 which is represented
by the accumulator register <code>xmm1</code> in line 13.  Because of the strict aliasing
rule, the compiler is required to emit a store instruction in line 14 that
writes the current value of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 back to memory before continuing with
reading again in lines 10 and 11.  This store is unnecessary if we can guarantee
that <code>y</code> does not alias <code>A</code> and/or <code>x</code> (these are typically distinct arrays for
the GEMV operation).  Furthermore, note that this code is trivial to vectorize
but <em>pointer aliasing prevents the compiler from doing this optimization.</em>  This
is another cause for performance degradation in C/C++ when pointer aliasing is
present. Slight modifications to this kernel implementation will help the
compiler emit more efficient code.</p>
<h3 id="memory-aware-restrict">Qualify the pointer to the destination array as <code>restrict</code></h3>
<p>As seen in the <a href="#strict-aliasing">previous section</a>, one simple
way to fix this problem is to add the <code>restrict</code> qualifier to the pointer
argument for the destination array <code>y</code> (the one being written to).  Changing the
function signature of the previous GEMV implementation to</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#00a8c8">void</span> <span style="color:#75af00">gemv</span><span style="color:#111">(</span><span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">A</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">x</span><span style="color:#111">,</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#00a8c8">restrict</span> <span style="color:#111">y</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">size_t</span> <span style="color:#111">n</span><span style="color:#111">)</span>
</span></span></code></pre></div><p>removes the unnecessary store instruction in the inner-most loop and allows the
compiler to emit SIMD instructions for vectorized code.  The assembly code for
the inner-most loop now looks like:<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-nasm" data-lang="nasm"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#111">.L4:</span> <span style="color:#75715e">; gcc 14.2 with flags -O3 -ftree-vectorize</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>        <span style="color:#75af00">movups</span>   <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#111">XMMWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rsi</span><span style="color:#f92672">+</span><span style="color:#111">rax</span><span style="color:#111">]</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>        <span style="color:#75af00">movups</span>   <span style="color:#111">xmm3</span><span style="color:#111">,</span> <span style="color:#111">XMMWORD</span> <span style="color:#111">PTR</span> <span style="color:#111">[</span><span style="color:#111">rdx</span><span style="color:#f92672">+</span><span style="color:#111">rax</span><span style="color:#111">]</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>        <span style="color:#75af00">add</span>      <span style="color:#111">rax</span><span style="color:#111">,</span> <span style="color:#ae81ff">16</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>        <span style="color:#75af00">mulps</span>    <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#111">xmm3</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>        <span style="color:#75af00">movaps</span>   <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>        <span style="color:#75af00">addss</span>    <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#111">xmm2</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>        <span style="color:#75af00">movaps</span>   <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>        <span style="color:#75af00">shufps</span>   <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#ae81ff">85</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>        <span style="color:#75af00">addss</span>    <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#111">xmm2</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>        <span style="color:#75af00">movaps</span>   <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>        <span style="color:#75af00">unpckhps</span> <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13</span><span>        <span style="color:#75af00">shufps</span>   <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span><span style="color:#111">,</span> <span style="color:#ae81ff">255</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14</span><span>        <span style="color:#75af00">addss</span>    <span style="color:#111">xmm1</span><span style="color:#111">,</span> <span style="color:#111">xmm2</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15</span><span>        <span style="color:#75af00">movaps</span>   <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16</span><span>        <span style="color:#75af00">addss</span>    <span style="color:#111">xmm2</span><span style="color:#111">,</span> <span style="color:#111">xmm1</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17</span><span>        <span style="color:#75af00">cmp</span>      <span style="color:#111">rax</span><span style="color:#111">,</span> <span style="color:#111">rcx</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18</span><span>        <span style="color:#75af00">jne</span>      <span style="color:#111">.L4</span>
</span></span></code></pre></div><p>Lines 2 and 3 perform 128-bit loads for 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 and 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

, respectively,
and store the result in registers <code>xmm0</code> and <code>xmm3</code>.  These are vector loads and
operate on 4 32-bit floats simultaneously (4-way SSE SIMD for the given compiler
flags, other SIMD instruction set extensions would generate a similar code path
for the given compiler flags).  Line 5 then executes a vector multiplication for
the product 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>

 and lines 5&ndash;16 correspond to a horizontal reduction
into the accumulator register <code>xmm2</code> (the value 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

). The required store
instruction that was observed for the case with strict aliasing is optimized
away for this implementation.</p>
<p>The auto-vectorized code generated by GCC 14.2.1 is not ideal because it
performs the horizontal reduction for every loop iteration 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span></span></span></span>

.  A better
approach would be to use a vector register for the accumulation of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">ij</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05724em;">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>


products using <a href="https://en.wikipedia.org/wiki/FMA_instruction_set" target="_blank" rel="noreferrer noopener">FMA
instructions</a> and then
perform the horizontal reduction of the accumulator register into 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

 once
after the 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.854em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span></span></span></span>

 accumulation steps. For comparison, compiling this code using
Clang 19.1.0 with the same optimization flags generates different scalar
(non-SIMD) code with 4-way loop-unrolling.</p>
<h3 id="thread-private-memory">Use Thread-Private Memory</h3>
<p>For the GEMV algorithm discussed here, using the <code>restrict</code> qualifier on the
destination array <code>y</code> is not necessary.  The strict aliasing rule is not a
problem for this kernel if we write the code in a way that will impose our
intent more clear on the compiler. In the initial implementation, it is clear
that the loop index <code>i</code> is constant in the inner-most loop.  There is no need at
all to index into array <code>y</code> for every iteration <code>j</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span><span style="color:#00a8c8">void</span> <span style="color:#75af00">gemv</span><span style="color:#111">(</span><span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">A</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">x</span><span style="color:#111">,</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">y</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">size_t</span> <span style="color:#111">n</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span><span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>    <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">i</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">i</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">i</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>        <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">j</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">j</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">j</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>            <span style="color:#111">y</span><span style="color:#111">[</span><span style="color:#111">i</span><span style="color:#111">]</span> <span style="color:#f92672">+=</span> <span style="color:#111">A</span><span style="color:#111">[</span><span style="color:#111">i</span> <span style="color:#f92672">*</span> <span style="color:#111">n</span> <span style="color:#f92672">+</span> <span style="color:#111">j</span><span style="color:#111">]</span> <span style="color:#f92672">*</span> <span style="color:#111">x</span><span style="color:#111">[</span><span style="color:#111">j</span><span style="color:#111">];</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6</span><span><span style="color:#111">}</span>
</span></span></code></pre></div><p>Instead of writing the initial code as shown above, it is better to use an
accumulator for which the compiler will allocate a register (which is
thread-private).  There is no need to use a <em>shared resource</em> such as <code>y[i]</code> to
perform the accumulation.  A better GEMV implementation would thus be</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-c" data-lang="c"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span><span style="color:#00a8c8">void</span> <span style="color:#75af00">gemv</span><span style="color:#111">(</span><span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">A</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">x</span><span style="color:#111">,</span> <span style="color:#00a8c8">float</span> <span style="color:#f92672">*</span><span style="color:#111">y</span><span style="color:#111">,</span> <span style="color:#00a8c8">const</span> <span style="color:#00a8c8">size_t</span> <span style="color:#111">n</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span><span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>    <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">i</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">i</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">i</span><span style="color:#111">)</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>        <span style="color:#00a8c8">float</span> <span style="color:#111">inner_product</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.0f</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>        <span style="color:#00a8c8">for</span> <span style="color:#111">(</span><span style="color:#00a8c8">size_t</span> <span style="color:#111">j</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span><span style="color:#111">;</span> <span style="color:#111">j</span> <span style="color:#f92672">&lt;</span> <span style="color:#111">n</span><span style="color:#111">;</span> <span style="color:#f92672">++</span><span style="color:#111">j</span><span style="color:#111">)</span> <span style="color:#111">{</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>            <span style="color:#111">inner_product</span> <span style="color:#f92672">+=</span> <span style="color:#111">A</span><span style="color:#111">[</span><span style="color:#111">i</span> <span style="color:#f92672">*</span> <span style="color:#111">n</span> <span style="color:#f92672">+</span> <span style="color:#111">j</span><span style="color:#111">]</span> <span style="color:#f92672">*</span> <span style="color:#111">x</span><span style="color:#111">[</span><span style="color:#111">j</span><span style="color:#111">];</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>        <span style="color:#111">}</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>        <span style="color:#111">y</span><span style="color:#111">[</span><span style="color:#111">i</span><span style="color:#111">]</span> <span style="color:#f92672">+=</span> <span style="color:#111">inner_product</span><span style="color:#111">;</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>    <span style="color:#111">}</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span><span style="color:#111">}</span>
</span></span></code></pre></div><p>Note that no <code>restrict</code> qualifier is used for this implementation.  The
resulting assembly code is identical to the one obtained using the <code>restrict</code>
qualified destination array as in the <a href="#memory-aware-restrict">previous section</a>.  The code is more elegant because <em>intention</em> is
expressed clearly by the use of an accumulator in thread-private memory. Using
thread-private memory whenever possible is furthermore important for writing
multi-threaded code where
<a href="https://en.wikipedia.org/wiki/Race_condition" target="_blank" rel="noreferrer noopener"><em>race-conditions</em></a> may be an
issue.</p>
<h3 id="fortran-implementation">Fortran Implementation</h3>
<p>Now it is time to compare against the Fortran equivalent of the GEMV kernel.
Because Fortran is using
<a href="https://en.wikipedia.org/wiki/Row-_and_column-major_order" target="_blank" rel="noreferrer noopener"><em>column-major</em></a>
storage, it is assumed that the matrix 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal">A</span></span></span></span>

 is <em>symmetric</em> for the remainder of
this post.  The following code is taken from the file <code>sgemv.f</code> of the
<a href="https://www.netlib.org/blas/#_reference_blas_version_3_12_0" target="_blank" rel="noreferrer noopener">reference BLAS 3.12.0
library</a> with some
code stripped off to meet the simplified GEMV variant discussed in the <a href="#kernel">Test
Kernel</a> section:</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fortran" data-lang="fortran"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span>      <span style="color:#00a8c8">SUBROUTINE</span> <span style="color:#111">SGEMV</span><span style="color:#111">(</span><span style="color:#111">A</span><span style="color:#111">,</span><span style="color:#111">X</span><span style="color:#111">,</span><span style="color:#111">Y</span><span style="color:#111">,</span><span style="color:#111">N</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>      <span style="color:#00a8c8">INTEGER</span><span style="color:#111">(</span><span style="color:#ae81ff">4</span><span style="color:#111">)</span> <span style="color:#111">N</span><span style="color:#111">,</span><span style="color:#111">I</span><span style="color:#111">,</span><span style="color:#111">J</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>      <span style="color:#00a8c8">REAL</span><span style="color:#111">(</span><span style="color:#ae81ff">4</span><span style="color:#111">)</span> <span style="color:#111">A</span><span style="color:#111">(</span><span style="color:#111">N</span><span style="color:#111">,</span><span style="color:#f92672">*</span><span style="color:#111">),</span><span style="color:#111">X</span><span style="color:#111">(</span><span style="color:#f92672">*</span><span style="color:#111">),</span><span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#f92672">*</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>      <span style="color:#00a8c8">REAL</span><span style="color:#111">(</span><span style="color:#ae81ff">4</span><span style="color:#111">)</span> <span style="color:#111">TEMP</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span><span style="color:#75715e">#ifdef TRANSPOSE
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span><span style="color:#75715e"></span><span style="color:#75715e">! Form  y := A^T*x + y.
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span><span style="color:#75715e"></span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">100</span> <span style="color:#111">J</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>      <span style="color:#111">TEMP</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.0</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">90</span> <span style="color:#111">I</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>      <span style="color:#111">TEMP</span> <span style="color:#f92672">=</span> <span style="color:#111">TEMP</span> <span style="color:#f92672">+</span> <span style="color:#111">A</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">,</span><span style="color:#111">J</span><span style="color:#111">)</span><span style="color:#f92672">*</span><span style="color:#111">X</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>   <span style="color:#ae81ff">90</span>             <span style="color:#00a8c8">CONTINUE</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>      <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">J</span><span style="color:#111">)</span> <span style="color:#f92672">=</span> <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">J</span><span style="color:#111">)</span> <span style="color:#f92672">+</span> <span style="color:#111">TEMP</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13</span><span>  <span style="color:#ae81ff">100</span>         <span style="color:#00a8c8">CONTINUE</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14</span><span><span style="color:#75715e">#else
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15</span><span><span style="color:#75715e"></span><span style="color:#75715e">! Form  y := A*x + y.
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16</span><span><span style="color:#75715e"></span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">60</span> <span style="color:#111">J</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17</span><span>      <span style="color:#111">TEMP</span> <span style="color:#f92672">=</span> <span style="color:#111">X</span><span style="color:#111">(</span><span style="color:#111">J</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18</span><span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">50</span> <span style="color:#111">I</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19</span><span>      <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">)</span> <span style="color:#f92672">=</span> <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">)</span> <span style="color:#f92672">+</span> <span style="color:#111">TEMP</span><span style="color:#f92672">*</span><span style="color:#111">A</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">,</span><span style="color:#111">J</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20</span><span>   <span style="color:#ae81ff">50</span>             <span style="color:#00a8c8">CONTINUE</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21</span><span>   <span style="color:#ae81ff">60</span>         <span style="color:#00a8c8">CONTINUE</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22</span><span><span style="color:#75715e">#endif
</span></span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23</span><span><span style="color:#75715e"></span>      <span style="color:#00a8c8">RETURN</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24</span><span>      <span style="color:#00a8c8">END</span>
</span></span></code></pre></div><p>As mentioned above, the equivalent Fortran algorithm is when <code>TRANSPOSE</code> is
defined (lines 7&ndash;13).  Compiling this code with GFortran 14.2.1 results again
in identical assembly code as obtained for the memory aware C/C++ versions.
Different for Fortran of course is the absence of aliasing, such that a loop
structure without the <code>TEMP</code> accumulator</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fortran" data-lang="fortran"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1</span><span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">100</span> <span style="color:#111">J</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2</span><span>      <span style="color:#00a8c8">DO</span> <span style="color:#ae81ff">90</span> <span style="color:#111">I</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span><span style="color:#111">,</span><span style="color:#111">N</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3</span><span>      <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">J</span><span style="color:#111">)</span> <span style="color:#f92672">=</span> <span style="color:#111">Y</span><span style="color:#111">(</span><span style="color:#111">J</span><span style="color:#111">)</span> <span style="color:#f92672">+</span> <span style="color:#111">A</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">,</span><span style="color:#111">J</span><span style="color:#111">)</span><span style="color:#f92672">*</span><span style="color:#111">X</span><span style="color:#111">(</span><span style="color:#111">I</span><span style="color:#111">)</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4</span><span>   <span style="color:#ae81ff">90</span>             <span style="color:#00a8c8">CONTINUE</span>
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5</span><span>  <span style="color:#ae81ff">100</span>         <span style="color:#00a8c8">CONTINUE</span>
</span></span></code></pre></div><p>still yields identical assembly code.</p>
<h2 id="benchmark">Benchmark</h2>
<p>In this section I am going to benchmark the different C/C++ implementations
discussed above and compare to the BLAS Fortran implementation.  The main metric
of interest is the operational intensity discussed in a <a href="#OI">previous
section</a>. The necessary instruction counts for flops and
memory accesses are determined using hardware counters via the
<a href="https://icl.utk.edu/papi/" target="_blank" rel="noreferrer noopener">PAPI</a> library.  The benchmark results shown below
are obtained using an <a href="https://www.intel.com/content/www/us/en/products/sku/126684/intel-core-i78700k-processor-12m-cache-up-to-4-70-ghz/specifications.html" target="_blank" rel="noreferrer noopener">Intel Core
i7-8700K</a>
CPU running at 3.70 GHz and the problem size is set to 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">10000</span></span></span></span>

.  The
benchmark code can be found
<a href="https://git.0xfab.ch/benchmark-fortran-c/log.html" target="_blank" rel="noreferrer noopener">here</a>.</p>
<p>Assuming the upper bound estimate discussed in an <a href="#OI">earlier section</a>, a total of 
  
  <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8974em;vertical-align:-0.0833em;"></span><span class="mord">2</span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">2</span><span class="mord mathnormal">n</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">200020000</span></span></span></span>

 memory instructions are to be
expected.  The naive C/C++ kernel with strict aliasing generates the following
benchmark output (compiled with GCC 14.2.1 and <code>-O2</code> flag):</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;display:grid;"><code class="language-text" data-lang="text"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span>Total cycles:                 427373621
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>Total instructions:           700071682
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>Instructions per cycle (IPC): 1.64
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>L1 cache size:                32 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>L2 cache size:                256 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>L3 cache size:                12288 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>Total problem size:           390703 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>Total L1 data misses:         12515647
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>Total load/store:             300010748 (expected: 200020000)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>Operational intensity:        1.666690e-01 (expected: 2.499875e-01)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>Performance [GFlop/s]:        2.204652e+00
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>Wall-time   [micro-seconds]:  9.072179e+04
</span></span></code></pre></div><p>The measured number of load/store instructions exceeds the expected value by 100
Million because of the additional store instruction required in the inner-most
loop due to the strict aliasing rule (see <a href="#memory-aware-implementation">Memory Aware Implementation</a>).  The additional store instructions
reduce the observed operational intensity accordingly.  The measurement for the
kernel without strict aliasing produces the expected values (compiled with GCC
14.2.1 and <code>-O2</code> flag):</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;display:grid;"><code class="language-text" data-lang="text"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span>Total cycles:                 421014858
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>Total instructions:           600101681
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>Instructions per cycle (IPC): 1.43
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>L1 cache size:                32 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>L2 cache size:                256 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>L3 cache size:                12288 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>Total problem size:           390703 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>Total L1 data misses:         12520995
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>Total load/store:             200020748 (expected: 200020000)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>Operational intensity:        2.499866e-01 (expected: 2.499875e-01)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>Performance [GFlop/s]:        2.237932e+00
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>Wall-time   [micro-seconds]:  8.937267e+04
</span></span></code></pre></div><p>The roofline plot below summarizes measurements obtained for single core
execution using different compilers as well as the C and Fortran implementations
of the GEMV test kernel.  The following compilers have been used on an Arch
Linux system with kernel 6.12.4:</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>gcc-14</code></td>
          <td><a href="https://gcc.gnu.org/" target="_blank" rel="noreferrer noopener">GNU C</a> compiler (version 14.2.1)</td>
      </tr>
      <tr>
          <td><code>gfortran-14</code></td>
          <td><a href="https://gcc.gnu.org/" target="_blank" rel="noreferrer noopener">GNU Fortran</a> compiler (version 14.2.1)</td>
      </tr>
      <tr>
          <td><code>clang-18</code></td>
          <td><a href="https://clang.llvm.org/" target="_blank" rel="noreferrer noopener">LLVM C</a> compiler (version 18.1.8)</td>
      </tr>
      <tr>
          <td><code>flang-18</code></td>
          <td><a href="https://flang.llvm.org/" target="_blank" rel="noreferrer noopener">LLVM Fortran</a> compiler (version 18.1.8)</td>
      </tr>
      <tr>
          <td><code>nvfortran-24</code></td>
          <td><a href="https://developer.nvidia.com/hpc-sdk" target="_blank" rel="noreferrer noopener">Nvidia HPC SDK Fortran</a> compiler (version 24.7, PGI legacy)</td>
      </tr>
  </tbody>
</table>
<p>All executables are built using <code>-O2</code> optimization targeting scalar code without
SIMD (AVX2) or FMA instructions (additional flags used with the Nvidia Fortran
compiler are <code>-acc=host</code> and <code>-mno-fma</code>):</p>
<img src="./figs/noopt_intel_i7_8700k.svg" alt="Intel Core i7-8700K single core"><p>The roofline shows that none of the test executables reach the hardware limit
for the given level of optimization.  The C implementations correspond to the
<code>gcc-14</code> and <code>clang-18</code> legend entries, where the <a href="#naive-implementation">naive implementations</a> are indicated with <em>strict aliasing</em>.  The
executable generated with the Nvidia Fortran compiler exhibits 4x more
load/store instructions than what is expected which results in slightly worse
performance compared to the other test executables.  (It is not clear why the
compiler does this but removing the <code>-mno-fma</code> flag immediately results in
aggressively optimized code, see <a href="#benchmark-opt">next section</a>.)</p>
<p>The overall performance for all test executables is roughly identical at this
level of optimization, irrespective of the strict aliasing rule.  Given the
upper bound estimate for the operational intensity of the GEMV test kernel, the
fastest executable is limited by the bandwidth ceiling which corresponds to
about 10.4 GFlop/s on this Intel architecture.  A C compiler that is
compliant with the strict aliasing rule (even when aggressive optimizations are
enabled) is unlikely to generate better code than what is observed for <code>gcc-14</code>
above. Therefore, the <a href="#naive-implementation">naive C implementation</a> of the GEMV test kernel is expected to exhibit 5x
slower performance due to the strict aliasing rule in C/C++.</p>
<h3 id="benchmark-opt">With Compiler Optimization</h3>
<p>More aggressive optimizations can be enabled by allowing the compiler to use
SIMD instructions via auto-vectorization as well as FMA instructions and other
optimization techniques of which some may disregard strict standard compliance.
The optimization flags used for the following measurements are:</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>gcc-14</code></td>
          <td><code>-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops</code></td>
      </tr>
      <tr>
          <td><code>gfortran-14</code></td>
          <td><code>-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops</code></td>
      </tr>
      <tr>
          <td><code>clang-18</code></td>
          <td><code>-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops</code></td>
      </tr>
      <tr>
          <td><code>flang-18</code></td>
          <td><code>-O3 -ffast-math -fstack-arrays -march=core-avx2</code></td>
      </tr>
      <tr>
          <td><code>nvfortran-24</code></td>
          <td><code>-O3 -acc=host</code></td>
      </tr>
  </tbody>
</table>
<p>These are rather aggressive optimization flags which may not always be suitable
for production code.  The benchmark output for the C kernel implementation with
strict aliasing now looks like (compiled with GCC 14.2.1 and the optimization
flags for <code>gcc-14</code> shown in the table above):</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;display:grid;"><code class="language-text" data-lang="text"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span>Total cycles:                 421750341
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>Total instructions:           337611682
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>Instructions per cycle (IPC): 0.80
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>L1 cache size:                32 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>L2 cache size:                256 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>L3 cache size:                12288 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>Total problem size:           390703 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>Total L1 data misses:         12520574
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>Total load/store:             300010663 (expected: 200020000)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>Operational intensity:        1.666691e-01 (expected: 2.499875e-01)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>Performance [GFlop/s]:        2.230897e+00
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>Wall-time   [micro-seconds]:  8.965453e+04
</span></span></code></pre></div><p>The total number of cycles remains about the same while the total number of
instructions executed has roughly halved due to optimization passes, reducing
the average number of instructions per cycle (IPC) by the same factor.  The
number of load/store instructions remains the same, indicating that the
optimized version is still issuing scalar load/stores which is a consequence of
the strict aliasing rule.  The numbers show that the GNU C compiler honors the
strict aliasing rule even when aggressive optimizations are enabled and thus
does not have much freedom to optimize the code.  Furthermore, IPC is not an
accurate metric for this memory bound code.</p>
<p>The benchmark output for the C kernel implementation using thread private memory
for the loop accumulator looks like the following (compiled with GCC 14.2.1 and
the optimization flags for <code>gcc-14</code> shown in the table above):</p>
<div class="highlight"><pre tabindex="0" style="color:#272822;background-color:#fafafa;-moz-tab-size:4;-o-tab-size:4;tab-size:4;display:grid;"><code class="language-text" data-lang="text"><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1</span><span>Total cycles:                 83715295
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2</span><span>Total instructions:           30031680
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3</span><span>Instructions per cycle (IPC): 0.36
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4</span><span>L1 cache size:                32 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5</span><span>L2 cache size:                256 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6</span><span>L3 cache size:                12288 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7</span><span>Total problem size:           390703 KB
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8</span><span>Total L1 data misses:         12537289
</span></span><span style="display:flex; background-color:#e1e1e1"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9</span><span>Total load/store:             25020760 (expected: 200020000)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10</span><span>Operational intensity:        1.998440e+00 (expected: 2.499875e-01)
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11</span><span>Performance [GFlop/s]:        1.121076e+01
</span></span><span style="display:flex;"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12</span><span>Wall-time   [micro-seconds]:  1.784090e+04
</span></span></code></pre></div><p>Here the compiler was able to use AVX2 SIMD instructions that operate on 256-bit
vector registers.  Since the test code is using 32-bit (single precision)
floating point data there are 8 SIMD lanes per register.  This is the reason for
8x fewer load/store instructions since each load/store instruction operates on 8
elements simultaneously.  The fewer memory instructions obfuscate the observed
operational intensity which appears to be 8x larger than expected (because it is
calculated assuming scalar memory instructions).  The <em>total number of bytes
transferred</em> remains the same however, which is the metric used to define the
denominator of the operational intensity.</p>
<p>The roofline for the measurements with optimized executables looks like this:</p>
<img src="./figs/opt_intel_i7_8700k.svg" alt="Intel Core i7-8700K single core optimized"><p>The optimized executables reach the hardware ceiling with the exception of the
naive C implementation that enforces the strict aliasing rule.  It is
interesting to note that the latter only applies to the GNU C compiler, the LLVM
C compiler seems to ignore the strict aliasing rule for the optimization flags
specified in the table above (which is one of these optimizations that <em>is not</em>
compliant with the standard mentioned above).  These results show that the
performance limiting factor should be the hardware rather than the programming
language or compiler. Therefore, Fortran is not faster than C.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>For a more thorough discussion see <a href="https://dl.acm.org/doi/pdf/10.1145/1498765.1498785" target="_blank" rel="noreferrer noopener">this
paper</a>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>See the 3C&rsquo;s, for example <a href="https://en.wikipedia.org/wiki/Cache_performance_measurement_and_metric" target="_blank" rel="noreferrer noopener">on
wikipedia</a>&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a>&#160;<a href="#fnref1:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3">
<p>The <code>restrict</code> type qualifier was introduced in C99.  In C++ <code>restrict</code> is
not a keyword but many compilers support it via built-in extensions. GCC
support is provided by the built-in <code>__restrict</code> and <code>__restrict__</code> type
qualifiers.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:4">
<p>The listing only shows the vectorized code path of the inner-most loop
assuming the vector size <code>n</code> is an integer multiple of 4.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div> ]]></description></item><item><title>Harvard Computer Science Classes</title><link rel="alternate" type="text/html" href="https://0xfab.ch/2024/08/harvard-classes/"/><pubDate>Thu, 15 Aug 2024 18:13:10 +0100</pubDate><author>info@0xfab.ch (Fabian Wermelinger)</author><guid>https://0xfab.ch/2024/08/harvard-classes/</guid><category term="lectures"/><category term="slides"/><category term="harvard"/><category term="performance"/><category term="python"/><category term="hpc"/><category term="c"/><category term="c++"/><description><![CDATA[ <p>From Fall 2021 to Summer 2023 I was teaching Computer Science classes at the
Harvard John A. Paulson School of Engineering and Applied Sciences
<a href="https://seas.harvard.edu/" target="_blank" rel="noreferrer noopener">SEAS</a>. Below you can find lecture slides and
auxiliary material for the classes I was responsible for.</p>
<h2 id="cs205-high-performance-computing-for-science-and-engineering">CS205: High-Performance Computing for Science and Engineering</h2>
<p>This is a graduate level class with focus on parallel programming (shared and
distributed memory models) on CPU architectures by exploiting different forms of
parallelism such as data-level, task-level, thread-level and instruction-level.
Before teaching this class at Harvard, I have taught part of the syllabus at ETH
Zurich.  The class website for the 2023 term can be found at
<a href="https://harvard-iacs.github.io/2023-CS205/" target="_blank" rel="noreferrer noopener">https://harvard-iacs.github.io/2023-CS205</a>.</p>
<p>For an overview of the topics, see the <a href="./cs205/cs205_syllabus.pdf">class
syllabus</a>.  Zip archives for
<a href="./cs205/cs205_homework.zip">homework</a> and <a href="./cs205/cs205_lab.zip">lab</a>
assignments are provided by following the links.</p>
<p>
<table class="center" style="width:90%; font-family:Inconsolata,monospace;">
<colgroup>
<col style="width:33%">
<col style="width:33%">
<col style="width:33%">
</colgroup>
<tbody>
<tr><td><a href="/2024/08/harvard-classes/cs205/lecture01.pdf">lecture01.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture02.pdf">lecture02.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture03.pdf">lecture03.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture04.pdf">lecture04.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture05.pdf">lecture05.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture06.pdf">lecture06.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture07_08.pdf">lecture07_08.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture09.pdf">lecture09.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture10.pdf">lecture10.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture11.pdf">lecture11.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture12.pdf">lecture12.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture13.pdf">lecture13.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture14.pdf">lecture14.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture15.pdf">lecture15.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture16.pdf">lecture16.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture17.pdf">lecture17.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture18_19.pdf">lecture18_19.pdf</a></td><td><a href="/2024/08/harvard-classes/cs205/lecture20.pdf">lecture20.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs205/lecture21.pdf">lecture21.pdf</a></td></tr>
</tbody>
</table>
</p>

<p>A Git repository with example codes used
in the class can be found <a href="https://git.0xfab.ch/cs205-lecture-examples/log.html" target="_blank" rel="noreferrer noopener">here</a>.</p>
<h2 id="cs107-systems-development-for-computational-science">CS107: Systems Development for Computational Science</h2>
<p>This is an undergraduate level class with focus on software development
practices as well as teaching the foundations of the Python programming
language.  The class website for the 2022 term can be found at
<a href="https://harvard-iacs.github.io/2022-CS107/" target="_blank" rel="noreferrer noopener">https://harvard-iacs.github.io/2022-CS107</a>.</p>
<p>For an overview of the topics, see the <a href="./cs107/cs107_syllabus.pdf">class
syllabus</a>. Zip archives for
<a href="./cs107/cs107_homework.zip">homework</a> and <a href="./cs107/cs107_lab.zip">lab</a>
assignments are provided by following the links.</p>
<p>
<table class="center" style="width:90%; font-family:Inconsolata,monospace;">
<colgroup>
<col style="width:33%">
<col style="width:33%">
<col style="width:33%">
</colgroup>
<tbody>
<tr><td><a href="/2024/08/harvard-classes/cs107/lecture01.pdf">lecture01.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture02.pdf">lecture02.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture03.pdf">lecture03.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture04.pdf">lecture04.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture05.pdf">lecture05.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture06.pdf">lecture06.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture07.pdf">lecture07.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture08.pdf">lecture08.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture09.pdf">lecture09.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture10.pdf">lecture10.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture11.pdf">lecture11.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture12.pdf">lecture12.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture13.pdf">lecture13.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture14.pdf">lecture14.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture15.pdf">lecture15.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture16.pdf">lecture16.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture17.pdf">lecture17.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture18.pdf">lecture18.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture19.pdf">lecture19.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture20.pdf">lecture20.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture21.pdf">lecture21.pdf</a></td></tr>
    <tr><td><a href="/2024/08/harvard-classes/cs107/lecture22.pdf">lecture22.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture23.pdf">lecture23.pdf</a></td><td><a href="/2024/08/harvard-classes/cs107/lecture24.pdf">lecture24.pdf</a></td></tr>
    <tr></tr>
</tbody>
</table>
</p>

<p>A Git repository with example codes used
in the class can be found <a href="https://git.0xfab.ch/cs107-lecture-examples/log.html" target="_blank" rel="noreferrer noopener">here</a>.</p> ]]></description></item></channel></rss>