Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

Why semantic markup?

  • 2 Replies
  • 1482 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 437
    • View Profile
Why semantic markup?
« October 04, 2017, 05:11:44 PM »
Semantic markup is the practice of tagging text with why that text is there, rather than simply "this is how it looks" (styling or presentation markup). This gives structure to text source, which can be useful in at least two ways:

  • Rather than repeating the "how it looks" (appearance) information with every use, it is consolidated into one place for consistency and easy changes to the appearance.
  • It can be searchable for changing or inventorying certain uses.
If you wanted a certain look to, say, your chapter headings, would you rather do the following each time you started a new chapter?

  • Give command to skip to (where necessary) skip to the top of a right-hand (odd number) page.
  • Skip down several lines.
  • Give the chapter number right justified, in 30pt Cooper Black.
  • Next line, give the heading My Chapter Title right justified, in 15pt Cooper Black.
  • Skip down several lines.
  • Start your first paragraph.
or,
  • Give markup such as <chap_start>My Chapter Title</chap_start>.
  • Start your first paragraph.

In the second case, some sort of "style file" (such as CSS) knows how you want your chapters started and styled. If you don't like the look of it, you change things in one place — perhaps a different typeface, or a different size. And the markup language keeps track of the chapter numbers for you. You wouldn't believe how many people think it's easier to just do it the first way!

If all you're doing is a one or two page memo, that will at most be printed out once or twice, and never updated or consolidated into some sort of collection, such manual operations are acceptable. However, for anything beyond that, you should consider a semantic markup setup. It can even be WYSIWYG editing, so long as the element buttons are semantic descriptions and not just styling. That is, buttons for "emphasis", "citation", etc. and not for "italic", "bold", "underline", etc. Certainly for books, manuals, journal articles, and the like, markup with semantics is mandatory.

For example, most WYSIWYG editors allow you to designate text as italic or bold (or both). This is bad practice for anything beyond a brief letter or memo. Let's say you use italics for emphasized text, for titles (citations), and for foreign words. You just wrote a nice technical report with a Word Processor, and your boss is so impressed that she asks you to submit it to a technical or scientific journal (as an article). The journal bounces back the manuscript with some style suggestions: "we use bold for emphasis, underlines for citations, and a different typeface for foreign words." If you had written this in a markup language, it would be easy to change the definition (in the style file) of "emphasis" from italic to bold, of "citation" from italic to underlined, and "foreign" from italic to a different typeface. Alas, you are going to have to trudge through the manuscript word-by-word and manually change all italics, after figuring out why you used italics for something. Fun! If it's a standard markup (such as HTML or LaTeX), you might be able to simply submit the markup and let the journal or publisher supply the style file. And lest you think this is an exaggeration, I've heard of publishers who want typewriter (fixed pitch) style submissions so they can count words, line lengths, and be double-spaced with room for editor's marks!

Some WYSIWYG Word Processors (such as MS Word) can give you limited semantic markup (e.g., designate headings for various purposes, at different levels), but you generally cannot export them to flat text files (sometimes to HTML). Even when you can, they often come loaded up with all sorts of extra crap (font selections, sizes, etc., that are repeated over and over) that you'd really rather not have to deal with. This is not to say that WYSIWYG word processing can't deliver good, clean markup; it's just that it's usually something tacked on after the fact, and it's not really designed from the ground up to do that. There's almost always some styling controls or tags mixed in, that you'll need to fix, especially if generic "italic" and "bold" etc. stylings are available.

Once you have your (semantic) markup cleanly separated from the text and from the styling, what can you do with it? Well, such text is better for screen readers, as knowing what the text is for can clue the reader in to how to modulate its voice. For instance, emphasized text might be read slower, louder, and at a lower pitch. A citation might have a slight pause before and after it. And a foreign word might be pronounced correctly (if the language used is included somewhere, such as <foreign lang="fr">après-ski</foreign> embedded withing English text).

Another use for flat file text source with markup (tags) is to have some processor scan through it looking for certain tags, and extracting that text into a separate file. For instance, find all citation tags to start building a bibliography for your document. Another could be to extract foreign words and phrases to start building a glossary. In both cases, the list could be sorted and manually or automatically looked over to spot possible misspellings and typos, helping to clean up your source. This must all be done manually if all you did was italicize this material.

For HTML web page markup, putting as much of the styling as possible into CSS leaves cleaner HTML text which search engines prefer to something cluttered with styling markup. For a journal, magazine, or book submission, it becomes much easier to meet their styling guidelines when style information is consolidated into one place.
« Last Edit: June 12, 2018, 11:22:33 AM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 437
    • View Profile
Re: Why semantic markup?
« Reply #1: October 18, 2017, 09:29:10 AM »
Besides marking for generic italic and bold text (non-semantic), font changes are frequently used to differentiate the meaning of text. Most word processors offer some sort of fixed pitch command, such as a [Tt] button  or <tt> markup, in additional to a general purpose <font typeface="…"> markup or selection. Again, marking up by appearance rather than meaning is a bad thing. Markup languages should offer only semantic markup, along with an easy way to control the appearance of a tag's output (i.e., something like a CSS file). Some word processors let you mark text semantically (e.g., "this is a level 2 heading"), but make it difficult to customize the appearance. HTML, a markup language, offers <cite>, <del>, <em>, <ins>, <hx>, <kbd>, <p>, <q>, <samp>, and <strong> as semantic markup (with CSS styling); as well as the ability to style (with CSS) various other text elements. A good markup language (such as LaTeX or XML) will permit you to define new semantic elements, as needed. Most word processors make it difficult to customize the appearance of semantic markup elements and tags, and even harder to add new elements.

It is preferable to not use only font changes to distinguish text purpose, but to semantically mark up the text (the purpose of the highlighted text) and apply consistent styling to it. The human reader sees the same thing either way, but software looking at the text source can tell why some text is different, and the author of the source has an easier time because the appearance is consistent with the purpose, and can be changed in one place if desired (e.g., instead of Courier, use some other fixed pitch/monospaced font for keyboard input).

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 437
    • View Profile
Re: Why semantic markup?
« Reply #2: June 12, 2018, 11:40:38 AM »
Continuing the discussion on italics and the <i> tag, there is a side discussion in https://bytes.com/topic/html-css/answers/164445-footnote-style on whether <i> has semantic meaning in and of itself (It does not — it is presentation markup only). It was pointed out that typographic convention for all sorts of things mandates italics, such as scientific species names (Pan troglodytes or Homo sapiens). This would most correctly be done with <span class="species"> or <span style="font-style: italic;">, although you may be able to persuade most browsers to accept a <species> tag with appropriate CSS markup. The question comes up of "what happens when/if HTML ever implements a tag of this name?" and "do I need to worry about SGML compatibility, etc.?". It's unlikely that HTML will ever define a <species> tag, but something more generic-sounding might come along some day. And don't worry if IE6 doesn't correctly handle such things; it's time to drive a stake through its heart!