Archival numbering techniques

This document describes a number of techniques that can be used to number pages when scanning a book for archival purposes. At the end I discuss a technique for naming/numbering music files from CDs or books.

Different techniques might be more appropriate for different books, and different indexing methods.

In all cases, I don't mention the file type, or file extension, which is determined by the graphics file format used for preserving the scan.

I'll take a small digression into graphics file formats. Most bi-color (Black and White, no grey scale) material is best stored in TIFF files with Group 4 (FAX Group 4, or CCITT Group 4) compression, for smallest size. Color material is probably best stored as TIFF with LZW compression if highest quality if desired, or as JPEG at an IJG quality level at or above 93 if the exact nuances of color and shading mostly result from the aging of the book, rather than being needed to render the particular shade chosen by an artist. Lower JPEG quality levels, or reducing the number of colors are also fine techniques, if there were a fixed number of distinguishable colors in the original material, and preserving the exact shade is not essential.

Technique A

Technique A is simply to scan each page of the document, and number it. First, determine the maximum number of pages you will need. Use a fixed number of digits, at least large enough to count to the last page of the document, and use leading zeros to fill in the otherwise unnecessary digits. See this link for details about why a fixed number of digits is required.

Use a filename prefix such as "pg" or "page" to help educate the beholders why the filenames include numbers, which are otherwise not particularly descriptive names.

Hence you have a directory containing files named "pg001", "pg002", "pg003", etc. for documents with up to 999 pages. If a document contains 9 or fewer pages, you can use a single digit; if it contains 99 or fewer pages, 2 digits suffice. If a large document contains 1000 or more pages, but fewer than 10,000, then 4 digits are appropriate.

This technique works fine if the pages are unnumbered, or have a single sequential numbering scheme.

If there are multiple "sections" (preface material (often numbered in books using lowercase Roman numerals), individually numbered chapters, separately numbered appendix or index pages, etc.) then technique A, while it would work, would not reflect the numbers in the book. If you wish your file names to reflect the numbering in the book, consider Technique B with suffix or Technique C.

Technique B

Technique B is a simple upgrade to technique A. The file name are created exactly as in technique A, and a suffix is used to provide additional information about the original numbering from the book. The name from technique A, by being used as the prefix, completely determines the order of the pages in a sorted directory listing, and provides uniqueness among all filenames. Thus, even if the numbering in the book itself takes many different forms, and includes duplicates, such numbering can simply be reproduced in the filename.

This example illustrates technique B for a book with 4 preface pages, 352 main body pages, 32 appendix pages, and some index pages. Or if the appendix and/or index pages do not have designations like "app" or "x" in the book, these can be omitted, or included, by choice...

pg001-i
pg002-ii
pg003-iii
pg004-iv
pg005-1
pg006-2
pg007-3
...
pg013-9
pg014-10
pg015-11
...
pg356-352
pg357-app1
pg358-app2
...
pg365-app9
pg366-app10
...
pg388-app32
pg389-x1
pg390-x2
...


Technique C

Technique C requires additional planning and analysis of the book before starting. It is suitable for complex books with multiple page numbering sequences, and can result in shorter, yet still meaningful names, than technique B. Generally the names will be slightly longer than those of technique A. The analysis should determine the number of different sections of the book, particularly each one that has independent numbering. Come up with a set of section designations _that sort properly_ like

apref
ch001
ch002
ch003
eappx
index

"a" and "e" were just to force the sort. "pref" could be preface material, "appx" could mean Appendix. If there is more than one appendix, eap01, eap02, etc. could be used. "index" is self explanatory. "atofc" could be used for a table of contents after preface material. "bpref" could be used for preface material after a table of contents... "chNNN" is for chapter 1, 2, 3 etc. If the book is less formal, but has several page number groupings, they could be just

sec01
sec02
sec03

It isn't necessary that the section designations all have the same number of characters, but it is useful to ensure that they sort alphabetically to the proper order of the book. Using the same number of characters provides a certain consistency that might be pleasing. If the section designation contains a substring of digits, however, like ch001 and sec01, the number of digits in the substring must be the same for each prefix (all the "ch" sections should use exactly same number of digits, all the "sec" sections should use the same number of digits, etc.)

After each section designation, place a fixed number of digits for page numbers, for example

apref-001
apref-002
...
atofc-001
atofc-002
ch001-001
ch001-002
ch001-003
...
ch002-001
...
sec05-001
...
eappx-001
...
index-001
...

Although it isn't necessary that each section include exactly the same fixed number of digits for page numbers, there must be enough digits to represent the pages in that section. Using the same number of digits for each section provides a certain consistency that might be pleasing.

Other issues

Identification of key pages

If some key pages, or even if every page, has an additional designation that help identify it, such a designation can be added to the end of any of the naming techniques. For example, perhaps you wish to identify the first page of the Table of Contents, and the first page of each chapter. If you are using technique A you might provide names such as

pg005-Table_of_contents
pg009-Chapter_1
pg037-Chapter_2
pg177-Chapter_7

Extra pages

In any of the above techniques, a single letter suffix, just after the fixed number of digits page number would work for up to 26 extra pages. On the other hand, maybe this technique isn't good past about 6 pages, because most people don't readily convert between a letter and its ordinal position in the alphabet... if there are pages with suffixes from a through p, inclusive, how many people would you expect would immediately know that there were 16 extra pages inserted between that and the next numbered page?

pg037
pg037a
pg037b
pg038


In any of the above techniques, if there are more than 26 extra pages, or if you just happen to like this technique better, the extra pages could be number, with a fixed number of additional digits, after a punctuation character...

ch005-037
ch005-037-01
ch005-037-02
...
ch005-037-29
ch005-037-30
ch005-038

Missing pages

Four techniques are described here...

1) scan the blank page (this works if it is blank, rather than missing)
2) Use a fixed size, pre-prepared, "placeholder" file, containing "This page was omitted" and/or "This page was blank". These can be small files, but will hold the place, and are more informative than a scanned, blank page. It also works if the page is truly missing, not blank. If there is a large gap in numbering, a custom placeholder page could be created that explains the whole gap. Such placeholder files would be easy to create in PhotoShop, or most graphics programs.
3) Just skip the page number. But because it looks "missing" some commentary should be added somewhere to point out that it was not an omission during archival. The exception, is if technique B is used, then the first number, which represents the actual ordinal count of pages in the book, never has a missing number. If there is a missing page according to the numbering of the book, that shows up via a sequence like:

pg053-53
pg054-54
pg055-57
pg056-58


4) Just ignore the page numbering in the book. This again requires some explanatory commentary, for the astute reader that notices that the scan named "pg055" actually contains a page whose scanned number sure looks like "57" :)

Personally, I prefer the 2nd solution, as only people that actually attempt to sequentially scan the book discover that there is a missing page, and only when they actually get there... and it is easy to skip on to the next page. Plus, if the page really existed in other copies of the book, and one such is found, and can be inserted, nothing else needs to change in the archive.

Music files

If a collection of music files is specifically from a book of music, then I recommend using the numbers from that book (if not unique, make them so with "a", "b", "c" suffixes after the numbers, or borrow from the techniques above). Be sure to use the correct, fixed number of digits, based on how high the numbers in the book go. If the book contains multiple numbering sequences, this can either be handled by using an extra leading digit to distinguish which numbering scheme, or an alphabetic prefix which is mnemonic for each section, and alphabetically sorts in the same order as the section in the book, again borrowing from the techniques above. After the number, or number with prefix and suffix, an underscore, and then the title from the book. If no titles are given, the first few words of the first line of the first verse can substitute.

If the same tune is used for multiple lyrics in the same book, but the music is identical (and thus two copies would be redundant), each number should follow the first number, separated by a "+" sign. Similarly, each titles should follow, in the same order, also separated by a "+" sign. If there is other information you wish to record about the song, another underscore should be placed after the title(s), and that information added there. The suffix or file extension might be any audio file suffix, corresponding to the format, such as .wav or .mp3, as appropriate.

If the collection is from a CD, the track number should be first, using a sufficient fixed number of digits (generally 2), then a hyphen, and then the title of the track, and after another underscore, the performer, or author or composer if desired, perhaps separated by underscores, and in a consistent order.

Generally, a book or CD should be collected into its own directory, and the information that pertains to the whole collection should be placed in the name of the directory, rather than in the name of each file. When there is a corresponding printed book, author and composer information can be referenced in the book, and there is no need to place it in the filename, unless you particularly wish to.

The main theme here is that each section of the file name (the number, the title, and other infermation) is separated by underscore characters, and if a file needs to make multiple references to a book, the references of each type are separated by "+" signs.