This document describes a number of techniques that can be used to
number pages when scanning a book for archival purposes.
At the end I discuss a technique for
naming/numbering music files from CDs or books.
Different techniques might be more appropriate for different books,
and different indexing methods.
In all cases, I don't mention the file type, or file extension,
which is determined by the graphics file format used for preserving
the scan.
I'll take a small digression into graphics file formats. Most
bi-color (Black and White, no grey scale) material is best stored
in TIFF files with Group 4 (FAX Group 4, or CCITT Group 4)
compression, for smallest size. Color material is probably
best stored as TIFF with LZW compression if highest quality if
desired, or as JPEG at an IJG quality level at or above 93 if the
exact nuances of color and shading mostly result from the aging of
the book, rather than being needed to render the particular shade
chosen by an artist. Lower JPEG quality levels, or reducing
the number of colors are also fine techniques, if there were a
fixed number of distinguishable colors in the original material,
and preserving the exact shade is not essential.
Technique A
Technique A is simply to scan each page of the document, and number
it. First, determine the maximum number of pages you will
need. Use a fixed number of digits, at least large enough to
count to the last page of the document, and use leading zeros to
fill in the otherwise unnecessary digits. See this link for details about why a fixed
number of digits is required.
Use a filename prefix such as "pg" or "page" to help educate the
beholders why the filenames include numbers, which are otherwise
not particularly descriptive names.
Hence you have a directory containing files named
"pg001", "pg002", "pg003", etc. for documents with up to 999
pages. If a document contains 9 or fewer pages, you can use a
single digit; if it contains 99 or fewer pages, 2 digits
suffice. If a large document contains 1000 or more pages, but
fewer than 10,000, then 4 digits are appropriate.
This technique works fine if the pages are unnumbered, or have a
single sequential numbering scheme.
If there are multiple "sections" (preface material (often
numbered in books using lowercase Roman numerals), individually
numbered chapters, separately numbered appendix or index pages,
etc.) then technique A, while it would work, would not reflect the
numbers in the book. If you wish your file names to reflect
the numbering in the book, consider Technique B with suffix or
Technique C.
Technique B
Technique B is a simple upgrade to technique A. The file name
are created exactly as in technique A, and a suffix is used to
provide additional information about the original numbering from
the book. The name from technique A, by being used as the
prefix, completely determines the order of the pages in a sorted
directory listing, and provides uniqueness among all
filenames. Thus, even if the numbering in the book
itself takes many different forms, and includes duplicates, such
numbering can simply be reproduced in the filename.
This example illustrates technique B for a book with 4 preface
pages, 352 main body pages, 32 appendix pages, and some index
pages. Or if the appendix and/or index pages do not have
designations like "app" or "x" in the book, these can be omitted,
or included, by choice...
Technique C requires additional planning and analysis of the book
before starting. It is suitable for complex books with
multiple page numbering sequences, and can result in shorter, yet
still meaningful names, than technique B. Generally the names
will be slightly longer than those of technique A. The
analysis should determine the number of different sections of the
book, particularly each one that has independent numbering.
Come up with a set of section designations _that sort
properly_ like
apref
ch001
ch002
ch003
eappx
index
"a" and "e" were just to force the sort. "pref"
could be preface material, "appx" could mean Appendix. If
there is more than one appendix, eap01, eap02, etc. could be
used. "index" is self explanatory. "atofc" could be
used for a table of contents after preface material. "bpref"
could be used for preface material after a table of
contents... "chNNN" is for chapter 1, 2, 3 etc. If the
book is less formal, but has several page number groupings, they
could be just
sec01
sec02
sec03
It isn't necessary that the section designations all have
the same number of characters, but it is useful to ensure that they
sort alphabetically to the proper order of the book. Using
the same number of characters provides a certain consistency that
might be pleasing. If the section designation contains a
substring of digits, however, like ch001 and sec01, the number of
digits in the substring must be the same for each prefix (all the
"ch" sections should use exactly same number of digits, all the
"sec" sections should use the same number of digits, etc.)
After each section designation, place a fixed number of digits for
page numbers, for example
Although it isn't necessary that each section include exactly the
same fixed number of digits for page numbers, there must be enough
digits to represent the pages in that section. Using the same
number of digits for each section provides a certain consistency
that might be pleasing.
Other issues
Identification of key pages
If some key pages, or even if every page, has an additional
designation that help identify it, such a designation can be added
to the end of any of the naming techniques. For example,
perhaps you wish to identify the first page of the Table of
Contents, and the first page of each chapter. If you are
using technique A you might provide names such as
In any of the above techniques, a single letter suffix, just after
the fixed number of digits page number would work for up to 26
extra pages. On the other hand, maybe this technique isn't
good past about 6 pages, because most people don't readily convert
between a letter and its ordinal position in the alphabet... if
there are pages with suffixes from a through p, inclusive, how many
people would you expect would immediately know that there were 16
extra pages inserted between that and the next numbered page?
pg037
pg037a
pg037b
pg038
In any of the above techniques, if there are more than 26 extra
pages, or if you just happen to like this technique better, the
extra pages could be number, with a fixed number of additional
digits, after a punctuation character...
1) scan the blank page (this works if it is blank, rather than
missing)
2) Use a fixed size, pre-prepared, "placeholder" file, containing
"This page was omitted" and/or "This page was blank". These
can be small files, but will hold the place, and are more
informative than a scanned, blank page. It also works if the
page is truly missing, not blank. If there is a large gap in
numbering, a custom placeholder page could be created that explains
the whole gap. Such placeholder files would be easy to create
in PhotoShop, or most graphics programs.
3) Just skip the page number. But because it looks "missing"
some commentary should be added somewhere to point out that it was
not an omission during archival. The exception, is if
technique B is used, then the first number, which represents the
actual ordinal count of pages in the book, never has a missing
number. If there is a missing page according to the numbering
of the book, that shows up via a sequence like:
pg053-53
pg054-54
pg055-57
pg056-58
4) Just ignore the page numbering in the book. This again requires
some explanatory commentary, for the astute reader that notices
that the scan named "pg055" actually contains a page whose scanned
number sure looks like "57" :)
Personally, I prefer the 2nd solution, as only people that actually
attempt to sequentially scan the book discover that there is a
missing page, and only when they actually get there... and it is
easy to skip on to the next page. Plus, if the page really
existed in other copies of the book, and one such is found, and can
be inserted, nothing else needs to change in the archive.
Music files
If a collection of music files is specifically from a book of
music, then I recommend using the numbers from that book (if not
unique, make them so with "a", "b", "c" suffixes after the numbers,
or borrow from the techniques above). Be sure to use the
correct, fixed number of digits, based on how high the numbers in
the book go. If the book contains multiple numbering
sequences, this can either be handled by using an extra leading
digit to distinguish which numbering scheme, or an alphabetic
prefix which is mnemonic for each section, and alphabetically sorts
in the same order as the section in the book, again borrowing from
the techniques above. After the number, or number with prefix
and suffix, an underscore, and then the title from the book.
If no titles are given, the first few words of the first line of
the first verse can substitute.
If the same tune is used for multiple lyrics in the same book, but
the music is identical (and thus two copies would be redundant),
each number should follow the first number, separated by a "+"
sign. Similarly, each titles should follow, in the same
order, also separated by a "+" sign. If there is other
information you wish to record about the song, another underscore
should be placed after the title(s), and that information added
there. The suffix or file extension might be any audio file
suffix, corresponding to the format, such as .wav or .mp3, as
appropriate.
If the collection is from a CD, the track number should be first,
using a sufficient fixed number of digits (generally 2), then a
hyphen, and then the title of the track, and after another
underscore, the performer, or author or composer if desired,
perhaps separated by underscores, and in a consistent order.
Generally, a book or CD should be collected into its own directory,
and the information that pertains to the whole collection should be
placed in the name of the directory, rather than in the name of
each file. When there is a corresponding printed book, author
and composer information can be referenced in the book, and there
is no need to place it in the filename, unless you particularly
wish to.
The main theme here is that each section of the file name (the
number, the title, and other infermation) is separated by
underscore characters, and if a file needs to make multiple
references to a book, the references of each type are separated by
"+" signs.