Re: Gutenberg DTDs

by "Frank Boumphrey" <bckman(at)ix.netcom.com>

 Date:  Mon, 21 Feb 2000 14:33:21 -0500
 To:  "Terence de giere" <terence(at)humanfactors.com>
 Cc:  <hwg-gutenberg-dtds(at)hwg.org>, <hwg-gutenberg(at)hwg.org>
 References: 
  todo: View Thread, Original
Thanks foryour email. I am cc'ing the reply to the gutenberg lists

> 1. DTD
> A week or so ago I downloaded the gutbook1.dtd and the bookfrag DTD

That is rather embaracing! I'm glad you discovered it before we decided to
'gel' the DTD's

I have added a declaration for pubinfo!

<!ELEMENT pubinfo (#PCDATA|para|line %inline.class;)*>
<!ATTLIST pubinfo
  %stdatts;
>

>
> 2. ENTITIES
> If we were to add the XHTML entity references for character sets to these
> DTD's do we have some rule as to the use of certain characters

we do not have a policy on entities just yet. All gutenberg files are meant
to be in 125 ASCII. There are several good reasons why we should not support
'typographic' quotation marks, but for now I think we need to stick to plain
double and single quotation marks.

As we start marking up foreign books though we are definatly going to have
to address this issue.

> 3. HOW WERE THE ORIGINAL E-TEXTS GENERATED?

this is out of our hands. Project Gutenberg sets the policys here. All books
are meant to be proffed, although I have noticed there is a wide discrepency
in the accuracyof the proff reading:>)

They have a lot of information on this on their site.

> 4. PREPARATION OF TEXTS

Our web site lays out some guide lines. Most Gutenberg etexts have a double
line break with the paragraph. Although I have developed several scripts for
handling many problems, in the end one has to go in and clean up the
document manually :>(

> 5. VERSIONS XML and/or XHTML
> Do we want both an XML and XHTML version of each book? The latter will
more
> accessible to more people, at least for a number of years to come.


the choice is yours! In fact I am working on a script to convert our DTD's
to XHTML, but it will not be ready for some time. Murray Altheim is also
working on a DTD that combines XHTML with other markup.

I am also slowly changing the site pages to correct the problems you
mentioned, although personally i don't care if netscape 3 viewers see the
style sheet:>)

Frank

----- Original Message -----
From: Terence de giere <terence(at)humanfactors.com>
To: Frank Boumphrey <bckman(at)ix.netcom.com>
Sent: Monday, February 21, 2000 1:17 PM
Subject: Gutenberg DTDs


> 1. DTD
> A week or so ago I downloaded the gutbook1.dtd and the bookfrag DTD. I
tried
> using these with Softquad's XMetaL Pro 1.2. Programs like this require the
> DTD's and entities to be compiled before using. XMetaL is a bit
particular.
> In these DTD's an element <pubinfo> was declared but not defined, and the
> compilation choked on this. I temporarily added a definition for this
> element, which I presume would contain reference to the original printed
> version's publisher.
>
> The pubinfo reference appeared in the titlepage element:
>
> <!ELEMENT titlepage  (#PCDATA | title | subtitle | author | pubinfo | para
|
> poem |
>                song | note | quote | emph | ital | reference | date |
place
> | name |
>                graphic | misc)* >
>
>
> So I could compile the DTD I temporarily gave pubinfo the following
> definition which I copied from one of the other elements although this
does
> not seem to be exactly what would be required of pubinfo:
>
> <!ELEMENT pubinfo  (#PCDATA | quote | emph | ital | reference | date |
place
> | name |
>                graphic | misc)* >
>
>
> 2. ENTITIES
> If we were to add the XHTML entity references for character sets to these
> DTD's do we have some rule as to the use of certain characters. Printed
> books have typographic quotation marks for example, and these are defined
in
> HTML 4.0 and XHTML 1.0 but not all browsers support these characters. The
> same goes for certain characters such as the en-dash and em-dash which are
> defined in the spec but do not display in many current browsers.
>
> Do we want to use plain quotation marks or typographic quotation marks or
> replicate what was probably in the orginal publication?  It is not always
> possible to tell what the author did. In England now they use single
> quotation marks where we use double quotation marks, but at the beginning
of
> the 20th century, they also used double quotation marks (the Strand
> Magazine, where Conan Doyle published Sherlock Holmes used double
quotation
> marks).
>
> 3. HOW WERE THE ORIGINAL E-TEXTS GENERATED?
> Mary Shelly's Frankenstein; or the Modern Prometheus was published in
three
> editions. The first was in three volumes, the second was in two volumes
but
> essentially the same, and the third edition had numerous changes and
> additions, the version most seen today. (This information was in the notes
> an editor made to a re-publication of the first edition of the story some
20
> years ago)
>
> Is there any kind of oversight in the projects that compares texts with
the
> original editions or manuscripts? How faithful to the originals are we
going
> to get? It would be great if there were some scholarly oversight on the
> project.
>
>
> 4. PREPARATION OF TEXTS
> I have fiddled with some Gutenberg texts before with word processors and
> desktop publishing programs. The way the texts are formed with line breaks
> after every line tends to cause problems. If there is no space at the end
of
> a word on a line, then the next line begins without a space between the
two
> words. I usually process these files in a word processor, substituting
> double carriage returns with a character not used in the document, and
then
> removing all the line breaks substituting a space. Then I replace the
> special character with a paragraph ending, and then search for double
spaces
> in the document which I remove. Since most books are mostly paragraphs,
> search and replace can also be used to add sets of <p></p> or
<para></para>
> tags at the beginning and end of all the paragraphs. Then the only thing
> left to mark up manually is the chapter headings etc.
>
> Before doing this it is also necessary to check for pre-formatted material
> such as poems, whose format will messed up by this procedure (and it might
> also be convenient to mark up headings etc. at this time).
>
> 5. VERSIONS XML and/or XHTML
> Do we want both an XML and XHTML version of each book? The latter will
more
> accessible to more people, at least for a number of years to come.
>
> HWG PAGES DISPLAYING STYLE SHEET CONTENTS AND XML INSTRUCTIONS
> 6. I just happened to start Netscape 3.04 on XHTML pages on the HWG site.
> The XML processor instruction at the beginning of the
> http://www.hwg.org/opcenter/gutenberg/index.html page displays as well as
> the style sheet contents. I noticed that the W3C is dropping the <?xml
> version="1.0"?> at the beginning of some of their XHTML pages on their web
> site (such as the home page). Commenting out the contents of the style
sheet
> prevents it from displaying in older browsers, but new browsers do not
> ignore processing the content if this is done.
>
> Terence de Giere
> tdegiere(at)humanfactors.com
>
>

HWG: hwg-gutenberg mailing list archives, maintained by Webmasters @ IWA