Vectorlist Archives: Re: Rocket Racer Source Code Found!

From: Graham Toal <gtoal_at_gtoal.com>
Date: Mon Mar 15 2004 - 22:12:05 EST

> > Yes, OCR is what I meant, not scan. It will take a while -- there
> > are at least a hundred pages and the paper is 20 years old and
> > showing it's age a bit.
>
> I would be very careful about OCRing the pages of code. You will
> invariably have to go line by line to make sure the OCR engine has
> translated the text properly. Might be easier (?) to get a small group
> of people to manually type in three or four pages each.

I have been through this process within the last year myself with some 30
year old printouts and I offer a couple of comments in light of that
experience:

1) It is very unlikely you will get a good OCR from an old printer listing.
Bite the bullet and type in the text. Be careful to type it in *exactly*
as on the listing, with the exact same spacing. Have each page done twice
(not by the same person). Then use diff to compare the results, and carefully
repair the files as you find errors in each. Eventually the two files
will converge.

OR - if you're not in a hurry -

2) Scan the paper at a good resolution (300dpi greyscale or better) and
store it until OCR technology improves. Don't filter it or use any
photoshop style tricks or software interpolation. Just store the best
raw data you can with current tech (but no need to overdo it with ultra-
high res scans). Store along with C code to decode the file format.

Although current OCR technology is poor (no better than about 10 years ago
I'm afraid) it will not always be this way. It is obvious to me that fixed
pitch printer listings *SHOULD* actually be easier to OCR because of the
fixed grid. I have a theory that it is not too difficult to write a custom
OCR program specifically for listings. The first thing is to align the text
by applying a shear in both X and Y to effectively rotate the text to be
exactly level on a horizontal baseline. (It's not exactly a rotation, but
when it's only a few degrees off true, it's damn close) Then you profile a
histogram of the pixels across both the X and Y axes, use an FFT to determine
the grid, and presto - you have each character bounded by a perfect rectangle.

After that it's trivial to detect all similar glyphs, average them to
determine an archetype, and then compare each glyph against the archetype.

You don't even need to train it with what glyph is which letter - you can
treat it as a 1:1 decryption problem and guess them on the fly.

I reckon I could write it in about 2 weeks of solid work; or it would make a
good final year undergrad project. unfortunately I've got so many projects
on the go that I just can't even look at this one. It's one of those
jobs I'm putting off until I retire :-) Hence why I'm scanning sources
now at a resolution that I can use later.

Current OCR does not take advantage of the fixed pitch of old printers.
We can definitely do better. If we're really lucky someone else will
write something like this sooner than before I retire :-)

Anyway, I have about 1000 pages of old O/S manuals from the early 70's
which are awaiting some software like this. And three programs that
we thought were worth retyping.

Graham
---------------------------------------------------------------------------
** Unsubscribe, subscribe, or view the archives at http://www.vectorlist.org
** Please direct other questions, comments, or problems to chris@westnet.com
Received on Mon Mar 15 21:53:52 2004

This archive was generated by hypermail 2.1.8 : Mon Mar 15 2004 - 23:50:01 EST