fixed typos

This commit is contained in:
Peter Williams 2020-05-24 21:23:33 +10:00
parent a5c538f420
commit 83033182fa

View File

@ -1,8 +1,7 @@
TEXT EXTRACTION CODE
====================
The code is currently split accross the text_*.go files to make it easier to navigate. Once you
understand the code you may wish to recombine this in the orginal text.go
\
The code is currently split accross the `text_*.go` files to make it easier to navigate. Once you
understand the code you may wish to recombine this in the orginal `text.go`.
BASIC IDEAS
-----------
@ -19,10 +18,10 @@ We define *depth* as distance from the bottom of a word's bounding box from the
depth := pageSize.Ury - r.Lly
* Pages are divided into rectangular regions called `textPara`s.
* The `textPara`s in a page are sorted in reading ouder (the order they are read, not the
* The `textPara`s in a page are sorted in reading order (the order they are read in, not the
*reading* direction above).
* Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box.
* Each `textLine` has a text reprentation.
* Each `textLine` has extracted for the line in its `text()` function.
Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its
`textLine`s.
@ -31,14 +30,12 @@ Page text is extracted by iterating over `textPara`s and within each `textPara`
WHERE TO START
--------------
`text_page.go` *makeTextPage* is the top level function that builds the `textPara`s.
`text_page.go` **makeTextPage** is the top level function that builds the `textPara`s.
* A page's `textMark`s are obtained from its contentstream.
* The `textMark`s are divided into `textWord`s.
* The `textWord`s are grouped into depth bins with each the contents of each bin sorted by reading direction.
* The page area is into rectangular regions for each paragraph.
* The page area is divided into rectangular regions, one for each paragraph.
* The words in of each rectangular region are aranged inot`textLine`s. Each rectangular region and
its constituent lines is a `textPara`.
* The `textPara`s are sorted into reading order.