fixed typos

2025-05-14 19:29:50 +08:00 · 2020-05-24 21:23:33 +10:00 · 2020-05-24 21:23:33 +10:00 · 83033182fa
commit 83033182fa
parent a5c538f420
1 changed files with 6 additions and 9 deletions
--- a/extractor/README.md
+++ b/extractor/README.md
@ -1,8 +1,7 @@
 TEXT EXTRACTION CODE
 ====================
-The code is currently split accross the text_*.go files to make it easier to navigate. Once you
-understand the code you may wish to recombine this in the orginal text.go
-\
+The code is currently split accross the `text_*.go` files to make it easier to navigate. Once you
+understand the code you may wish to recombine this in the orginal `text.go`.

 BASIC IDEAS
 -----------
@ -19,10 +18,10 @@ We define *depth* as distance from the bottom of a word's bounding box from the
 depth := pageSize.Ury - r.Lly

 * Pages are divided into rectangular regions called `textPara`s.
-* The `textPara`s in a page are sorted in reading ouder (the order they are read, not the
+* The `textPara`s in a page are sorted in reading order (the order they are read in, not the
 *reading* direction above).
 * Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box.
-* Each `textLine` has a text reprentation.
+* Each `textLine` has extracted for the line in its `text()` function.

 Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its
 `textLine`s.
@ -31,14 +30,12 @@ Page text is extracted by iterating over `textPara`s and within each `textPara`
 WHERE TO START
 --------------

-`text_page.go` *makeTextPage* is the top level function that builds the `textPara`s.
+`text_page.go` **makeTextPage** is the top level function that builds the `textPara`s.

 * A page's `textMark`s are obtained from its contentstream.
 * The `textMark`s are divided into `textWord`s.
 * The `textWord`s are grouped into depth bins with each the contents of each bin sorted by reading direction.
-* The page area is into rectangular regions for each paragraph.
+* The page area is divided into rectangular regions, one for each paragraph.
 * The words in of each rectangular region are aranged inot`textLine`s. Each rectangular region and
 its constituent lines is a `textPara`.
 * The `textPara`s are sorted into reading order.
-
-