diff --git a/extractor/README.md b/extractor/README.md index 70bcddc0..cfb5ea2c 100644 --- a/extractor/README.md +++ b/extractor/README.md @@ -1,8 +1,7 @@ TEXT EXTRACTION CODE ==================== -The code is currently split accross the text_*.go files to make it easier to navigate. Once you -understand the code you may wish to recombine this in the orginal text.go -\ +The code is currently split accross the `text_*.go` files to make it easier to navigate. Once you +understand the code you may wish to recombine this in the orginal `text.go`. BASIC IDEAS ----------- @@ -19,10 +18,10 @@ We define *depth* as distance from the bottom of a word's bounding box from the depth := pageSize.Ury - r.Lly * Pages are divided into rectangular regions called `textPara`s. -* The `textPara`s in a page are sorted in reading ouder (the order they are read, not the +* The `textPara`s in a page are sorted in reading order (the order they are read in, not the *reading* direction above). * Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box. -* Each `textLine` has a text reprentation. +* Each `textLine` has extracted for the line in its `text()` function. Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its `textLine`s. @@ -31,14 +30,12 @@ Page text is extracted by iterating over `textPara`s and within each `textPara` WHERE TO START -------------- -`text_page.go` *makeTextPage* is the top level function that builds the `textPara`s. +`text_page.go` **makeTextPage** is the top level function that builds the `textPara`s. * A page's `textMark`s are obtained from its contentstream. * The `textMark`s are divided into `textWord`s. * The `textWord`s are grouped into depth bins with each the contents of each bin sorted by reading direction. -* The page area is into rectangular regions for each paragraph. +* The page area is divided into rectangular regions, one for each paragraph. * The words in of each rectangular region are aranged inot`textLine`s. Each rectangular region and its constituent lines is a `textPara`. * The `textPara`s are sorted into reading order. - -