Updated extractor/README

2025-05-13 19:29:10 +08:00 · 2020-06-22 17:56:32 +10:00 · 2020-06-22 17:56:32 +10:00 · 80b54ef1de
commit 80b54ef1de
parent acb5caaf6c
1 changed files with 13 additions and 25 deletions
--- a/extractor/README.md
+++ b/extractor/README.md
@ -1,9 +1,6 @@
 TEXT EXTRACTION CODE
 ====================

-BASIC IDEAS
-----------
-
 There are two [directions](https://www.w3.org/International/questions/qa-scripts.en#directions)s\.

 - *reading*
@ -13,18 +10,6 @@ In English text,
 - the *reading* direction is left to right, increasing X in the PDF coordinate system.
 - the *depth* directon is top to bottom, decreasing Y in the PDF coordinate system.

-*depth* is the distance from the bottom of a word's bounding box from the top of the page.
-depth := pageSize.Ury - r.Lly
-
-* Pages are divided into rectangular regions called `textPara`s.
-* The `textPara`s in a page are sorted in reading order (the order they are read in, not the
-*reading* direction above).
-* Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box.
-* Each `textLine` has extracted for the line in its `text()` function.
-* Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its
-`textLine`s.
-* The textMarks corresponding to extracted text can be found.
-

 HOW TEXT IS EXTRACTED
 ---------------------
@ -36,13 +21,13 @@ HOW TEXT IS EXTRACTED
 and spltting on space characters and the gaps between marks.
 * The `textWords`s are grouped into `textParas`s based on their bounding boxes' proximities to other
 textWords.
-* The textWords in each textPara are arranged into textLines (textWords of similar depths).
-* With each textLine, textWords are sorted in reading order each one that starts a whole word is marked.
-See textLine.text()
-* textPara.writeCellText() shows how to extract the paragraph text from this arrangment.
+* The `textWord`s in each `textPara` are arranged into `textLine`s (`textWord`s of similar depth).
+* Within each `textLine`, `textWord`s are sorted in reading order each one that starts a whole word is marked.
+See `textLine.text()`.
+* `textPara.writeCellText()` shows how to extract the paragraph text from this arrangment.
 * All the `textPara`s on a page are checked to see if they are arranged as cells within a table and,
 if they are, they are combined into `textTable`s and a textPara containing the textTable replaces the
-the textParas containing the cells.
+the `textPara`s containing the cells.
 * The textParas, some of which may be tables, in sorted into reading order (the order in which they
 are reading, not in the reading directions).

@ -61,9 +46,12 @@ of about the same depth sorted left to right.
 * textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.

 TODO
-====
-Remove serial code????
-Reinstate rotated text handling.
-Reinstate hyphen diacritic composition.
-Reinstate duplicate text removal
+-----
+
+* Remove serial code????
+* Remove verbose* logginng?
+* Reinstate rotated text handling.
+* Reinstate  diacritic composition.
+* Reinstate duplicate text removal.
+* Reinstate creater_test.go extraction test.