diff --git a/extractor/README.md b/extractor/README.md index 2351ab8d..fde36697 100644 --- a/extractor/README.md +++ b/extractor/README.md @@ -1,9 +1,6 @@ TEXT EXTRACTION CODE ==================== -BASIC IDEAS ------------ - There are two [directions](https://www.w3.org/International/questions/qa-scripts.en#directions)s\. - *reading* @@ -13,18 +10,6 @@ In English text, - the *reading* direction is left to right, increasing X in the PDF coordinate system. - the *depth* directon is top to bottom, decreasing Y in the PDF coordinate system. -*depth* is the distance from the bottom of a word's bounding box from the top of the page. -depth := pageSize.Ury - r.Lly - -* Pages are divided into rectangular regions called `textPara`s. -* The `textPara`s in a page are sorted in reading order (the order they are read in, not the -*reading* direction above). -* Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box. -* Each `textLine` has extracted for the line in its `text()` function. -* Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its -`textLine`s. -* The textMarks corresponding to extracted text can be found. - HOW TEXT IS EXTRACTED --------------------- @@ -36,13 +21,13 @@ HOW TEXT IS EXTRACTED and spltting on space characters and the gaps between marks. * The `textWords`s are grouped into `textParas`s based on their bounding boxes' proximities to other textWords. -* The textWords in each textPara are arranged into textLines (textWords of similar depths). -* With each textLine, textWords are sorted in reading order each one that starts a whole word is marked. -See textLine.text() -* textPara.writeCellText() shows how to extract the paragraph text from this arrangment. +* The `textWord`s in each `textPara` are arranged into `textLine`s (`textWord`s of similar depth). +* Within each `textLine`, `textWord`s are sorted in reading order each one that starts a whole word is marked. +See `textLine.text()`. +* `textPara.writeCellText()` shows how to extract the paragraph text from this arrangment. * All the `textPara`s on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined into `textTable`s and a textPara containing the textTable replaces the -the textParas containing the cells. +the `textPara`s containing the cells. * The textParas, some of which may be tables, in sorted into reading order (the order in which they are reading, not in the reading directions). @@ -61,9 +46,12 @@ of about the same depth sorted left to right. * textLine.markWordBoundaries() marks the textWords in each textLine that start whole words. TODO -==== -Remove serial code???? -Reinstate rotated text handling. -Reinstate hyphen diacritic composition. -Reinstate duplicate text removal +----- + +* Remove serial code???? +* Remove verbose* logginng? +* Reinstate rotated text handling. +* Reinstate diacritic composition. +* Reinstate duplicate text removal. +* Reinstate creater_test.go extraction test.