mirror of
https://github.com/unidoc/unipdf.git
synced 2025-05-13 19:29:10 +08:00
Updated extractor/README
This commit is contained in:
parent
acb5caaf6c
commit
80b54ef1de
@ -1,9 +1,6 @@
|
||||
TEXT EXTRACTION CODE
|
||||
====================
|
||||
|
||||
BASIC IDEAS
|
||||
-----------
|
||||
|
||||
There are two [directions](https://www.w3.org/International/questions/qa-scripts.en#directions)s\.
|
||||
|
||||
- *reading*
|
||||
@ -13,18 +10,6 @@ In English text,
|
||||
- the *reading* direction is left to right, increasing X in the PDF coordinate system.
|
||||
- the *depth* directon is top to bottom, decreasing Y in the PDF coordinate system.
|
||||
|
||||
*depth* is the distance from the bottom of a word's bounding box from the top of the page.
|
||||
depth := pageSize.Ury - r.Lly
|
||||
|
||||
* Pages are divided into rectangular regions called `textPara`s.
|
||||
* The `textPara`s in a page are sorted in reading order (the order they are read in, not the
|
||||
*reading* direction above).
|
||||
* Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box.
|
||||
* Each `textLine` has extracted for the line in its `text()` function.
|
||||
* Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its
|
||||
`textLine`s.
|
||||
* The textMarks corresponding to extracted text can be found.
|
||||
|
||||
|
||||
HOW TEXT IS EXTRACTED
|
||||
---------------------
|
||||
@ -36,13 +21,13 @@ HOW TEXT IS EXTRACTED
|
||||
and spltting on space characters and the gaps between marks.
|
||||
* The `textWords`s are grouped into `textParas`s based on their bounding boxes' proximities to other
|
||||
textWords.
|
||||
* The textWords in each textPara are arranged into textLines (textWords of similar depths).
|
||||
* With each textLine, textWords are sorted in reading order each one that starts a whole word is marked.
|
||||
See textLine.text()
|
||||
* textPara.writeCellText() shows how to extract the paragraph text from this arrangment.
|
||||
* The `textWord`s in each `textPara` are arranged into `textLine`s (`textWord`s of similar depth).
|
||||
* Within each `textLine`, `textWord`s are sorted in reading order each one that starts a whole word is marked.
|
||||
See `textLine.text()`.
|
||||
* `textPara.writeCellText()` shows how to extract the paragraph text from this arrangment.
|
||||
* All the `textPara`s on a page are checked to see if they are arranged as cells within a table and,
|
||||
if they are, they are combined into `textTable`s and a textPara containing the textTable replaces the
|
||||
the textParas containing the cells.
|
||||
the `textPara`s containing the cells.
|
||||
* The textParas, some of which may be tables, in sorted into reading order (the order in which they
|
||||
are reading, not in the reading directions).
|
||||
|
||||
@ -61,9 +46,12 @@ of about the same depth sorted left to right.
|
||||
* textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.
|
||||
|
||||
TODO
|
||||
====
|
||||
Remove serial code????
|
||||
Reinstate rotated text handling.
|
||||
Reinstate hyphen diacritic composition.
|
||||
Reinstate duplicate text removal
|
||||
-----
|
||||
|
||||
* Remove serial code????
|
||||
* Remove verbose* logginng?
|
||||
* Reinstate rotated text handling.
|
||||
* Reinstate diacritic composition.
|
||||
* Reinstate duplicate text removal.
|
||||
* Reinstate creater_test.go extraction test.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user