mirror of
https://github.com/unidoc/unipdf.git
synced 2025-05-14 19:29:50 +08:00
Added an expanation of the text columns code to README.md.
This commit is contained in:
parent
6b13a99b82
commit
a5c538f420
@ -1,3 +1,11 @@
|
||||
TEXT EXTRACTION CODE
|
||||
====================
|
||||
The code is currently split accross the text_*.go files to make it easier to navigate. Once you
|
||||
understand the code you may wish to recombine this in the orginal text.go
|
||||
\
|
||||
|
||||
BASIC IDEAS
|
||||
-----------
|
||||
There are two directions
|
||||
|
||||
- *reading*
|
||||
@ -9,3 +17,28 @@ In English text,
|
||||
|
||||
We define *depth* as distance from the bottom of a word's bounding box from the top of the page.
|
||||
depth := pageSize.Ury - r.Lly
|
||||
|
||||
* Pages are divided into rectangular regions called `textPara`s.
|
||||
* The `textPara`s in a page are sorted in reading ouder (the order they are read, not the
|
||||
*reading* direction above).
|
||||
* Each `textPara` contains `textLine`s, lines with the `textPara`'s bounding box.
|
||||
* Each `textLine` has a text reprentation.
|
||||
|
||||
Page text is extracted by iterating over `textPara`s and within each `textPara` iterating over its
|
||||
`textLine`s.
|
||||
|
||||
|
||||
WHERE TO START
|
||||
--------------
|
||||
|
||||
`text_page.go` *makeTextPage* is the top level function that builds the `textPara`s.
|
||||
|
||||
* A page's `textMark`s are obtained from its contentstream.
|
||||
* The `textMark`s are divided into `textWord`s.
|
||||
* The `textWord`s are grouped into depth bins with each the contents of each bin sorted by reading direction.
|
||||
* The page area is into rectangular regions for each paragraph.
|
||||
* The words in of each rectangular region are aranged inot`textLine`s. Each rectangular region and
|
||||
its constituent lines is a `textPara`.
|
||||
* The `textPara`s are sorted into reading order.
|
||||
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user