mirror of
https://github.com/unidoc/unipdf.git
synced 2025-05-14 19:29:50 +08:00
TEXT EXTRACTION CODE
The code is currently split accross the text_*.go
files to make it easier to navigate. Once you
understand the code you may wish to recombine this in the orginal text.go
.
BASIC IDEAS
There are two directionss.
- reading
- depth
In English text,
- the reading direction is left to right, increasing X in the PDF coordinate system.
- the depth directon is top to bottom, decreasing Y in the PDF coordinate system.
We define depth as distance from the bottom of a word's bounding box from the top of the page. depth := pageSize.Ury - r.Lly
- Pages are divided into rectangular regions called
textPara
s. - The
textPara
s in a page are sorted in reading order (the order they are read in, not the reading direction above). - Each
textPara
containstextLine
s, lines with thetextPara
's bounding box. - Each
textLine
has extracted for the line in itstext()
function.
Page text is extracted by iterating over textPara
s and within each textPara
iterating over its
textLine
s.
WHERE TO START
text_page.go
makeTextPage is the top level function that builds the textPara
s.
- A page's
textMark
s are obtained from its contentstream. - The
textMark
s are divided intotextWord
s. - The
textWord
s are grouped into depth bins with the contents of each bin sorted by reading direction. - The page area is divided into rectangular regions, one for each paragraph.
- The words in of each rectangular region are aranged inot
textLine
s. Each rectangular region and its constituent lines is atextPara
. - The
textPara
s are sorted into reading order.
TODO
Remove serial code.