OrgGo/unipdf

mirror of https://github.com/unidoc/unipdf.git synced 2025-05-14 19:29:50 +08:00

History

Peter Williams 603b5ff4e7 Added function comments.

2020-05-25 14:00:00 +10:00

..

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

const.go

First version of text extraction that recognizes columns

2020-05-24 21:00:37 +10:00

doc.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

extractor.go

First version of text extraction that recognizes columns

2020-05-24 21:00:37 +10:00

image_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

image.go

First version of text extraction that recognizes columns

2020-05-24 21:00:37 +10:00

README.md

Added function comments.

2020-05-25 14:00:00 +10:00

text_bound.go

Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

2020-05-25 09:39:30 +10:00

text_const.go

Added function comments.

2020-05-25 14:00:00 +10:00

text_line.go

Added function comments.

2020-05-25 14:00:00 +10:00

text_mark.go

Added function comments.

2020-05-25 14:00:00 +10:00

text_page.go

Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

2020-05-25 09:39:30 +10:00

text_para.go

Added function comments.

2020-05-25 14:00:00 +10:00

text_strata.go

Added function comments.

2020-05-25 14:00:00 +10:00

text_test.go

First version of text extraction that recognizes columns

2020-05-24 21:00:37 +10:00

text_utils.go

First version of text extraction that recognizes columns

2020-05-24 21:00:37 +10:00

text_word.go

Added function comments.

2020-05-25 14:00:00 +10:00

text.go

Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

2020-05-25 09:39:30 +10:00

utils.go

Simplify license loading and support environment variables

2019-08-04 09:28:42 +00:00

README.md

TEXT EXTRACTION CODE

The code is currently split accross the text_*.go files to make it easier to navigate. Once you understand the code you may wish to recombine this in the orginal text.go.

BASIC IDEAS

There are two directionss.

reading
depth

In English text,

the reading direction is left to right, increasing X in the PDF coordinate system.
the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

We define depth as distance from the bottom of a word's bounding box from the top of the page. depth := pageSize.Ury - r.Lly

Pages are divided into rectangular regions called textParas.
The textParas in a page are sorted in reading order (the order they are read in, not the reading direction above).
Each textPara contains textLines, lines with the textPara's bounding box.
Each textLine has extracted for the line in its text() function.

Page text is extracted by iterating over textParas and within each textPara iterating over its textLines.

WHERE TO START

text_page.go makeTextPage is the top level function that builds the textParas.

A page's textMarks are obtained from its contentstream.
The textMarks are divided into textWords.
The textWords are grouped into depth bins with the contents of each bin sorted by reading direction.
The page area is divided into rectangular regions, one for each paragraph.
The words in of each rectangular region are aranged inottextLines. Each rectangular region and its constituent lines is a textPara.
The textParas are sorted into reading order.

TODO

Remove serial code.