mirror of
https://github.com/unidoc/unipdf.git
synced 2025-04-24 13:48:49 +08:00

* Fixed filename:page in logging * Got CMap working for multi-rune entries * Treat CMap entries as strings instead of runes to handle multi-byte encodings. * Added a test for multibyte encoding. * First version of text extraction that recognizes columns * Added an expanation of the text columns code to README.md. * fixed typos * Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code. * Added function comments. * Fixed text state save/restore. * Adjusted inter-word search distance to make paragrah division work for thanh.pdf * Got text_test.go passing. * Reinstated hyphen suppression * Handle more cases of fonts not being set in text extraction code. * Fixed typo * More verbose logging * Adding tables to text extractor. * Added tests for columns extraction. * Removed commented code * Check for textParas that are on the same line when writing out extracted text. * Absorb text to the left of paras into paras e.g. Footnote numbers * Removed funny character from text_test.go * Commented out a creator_test.go test that was broken by my text extraction changes. * Big changes to columns text extraction code for PR. Performance improvements in several places. Commented code. * Updated extractor/README * Cleaned up some comments and removed a panic * Increased threshold for truncating extracted text when there is no license 100 -> 102. This is a workaround to let a test in creator_test.go pass. With the old text extraction code the following extracted text was 100 chars. With the new code it is 102 chars which looks correct. "你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n" * Improved an error message. * Removed irrelevant spaces * Commented code and removed unused functions. * Reverted PdfRectangle changes * Added duplicate text detection. * Combine diacritic textMarks in text extraction * Reinstated a diacritic recombination test. * Small code reorganisation * Reinstated handling of rotated text * Addressed issues in PR review * Added color fields to TextMark * Updated README * Reinstated the disabled tests I missed before. * Tightened definition for tables to prevent detection of tables where there weren't any. * Compute line splitting search range based on fontsize of first word in word bag. * Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs. See https://blog.golang.org/go1.13-errors * Fixed some naming and added some comments. * errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility * Removed code that doesn't ever get called. * Removed unused test
TEXT EXTRACTION CODE
There are two directionss.
- reading
- depth
In English text,
- the reading direction is left to right, increasing X in the PDF coordinate system.
- the depth directon is top to bottom, decreasing Y in the PDF coordinate system.
HOW TEXT IS EXTRACTED
text_page.go
makeTextPage() is the top level text extraction function. It returns an ordered
list of textPara
s which are described below.
- A page's
textMark
s are obtained from its content stream. They are in the order they occur in the content stream. - The
textMark
s are grouped into word fragments calledtextWord
s by scanning through the textMarks and splitting on space characters and the gaps between marks. - The
textWords
s are grouped into rectangular regions based on their bounding boxes' proximities to othertextWords
. These rectangular regions are calledtextParas
s. (In the current implementation there is an intermediate step where thetextWords
are divided into containers calledwordBags
.) - The
textWord
s in eachtextPara
are arranged intotextLine
s (textWord
s of similar depth). - Within each
textLine
,textWord
s are sorted in reading order and each one that starts a whole word is marked by setting itsnewWord
flag to true. (SeetextLine.text()
.) - All the
textPara
s on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined intotextTable
s and atextPara
containing thetextTable
replaces thetextPara
s containing the cells. - The
textPara
s, some of which may be tables, are sorted into reading order (the order in which they are read, not in the reading direction).
The entire order of extracted text from a page is expressed in paraList.writeText()
.
- This function iterates through the
textPara
s, which are sorted in reading order. - For each
textPara
with a table, it iterates through the table celltextPara
s. (SeetextPara.writeCellText()
.) - For each (top level or table cell)
textPara
, it iterates through thetextLine
s. - For each
textLine
, it iterates through thetextWord
s inserting a space before each one that has thenewWord
flag set.
textWord
creation
makeTextWords()
combinestextMark
s intotextWord
s, word fragments.textWord
s are the atoms of the text extraction code.
textPara
creation
dividePage()
combinestextWord
s that are close to each other into groups in rectangular regions calledwordBags
.wordBag.arrangeText()
arranges thetextWord
s in the rectangular regions intotextLine
s, groups textWords of about the same depth sorted left to right.textLine.markWordBoundaries()
marks thetextWord
s in eachtextLine
that start whole words.
TODO
- Handle diagonal text.
- Get R to L text extraction working.
- Get top to bottom text extraction working.