unipdf/extractor
Peter Williams 88fda44e0a
Text extraction code for columns. (#366)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* First version of text extraction that recognizes columns

* Added an expanation of the text columns code to README.md.

* fixed typos

* Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

* Added function comments.

* Fixed text state save/restore.

* Adjusted inter-word search distance to make paragrah division work for thanh.pdf

* Got text_test.go passing.

* Reinstated hyphen suppression

* Handle more cases of fonts not being set in text extraction code.

* Fixed typo

* More verbose logging

* Adding tables to text extractor.

* Added tests for columns extraction.

* Removed commented code

* Check for textParas that are on the same line when writing out extracted text.

* Absorb text to the left of paras into paras e.g. Footnote numbers

* Removed funny character from text_test.go

* Commented out a creator_test.go test that was broken by my text extraction changes.

* Big changes to columns text extraction code for PR.

Performance improvements in several places.
Commented code.

* Updated extractor/README

* Cleaned up some comments and removed a panic

* Increased threshold for truncating extracted text when there is no license 100 -> 102.

This is a workaround to let a test in creator_test.go pass.

With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.

"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"

* Improved an error message.

* Removed irrelevant spaces

* Commented code and removed unused functions.

* Reverted PdfRectangle changes

* Added duplicate text detection.

* Combine diacritic textMarks in text extraction

* Reinstated a diacritic recombination test.

* Small code reorganisation

* Reinstated handling of rotated text

* Addressed issues in PR review

* Added color fields to TextMark

* Updated README

* Reinstated the disabled tests I missed before.

* Tightened definition for tables to prevent detection of tables where there weren't any.

* Compute line splitting search range based on fontsize of first word in word bag.

* Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs.

See https://blog.golang.org/go1.13-errors

* Fixed some naming and added some comments.

* errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility

* Removed code that doesn't ever get called.

* Removed unused test
2020-06-30 19:33:10 +00:00
..

TEXT EXTRACTION CODE

There are two directionss.

  • reading
  • depth

In English text,

  • the reading direction is left to right, increasing X in the PDF coordinate system.
  • the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

HOW TEXT IS EXTRACTED

text_page.go makeTextPage() is the top level text extraction function. It returns an ordered list of textParas which are described below.

  • A page's textMarks are obtained from its content stream. They are in the order they occur in the content stream.
  • The textMarks are grouped into word fragments calledtextWords by scanning through the textMarks and splitting on space characters and the gaps between marks.
  • The textWordss are grouped into rectangular regions based on their bounding boxes' proximities to other textWords. These rectangular regions are called textParass. (In the current implementation there is an intermediate step where the textWords are divided into containers called wordBags.)
  • The textWords in each textPara are arranged into textLines (textWords of similar depth).
  • Within each textLine, textWords are sorted in reading order and each one that starts a whole word is marked by setting its newWord flag to true. (See textLine.text().)
  • All the textParas on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined into textTables and a textPara containing the textTable replaces the textParas containing the cells.
  • The textParas, some of which may be tables, are sorted into reading order (the order in which they are read, not in the reading direction).

The entire order of extracted text from a page is expressed in paraList.writeText().

  • This function iterates through the textParas, which are sorted in reading order.
  • For each textPara with a table, it iterates through the table cell textParas. (See textPara.writeCellText().)
  • For each (top level or table cell) textPara, it iterates through the textLines.
  • For each textLine, it iterates through the textWords inserting a space before each one that has the newWord flag set.

textWord creation

  • makeTextWords() combines textMarks into textWords, word fragments.
  • textWords are the atoms of the text extraction code.

textPara creation

  • dividePage() combines textWords that are close to each other into groups in rectangular regions called wordBags.
  • wordBag.arrangeText() arranges the textWords in the rectangular regions into textLines, groups textWords of about the same depth sorted left to right.
  • textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.

TODO

  • Handle diagonal text.
  • Get R to L text extraction working.
  • Get top to bottom text extraction working.