unipdf/extractor
..

TEXT EXTRACTION CODE

The code is currently split accross the text_*.go files to make it easier to navigate. Once you understand the code you may wish to recombine this in the orginal text.go.

BASIC IDEAS

There are two directionss.

  • reading
  • depth

In English text,

  • the reading direction is left to right, increasing X in the PDF coordinate system.
  • the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

We define depth as distance from the bottom of a word's bounding box from the top of the page. depth := pageSize.Ury - r.Lly

  • Pages are divided into rectangular regions called textParas.
  • The textParas in a page are sorted in reading order (the order they are read in, not the reading direction above).
  • Each textPara contains textLines, lines with the textPara's bounding box.
  • Each textLine has extracted for the line in its text() function.

Page text is extracted by iterating over textParas and within each textPara iterating over its textLines.

WHERE TO START

text_page.go makeTextPage is the top level function that builds the textParas.

  • A page's textMarks are obtained from its contentstream.
  • The textMarks are divided into textWords.
  • The textWords are grouped into depth bins with the contents of each bin sorted by reading direction.
  • The page area is divided into rectangular regions, one for each paragraph.
  • The words in of each rectangular region are aranged inottextLines. Each rectangular region and its constituent lines is a textPara.
  • The textParas are sorted into reading order.