unipdf/extractor
2020-06-22 21:17:39 +10:00
..

TEXT EXTRACTION CODE

There are two directionss.

  • reading
  • depth

In English text,

  • the reading direction is left to right, increasing X in the PDF coordinate system.
  • the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

HOW TEXT IS EXTRACTED

text_page.go makeTextPage is the top level function that builds the textParas.

  • A page's textMarks are obtained from its contentstream. They are in the order they occur in the contentstrem.
  • The textMarks are grouped into word fragments calledtextWords by scanning through the textMarks and spltting on space characters and the gaps between marks.
  • The textWordss are grouped into textParass based on their bounding boxes' proximities to other textWords.
  • The textWords in each textPara are arranged into textLines (textWords of similar depth).
  • Within each textLine, textWords are sorted in reading order and each one that starts a whole word is marked. See textLine.text().
  • textPara.writeCellText() shows how to extract the paragraph text from this arrangment.
  • All the textParas on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined into textTables and a textPara containing the textTable replaces the textParas containing the cells.
  • The textParas, some of which may be tables, are sorted into reading order (the order in which they are reading, not in the reading directions).

The entire order of extracted text from a page is expressed in paraList.writeText() which

  • Iterates through the `textParas1, which are sorted in reading.
  • For each textPara with a table, iterates through through the table cell textParas.
  • For each (top level or table cell) textPara iterates through the textLines.
  • For each textLine iterates through the textWords inserting a space before each one that has the newWord flag set.

textWord creation

  • makeTextWords() combines textMarks into textWords, word fragments
  • textWord`s are the atoms of the text extraction code.

textPara creation

  • dividePage() combines textWords, that are close to each other into groups in rectangular regions called wordBags.
  • wordBag.arrangeText() arranges the textWords in the rectangle into textLines, groups textWords of about the same depth sorted left to right.
  • textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.

TODO

  • Remove serial code????
  • Remove verbose* logginng?
  • Reinstate rotated text handling.
  • Reinstate diacritic composition.
  • Reinstate duplicate text removal.
  • Reinstate creater_test.go extraction test.
  • Come up with a better name for reading direction,