mirror of
https://github.com/unidoc/unipdf.git
synced 2025-05-14 19:29:50 +08:00
TEXT EXTRACTION CODE
There are two directionss.
- reading
- depth
In English text,
- the reading direction is left to right, increasing X in the PDF coordinate system.
- the depth directon is top to bottom, decreasing Y in the PDF coordinate system.
HOW TEXT IS EXTRACTED
text_page.go
makeTextPage is the top level function that builds the textPara
s.
- A page's
textMark
s are obtained from its contentstream. They are in the order they occur in the contentstrem. - The
textMark
s are grouped into word fragments calledtextWord
s by scanning through the textMarks and spltting on space characters and the gaps between marks. - The
textWords
s are grouped intotextParas
s based on their bounding boxes' proximities to other textWords. - The
textWord
s in eachtextPara
are arranged intotextLine
s (textWord
s of similar depth). - Within each
textLine
,textWord
s are sorted in reading order and each one that starts a whole word is marked. SeetextLine.text()
. textPara.writeCellText()
shows how to extract the paragraph text from this arrangment.- All the
textPara
s on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined intotextTable
s and atextPara
containing thetextTable
replaces thetextPara
s containing the cells. - The
textPara
s, some of which may be tables, are sorted into reading order (the order in which they are reading, not in the reading directions).
The entire order of extracted text from a page is expressed in paraList.writeText()
which
- Iterates through the `textParas1, which are sorted in reading.
- For each
textPara
with a table, iterates through through the table celltextPara
s. - For each (top level or table cell)
textPara
iterates through thetextLine
s. - For each
textLine
iterates through thetextWord
s inserting a space before each one that has thenewWord
flag set.
textWord
creation
makeTextWords()
combinestextMark
s intotextWord
s, word fragments- textWord`s are the atoms of the text extraction code.
textPara
creation
dividePage()
combinestextWord
s, that are close to each other into groups in rectangular regions calledwordBags
.- wordBag.arrangeText() arranges the textWords in the rectangle into
textLine
s, groups textWords of about the same depth sorted left to right. - textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.
TODO
- Remove serial code????
- Remove verbose* logginng?
- Reinstate rotated text handling.
- Reinstate diacritic composition.
- Reinstate duplicate text removal.
- Reinstate creater_test.go extraction test.
- Come up with a better name for reading direction,