mirror of
https://github.com/unidoc/unipdf.git
synced 2025-05-01 22:17:29 +08:00

* Fixed filename:page in logging * Got CMap working for multi-rune entries * Treat CMap entries as strings instead of runes to handle multi-byte encodings. * Added a test for multibyte encoding. * First version of text extraction that recognizes columns * Added an expanation of the text columns code to README.md. * fixed typos * Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code. * Added function comments. * Fixed text state save/restore. * Adjusted inter-word search distance to make paragrah division work for thanh.pdf * Got text_test.go passing. * Reinstated hyphen suppression * Handle more cases of fonts not being set in text extraction code. * Fixed typo * More verbose logging * Adding tables to text extractor. * Added tests for columns extraction. * Removed commented code * Check for textParas that are on the same line when writing out extracted text. * Absorb text to the left of paras into paras e.g. Footnote numbers * Removed funny character from text_test.go * Commented out a creator_test.go test that was broken by my text extraction changes. * Big changes to columns text extraction code for PR. Performance improvements in several places. Commented code. * Updated extractor/README * Cleaned up some comments and removed a panic * Increased threshold for truncating extracted text when there is no license 100 -> 102. This is a workaround to let a test in creator_test.go pass. With the old text extraction code the following extracted text was 100 chars. With the new code it is 102 chars which looks correct. "你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n" * Improved an error message. * Removed irrelevant spaces * Commented code and removed unused functions. * Reverted PdfRectangle changes * Added duplicate text detection. * Combine diacritic textMarks in text extraction * Reinstated a diacritic recombination test. * Small code reorganisation * Reinstated handling of rotated text * Addressed issues in PR review * Added color fields to TextMark * Updated README * Reinstated the disabled tests I missed before. * Tightened definition for tables to prevent detection of tables where there weren't any. * Compute line splitting search range based on fontsize of first word in word bag. * Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs. See https://blog.golang.org/go1.13-errors * Fixed some naming and added some comments. * errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility * Removed code that doesn't ever get called. * Removed unused test
89 lines
2.5 KiB
Go
89 lines
2.5 KiB
Go
/*
|
|
* This file is subject to the terms and conditions defined in
|
|
* file 'LICENSE.md', which is part of this source code package.
|
|
*/
|
|
|
|
package extractor
|
|
|
|
// The follow constant configure debugging.
|
|
const (
|
|
verbose = false
|
|
verboseGeom = false
|
|
verbosePage = false
|
|
verbosePara = false
|
|
verboseParaLine = verbosePara && false
|
|
verboseParaWord = verboseParaLine && false
|
|
verboseTable = false
|
|
)
|
|
|
|
// The following constants control the approaches used in the code.
|
|
const (
|
|
doHyphens = true
|
|
doRemoveDuplicates = true
|
|
doCombineDiacritics = true
|
|
useEBBox = false
|
|
)
|
|
|
|
// The following constants are the tuning parameter for text extracton
|
|
const (
|
|
// Change in angle of text in degrees that we treat as a different orientatiom/
|
|
orientationGranularity = 10
|
|
// Size of depth bins in points
|
|
depthBinPoints = 6
|
|
|
|
// Variation in line depth as a fraction of font size. +lineDepthR for subscripts, -lineDepthR for
|
|
// superscripts
|
|
lineDepthR = 0.5
|
|
|
|
// All constants that end in R are relative to font size.
|
|
|
|
maxWordAdvanceR = 0.11
|
|
|
|
maxKerningR = 0.19
|
|
maxLeadingR = 0.04
|
|
|
|
// Max difference in font sizes allowed within a word.
|
|
maxIntraWordFontTolR = 0.04
|
|
|
|
// Maximum gap between a word and a para in the depth direction for which we pull the word
|
|
// into the para, as a fraction of the font size.
|
|
maxIntraDepthGapR = 1.0
|
|
// Max diffrence in font size for word and para for the above case
|
|
maxIntraDepthFontTolR = 0.04
|
|
|
|
// Maximum gap between a word and a para in the reading direction for which we pull the word
|
|
// into the para.
|
|
maxIntraReadingGapR = 0.4
|
|
// Max diffrence in font size for word and para for the above case
|
|
maxIntraReadingFontTol = 0.7
|
|
|
|
// Minimum spacing between paras in the reading direction.
|
|
minInterReadingGapR = 1.0
|
|
// Max difference in font size for word and para for the above case
|
|
minInterReadingFontTol = 0.1
|
|
|
|
// Maximum inter-word spacing.
|
|
maxIntraWordGapR = 1.4
|
|
|
|
// Maximum overlap between characters allowd within a line
|
|
maxIntraLineOverlapR = 0.46
|
|
|
|
// Maximum spacing between characters within a line.
|
|
maxIntraLineGapR = 0.02
|
|
|
|
// Maximum difference in coordinates of duplicated textWords.
|
|
maxDuplicateWordR = 0.2
|
|
|
|
// Maximum distance from a character to its diacritic marks as a fraction of the character size.
|
|
diacriticRadiusR = 0.5
|
|
|
|
// Minimum number of rumes in the first half of a hyphenated word
|
|
minHyphenation = 4
|
|
|
|
// The distance we look down from the top of a wordBag for the leftmost word.
|
|
topWordRangeR = 4.0
|
|
|
|
// Minimum number of cells in a textTable
|
|
minTableParas = 6
|
|
)
|