13 Commits

Author SHA1 Message Date
Peter Williams
5933a3dd81 Added duplicate text detection. 2020-06-23 15:33:34 +10:00
Peter Williams
acb5caaf6c Big changes to columns text extraction code for PR.
Performance improvements in several places.
Commented code.
2020-06-22 17:49:19 +10:00
Peter Williams
b4d90b6402 Absorb text to the left of paras into paras e.g. Footnote numbers 2020-06-05 21:43:09 +10:00
Peter Williams
30fc953954 Check for textParas that are on the same line when writing out extracted text. 2020-06-05 15:44:31 +10:00
Peter Williams
af9508cc5c Added tests for columns extraction. 2020-06-05 14:01:31 +10:00
Peter Williams
29f2d9b8cf Merge branch 'development' of https://github.com/unidoc/unipdf into columns 2020-06-05 11:43:04 +10:00
Peter Williams
40806d7f96 Adding tables to text extractor. 2020-06-01 14:04:32 +10:00
Peter Williams
49bbef0442 More verbose logging 2020-05-29 08:58:23 +10:00
Peter Williams
418f859d44 Reinstated hyphen suppression 2020-05-27 21:11:47 +10:00
Peter Williams
d21e2f83c4 Got text_test.go passing. 2020-05-27 18:15:18 +10:00
Peter Williams
fad1552009 Fixed text state save/restore. 2020-05-26 13:26:09 +10:00
Peter Williams
c515472849 Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code. 2020-05-25 09:39:30 +10:00
Peter Williams
6b13a99b82 First version of text extraction that recognizes columns 2020-05-24 21:00:37 +10:00