1738 Commits

Author SHA1 Message Date
Peter Williams
5933a3dd81 Added duplicate text detection. 2020-06-23 15:33:34 +10:00
Peter Williams
e65fb041e5 Reverted PdfRectangle changes 2020-06-23 14:18:58 +10:00
Peter Williams
17bee4d907 Commented code and removed unused functions. 2020-06-23 11:39:01 +10:00
Peter Williams
1c54e01d83 Removed irrelevant spaces 2020-06-23 09:43:02 +10:00
Peter Williams
09ebbcf577 Improved an error message. 2020-06-23 09:33:09 +10:00
Peter Williams
72155a07dc Increased threshold for truncating extracted text when there is no license 100 -> 102.
This is a workaround to let a test in creator_test.go pass.

With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.

"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"
2020-06-23 08:59:54 +10:00
Peter Williams
91479a7c2b Cleaned up some comments and removed a panic 2020-06-22 21:17:39 +10:00
Peter Williams
80b54ef1de Updated extractor/README 2020-06-22 17:56:32 +10:00
Peter Williams
acb5caaf6c Big changes to columns text extraction code for PR.
Performance improvements in several places.
Commented code.
2020-06-22 17:49:19 +10:00
Peter Williams
5d7e4aad51 Commented out a creator_test.go test that was broken by my text extraction changes. 2020-06-22 17:36:42 +10:00
Peter Williams
a7779a34d8 Merge branch 'development' of https://github.com/unidoc/unipdf into columns 2020-06-22 16:37:06 +10:00
Adrian-George Bostan
7bf2f62c3b
Skip referenced pages which are not present in the catalog (#377)
* Skip referenced pages which are not present in the catalog

* Improve documentation for the copyObject method of the writer

* Add creator test case for checking referenced page destinations
2020-06-18 15:06:06 +00:00
Gunnsteinn Hall
ae20c30ae4
Merge pull request #376 from gunnsth/dev-merge-master
Merge master into development
2020-06-16 22:08:23 +00:00
Gunnsteinn Hall
1b1158ed94 Merge remote-tracking branch 'upstream/master' into dev-merge-master 2020-06-16 21:45:48 +00:00
Gunnsteinn Hall
dbd2364470 Merge branch 'development' of https://github.com/unidoc/unipdf into development 2020-06-16 21:19:49 +00:00
Gunnsteinn Hall
11f692bc3a
Font subsetting and font optimization improvements (#362)
* Track runes in IdentityEncoder (for subsetting), track decoded runes

* Working with the identity encoder in font_composite.go

* Add GetFilterArray to multi encoder.  Add comments.

* Add NewFromContents constructor to extractor only requiring contents and resources

* golint fixes

* Optimizer compress streams - improved detection of raw streams

* Optimize - CleanContentStream optimizer that removes redundant operands

* WIP Optimize - clean fonts

Will support both font file reduction and subsetting. (WIP)

* Optimize - image processing - try combined DCT and Flate

* Update options.go

* Update optimizer.go

* Create utils.go for optimize with common methods needed for optimization

* Optimizer - add font subsetting method

Covers XObject Forms, annotaitons etc.  Uses extractor package to extract text marks covering what fonts and glyphs are used.  Package truetype used for subsetting.

* Add some comments

* Fix cmap parsing rune conversion

* Error checking for extractor.  Add some comments.

* Update Jenkinsfile

* Update modules
2020-06-16 21:19:10 +00:00
Gunnsteinn Hall
8ab0b6ff45
Merge pull request #372 from gunnsth/release/v3.8.0
Prepare unipdf release v3.8.0
v3.8.0
2020-06-16 08:35:52 +00:00
Gunnsteinn Hall
deb563b581 Prepare release v3.8.0 2020-06-15 20:17:12 +00:00
Gunnsteinn Hall
c7c50ffc37 Merge remote-tracking branch 'upstream/master' into release/v3.8.0 2020-06-15 20:16:22 +00:00
Gunnsteinn Hall
9e5a17eace Merge branch 'development' of https://github.com/unidoc/unipdf into development 2020-06-15 20:15:52 +00:00
Peter Williams
e6be02163c Merge branch 'development' of https://github.com/unidoc/unipdf into columns 2020-06-15 10:42:21 +10:00
Peter Williams
975e03811f Removed funny character from text_test.go 2020-06-15 10:41:49 +10:00
Adrian-George Bostan
99ef1b861d
Combo field appearance (#370)
* Fix combo field appearances not being shown

* Fix V object type for choice and button fields

* Refactor form fill for combo and checkbox fields

* Add fill test case for text, combo and checkbox fields

* Prevent panic when flattening forms using a nil appearance generator
2020-06-10 16:58:00 +00:00
Adrian-George Bostan
6cb58f6327
Add configurable font fallback options for form fields (#368)
* Add configurable fallback font support for form fill/flatten

* Add appearance font to AcroForm DR

* Refactor DA process method

* Remove unnecessary font default size variable

* Minor refactor in the appearance generation functions

* Improve processDA appearance style method

* Use original font container if present in DR

* Maintain original appearance font autosizing behavior
2020-06-09 15:16:54 +00:00
Adrian-George Bostan
6b8d5c42f7
Fix outline null object check (#367) 2020-06-05 11:46:55 +00:00
Peter Williams
b4d90b6402 Absorb text to the left of paras into paras e.g. Footnote numbers 2020-06-05 21:43:09 +10:00
Peter Williams
30fc953954 Check for textParas that are on the same line when writing out extracted text. 2020-06-05 15:44:31 +10:00
Peter Williams
16b3c1c450 Removed commented code 2020-06-05 14:21:53 +10:00
Peter Williams
af9508cc5c Added tests for columns extraction. 2020-06-05 14:01:31 +10:00
Peter Williams
29f2d9b8cf Merge branch 'development' of https://github.com/unidoc/unipdf into columns 2020-06-05 11:43:04 +10:00
Peter Williams
5777ee1394
Handle multibyte entries in CMaps. (#353)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* Changed rune->CharCode maps to string->CharCode.

* Removed unintentional changes.

* Updated comments to match new function definitions.

* Changed some []rune APIs to string

* Fixes for reviewer comments.
2020-06-03 13:55:15 +00:00
Peter Williams
40806d7f96 Adding tables to text extractor. 2020-06-01 14:04:32 +10:00
Gunnsteinn Hall
4508e17036
Merge pull request #364 from adrg/flatten-text-field-rotation
Account for rotation when generating flattened text field appearances
2020-05-29 17:35:58 +00:00
Adrian-George Bostan
d6e1cb5761 Account for rotation when generating flattened text field appearances 2020-05-29 17:49:00 +03:00
Peter Williams
49bbef0442 More verbose logging 2020-05-29 08:58:23 +10:00
Peter Williams
a14d8e73d8 Fixed typo 2020-05-28 12:10:49 +10:00
Peter Williams
2260e245f7 Handle more cases of fonts not being set in text extraction code. 2020-05-28 12:08:15 +10:00
Peter Williams
418f859d44 Reinstated hyphen suppression 2020-05-27 21:11:47 +10:00
Peter Williams
d21e2f83c4 Got text_test.go passing. 2020-05-27 18:15:18 +10:00
Peter Williams
6b4314f97c Adjusted inter-word search distance to make paragrah division work for thanh.pdf 2020-05-26 18:53:23 +10:00
Gunnsteinn Hall
f99c0cd58f
Merge pull request #363 from gunnsth/release/v3.7.1
Prepare unipdf release v3.7.1
v3.7.1
2020-05-26 08:32:22 +00:00
Peter Williams
fad1552009 Fixed text state save/restore. 2020-05-26 13:26:09 +10:00
Gunnsteinn Hall
4b80c3bff1 Update version.go 2020-05-25 23:35:47 +00:00
Gunnsteinn Hall
81588f196e Merge remote-tracking branch 'upstream/development' into release/v3.7.1 2020-05-25 23:35:14 +00:00
Adrian-George Bostan
d078608da4
Account for parent CTM when calculating positions of extracted forms (#349)
* Take parent CTM into account for form field text

* Pass a modified  graphics state instance to new text objects
2020-05-25 23:34:44 +00:00
Gunnsteinn Hall
e8d29245a2 Prepare release v3.7.1 2020-05-25 23:07:17 +00:00
Gunnsteinn Hall
f7215be3eb Merge remote-tracking branch 'upstream/master' into release/v3.7.1 2020-05-25 23:04:37 +00:00
Gunnsteinn Hall
ef7c2e6b5b Merge branch 'development' of https://github.com/unidoc/unipdf into development 2020-05-25 22:54:19 +00:00
Peter Williams
603b5ff4e7 Added function comments. 2020-05-25 14:00:00 +10:00
Peter Williams
c515472849 Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code. 2020-05-25 09:39:30 +10:00