82 Commits

Author SHA1 Message Date
UniDoc Build
79e32364de prepare release 2020-11-11 18:48:37 +00:00
UniDoc Build
22540b937c prepare release 2020-10-19 10:58:10 +00:00
UniDoc Build
56a210342e prepare release 2020-10-12 14:17:59 +00:00
UniDoc Build
87cbc66cbd prepare release 2020-10-05 19:28:24 +00:00
UniDoc Build
22ca2c0eed prepare release 2020-09-28 23:18:17 +00:00
UniDoc Build
9107a86674 prepare release 2020-09-21 01:20:10 +00:00
UniDoc Build
b991a36456 prepare release 2020-09-14 09:32:45 +00:00
UniDoc Build
fd3b669a36 prepare release 2020-09-07 00:23:12 +00:00
UniDoc Build
61b6580cb9 prepare release 2020-08-31 21:12:07 +00:00
UniDoc Build
1501d07a74 prepare release 2020-08-27 21:45:09 +00:00
Peter Williams
88fda44e0a
Text extraction code for columns. (#366)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* First version of text extraction that recognizes columns

* Added an expanation of the text columns code to README.md.

* fixed typos

* Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

* Added function comments.

* Fixed text state save/restore.

* Adjusted inter-word search distance to make paragrah division work for thanh.pdf

* Got text_test.go passing.

* Reinstated hyphen suppression

* Handle more cases of fonts not being set in text extraction code.

* Fixed typo

* More verbose logging

* Adding tables to text extractor.

* Added tests for columns extraction.

* Removed commented code

* Check for textParas that are on the same line when writing out extracted text.

* Absorb text to the left of paras into paras e.g. Footnote numbers

* Removed funny character from text_test.go

* Commented out a creator_test.go test that was broken by my text extraction changes.

* Big changes to columns text extraction code for PR.

Performance improvements in several places.
Commented code.

* Updated extractor/README

* Cleaned up some comments and removed a panic

* Increased threshold for truncating extracted text when there is no license 100 -> 102.

This is a workaround to let a test in creator_test.go pass.

With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.

"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"

* Improved an error message.

* Removed irrelevant spaces

* Commented code and removed unused functions.

* Reverted PdfRectangle changes

* Added duplicate text detection.

* Combine diacritic textMarks in text extraction

* Reinstated a diacritic recombination test.

* Small code reorganisation

* Reinstated handling of rotated text

* Addressed issues in PR review

* Added color fields to TextMark

* Updated README

* Reinstated the disabled tests I missed before.

* Tightened definition for tables to prevent detection of tables where there weren't any.

* Compute line splitting search range based on fontsize of first word in word bag.

* Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs.

See https://blog.golang.org/go1.13-errors

* Fixed some naming and added some comments.

* errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility

* Removed code that doesn't ever get called.

* Removed unused test
2020-06-30 19:33:10 +00:00
Adrian-George Bostan
54e965785b
Add cached Stream method for CMap objects (#382)
* Add cached Stream method for CMaps

* Use CMap Stream method when creating font PDF dictionary objects
2020-06-27 00:30:18 +00:00
Adrian-George Bostan
7bf2f62c3b
Skip referenced pages which are not present in the catalog (#377)
* Skip referenced pages which are not present in the catalog

* Improve documentation for the copyObject method of the writer

* Add creator test case for checking referenced page destinations
2020-06-18 15:06:06 +00:00
Gunnsteinn Hall
11f692bc3a
Font subsetting and font optimization improvements (#362)
* Track runes in IdentityEncoder (for subsetting), track decoded runes

* Working with the identity encoder in font_composite.go

* Add GetFilterArray to multi encoder.  Add comments.

* Add NewFromContents constructor to extractor only requiring contents and resources

* golint fixes

* Optimizer compress streams - improved detection of raw streams

* Optimize - CleanContentStream optimizer that removes redundant operands

* WIP Optimize - clean fonts

Will support both font file reduction and subsetting. (WIP)

* Optimize - image processing - try combined DCT and Flate

* Update options.go

* Update optimizer.go

* Create utils.go for optimize with common methods needed for optimization

* Optimizer - add font subsetting method

Covers XObject Forms, annotaitons etc.  Uses extractor package to extract text marks covering what fonts and glyphs are used.  Package truetype used for subsetting.

* Add some comments

* Fix cmap parsing rune conversion

* Error checking for extractor.  Add some comments.

* Update Jenkinsfile

* Update modules
2020-06-16 21:19:10 +00:00
Adrian-George Bostan
99ef1b861d
Combo field appearance (#370)
* Fix combo field appearances not being shown

* Fix V object type for choice and button fields

* Refactor form fill for combo and checkbox fields

* Add fill test case for text, combo and checkbox fields

* Prevent panic when flattening forms using a nil appearance generator
2020-06-10 16:58:00 +00:00
Adrian-George Bostan
6b8d5c42f7
Fix outline null object check (#367) 2020-06-05 11:46:55 +00:00
Peter Williams
5777ee1394
Handle multibyte entries in CMaps. (#353)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* Changed rune->CharCode maps to string->CharCode.

* Removed unintentional changes.

* Updated comments to match new function definitions.

* Changed some []rune APIs to string

* Fixes for reviewer comments.
2020-06-03 13:55:15 +00:00
Adrian-George Bostan
5efaa02e23
Use page indirect object for internal outline destinations (#359)
* Use page indirect object for internal outlines

* Use page indirect object in creator outline destinations

* Adapt creator test case to test outline creation and retrieval
2020-05-22 16:19:43 +00:00
Adrian-George Bostan
d2941b5477
Add reader method for checking if the AcroForm needs repair (#356)
* Add AcroFormNeeds repair method

* Add AcroForm repair check test case
2020-05-20 16:04:02 +00:00
Adrian-George Bostan
80d51c5532
Add reader AcroForm repair functionality (#351)
* Add method for retrieving widget parent form field

* Add reader method for repairing AcroForm

* Add AcroForm repair test case

* Add AcroForm repair options

* RepairAcroForm documentation improvements
2020-05-19 12:42:07 +00:00
Gunnsteinn Hall
ad2a1e9c9d
Subsetting fixes (#346)
* Update unitype lib which improves subsetting

* Add text extraction check to creator font subsetting example

Helps ensure ToUnicode map is set correctly.

* Clean up import

* Fix spelling
2020-05-12 07:15:09 +00:00
Adrian-George Bostan
aef6e5e976
Fix CMap generation and serialization for composite fonts (#344)
* Fix CMap charcode mapping serialization

* Improve CMap generation in the NewCompositePdfFontFromTTF function
2020-05-08 00:15:09 +00:00
Gunnsteinn Hall
9ef2f27694
Support for subsetting fonts (#335)
* Subsetting of TrueType CID fonts using unitype

* Simplify call to SubsetRegistered so can be done right after loading font via creator finalizer

* Add an EnableFontSubsetting function on the creator to simplify font subsetting for creator users
2020-05-05 00:17:27 +00:00
Adrian-George Bostan
d84d0c4375
Form fill fixes (#328)
* Parse form fields with embedded widget annotations

* Try matching fields both by partial and full names on form fill

* Use default font if widget font is not found when generating appearance

* Add JSON extract and fill test case
2020-04-24 16:48:06 +00:00
Adrian-George Bostan
cb0166e96b
Add low level PageLabels support (#325)
* Add reader method for retriving the PageLabels entry from the catalog
* Add writer method for setting the PageLabels entry in the catalog.
* Add creator method for adding page labels for the output file
* Add creator page labels test case
* Minor page labels test case correction
2020-04-22 21:17:33 +00:00
Alexey Pavlyukov
a69d788171
Add timestamp signature handler (#301)
* Add timestamp signature handler

* Add timestamp signature handler test

* fix PR issues

* fix PR issues

* fix PR issues

* Fix

Co-authored-by: Gunnsteinn Hall <gunnsteinn.hall@gmail.com>
2020-04-22 20:21:53 +00:00
Alfred Hall
bc5c0d95d3
Merge pull request #320 from gunnsth/dev-writer-error-handling
Fix error handling in Writer
2020-04-18 17:13:40 +00:00
Gunnsteinn Hall
6308fc8014 Fix error handling in write, with a testcase. 2020-04-18 13:48:44 +00:00
Gunnsteinn Hall
fa5f13501b Fixes 2020-04-18 11:12:26 +00:00
Gunnsteinn Hall
d23d4b8c79 Add NewCompositePdfFontFromTTF to load composite TTF from memory 2020-04-18 10:37:10 +00:00
Adrian-George Bostan
a351532cd3
Prevent Type 0 function evaluation crash (#309) 2020-04-15 21:05:20 +00:00
Adrian-George Bostan
ff79a9b1bd
Prevent recursion when building invalid outline tree (#308) 2020-04-15 19:33:36 +00:00
Adrian-George Bostan
d605803bd2
Prevent panics (#305)
* Remove panic on font nil Differences array

* Remove unused bcmaps function

* Remove panics from the core/security/crypt package

* Fix extractor invalid Do operand crash

* Fix TTF parser crash for invalid hhea number of hMetrics

* Remove ECB crypt panics

* Remove standard_r6 panics

* Remove panic from render package
2020-04-14 21:09:16 +00:00
Jacek Kucharczyk
ad0b31ea1b
Optimizer fix for the CCITTFax Encoder. ISS #243. Fixes JBIG2 i386 architecture compile issue. (#297)
* Fixed issue #243. Added optimize integration tests.

* Minor style change.

* XObjImage getParamsDict updates Columns and Rows.

* Added doc file for the optimize/tests package.

* UpdateParams for CCITTFax Encoder accepts Width and Height also. Removed 
GetParamsDict Columns and Rows parameters from model.Image and 
model.XObjImage.

* Fix i386 issue for the jbig2 arithmetic encoder.

* Added 386 architecture to the .travis/cross_build.sh
2020-04-08 11:11:49 +00:00
Jacek Kucharczyk
29efa30439
JBIG2 Encoder support for inserting binary images into PDF (#288)
* Added JBIG2 PDF support
* Added JBIG2 Encoder binary image requirements
* PR #288 revision r1 fixes
* PR #288 revision r2 fixes
2020-04-03 20:54:59 +00:00
Adrian-George Bostan
64a43b38d2
Prevent crashing when processing content stream (#291)
* Skip invalid pop operation on empty graphics state stacks
* Fix clipping input values to size for Type 0 Functions
* Do not pass invalid Q content stream operator to external handlers
2020-04-01 20:08:41 +00:00
Adrian-George Bostan
edba514087 Use NRGBA when loading model.Image instances from Go images 2020-03-26 21:47:00 +02:00
Adrian-George Bostan
1d46fb4cc6
Parse ttf encoding subtable 31 after subtable 10 (#273) 2020-03-07 13:08:30 +00:00
Gunnsteinn Hall
937669cfed
Add basic glyph metrics support for Type 0 CID fonts (#272)
* Add basic glyph metrics support for Type 0 CID fonts
* Initialize font widths map if no W array is present
2020-03-05 18:47:16 +00:00
Peter Williams
e056c0e4d4
Fixed PdfColorspaceSpecialIndexed.ImageToRGB() (#259)
* Fixed PdfColorspaceSpecialIndexed.ImageToRGB() Fixes https://github.com/unidoc/unipdf/issues/258
* Fixed indexed colorspace bounds checking.
* Being super cautious to prevent a divide by zero error. I don't think the base cs can have <1 cpts.
* Updated image hash in extract_images_test.go to match new indexed colorspace code.
* add testfile from unipdf#258
2020-02-26 13:26:20 +00:00
Adrian-George Bostan
9de5fe644e
Add PdfFont text encoding methods (#257)
* Add PdfFont method for encoding runes to charcode bytes
* Add getter method for CMap nbits
* Take CMap nbits into account when encoding text
* Adapt font test cases to include text encoding testing
2020-02-17 22:54:20 +00:00
Adrian-George Bostan
e2b3c6e6ba
Add predefined CMaps for Type 0 composite fonts (#246)
* Add packed predefined cmaps
* Add cmap cid range parsing
* Load base cmap for predefined cmaps
* Refactor pdfFont to Unicode methods
* Preserve CharcodeBytesToUnicode behavior
* Add support for CID-keyed Type 0 fonts
* Add method documentation for the cmap package
* Refactor and document charcode to Unicode conversion code
* Add more cmap parsing test cases
* Add more method documentation in the cmap package.
* Remove unused code from the bcmaps package
* Improve cmap test case
* Assume identity when encoder is missing on regenerating field appearance
* Add missing encoder log message
* Add inverse CMap mappings
* Add CMap encoder
* Address golint notices and small fix in the cmap package
* Keep smaller charcodes when generating cmap inverse mappings
* Update extractor test case
* Keep latest supplement charcodes/CIDs when computing inverse mappings
* Fix comment typo
2020-02-07 19:56:30 +00:00
Gunnsteinn Hall
81e3e14eb9
Merge pull request #242 from unidoc/master
Master into development
2020-01-30 22:24:56 +00:00
Adrian-George Bostan
3bd083475d Minor refactoring 2020-01-21 22:18:11 +02:00
Adrian-George Bostan
692ead8496 Improve outline destination parsing 2020-01-21 22:11:20 +02:00
Samuel Stauffer
d3a160ba41 Follow object indirections in PdfPage.GetMediaBox 2020-01-17 14:35:46 -08:00
Adrian-George Bostan
7c5d52cca5 Add outline test case 2020-01-16 20:52:59 +02:00
Adrian-George Bostan
029d4c34d8 Rename NewOutlineFromReaderOutline to GetOutlines and move it in the reader 2020-01-16 19:51:34 +02:00
Adrian-George Bostan
84dd2d145a Add ToOutlineTree method for outline conversion 2020-01-16 19:31:54 +02:00
Adrian-George Bostan
dbd9e96abc Fix method comment typo 2020-01-15 23:36:07 +02:00