29 Commits

Author SHA1 Message Date
Gunnsteinn Hall
11f692bc3a
Font subsetting and font optimization improvements (#362)
* Track runes in IdentityEncoder (for subsetting), track decoded runes

* Working with the identity encoder in font_composite.go

* Add GetFilterArray to multi encoder.  Add comments.

* Add NewFromContents constructor to extractor only requiring contents and resources

* golint fixes

* Optimizer compress streams - improved detection of raw streams

* Optimize - CleanContentStream optimizer that removes redundant operands

* WIP Optimize - clean fonts

Will support both font file reduction and subsetting. (WIP)

* Optimize - image processing - try combined DCT and Flate

* Update options.go

* Update optimizer.go

* Create utils.go for optimize with common methods needed for optimization

* Optimizer - add font subsetting method

Covers XObject Forms, annotaitons etc.  Uses extractor package to extract text marks covering what fonts and glyphs are used.  Package truetype used for subsetting.

* Add some comments

* Fix cmap parsing rune conversion

* Error checking for extractor.  Add some comments.

* Update Jenkinsfile

* Update modules
2020-06-16 21:19:10 +00:00
Peter Williams
5777ee1394
Handle multibyte entries in CMaps. (#353)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* Changed rune->CharCode maps to string->CharCode.

* Removed unintentional changes.

* Updated comments to match new function definitions.

* Changed some []rune APIs to string

* Fixes for reviewer comments.
2020-06-03 13:55:15 +00:00
Gunnsteinn Hall
ad2a1e9c9d
Subsetting fixes (#346)
* Update unitype lib which improves subsetting

* Add text extraction check to creator font subsetting example

Helps ensure ToUnicode map is set correctly.

* Clean up import

* Fix spelling
2020-05-12 07:15:09 +00:00
Adrian-George Bostan
aef6e5e976
Fix CMap generation and serialization for composite fonts (#344)
* Fix CMap charcode mapping serialization

* Improve CMap generation in the NewCompositePdfFontFromTTF function
2020-05-08 00:15:09 +00:00
Gunnsteinn Hall
9ef2f27694
Support for subsetting fonts (#335)
* Subsetting of TrueType CID fonts using unitype

* Simplify call to SubsetRegistered so can be done right after loading font via creator finalizer

* Add an EnableFontSubsetting function on the creator to simplify font subsetting for creator users
2020-05-05 00:17:27 +00:00
Adrian-George Bostan
6678fc040a
Cache raw CMap data (#324) 2020-04-21 21:53:36 +00:00
Gunnsteinn Hall
11f3a6e7a2
Fix for crash in CCITT decoder. Resolves https://github.com/unidoc/unipdf/issues/314 (#315) 2020-04-16 23:05:50 +00:00
Adrian-George Bostan
d605803bd2
Prevent panics (#305)
* Remove panic on font nil Differences array

* Remove unused bcmaps function

* Remove panics from the core/security/crypt package

* Fix extractor invalid Do operand crash

* Fix TTF parser crash for invalid hhea number of hMetrics

* Remove ECB crypt panics

* Remove standard_r6 panics

* Remove panic from render package
2020-04-14 21:09:16 +00:00
Jacek Kucharczyk
ad0b31ea1b
Optimizer fix for the CCITTFax Encoder. ISS #243. Fixes JBIG2 i386 architecture compile issue. (#297)
* Fixed issue #243. Added optimize integration tests.

* Minor style change.

* XObjImage getParamsDict updates Columns and Rows.

* Added doc file for the optimize/tests package.

* UpdateParams for CCITTFax Encoder accepts Width and Height also. Removed 
GetParamsDict Columns and Rows parameters from model.Image and 
model.XObjImage.

* Fix i386 issue for the jbig2 arithmetic encoder.

* Added 386 architecture to the .travis/cross_build.sh
2020-04-08 11:11:49 +00:00
Jacek Kucharczyk
29efa30439
JBIG2 Encoder support for inserting binary images into PDF (#288)
* Added JBIG2 PDF support
* Added JBIG2 Encoder binary image requirements
* PR #288 revision r1 fixes
* PR #288 revision r2 fixes
2020-04-03 20:54:59 +00:00
Jacek Kucharczyk
c582323a8f
JBIG2 Generic Encoder (#264)
* Prepared skeleton and basic component implementations for the jbig2 encoding.

* Added Bitset. Implemented Bitmap.

* Decoder with old Arithmetic Decoder

* Partly working arithmetic

* Working arithmetic decoder.

* MMR patched.

* rebuild to apache.

* Working generic

* Working generic

* Decoded full document

* Update Jenkinsfile go version [master] (#398)

* Update Jenkinsfile go version

* Decoded AnnexH document

* Minor issues fixed.

* Update README.md

* Fixed generic region errors. Added benchmark. Added bitmap unpadder. Added Bitmap toImage method.

* Fixed endofpage error

* Added integration test.

* Decoded all test files without errors. Implemented JBIG2Global.

* Merged with v3 version

* Fixed the EOF in the globals issue

* Fixed the JBIG2 ChocolateData Decode

* JBIG2 Added license information

* Minor fix in jbig2 encoding.

* Applied the logging convention

* Cleaned unnecessary imports

* Go modules clear unused imports

* checked out the README.md

* Moved trace to Debug. Fixed the build integrate tag in the document_decode_test.go

* Initial encoder skeleton

* Applied UniPDF Developer Guide. Fixed lint issues.

* Cleared documentation, fixed style issues.

* Added jbig2 doc.go files. Applied unipdf guide style.

* Minor code style changes.

* Minor naming and style issues fixes.

* Minor naming changes. Style issues fixed.

* Review r11 fixes.

* Added JBIG2 Encoder skeleton.

* Moved Document and Page to jbig2/document package. Created decoder package responsible for decoding jbig2 stream.

* Implemented raster functions.

* Added raster uni low test funcitons.

* Added raster low test functions

* untracked files on jbig2-encoder: c869089 Added raster low test functions

* index on jbig2-encoder: c869089 Added raster low test functions

* Added morph files.

* implemented jbig2 encoder basics

* JBIG2 Encoder - Generic method

* Added jbig2 image encode ttests, black/white image tests

* cleaned and tested jbig2 package

* unfinished jbig2 classified encoder

* jbig2 minor style changes

* minor jbig2 encoder changes

* prepared JBIG2 Encoder

* Style and lint fixes

* Minor changes and lints

* Fixed shift unsinged value build errors

* Minor naming change

* Added jbig2 encode, image gondels. Fixed jbig2 decode bug.

* Provided jbig2 core.DecodeGlobals function.

* Fixed JBIG2Encoder `r6` revision issues.

* Removed public JBIG2Encoder document.

* Minor style changes

* added NewJBIG2Encoder function.

* fixed JBIG2Encoder 'r9' revision issues.

* Cleared 'r9' commented code.

* Updated ACKNOWLEDGEMENETS. Fixed JBIG2Encoder 'r10' revision issues.

Co-authored-by: Gunnsteinn Hall <gunnsteinn.hall@gmail.com>
2020-03-27 11:47:41 +00:00
Adrian-George Bostan
d961079c5d
Add basic image rendering support (#266)
* Add render package
* Add text state
* Add more text operators
* Remove unnecessary files
* Add text font
* Add custom text render method
* Improve text rendering method
* Rename text state methods
* Refactor and document context interface
* Refact text begin/end operators
* Fix graphics state transformations
* Keep original font when doing font substitution
* Take page cropbox into account
* Revert to substitution font if original font measurement is 0
* Add font substitution package
* Implement addition transform.Point methods
* Use transform.Point in the image context package
* Remove unneeded functionality from the render image package
* Fix golint notices in the image rendering package
* Fix go vet notices in the render package
* Fix golint notices in the top-level render package
* Improve render context package documentation
* Document context text state struct.
* Document context text font struct.
* Minor logging improvements
* Add license disclaimer to the render package files
* Avoid using package aliases where possible
* Change style of section comments
* Adapt render package import style to follow the developer guide
* Improve documentation for the internal matrix implementation
* Update render package dependency versions
* Apply crop box post render
* Account for offseted media boxes
* Improve metrics of rendered characters
* Fix text matrix translation
* Change priority of fonts used for measuring rendered characters
* Skip invalid m and l operators on image rendering
* Small fix for v operator
* Fix rendered characters spacing issues
* Refactor naming of internal render packages
2020-03-02 21:22:54 +00:00
Peter Williams
e056c0e4d4
Fixed PdfColorspaceSpecialIndexed.ImageToRGB() (#259)
* Fixed PdfColorspaceSpecialIndexed.ImageToRGB() Fixes https://github.com/unidoc/unipdf/issues/258
* Fixed indexed colorspace bounds checking.
* Being super cautious to prevent a divide by zero error. I don't think the base cs can have <1 cpts.
* Updated image hash in extract_images_test.go to match new indexed colorspace code.
* add testfile from unipdf#258
2020-02-26 13:26:20 +00:00
Adrian-George Bostan
9de5fe644e
Add PdfFont text encoding methods (#257)
* Add PdfFont method for encoding runes to charcode bytes
* Add getter method for CMap nbits
* Take CMap nbits into account when encoding text
* Adapt font test cases to include text encoding testing
2020-02-17 22:54:20 +00:00
Adrian-George Bostan
e2b3c6e6ba
Add predefined CMaps for Type 0 composite fonts (#246)
* Add packed predefined cmaps
* Add cmap cid range parsing
* Load base cmap for predefined cmaps
* Refactor pdfFont to Unicode methods
* Preserve CharcodeBytesToUnicode behavior
* Add support for CID-keyed Type 0 fonts
* Add method documentation for the cmap package
* Refactor and document charcode to Unicode conversion code
* Add more cmap parsing test cases
* Add more method documentation in the cmap package.
* Remove unused code from the bcmaps package
* Improve cmap test case
* Assume identity when encoder is missing on regenerating field appearance
* Add missing encoder log message
* Add inverse CMap mappings
* Add CMap encoder
* Address golint notices and small fix in the cmap package
* Keep smaller charcodes when generating cmap inverse mappings
* Update extractor test case
* Keep latest supplement charcodes/CIDs when computing inverse mappings
* Fix comment typo
2020-02-07 19:56:30 +00:00
Samuel Stauffer
5f19bfa269 Address comments on PR 2020-01-06 11:13:16 -08:00
Samuel Stauffer
e85397b57a Unify and optimize number parsing 2020-01-06 11:05:42 -08:00
Adrian-George Bostan
23aec77478 Add basic support for UTF-16 text encodings (#203)
* Add UTF-16 text encoder
2019-11-28 00:47:00 +00:00
Adrian-George Bostan
56e81d3a1a Take decode arrays into account when processing grayscale images (#159)
* Take decode arrays into account when processing grayscale images
* Adapt image extraction test case hashes
* Minor refactoring in the ColorAt image method
* Always return vanilla data from the jbig2 decoder
2019-08-30 19:16:23 +00:00
Jacek Kucharczyk
24648f4481 Issue #144 Fix - JBIG2 - Changed integer variables types (#148)
* Fixing platform indepenedent integer size
* Cleared test logs.
* Cleared unnecessary int32
* Defined precise integer size for jbig2 segments.
2019-08-29 19:12:18 +00:00
Adrian-George Bostan
febf633172 Image memory optimizations (#149)
* Add ColorAt method for images
* Avoid resample on image to Go image conversion
* Avoid resample when converting grayscale image to RGB
* Preserve old behavior of image to Go image conversion
* Add missing case in the ToGoImage method
* Fix grayscale to RGB image conversion
* Improve code documentation
* Fix color extraction for CMYK and 4 bit RGB
* Add test case for the ColorAt image method
* Avoid resampling when converting CMYK image to RGB
* Add notice comment for the GetSamples/SetSamples image methods
2019-08-22 20:15:16 +00:00
Adrian-George Bostan
cca04199e6 Add extract images test case, with memory profiling (#146)
* Add extract images test case, with memory profiling
* Use TotalAlloc insted of Alloc for memory profiling
* Remove calls to debug.FreeOSMemory from test cases
2019-08-19 22:37:16 +00:00
Peter Williams
9ebcfcf168 Finding bounding boxes of substrings of extracted text. (#109)
* Added text bounding box extraction.
* Add `font` field to textMark struct;
Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations
* Reorganizing extractor/text.go
* Added a text extraction position test.
* Added another text extraction location test.
* Text extraction location testing.
* Added tests for text extraction with location information.
* Cleaned up text extraction tests. No changes to functionality.
* Simplifying text extraction code.
* Simplified line construction in text.go
* Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers.
* Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text.
* Add comments explaining how to calculate text bounding boxes.
* Made text_test.go naming consistent with function comments in text.go
* Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables.
* uncommeted text stress test. Use go test --short to skip
* TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)
2019-07-18 06:41:47 +00:00
Jacek Kucharczyk
4b1c345214 JBIG2 decoder benchmark patch 2019-07-16 15:40:22 +00:00
Jacek Kucharczyk
e85616cec2 JBIG2Decoder implementation (#67)
* Prepared skeleton and basic component implementations for the jbig2 encoding.
* Added Bitset. Implemented Bitmap.
* Decoder with old Arithmetic Decoder
* Partly working arithmetic
* Working arithmetic decoder.
* MMR patched.
* rebuild to apache.
* Working generic
* Decoded full document
* Decoded AnnexH document
* Minor issues fixed.
* Update README.md
* Fixed generic region errors. Added benchmark. Added bitmap unpadder. Added Bitmap toImage method.
* Fixed endofpage error
* Added integration test.
* Decoded all test files without errors. Implemented JBIG2Global.
* Merged with v3 version
* Fixed the EOF in the globals issue
* Fixed the JBIG2 ChocolateData Decode
* JBIG2 Added license information
* Minor fix in jbig2 encoding.
* Applied the logging convention
* Cleaned unnecessary imports
* Go modules clear unused imports
* checked out the README.md
* Moved trace to Debug. Fixed the build integrate tag in the document_decode_test.go
* Applied UniPDF Developer Guide. Fixed lint issues.
* Cleared documentation, fixed style issues.
* Added jbig2 doc.go files. Applied unipdf guide style.
* Minor code style changes.
* Minor naming and style issues fixes.
* Minor naming changes. Style issues fixed.
* Review r11 fixes.
* Integrate jbig2 tests with build system
* Added jbig2 integration test golden files.
* Minor jbig2 integration test fix
* Removed jbig2 integration image assertions
* Fixed jbig2 rowstride issue. Implemented jbig2 bit writer
* Changed golden files logic. Fixes r13 issues.
2019-07-14 21:18:40 +00:00
Adrian-George Bostan
d8dcc051b3 Fix annotation flatten when AcroForm does not exist (#93)
* Fix annotation flatten when AcroForm does not exist.
* Adapt test case file hashes to account for file flattening
2019-06-25 19:29:03 +00:00
Gunnsteinn Hall
7a9a8ff542
Add FDF merge test case for form filling and flattening with change detection (#98)
Manually verified that output PDFs look good and leave hash check to detect change. If there is a change in the future, the hash change will trigger a failure upon which the output PDFs need to be re-checked and hashes updated if appropriate.
2019-06-25 08:08:51 +00:00
Adrian-George Bostan
8425bf7c8f Update page resources Font dictionary when applying license information (#5)
* Make PdfObjectDictionary Merge method chainable
* Update page resources Font dictionary when applying license information
* Add license font to the page resources only when it does not exist
* Update hash for split test after verification
2019-05-30 10:52:05 +00:00
Adrian-George Bostan
c64812093d Remmove pdf folder and move packages up one level (#2) 2019-05-16 20:44:51 +00:00