10 Commits

Author SHA1 Message Date
Gunnsteinn Hall
11f692bc3a
Font subsetting and font optimization improvements (#362)
* Track runes in IdentityEncoder (for subsetting), track decoded runes

* Working with the identity encoder in font_composite.go

* Add GetFilterArray to multi encoder.  Add comments.

* Add NewFromContents constructor to extractor only requiring contents and resources

* golint fixes

* Optimizer compress streams - improved detection of raw streams

* Optimize - CleanContentStream optimizer that removes redundant operands

* WIP Optimize - clean fonts

Will support both font file reduction and subsetting. (WIP)

* Optimize - image processing - try combined DCT and Flate

* Update options.go

* Update optimizer.go

* Create utils.go for optimize with common methods needed for optimization

* Optimizer - add font subsetting method

Covers XObject Forms, annotaitons etc.  Uses extractor package to extract text marks covering what fonts and glyphs are used.  Package truetype used for subsetting.

* Add some comments

* Fix cmap parsing rune conversion

* Error checking for extractor.  Add some comments.

* Update Jenkinsfile

* Update modules
2020-06-16 21:19:10 +00:00
Peter Williams
5777ee1394
Handle multibyte entries in CMaps. (#353)
* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* Changed rune->CharCode maps to string->CharCode.

* Removed unintentional changes.

* Updated comments to match new function definitions.

* Changed some []rune APIs to string

* Fixes for reviewer comments.
2020-06-03 13:55:15 +00:00
Adrian-George Bostan
d078608da4
Account for parent CTM when calculating positions of extracted forms (#349)
* Take parent CTM into account for form field text

* Pass a modified  graphics state instance to new text objects
2020-05-25 23:34:44 +00:00
Adrian-George Bostan
61ff51916a
Double quote content stream operator fixes (#313)
* Fix wrong symbol checks used for the double quote content stream operator

* Fix text extraction parameter check for the double quote operator
2020-04-16 14:32:34 +00:00
Adrian-George Bostan
d605803bd2
Prevent panics (#305)
* Remove panic on font nil Differences array

* Remove unused bcmaps function

* Remove panics from the core/security/crypt package

* Fix extractor invalid Do operand crash

* Fix TTF parser crash for invalid hhea number of hMetrics

* Remove ECB crypt panics

* Remove standard_r6 panics

* Remove panic from render package
2020-04-14 21:09:16 +00:00
Adrian-George Bostan
f7b5ffa954 Prevent extractor panic for invalid PDF text objects (#196)
* Prevent extractor panic for invalid PDF text objects
* Document text extraction behavior of invalid text objects
2019-10-30 20:36:35 +00:00
Peter Williams
aea4cb1d55 Make PageText.sortPosition() sort order deterministic. (#153) 2019-08-29 18:26:53 +00:00
Gunnsteinn Hall
21141a9d3e Add Append to TextMarkArray
Useful when processing and grouping text marks.
2019-08-04 09:29:21 +00:00
Peter Williams
9ebcfcf168 Finding bounding boxes of substrings of extracted text. (#109)
* Added text bounding box extraction.
* Add `font` field to textMark struct;
Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations
* Reorganizing extractor/text.go
* Added a text extraction position test.
* Added another text extraction location test.
* Text extraction location testing.
* Added tests for text extraction with location information.
* Cleaned up text extraction tests. No changes to functionality.
* Simplifying text extraction code.
* Simplified line construction in text.go
* Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers.
* Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text.
* Add comments explaining how to calculate text bounding boxes.
* Made text_test.go naming consistent with function comments in text.go
* Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables.
* uncommeted text stress test. Use go test --short to skip
* TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)
2019-07-18 06:41:47 +00:00
Adrian-George Bostan
c64812093d Remmove pdf folder and move packages up one level (#2) 2019-05-16 20:44:51 +00:00