unipdf

mirror of https://github.com/unidoc/unipdf.git synced 2025-04-27 13:48:51 +08:00

Author	SHA1	Message	Date
Gunnsteinn Hall	11f692bc3a	Font subsetting and font optimization improvements (#362 ) * Track runes in IdentityEncoder (for subsetting), track decoded runes * Working with the identity encoder in font_composite.go * Add GetFilterArray to multi encoder. Add comments. * Add NewFromContents constructor to extractor only requiring contents and resources * golint fixes * Optimizer compress streams - improved detection of raw streams * Optimize - CleanContentStream optimizer that removes redundant operands * WIP Optimize - clean fonts Will support both font file reduction and subsetting. (WIP) * Optimize - image processing - try combined DCT and Flate * Update options.go * Update optimizer.go * Create utils.go for optimize with common methods needed for optimization * Optimizer - add font subsetting method Covers XObject Forms, annotaitons etc. Uses extractor package to extract text marks covering what fonts and glyphs are used. Package truetype used for subsetting. * Add some comments * Fix cmap parsing rune conversion * Error checking for extractor. Add some comments. * Update Jenkinsfile * Update modules	2020-06-16 21:19:10 +00:00
Peter Williams	5777ee1394	Handle multibyte entries in CMaps. (#353 ) * Fixed filename:page in logging * Got CMap working for multi-rune entries * Treat CMap entries as strings instead of runes to handle multi-byte encodings. * Added a test for multibyte encoding. * Changed rune->CharCode maps to string->CharCode. * Removed unintentional changes. * Updated comments to match new function definitions. * Changed some []rune APIs to string * Fixes for reviewer comments.	2020-06-03 13:55:15 +00:00
Adrian-George Bostan	d078608da4	Account for parent CTM when calculating positions of extracted forms (#349 ) * Take parent CTM into account for form field text * Pass a modified graphics state instance to new text objects	2020-05-25 23:34:44 +00:00
Adrian-George Bostan	61ff51916a	Double quote content stream operator fixes (#313 ) * Fix wrong symbol checks used for the double quote content stream operator * Fix text extraction parameter check for the double quote operator	2020-04-16 14:32:34 +00:00
Adrian-George Bostan	d605803bd2	Prevent panics (#305 ) * Remove panic on font nil Differences array * Remove unused bcmaps function * Remove panics from the core/security/crypt package * Fix extractor invalid Do operand crash * Fix TTF parser crash for invalid hhea number of hMetrics * Remove ECB crypt panics * Remove standard_r6 panics * Remove panic from render package	2020-04-14 21:09:16 +00:00
Adrian-George Bostan	f7b5ffa954	Prevent extractor panic for invalid PDF text objects (#196 ) * Prevent extractor panic for invalid PDF text objects * Document text extraction behavior of invalid text objects	2019-10-30 20:36:35 +00:00
Peter Williams	aea4cb1d55	Make PageText.sortPosition() sort order deterministic. (#153 )	2019-08-29 18:26:53 +00:00
Gunnsteinn Hall	21141a9d3e	Add Append to TextMarkArray Useful when processing and grouping text marks.	2019-08-04 09:29:21 +00:00
Peter Williams	9ebcfcf168	Finding bounding boxes of substrings of extracted text. (#109 ) * Added text bounding box extraction. * Add `font` field to textMark struct; Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations * Reorganizing extractor/text.go * Added a text extraction position test. * Added another text extraction location test. * Text extraction location testing. * Added tests for text extraction with location information. * Cleaned up text extraction tests. No changes to functionality. * Simplifying text extraction code. * Simplified line construction in text.go * Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers. * Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text. * Add comments explaining how to calculate text bounding boxes. * Made text_test.go naming consistent with function comments in text.go * Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables. * uncommeted text stress test. Use go test --short to skip * TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)	2019-07-18 06:41:47 +00:00
Adrian-George Bostan	c64812093d	Remmove pdf folder and move packages up one level (#2 )	2019-05-16 20:44:51 +00:00

10 Commits