* Fixed filename:page in logging
* Got CMap working for multi-rune entries
* Treat CMap entries as strings instead of runes to handle multi-byte encodings.
* Added a test for multibyte encoding.
* First version of text extraction that recognizes columns
* Added an expanation of the text columns code to README.md.
* fixed typos
* Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.
* Added function comments.
* Fixed text state save/restore.
* Adjusted inter-word search distance to make paragrah division work for thanh.pdf
* Got text_test.go passing.
* Reinstated hyphen suppression
* Handle more cases of fonts not being set in text extraction code.
* Fixed typo
* More verbose logging
* Adding tables to text extractor.
* Added tests for columns extraction.
* Removed commented code
* Check for textParas that are on the same line when writing out extracted text.
* Absorb text to the left of paras into paras e.g. Footnote numbers
* Removed funny character from text_test.go
* Commented out a creator_test.go test that was broken by my text extraction changes.
* Big changes to columns text extraction code for PR.
Performance improvements in several places.
Commented code.
* Updated extractor/README
* Cleaned up some comments and removed a panic
* Increased threshold for truncating extracted text when there is no license 100 -> 102.
This is a workaround to let a test in creator_test.go pass.
With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.
"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"
* Improved an error message.
* Removed irrelevant spaces
* Commented code and removed unused functions.
* Reverted PdfRectangle changes
* Added duplicate text detection.
* Combine diacritic textMarks in text extraction
* Reinstated a diacritic recombination test.
* Small code reorganisation
* Reinstated handling of rotated text
* Addressed issues in PR review
* Added color fields to TextMark
* Updated README
* Reinstated the disabled tests I missed before.
* Tightened definition for tables to prevent detection of tables where there weren't any.
* Compute line splitting search range based on fontsize of first word in word bag.
* Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs.
See https://blog.golang.org/go1.13-errors
* Fixed some naming and added some comments.
* errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility
* Removed code that doesn't ever get called.
* Removed unused test
* Track runes in IdentityEncoder (for subsetting), track decoded runes
* Working with the identity encoder in font_composite.go
* Add GetFilterArray to multi encoder. Add comments.
* Add NewFromContents constructor to extractor only requiring contents and resources
* golint fixes
* Optimizer compress streams - improved detection of raw streams
* Optimize - CleanContentStream optimizer that removes redundant operands
* WIP Optimize - clean fonts
Will support both font file reduction and subsetting. (WIP)
* Optimize - image processing - try combined DCT and Flate
* Update options.go
* Update optimizer.go
* Create utils.go for optimize with common methods needed for optimization
* Optimizer - add font subsetting method
Covers XObject Forms, annotaitons etc. Uses extractor package to extract text marks covering what fonts and glyphs are used. Package truetype used for subsetting.
* Add some comments
* Fix cmap parsing rune conversion
* Error checking for extractor. Add some comments.
* Update Jenkinsfile
* Update modules
* Update unitype lib which improves subsetting
* Add text extraction check to creator font subsetting example
Helps ensure ToUnicode map is set correctly.
* Clean up import
* Fix spelling
* Subsetting of TrueType CID fonts using unitype
* Simplify call to SubsetRegistered so can be done right after loading font via creator finalizer
* Add an EnableFontSubsetting function on the creator to simplify font subsetting for creator users
* Add render package
* Add text state
* Add more text operators
* Remove unnecessary files
* Add text font
* Add custom text render method
* Improve text rendering method
* Rename text state methods
* Refactor and document context interface
* Refact text begin/end operators
* Fix graphics state transformations
* Keep original font when doing font substitution
* Take page cropbox into account
* Revert to substitution font if original font measurement is 0
* Add font substitution package
* Implement addition transform.Point methods
* Use transform.Point in the image context package
* Remove unneeded functionality from the render image package
* Fix golint notices in the image rendering package
* Fix go vet notices in the render package
* Fix golint notices in the top-level render package
* Improve render context package documentation
* Document context text state struct.
* Document context text font struct.
* Minor logging improvements
* Add license disclaimer to the render package files
* Avoid using package aliases where possible
* Change style of section comments
* Adapt render package import style to follow the developer guide
* Improve documentation for the internal matrix implementation
* Update render package dependency versions
* Apply crop box post render
* Account for offseted media boxes
* Improve metrics of rendered characters
* Fix text matrix translation
* Change priority of fonts used for measuring rendered characters
* Skip invalid m and l operators on image rendering
* Small fix for v operator
* Fix rendered characters spacing issues
* Refactor naming of internal render packages