* Fixed filename:page in logging
* Got CMap working for multi-rune entries
* Treat CMap entries as strings instead of runes to handle multi-byte encodings.
* Added a test for multibyte encoding.
* First version of text extraction that recognizes columns
* Added an expanation of the text columns code to README.md.
* fixed typos
* Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.
* Added function comments.
* Fixed text state save/restore.
* Adjusted inter-word search distance to make paragrah division work for thanh.pdf
* Got text_test.go passing.
* Reinstated hyphen suppression
* Handle more cases of fonts not being set in text extraction code.
* Fixed typo
* More verbose logging
* Adding tables to text extractor.
* Added tests for columns extraction.
* Removed commented code
* Check for textParas that are on the same line when writing out extracted text.
* Absorb text to the left of paras into paras e.g. Footnote numbers
* Removed funny character from text_test.go
* Commented out a creator_test.go test that was broken by my text extraction changes.
* Big changes to columns text extraction code for PR.
Performance improvements in several places.
Commented code.
* Updated extractor/README
* Cleaned up some comments and removed a panic
* Increased threshold for truncating extracted text when there is no license 100 -> 102.
This is a workaround to let a test in creator_test.go pass.
With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.
"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"
* Improved an error message.
* Removed irrelevant spaces
* Commented code and removed unused functions.
* Reverted PdfRectangle changes
* Added duplicate text detection.
* Combine diacritic textMarks in text extraction
* Reinstated a diacritic recombination test.
* Small code reorganisation
* Reinstated handling of rotated text
* Addressed issues in PR review
* Added color fields to TextMark
* Updated README
* Reinstated the disabled tests I missed before.
* Tightened definition for tables to prevent detection of tables where there weren't any.
* Compute line splitting search range based on fontsize of first word in word bag.
* Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs.
See https://blog.golang.org/go1.13-errors
* Fixed some naming and added some comments.
* errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility
* Removed code that doesn't ever get called.
* Removed unused test
* Track runes in IdentityEncoder (for subsetting), track decoded runes
* Working with the identity encoder in font_composite.go
* Add GetFilterArray to multi encoder. Add comments.
* Add NewFromContents constructor to extractor only requiring contents and resources
* golint fixes
* Optimizer compress streams - improved detection of raw streams
* Optimize - CleanContentStream optimizer that removes redundant operands
* WIP Optimize - clean fonts
Will support both font file reduction and subsetting. (WIP)
* Optimize - image processing - try combined DCT and Flate
* Update options.go
* Update optimizer.go
* Create utils.go for optimize with common methods needed for optimization
* Optimizer - add font subsetting method
Covers XObject Forms, annotaitons etc. Uses extractor package to extract text marks covering what fonts and glyphs are used. Package truetype used for subsetting.
* Add some comments
* Fix cmap parsing rune conversion
* Error checking for extractor. Add some comments.
* Update Jenkinsfile
* Update modules
* Fixed filename:page in logging
* Got CMap working for multi-rune entries
* Treat CMap entries as strings instead of runes to handle multi-byte encodings.
* Added a test for multibyte encoding.
* Changed rune->CharCode maps to string->CharCode.
* Removed unintentional changes.
* Updated comments to match new function definitions.
* Changed some []rune APIs to string
* Fixes for reviewer comments.
* Add packed predefined cmaps
* Add cmap cid range parsing
* Load base cmap for predefined cmaps
* Refactor pdfFont to Unicode methods
* Preserve CharcodeBytesToUnicode behavior
* Add support for CID-keyed Type 0 fonts
* Add method documentation for the cmap package
* Refactor and document charcode to Unicode conversion code
* Add more cmap parsing test cases
* Add more method documentation in the cmap package.
* Remove unused code from the bcmaps package
* Improve cmap test case
* Assume identity when encoder is missing on regenerating field appearance
* Add missing encoder log message
* Add inverse CMap mappings
* Add CMap encoder
* Address golint notices and small fix in the cmap package
* Keep smaller charcodes when generating cmap inverse mappings
* Update extractor test case
* Keep latest supplement charcodes/CIDs when computing inverse mappings
* Fix comment typo
* Added text bounding box extraction.
* Add `font` field to textMark struct;
Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations
* Reorganizing extractor/text.go
* Added a text extraction position test.
* Added another text extraction location test.
* Text extraction location testing.
* Added tests for text extraction with location information.
* Cleaned up text extraction tests. No changes to functionality.
* Simplifying text extraction code.
* Simplified line construction in text.go
* Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers.
* Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text.
* Add comments explaining how to calculate text bounding boxes.
* Made text_test.go naming consistent with function comments in text.go
* Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables.
* uncommeted text stress test. Use go test --short to skip
* TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)