unipdf

mirror of https://github.com/unidoc/unipdf.git synced 2025-04-26 13:48:55 +08:00

Author	SHA1	Message	Date
UniDoc Build	22ca2c0eed	prepare release	2020-09-28 23:18:17 +00:00
UniDoc Build	9107a86674	prepare release	2020-09-21 01:20:10 +00:00
UniDoc Build	b991a36456	prepare release	2020-09-14 09:32:45 +00:00
UniDoc Build	fd3b669a36	prepare release	2020-09-07 00:23:12 +00:00
UniDoc Build	61b6580cb9	prepare release	2020-08-31 21:12:07 +00:00
UniDoc Build	1501d07a74	prepare release	2020-08-27 21:45:09 +00:00
Peter Williams	88fda44e0a	Text extraction code for columns. (#366 ) * Fixed filename:page in logging * Got CMap working for multi-rune entries * Treat CMap entries as strings instead of runes to handle multi-byte encodings. * Added a test for multibyte encoding. * First version of text extraction that recognizes columns * Added an expanation of the text columns code to README.md. * fixed typos * Abstracted textWord depth calculation. This required change textMark to textMark in a lot of code. Added function comments. * Fixed text state save/restore. * Adjusted inter-word search distance to make paragrah division work for thanh.pdf * Got text_test.go passing. * Reinstated hyphen suppression * Handle more cases of fonts not being set in text extraction code. * Fixed typo * More verbose logging * Adding tables to text extractor. * Added tests for columns extraction. * Removed commented code * Check for textParas that are on the same line when writing out extracted text. * Absorb text to the left of paras into paras e.g. Footnote numbers * Removed funny character from text_test.go * Commented out a creator_test.go test that was broken by my text extraction changes. * Big changes to columns text extraction code for PR. Performance improvements in several places. Commented code. * Updated extractor/README * Cleaned up some comments and removed a panic * Increased threshold for truncating extracted text when there is no license 100 -> 102. This is a workaround to let a test in creator_test.go pass. With the old text extraction code the following extracted text was 100 chars. With the new code it is 102 chars which looks correct. "你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n" * Improved an error message. * Removed irrelevant spaces * Commented code and removed unused functions. * Reverted PdfRectangle changes * Added duplicate text detection. * Combine diacritic textMarks in text extraction * Reinstated a diacritic recombination test. * Small code reorganisation * Reinstated handling of rotated text * Addressed issues in PR review * Added color fields to TextMark * Updated README * Reinstated the disabled tests I missed before. * Tightened definition for tables to prevent detection of tables where there weren't any. * Compute line splitting search range based on fontsize of first word in word bag. * Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs. See https://blog.golang.org/go1.13-errors * Fixed some naming and added some comments. * errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility * Removed code that doesn't ever get called. * Removed unused test	2020-06-30 19:33:10 +00:00
Adrian-George Bostan	4f96762674	Add fill and stroke colors for extracted text marks (#381 ) * Add fill and stroke colors for text marks * Add extractor text color test case * Fix fillColor field comment typo	2020-06-24 09:55:58 +00:00
Gunnsteinn Hall	11f692bc3a	Font subsetting and font optimization improvements (#362 ) * Track runes in IdentityEncoder (for subsetting), track decoded runes * Working with the identity encoder in font_composite.go * Add GetFilterArray to multi encoder. Add comments. * Add NewFromContents constructor to extractor only requiring contents and resources * golint fixes * Optimizer compress streams - improved detection of raw streams * Optimize - CleanContentStream optimizer that removes redundant operands * WIP Optimize - clean fonts Will support both font file reduction and subsetting. (WIP) * Optimize - image processing - try combined DCT and Flate * Update options.go * Update optimizer.go * Create utils.go for optimize with common methods needed for optimization * Optimizer - add font subsetting method Covers XObject Forms, annotaitons etc. Uses extractor package to extract text marks covering what fonts and glyphs are used. Package truetype used for subsetting. * Add some comments * Fix cmap parsing rune conversion * Error checking for extractor. Add some comments. * Update Jenkinsfile * Update modules	2020-06-16 21:19:10 +00:00
Peter Williams	5777ee1394	Handle multibyte entries in CMaps. (#353 ) * Fixed filename:page in logging * Got CMap working for multi-rune entries * Treat CMap entries as strings instead of runes to handle multi-byte encodings. * Added a test for multibyte encoding. * Changed rune->CharCode maps to string->CharCode. * Removed unintentional changes. * Updated comments to match new function definitions. * Changed some []rune APIs to string * Fixes for reviewer comments.	2020-06-03 13:55:15 +00:00
Adrian-George Bostan	d078608da4	Account for parent CTM when calculating positions of extracted forms (#349 ) * Take parent CTM into account for form field text * Pass a modified graphics state instance to new text objects	2020-05-25 23:34:44 +00:00
Adrian-George Bostan	61ff51916a	Double quote content stream operator fixes (#313 ) * Fix wrong symbol checks used for the double quote content stream operator * Fix text extraction parameter check for the double quote operator	2020-04-16 14:32:34 +00:00
Adrian-George Bostan	d605803bd2	Prevent panics (#305 ) * Remove panic on font nil Differences array * Remove unused bcmaps function * Remove panics from the core/security/crypt package * Fix extractor invalid Do operand crash * Fix TTF parser crash for invalid hhea number of hMetrics * Remove ECB crypt panics * Remove standard_r6 panics * Remove panic from render package	2020-04-14 21:09:16 +00:00
Adrian-George Bostan	e2b3c6e6ba	Add predefined CMaps for Type 0 composite fonts (#246 ) * Add packed predefined cmaps * Add cmap cid range parsing * Load base cmap for predefined cmaps * Refactor pdfFont to Unicode methods * Preserve CharcodeBytesToUnicode behavior * Add support for CID-keyed Type 0 fonts * Add method documentation for the cmap package * Refactor and document charcode to Unicode conversion code * Add more cmap parsing test cases * Add more method documentation in the cmap package. * Remove unused code from the bcmaps package * Improve cmap test case * Assume identity when encoder is missing on regenerating field appearance * Add missing encoder log message * Add inverse CMap mappings * Add CMap encoder * Address golint notices and small fix in the cmap package * Keep smaller charcodes when generating cmap inverse mappings * Update extractor test case * Keep latest supplement charcodes/CIDs when computing inverse mappings * Fix comment typo	2020-02-07 19:56:30 +00:00
Adrian-George Bostan	f7b5ffa954	Prevent extractor panic for invalid PDF text objects (#196 ) * Prevent extractor panic for invalid PDF text objects * Document text extraction behavior of invalid text objects	2019-10-30 20:36:35 +00:00
Peter Williams	aea4cb1d55	Make PageText.sortPosition() sort order deterministic. (#153 )	2019-08-29 18:26:53 +00:00
Gunnsteinn Hall	21141a9d3e	Add Append to TextMarkArray Useful when processing and grouping text marks.	2019-08-04 09:29:21 +00:00
Gunnsteinn Hall	1d7b969b91	Simplify license loading and support environment variables	2019-08-04 09:28:42 +00:00
Peter Williams	9ebcfcf168	Finding bounding boxes of substrings of extracted text. (#109 ) * Added text bounding box extraction. * Add `font` field to textMark struct; Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations * Reorganizing extractor/text.go * Added a text extraction position test. * Added another text extraction location test. * Text extraction location testing. * Added tests for text extraction with location information. * Cleaned up text extraction tests. No changes to functionality. * Simplifying text extraction code. * Simplified line construction in text.go * Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers. * Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text. * Add comments explaining how to calculate text bounding boxes. * Made text_test.go naming consistent with function comments in text.go * Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables. * uncommeted text stress test. Use go test --short to skip * TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)	2019-07-18 06:41:47 +00:00
Adrian-George Bostan	c64812093d	Remmove pdf folder and move packages up one level (#2 )	2019-05-16 20:44:51 +00:00

20 Commits