unipdf

mirror of https://github.com/unidoc/unipdf.git synced 2025-05-01 22:17:29 +08:00

Author	SHA1	Message	Date
Adrian-George Bostan	e2b3c6e6ba	Add predefined CMaps for Type 0 composite fonts (#246 ) * Add packed predefined cmaps * Add cmap cid range parsing * Load base cmap for predefined cmaps * Refactor pdfFont to Unicode methods * Preserve CharcodeBytesToUnicode behavior * Add support for CID-keyed Type 0 fonts * Add method documentation for the cmap package * Refactor and document charcode to Unicode conversion code * Add more cmap parsing test cases * Add more method documentation in the cmap package. * Remove unused code from the bcmaps package * Improve cmap test case * Assume identity when encoder is missing on regenerating field appearance * Add missing encoder log message * Add inverse CMap mappings * Add CMap encoder * Address golint notices and small fix in the cmap package * Keep smaller charcodes when generating cmap inverse mappings * Update extractor test case * Keep latest supplement charcodes/CIDs when computing inverse mappings * Fix comment typo	2020-02-07 19:56:30 +00:00
Adrian-George Bostan	f7b5ffa954	Prevent extractor panic for invalid PDF text objects (#196 ) * Prevent extractor panic for invalid PDF text objects * Document text extraction behavior of invalid text objects	2019-10-30 20:36:35 +00:00
Peter Williams	aea4cb1d55	Make PageText.sortPosition() sort order deterministic. (#153 )	2019-08-29 18:26:53 +00:00
Gunnsteinn Hall	21141a9d3e	Add Append to TextMarkArray Useful when processing and grouping text marks.	2019-08-04 09:29:21 +00:00
Gunnsteinn Hall	1d7b969b91	Simplify license loading and support environment variables	2019-08-04 09:28:42 +00:00
Peter Williams	9ebcfcf168	Finding bounding boxes of substrings of extracted text. (#109 ) * Added text bounding box extraction. * Add `font` field to textMark struct; Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations * Reorganizing extractor/text.go * Added a text extraction position test. * Added another text extraction location test. * Text extraction location testing. * Added tests for text extraction with location information. * Cleaned up text extraction tests. No changes to functionality. * Simplifying text extraction code. * Simplified line construction in text.go * Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers. * Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text. * Add comments explaining how to calculate text bounding boxes. * Made text_test.go naming consistent with function comments in text.go * Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables. * uncommeted text stress test. Use go test --short to skip * TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)	2019-07-18 06:41:47 +00:00
Adrian-George Bostan	c64812093d	Remmove pdf folder and move packages up one level (#2 )	2019-05-16 20:44:51 +00:00

7 Commits