7 Commits

Author SHA1 Message Date
Adrian-George Bostan
e2b3c6e6ba
Add predefined CMaps for Type 0 composite fonts (#246)
* Add packed predefined cmaps
* Add cmap cid range parsing
* Load base cmap for predefined cmaps
* Refactor pdfFont to Unicode methods
* Preserve CharcodeBytesToUnicode behavior
* Add support for CID-keyed Type 0 fonts
* Add method documentation for the cmap package
* Refactor and document charcode to Unicode conversion code
* Add more cmap parsing test cases
* Add more method documentation in the cmap package.
* Remove unused code from the bcmaps package
* Improve cmap test case
* Assume identity when encoder is missing on regenerating field appearance
* Add missing encoder log message
* Add inverse CMap mappings
* Add CMap encoder
* Address golint notices and small fix in the cmap package
* Keep smaller charcodes when generating cmap inverse mappings
* Update extractor test case
* Keep latest supplement charcodes/CIDs when computing inverse mappings
* Fix comment typo
2020-02-07 19:56:30 +00:00
Adrian-George Bostan
f7b5ffa954 Prevent extractor panic for invalid PDF text objects (#196)
* Prevent extractor panic for invalid PDF text objects
* Document text extraction behavior of invalid text objects
2019-10-30 20:36:35 +00:00
Peter Williams
aea4cb1d55 Make PageText.sortPosition() sort order deterministic. (#153) 2019-08-29 18:26:53 +00:00
Gunnsteinn Hall
21141a9d3e Add Append to TextMarkArray
Useful when processing and grouping text marks.
2019-08-04 09:29:21 +00:00
Gunnsteinn Hall
1d7b969b91 Simplify license loading and support environment variables 2019-08-04 09:28:42 +00:00
Peter Williams
9ebcfcf168 Finding bounding boxes of substrings of extracted text. (#109)
* Added text bounding box extraction.
* Add `font` field to textMark struct;
Create a new method `TextComponents` to retrieve all the text components of the extracted text in the page, with position and character informations
* Reorganizing extractor/text.go
* Added a text extraction position test.
* Added another text extraction location test.
* Text extraction location testing.
* Added tests for text extraction with location information.
* Cleaned up text extraction tests. No changes to functionality.
* Simplifying text extraction code.
* Simplified line construction in text.go
* Returning TextMark's in TextMarkArray which are based on PdfObjectArray but read-only, so not pointers.
* Added text extraction to show PDFs marked-up with bounding boxes of substring in extracted text.
* Add comments explaining how to calculate text bounding boxes.
* Made text_test.go naming consistent with function comments in text.go
* Use tm, pt, tl for textMark/TextMark PageText and TextLine receivers and local variables.
* uncommeted text stress test. Use go test --short to skip
* TextMark.Offset is now an index into the extracted text. It was an index into []rune(text)
2019-07-18 06:41:47 +00:00
Adrian-George Bostan
c64812093d Remmove pdf folder and move packages up one level (#2) 2019-05-16 20:44:51 +00:00