82 Commits

Author SHA1 Message Date
Peter Williams
ca2b73bd7a Removed combineDiacritics from text extraction because it was causing ' and " to be combined with the letters proceeding them.
Need to fix this and reinstate combineDiacritics.
2019-01-01 12:22:39 +11:00
Gunnsteinn Hall
8f031e7bdb remove panic in extractor 2018-12-27 17:18:52 +00:00
Denys Smirnov
dbbef4fd05 Merge remote-tracking branch 'peterwilliams97/extract.text' into extract.text
# Conflicts:
#	pdf/extractor/text.go
2018-12-27 12:40:55 +02:00
Peter Williams
c70b66a00d Fixed incorrectly named variable. 2018-12-27 21:33:31 +11:00
Denys Smirnov
53687f854e Merge remote-tracking branch 'origin/v3' into extract.text
# Conflicts:
#	pdf/contentstream/processor.go
#	pdf/extractor/text.go
#	pdf/extractor/utils.go
#	pdf/internal/textencoding/winansi.go
#	pdf/model/font.go
#	pdf/model/font_composite.go
#	pdf/model/font_simple.go
#	pdf/model/font_test.go
#	pdf/model/fontfile.go
#	pdf/model/fonts/ttfparser.go
#	pdf/model/structures.go
2018-12-27 12:17:28 +02:00
Peter Williams
2fe54a4269 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text 2018-12-27 20:53:59 +11:00
Peter Williams
28957d37b8 fixed comment 2018-12-27 20:53:37 +11:00
Peter Williams
af99ee41db Recurse through form XObjects for text extractions. 2018-12-27 20:51:34 +11:00
Denys Smirnov
e729fa618d model: refactor CharcodesToUnicode to return string and remove TODO 2018-12-26 17:11:41 +02:00
Denys Smirnov
7f667d8fbb model: remove Standard14Font in favor of fonts.StdFont; resolves #269 2018-12-19 13:43:09 +05:00
Denys Smirnov
3687c83b37 errors should start with a lower case 2018-12-15 18:49:15 +05:00
Denys Smirnov
9f0df8945d don't use XXX for TODOs 2018-12-09 21:39:11 +02:00
Denys Smirnov
99f3184879 define slices with a var instead of an empty literal 2018-12-09 19:28:50 +02:00
Denys Smirnov
2658fe9c06 assert types for the new code as well 2018-12-07 18:43:24 +02:00
Gunnsteinn Hall
1f56c18454 Address review comments 2018-12-07 10:32:49 +00:00
Peter Williams
8c1c2aa926 left-to-write -> left-to-right 2018-12-02 18:41:48 +11:00
Peter Williams
d2f1728672 Addressed review comments.
- Removed debug code.
- Explained magic constants
- Added file reference to PdfBox map.
2018-12-02 18:13:40 +11:00
Peter Williams
c4a39a1353 Look for CharMetrics for char code 32 when finding space width. 2018-12-02 13:12:10 +11:00
Peter Williams
835f329c28 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text 2018-12-02 10:02:16 +11:00
Peter Williams
9c258551ad Documented font code. Fall back to StandardEncoding when no encoding is speficied for a font. 2018-12-02 09:14:58 +11:00
Gunnsteinn Hall
2b1c796a74 Addressing review comments 2018-11-30 23:01:04 +00:00
Gunnsteinn Hall
283c9bf778 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text.take2 2018-11-30 17:05:49 +00:00
Gunnsteinn Hall
33843599f2 Another round of addressing review comments 2018-11-30 16:53:48 +00:00
Peter Williams
f566fe5f68 Moved point.go and matrix.go back to their original locations. 2018-11-30 12:17:52 +11:00
Peter Williams
785a83e866 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text
NOTE: Fixed a text_test.go regression by modifying getCharCodeMetrics().
2018-11-30 10:46:33 +11:00
Peter Williams
7bbcec65fa Made Matrix and Point structs more general and moved them to their own files in pdf/model. 2018-11-29 17:04:20 +11:00
Gunnsteinn Hall
f04f83b271 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text 2018-11-28 23:33:31 +00:00
Gunnsteinn Hall
520ab09a72 Addressing review comments 2018-11-28 23:25:17 +00:00
Peter Williams
da8544e68b Moved Matrix code to model/matrix.go 2018-11-28 22:29:35 +11:00
Peter Williams
ad83b1c948 In text extraction, split lines with tolerance on y coordinate. 2018-11-28 22:13:56 +11:00
Peter Williams
6529b42a70 Remove duplicate code. 2018-11-28 18:22:42 +11:00
Peter Williams
36a1148962 Combine diacritics in text extraction. 2018-11-28 18:06:03 +11:00
Peter Williams
f373881a48 Removed some unused struct fields. 2018-11-27 13:37:12 +11:00
Peter Williams
536c688001 Fixed orientation handling in text extraction. 2018-11-26 17:17:17 +11:00
Peter Williams
a815ca7271 Premultiply coordinate transforms to text matrix in text extraction. 2018-11-26 08:09:52 +11:00
Peter Williams
6e5e32dd92 Fixed encoding selection for standard 14 fonts. 2018-11-22 22:01:04 +11:00
Peter Williams
8b964f2008 Set font even when Tf operator is not between BT and ET. 2018-11-21 13:14:11 +11:00
Peter Williams
2f8b50af75 Fixed landscape rotation for text extraction.
Also compute metrics for standard 14 fonts when not created from dict.
2018-11-19 16:50:28 +11:00
Peter Williams
ea8a26a7dc Fixed text matrix multiplication order. 2018-11-19 14:19:50 +11:00
Peter Williams
a9019a50a3 Fixes for text extraction corpus testing.
- Correct matrix multiplication order in text.go
- Look up standard 14 font widths after applying custom encoding.
2018-11-18 17:21:30 +11:00
Peter Williams
851aa267b1 Added test for position based text extraction 2018-11-12 11:04:09 +11:00
Peter Williams
85cb1db004 Fixed position sorting for text extraction for landscape text. 2018-11-10 21:19:02 +11:00
Peter Williams
a2342ec6c6 First attempt at getting font metrics by character code. 2018-11-08 15:20:12 +11:00
Peter Williams
b0c440dd00 Fixed text position tracking. 2018-10-30 21:55:30 +11:00
Peter Williams
2c8c8e5c98 Removed debugging code. 2018-10-09 19:05:38 +11:00
Peter Williams
89d1bce9da testing hack 2018-10-09 13:47:43 +11:00
Peter Williams
f6dc3e2fc3 First attempt at splitting words in text extraction using a space detection heuristic 2018-10-09 11:49:59 +11:00
Peter Williams
24d522bdb2 Merge branch 'v3' of https://github.com/unidoc/unidoc into extract 2018-09-24 15:25:44 +10:00
Peter Williams
c76fa6985e Moved font cache from global variable to Extractor. 2018-09-22 09:28:18 +10:00
Peter Williams
69be54d501 Cleaned up some comments. 2018-09-21 16:43:10 +10:00