89 Commits

Author SHA1 Message Date
Peter Williams
e68c15d31c Corrected order of matrix multiplication for cm operator.
The change to Matrix.Concat made for this fix simplified some text extraction matrix code.
2019-01-22 18:18:27 +11:00
Peter Williams
72c7fd37d0 (*pageText). -> pageText. 2019-01-05 14:10:54 +11:00
Peter Williams
6b1764c118 (*pt). -> pt. 2019-01-05 09:14:10 +11:00
Peter Williams
4aa7e5051e Changes missed in previous commit. 2019-01-04 16:07:03 +11:00
Peter Williams
e251b6b2f2 Made TextList an opaque struct and renamed it to PageText to reflect its purpose rather than its current implementation. 2019-01-04 16:02:22 +11:00
Peter Williams
4cb130c31f Fixed some typos. 2019-01-03 15:41:36 +11:00
Peter Williams
2f2b5c6ec1 Made many fields text.go private. 2019-01-02 10:39:30 +11:00
Peter Williams
ca2b73bd7a Removed combineDiacritics from text extraction because it was causing ' and " to be combined with the letters proceeding them.
Need to fix this and reinstate combineDiacritics.
2019-01-01 12:22:39 +11:00
Gunnsteinn Hall
8f031e7bdb remove panic in extractor 2018-12-27 17:18:52 +00:00
Denys Smirnov
dbbef4fd05 Merge remote-tracking branch 'peterwilliams97/extract.text' into extract.text
# Conflicts:
#	pdf/extractor/text.go
2018-12-27 12:40:55 +02:00
Peter Williams
c70b66a00d Fixed incorrectly named variable. 2018-12-27 21:33:31 +11:00
Denys Smirnov
53687f854e Merge remote-tracking branch 'origin/v3' into extract.text
# Conflicts:
#	pdf/contentstream/processor.go
#	pdf/extractor/text.go
#	pdf/extractor/utils.go
#	pdf/internal/textencoding/winansi.go
#	pdf/model/font.go
#	pdf/model/font_composite.go
#	pdf/model/font_simple.go
#	pdf/model/font_test.go
#	pdf/model/fontfile.go
#	pdf/model/fonts/ttfparser.go
#	pdf/model/structures.go
2018-12-27 12:17:28 +02:00
Peter Williams
2fe54a4269 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text 2018-12-27 20:53:59 +11:00
Peter Williams
28957d37b8 fixed comment 2018-12-27 20:53:37 +11:00
Peter Williams
af99ee41db Recurse through form XObjects for text extractions. 2018-12-27 20:51:34 +11:00
Denys Smirnov
e729fa618d model: refactor CharcodesToUnicode to return string and remove TODO 2018-12-26 17:11:41 +02:00
Denys Smirnov
7f667d8fbb model: remove Standard14Font in favor of fonts.StdFont; resolves #269 2018-12-19 13:43:09 +05:00
Denys Smirnov
3687c83b37 errors should start with a lower case 2018-12-15 18:49:15 +05:00
Denys Smirnov
9f0df8945d don't use XXX for TODOs 2018-12-09 21:39:11 +02:00
Denys Smirnov
99f3184879 define slices with a var instead of an empty literal 2018-12-09 19:28:50 +02:00
Denys Smirnov
2658fe9c06 assert types for the new code as well 2018-12-07 18:43:24 +02:00
Gunnsteinn Hall
1f56c18454 Address review comments 2018-12-07 10:32:49 +00:00
Peter Williams
8c1c2aa926 left-to-write -> left-to-right 2018-12-02 18:41:48 +11:00
Peter Williams
d2f1728672 Addressed review comments.
- Removed debug code.
- Explained magic constants
- Added file reference to PdfBox map.
2018-12-02 18:13:40 +11:00
Peter Williams
c4a39a1353 Look for CharMetrics for char code 32 when finding space width. 2018-12-02 13:12:10 +11:00
Peter Williams
835f329c28 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text 2018-12-02 10:02:16 +11:00
Peter Williams
9c258551ad Documented font code. Fall back to StandardEncoding when no encoding is speficied for a font. 2018-12-02 09:14:58 +11:00
Gunnsteinn Hall
2b1c796a74 Addressing review comments 2018-11-30 23:01:04 +00:00
Gunnsteinn Hall
283c9bf778 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text.take2 2018-11-30 17:05:49 +00:00
Gunnsteinn Hall
33843599f2 Another round of addressing review comments 2018-11-30 16:53:48 +00:00
Peter Williams
f566fe5f68 Moved point.go and matrix.go back to their original locations. 2018-11-30 12:17:52 +11:00
Peter Williams
785a83e866 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text
NOTE: Fixed a text_test.go regression by modifying getCharCodeMetrics().
2018-11-30 10:46:33 +11:00
Peter Williams
7bbcec65fa Made Matrix and Point structs more general and moved them to their own files in pdf/model. 2018-11-29 17:04:20 +11:00
Gunnsteinn Hall
f04f83b271 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text 2018-11-28 23:33:31 +00:00
Gunnsteinn Hall
520ab09a72 Addressing review comments 2018-11-28 23:25:17 +00:00
Peter Williams
da8544e68b Moved Matrix code to model/matrix.go 2018-11-28 22:29:35 +11:00
Peter Williams
ad83b1c948 In text extraction, split lines with tolerance on y coordinate. 2018-11-28 22:13:56 +11:00
Peter Williams
6529b42a70 Remove duplicate code. 2018-11-28 18:22:42 +11:00
Peter Williams
36a1148962 Combine diacritics in text extraction. 2018-11-28 18:06:03 +11:00
Peter Williams
f373881a48 Removed some unused struct fields. 2018-11-27 13:37:12 +11:00
Peter Williams
536c688001 Fixed orientation handling in text extraction. 2018-11-26 17:17:17 +11:00
Peter Williams
a815ca7271 Premultiply coordinate transforms to text matrix in text extraction. 2018-11-26 08:09:52 +11:00
Peter Williams
6e5e32dd92 Fixed encoding selection for standard 14 fonts. 2018-11-22 22:01:04 +11:00
Peter Williams
8b964f2008 Set font even when Tf operator is not between BT and ET. 2018-11-21 13:14:11 +11:00
Peter Williams
2f8b50af75 Fixed landscape rotation for text extraction.
Also compute metrics for standard 14 fonts when not created from dict.
2018-11-19 16:50:28 +11:00
Peter Williams
ea8a26a7dc Fixed text matrix multiplication order. 2018-11-19 14:19:50 +11:00
Peter Williams
a9019a50a3 Fixes for text extraction corpus testing.
- Correct matrix multiplication order in text.go
- Look up standard 14 font widths after applying custom encoding.
2018-11-18 17:21:30 +11:00
Peter Williams
851aa267b1 Added test for position based text extraction 2018-11-12 11:04:09 +11:00
Peter Williams
85cb1db004 Fixed position sorting for text extraction for landscape text. 2018-11-10 21:19:02 +11:00
Peter Williams
a2342ec6c6 First attempt at getting font metrics by character code. 2018-11-08 15:20:12 +11:00