Peter Williams
|
e68c15d31c
|
Corrected order of matrix multiplication for cm operator.
The change to Matrix.Concat made for this fix simplified some text extraction matrix code.
|
2019-01-22 18:18:27 +11:00 |
|
Peter Williams
|
72c7fd37d0
|
(*pageText). -> pageText.
|
2019-01-05 14:10:54 +11:00 |
|
Peter Williams
|
6b1764c118
|
(*pt). -> pt.
|
2019-01-05 09:14:10 +11:00 |
|
Peter Williams
|
4aa7e5051e
|
Changes missed in previous commit.
|
2019-01-04 16:07:03 +11:00 |
|
Peter Williams
|
e251b6b2f2
|
Made TextList an opaque struct and renamed it to PageText to reflect its purpose rather than its current implementation.
|
2019-01-04 16:02:22 +11:00 |
|
Peter Williams
|
4cb130c31f
|
Fixed some typos.
|
2019-01-03 15:41:36 +11:00 |
|
Peter Williams
|
2f2b5c6ec1
|
Made many fields text.go private.
|
2019-01-02 10:39:30 +11:00 |
|
Peter Williams
|
ca2b73bd7a
|
Removed combineDiacritics from text extraction because it was causing ' and " to be combined with the letters proceeding them.
Need to fix this and reinstate combineDiacritics.
|
2019-01-01 12:22:39 +11:00 |
|
Gunnsteinn Hall
|
8f031e7bdb
|
remove panic in extractor
|
2018-12-27 17:18:52 +00:00 |
|
Denys Smirnov
|
dbbef4fd05
|
Merge remote-tracking branch 'peterwilliams97/extract.text' into extract.text
# Conflicts:
# pdf/extractor/text.go
|
2018-12-27 12:40:55 +02:00 |
|
Peter Williams
|
c70b66a00d
|
Fixed incorrectly named variable.
|
2018-12-27 21:33:31 +11:00 |
|
Denys Smirnov
|
53687f854e
|
Merge remote-tracking branch 'origin/v3' into extract.text
# Conflicts:
# pdf/contentstream/processor.go
# pdf/extractor/text.go
# pdf/extractor/utils.go
# pdf/internal/textencoding/winansi.go
# pdf/model/font.go
# pdf/model/font_composite.go
# pdf/model/font_simple.go
# pdf/model/font_test.go
# pdf/model/fontfile.go
# pdf/model/fonts/ttfparser.go
# pdf/model/structures.go
|
2018-12-27 12:17:28 +02:00 |
|
Peter Williams
|
2fe54a4269
|
Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text
|
2018-12-27 20:53:59 +11:00 |
|
Peter Williams
|
28957d37b8
|
fixed comment
|
2018-12-27 20:53:37 +11:00 |
|
Peter Williams
|
af99ee41db
|
Recurse through form XObjects for text extractions.
|
2018-12-27 20:51:34 +11:00 |
|
Denys Smirnov
|
e729fa618d
|
model: refactor CharcodesToUnicode to return string and remove TODO
|
2018-12-26 17:11:41 +02:00 |
|
Denys Smirnov
|
7f667d8fbb
|
model: remove Standard14Font in favor of fonts.StdFont; resolves #269
|
2018-12-19 13:43:09 +05:00 |
|
Denys Smirnov
|
3687c83b37
|
errors should start with a lower case
|
2018-12-15 18:49:15 +05:00 |
|
Denys Smirnov
|
9f0df8945d
|
don't use XXX for TODOs
|
2018-12-09 21:39:11 +02:00 |
|
Denys Smirnov
|
99f3184879
|
define slices with a var instead of an empty literal
|
2018-12-09 19:28:50 +02:00 |
|
Denys Smirnov
|
2658fe9c06
|
assert types for the new code as well
|
2018-12-07 18:43:24 +02:00 |
|
Gunnsteinn Hall
|
1f56c18454
|
Address review comments
|
2018-12-07 10:32:49 +00:00 |
|
Peter Williams
|
8c1c2aa926
|
left-to-write -> left-to-right
|
2018-12-02 18:41:48 +11:00 |
|
Peter Williams
|
d2f1728672
|
Addressed review comments.
- Removed debug code.
- Explained magic constants
- Added file reference to PdfBox map.
|
2018-12-02 18:13:40 +11:00 |
|
Peter Williams
|
c4a39a1353
|
Look for CharMetrics for char code 32 when finding space width.
|
2018-12-02 13:12:10 +11:00 |
|
Peter Williams
|
835f329c28
|
Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text
|
2018-12-02 10:02:16 +11:00 |
|
Peter Williams
|
9c258551ad
|
Documented font code. Fall back to StandardEncoding when no encoding is speficied for a font.
|
2018-12-02 09:14:58 +11:00 |
|
Gunnsteinn Hall
|
2b1c796a74
|
Addressing review comments
|
2018-11-30 23:01:04 +00:00 |
|
Gunnsteinn Hall
|
283c9bf778
|
Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text.take2
|
2018-11-30 17:05:49 +00:00 |
|
Gunnsteinn Hall
|
33843599f2
|
Another round of addressing review comments
|
2018-11-30 16:53:48 +00:00 |
|
Peter Williams
|
f566fe5f68
|
Moved point.go and matrix.go back to their original locations.
|
2018-11-30 12:17:52 +11:00 |
|
Peter Williams
|
785a83e866
|
Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into extract.text
NOTE: Fixed a text_test.go regression by modifying getCharCodeMetrics().
|
2018-11-30 10:46:33 +11:00 |
|
Peter Williams
|
7bbcec65fa
|
Made Matrix and Point structs more general and moved them to their own files in pdf/model.
|
2018-11-29 17:04:20 +11:00 |
|
Gunnsteinn Hall
|
f04f83b271
|
Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text
|
2018-11-28 23:33:31 +00:00 |
|
Gunnsteinn Hall
|
520ab09a72
|
Addressing review comments
|
2018-11-28 23:25:17 +00:00 |
|
Peter Williams
|
da8544e68b
|
Moved Matrix code to model/matrix.go
|
2018-11-28 22:29:35 +11:00 |
|
Peter Williams
|
ad83b1c948
|
In text extraction, split lines with tolerance on y coordinate.
|
2018-11-28 22:13:56 +11:00 |
|
Peter Williams
|
6529b42a70
|
Remove duplicate code.
|
2018-11-28 18:22:42 +11:00 |
|
Peter Williams
|
36a1148962
|
Combine diacritics in text extraction.
|
2018-11-28 18:06:03 +11:00 |
|
Peter Williams
|
f373881a48
|
Removed some unused struct fields.
|
2018-11-27 13:37:12 +11:00 |
|
Peter Williams
|
536c688001
|
Fixed orientation handling in text extraction.
|
2018-11-26 17:17:17 +11:00 |
|
Peter Williams
|
a815ca7271
|
Premultiply coordinate transforms to text matrix in text extraction.
|
2018-11-26 08:09:52 +11:00 |
|
Peter Williams
|
6e5e32dd92
|
Fixed encoding selection for standard 14 fonts.
|
2018-11-22 22:01:04 +11:00 |
|
Peter Williams
|
8b964f2008
|
Set font even when Tf operator is not between BT and ET.
|
2018-11-21 13:14:11 +11:00 |
|
Peter Williams
|
2f8b50af75
|
Fixed landscape rotation for text extraction.
Also compute metrics for standard 14 fonts when not created from dict.
|
2018-11-19 16:50:28 +11:00 |
|
Peter Williams
|
ea8a26a7dc
|
Fixed text matrix multiplication order.
|
2018-11-19 14:19:50 +11:00 |
|
Peter Williams
|
a9019a50a3
|
Fixes for text extraction corpus testing.
- Correct matrix multiplication order in text.go
- Look up standard 14 font widths after applying custom encoding.
|
2018-11-18 17:21:30 +11:00 |
|
Peter Williams
|
851aa267b1
|
Added test for position based text extraction
|
2018-11-12 11:04:09 +11:00 |
|
Peter Williams
|
85cb1db004
|
Fixed position sorting for text extraction for landscape text.
|
2018-11-10 21:19:02 +11:00 |
|
Peter Williams
|
a2342ec6c6
|
First attempt at getting font metrics by character code.
|
2018-11-08 15:20:12 +11:00 |
|