23 Commits

Author SHA1 Message Date
Gunnsteinn Hall
ee1416433c Use StandardEncoding for builtin standard fonts (not WinAnsiEncoding). Fix testcases.
Add test cases and fix the encoding table also based on observed errors
2019-03-28 09:57:33 +00:00
Gunnsteinn Hall
9665959bcf Move model/fonts to model/internal/fonts - reducing export surface
- Move the folder
- Update imports
- Add type aliases to access needed types from model (fonts.StdFont, fonts.CharMetrics and the font names)
2019-03-12 19:08:37 +00:00
Peter Williams
ca2b73bd7a Removed combineDiacritics from text extraction because it was causing ' and " to be combined with the letters proceeding them.
Need to fix this and reinstate combineDiacritics.
2019-01-01 12:22:39 +11:00
Denys Smirnov
53687f854e Merge remote-tracking branch 'origin/v3' into extract.text
# Conflicts:
#	pdf/contentstream/processor.go
#	pdf/extractor/text.go
#	pdf/extractor/utils.go
#	pdf/internal/textencoding/winansi.go
#	pdf/model/font.go
#	pdf/model/font_composite.go
#	pdf/model/font_simple.go
#	pdf/model/font_test.go
#	pdf/model/fontfile.go
#	pdf/model/fonts/ttfparser.go
#	pdf/model/structures.go
2018-12-27 12:17:28 +02:00
Gunnsteinn Hall
f04f83b271 Merge branch 'extract.text' of https://github.com/peterwilliams97/unidoc into v3-peterwilliams97-extract.text 2018-11-28 23:33:31 +00:00
Gunnsteinn Hall
520ab09a72 Addressing review comments 2018-11-28 23:25:17 +00:00
Peter Williams
36a1148962 Combine diacritics in text extraction. 2018-11-28 18:06:03 +11:00
Peter Williams
536c688001 Fixed orientation handling in text extraction. 2018-11-26 17:17:17 +11:00
Peter Williams
a815ca7271 Premultiply coordinate transforms to text matrix in text extraction. 2018-11-26 08:09:52 +11:00
Peter Williams
8b964f2008 Set font even when Tf operator is not between BT and ET. 2018-11-21 13:14:11 +11:00
Peter Williams
dcb2b14d55 Handle standard 14 TrueType fonts and stanard 14 font aliases in text extraction. 2018-11-20 17:49:37 +11:00
Peter Williams
cad144cec3 Handle missing widths in text extraction 2018-11-20 15:49:28 +11:00
Peter Williams
a9019a50a3 Fixes for text extraction corpus testing.
- Correct matrix multiplication order in text.go
- Look up standard 14 font widths after applying custom encoding.
2018-11-18 17:21:30 +11:00
Peter Williams
851aa267b1 Added test for position based text extraction 2018-11-12 11:04:09 +11:00
Peter Williams
a2342ec6c6 First attempt at getting font metrics by character code. 2018-11-08 15:20:12 +11:00
Peter Williams
a6ce81c001 Merge branch 'render.v3.hungarian' into extract 2018-11-02 15:13:48 +11:00
Peter Williams
3310b040db Don't import core anonymously 2018-07-15 17:22:00 +10:00
Peter Williams
6582182078 reduced differences with compositefont branch 2018-07-15 16:28:56 +10:00
Peter Williams
c9f2b87def Added NewStandard14Font() to make existing fonts.Font code work with *PdfFont 2018-07-07 09:45:55 +10:00
Peter Williams
d184031903 Updated the text extractor to use the new font code 2018-06-27 16:31:28 +10:00
Gunnsteinn Hall
a4fe3bded2 Add LICENSE.md with reference to AGPL and Commercial license. Add license header info to code. 2018-03-22 14:03:47 +00:00
Gunnsteinn Hall
d5396dd893 Fixes in extractor testing 2018-03-22 13:53:12 +00:00
Gunnsteinn Hall
817ea404b9 Extractor package with powerful text extraction capabilities and CMap handling. Closes #17 2018-03-22 13:01:04 +00:00