unipdf

OrgGo/unipdf

Fork 0

mirror of https://github.com/unidoc/unipdf.git synced 2025-04-24 13:48:49 +08:00

History

Peter Williams 88fda44e0a

Text extraction code for columns. (#366 )

* Fixed filename:page in logging

* Got CMap working for multi-rune entries

* Treat CMap entries as strings instead of runes to handle multi-byte encodings.

* Added a test for multibyte encoding.

* First version of text extraction that recognizes columns

* Added an expanation of the text columns code to README.md.

* fixed typos

* Abstracted textWord depth calculation. This required change textMark to *textMark in a lot of code.

* Added function comments.

* Fixed text state save/restore.

* Adjusted inter-word search distance to make paragrah division work for thanh.pdf

* Got text_test.go passing.

* Reinstated hyphen suppression

* Handle more cases of fonts not being set in text extraction code.

* Fixed typo

* More verbose logging

* Adding tables to text extractor.

* Added tests for columns extraction.

* Removed commented code

* Check for textParas that are on the same line when writing out extracted text.

* Absorb text to the left of paras into paras e.g. Footnote numbers

* Removed funny character from text_test.go

* Commented out a creator_test.go test that was broken by my text extraction changes.

* Big changes to columns text extraction code for PR.

Performance improvements in several places.
Commented code.

* Updated extractor/README

* Cleaned up some comments and removed a panic

* Increased threshold for truncating extracted text when there is no license 100 -> 102.

This is a workaround to let a test in creator_test.go pass.

With the old text extraction code the following extracted text was 100 chars. With the new code it
is 102 chars which looks correct.

"你好\n你好你好你好你好\n河上白云\n\nUnlicensed UniDoc - Get a license on https://unidoc.io\n\n"

* Improved an error message.

* Removed irrelevant spaces

* Commented code and removed unused functions.

* Reverted PdfRectangle changes

* Added duplicate text detection.

* Combine diacritic textMarks in text extraction

* Reinstated a diacritic recombination test.

* Small code reorganisation

* Reinstated handling of rotated text

* Addressed issues in PR review

* Added color fields to TextMark

* Updated README

* Reinstated the disabled tests I missed before.

* Tightened definition for tables to prevent detection of tables where there weren't any.

* Compute line splitting search range based on fontsize of first word in word bag.

* Use errors.Is(err, core.ErrNotSupported) to distinguish unsupported font errorrs.

See https://blog.golang.org/go1.13-errors

* Fixed some naming and added some comments.

* errors.Is -> xerrors.Is and %w -> %v for go 1.12 compatibility

* Removed code that doesn't ever get called.

* Removed unused test

2020-06-30 19:33:10 +00:00

internal/fonts

Text extraction code for columns. (#366 )

2020-06-30 19:33:10 +00:00

optimize

Font subsetting and font optimization improvements (#362 )

2020-06-16 21:19:10 +00:00

sighandler

Add timestamp signature handler (#301 )

2020-04-22 20:21:53 +00:00

testdata

Add NewPdfFontFromTTF(io.ReadSeeker) function. (#199 )

2019-11-20 23:05:40 +00:00

action_test.go

Becoded action support (#161 )

2019-08-30 08:50:30 +00:00

action.go

Becoded action support (#161 )

2019-08-30 08:50:30 +00:00

annotations.go

Font subsetting and font optimization improvements (#362 )

2020-06-16 21:19:10 +00:00

appender_test.go

Add timestamp signature handler (#301 )

2020-04-22 20:21:53 +00:00

appender.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

colorspace_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

colorspace.go

Fixed PdfColorspaceSpecialIndexed.ImageToRGB() (#259 )

2020-02-26 13:26:20 +00:00

const.go

Text extraction code for columns. (#366 )

2020-06-30 19:33:10 +00:00

doc.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

fields.go

Form fill fixes (#328 )

2020-04-24 16:48:06 +00:00

file_test.go

Becoded action support (#161 )

2019-08-30 08:50:30 +00:00

file.go

Becoded action support (#161 )

2019-08-30 08:50:30 +00:00

flatten.go

Combo field appearance (#370 )

2020-06-10 16:58:00 +00:00

font_composite_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

font_composite.go

Text extraction code for columns. (#366 )

2020-06-30 19:33:10 +00:00

font_simple.go

Add NewCompositePdfFontFromTTF to load composite TTF from memory

2020-04-18 10:37:10 +00:00

font_test.go

Text extraction code for columns. (#366 )

2020-06-30 19:33:10 +00:00

font.go

Text extraction code for columns. (#366 )

2020-06-30 19:33:10 +00:00

fontfile.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

form_test.go

Add reader method for checking if the AcroForm needs repair (#356 )

2020-05-20 16:04:02 +00:00

form.go

Combo field appearance (#370 )

2020-06-10 16:58:00 +00:00

functions_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

functions.go

Prevent Type 0 function evaluation crash (#309 )

2020-04-15 21:05:20 +00:00

fuzz_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

image_test.go

Image memory optimizations (#149 )

2019-08-22 20:15:16 +00:00

image.go

JBIG2 Encoder support for inserting binary images into PDF (#288 )

2020-04-03 20:54:59 +00:00

model.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

optimizer.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

outline.go

Use page indirect object for internal outline destinations (#359 )

2020-05-22 16:19:43 +00:00

outlines_test.go

Minor refactoring

2020-01-21 22:18:11 +02:00

outlines.go

Fix method comment typo

2020-01-15 23:36:07 +02:00

page_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

page.go

Font subsetting and font optimization improvements (#362 )

2020-06-16 21:19:10 +00:00

pattern.go

Changes to make the lazy reader work on the PaperCut corpus (#194 )

2019-10-28 20:49:07 +00:00

reader_test.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

reader.go

Fix outline null object check (#367 )

2020-06-05 11:46:55 +00:00

resources.go

Changes to make the lazy reader work on the PaperCut corpus (#194 )

2019-10-28 20:49:07 +00:00

shading.go

Changes to make the lazy reader work on the PaperCut corpus (#194 )

2019-10-28 20:49:07 +00:00

signature_handler.go

Add timestamp signature handler (#301 )

2020-04-22 20:21:53 +00:00

signature.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

structures.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

utils.go

Remmove pdf folder and move packages up one level (#2 )

2019-05-16 20:44:51 +00:00

writer_test.go

Fix error handling in write, with a testcase.

2020-04-18 13:48:44 +00:00

writer.go

Skip referenced pages which are not present in the catalog (#377 )

2020-06-18 15:06:06 +00:00

xobject.go

JBIG2 Encoder support for inserting binary images into PDF (#288 )

2020-04-03 20:54:59 +00:00