unipdf/extractor/extractor.go
Gunnsteinn Hall 11f692bc3a
Font subsetting and font optimization improvements (#362)
* Track runes in IdentityEncoder (for subsetting), track decoded runes

* Working with the identity encoder in font_composite.go

* Add GetFilterArray to multi encoder.  Add comments.

* Add NewFromContents constructor to extractor only requiring contents and resources

* golint fixes

* Optimizer compress streams - improved detection of raw streams

* Optimize - CleanContentStream optimizer that removes redundant operands

* WIP Optimize - clean fonts

Will support both font file reduction and subsetting. (WIP)

* Optimize - image processing - try combined DCT and Flate

* Update options.go

* Update optimizer.go

* Create utils.go for optimize with common methods needed for optimization

* Optimizer - add font subsetting method

Covers XObject Forms, annotaitons etc.  Uses extractor package to extract text marks covering what fonts and glyphs are used.  Package truetype used for subsetting.

* Add some comments

* Fix cmap parsing rune conversion

* Error checking for extractor.  Add some comments.

* Update Jenkinsfile

* Update modules
2020-06-16 21:19:10 +00:00

58 lines
1.8 KiB
Go

/*
* This file is subject to the terms and conditions defined in
* file 'LICENSE.md', which is part of this source code package.
*/
package extractor
import (
"github.com/unidoc/unipdf/v3/model"
)
// Extractor stores and offers functionality for extracting content from PDF pages.
type Extractor struct {
// stream contents and resources for page
contents string
resources *model.PdfPageResources
// fontCache is a simple LRU cache that is used to prevent redundant constructions of PdfFont's from
// PDF objects. NOTE: This is not a conventional glyph cache. It only caches PdfFont's.
fontCache map[string]fontEntry
// text results from running extractXYText on forms within the page.
// TODO(peterwilliams): Cache this map accross all pages in a PDF to speed up processig.
formResults map[string]textResult
// accessCount is used to set fontEntry.access to an incrementing number.
accessCount int64
// textCount is an incrementing number used to identify XYTest objects.
textCount int64
}
// New returns an Extractor instance for extracting content from the input PDF page.
func New(page *model.PdfPage) (*Extractor, error) {
contents, err := page.GetAllContentStreams()
if err != nil {
return nil, err
}
// Uncomment these lines to see the contents of the page. For debugging.
// fmt.Println("========================= +++ =========================")
// fmt.Printf("%s\n", contents)
// fmt.Println("========================= ::: =========================")
return NewFromContents(contents, page.Resources)
}
// NewFromContents creates a new extractor from contents and page resources.
func NewFromContents(contents string, resources *model.PdfPageResources) (*Extractor, error) {
e := &Extractor{
contents: contents,
resources: resources,
fontCache: map[string]fontEntry{},
formResults: map[string]textResult{},
}
return e, nil
}