Skip to contents

DESCRIPTION Blabla...

Usage

add_toc(
  md,
  min_tier = 2L,
  max_tier = 6L,
  position = "above",
  md_flavor = c("github", "gitlab"),
  add_title = TRUE,
  title = "Table of contents",
  title_tier = min_tier,
  add_backlinks = add_title,
  backlink_strings = c("↑", "↓"),
  backlink_position = c("before", "after"),
  listing_style = c("-", "*", "ordered", "indented"),
  toc_id = "toc",
  old_toc_id = toc_id
)

Arguments

md

(R) Markdown document to be processed as a single file path, a single URL or a character vector (one string per line).

min_tier

Minimum tier of headers (<h1><h6>) to include in the TOC. Integer between 1 and 6. min_tier = 2 for example means to create TOC entries for all <h2> and below headers.

max_tier

Maximum tier of headers (<h1><h6>) to include in the TOC. Integer between 1 and 6. max_tier = 5 for example means to create TOC entries for all headers down to <h5>. max_tier must be >= min_tier.

position

Position in the Markdown document at which to add the TOC. Possible values include:

  • "top": The very beginning of the document, i.e. the first line.

  • "bottom": The very end of the document, i.e. the last line.

  • "above": Above the lines between the uppermost header of tier <= min_tier and the next header above (if any).

  • "below": Below the lines between the uppermost header of tier <= min_tier and the next header above (if any), i.e. right above the uppermost header of tier <= min_tier.

  • "none": Only remove a possibly existing TOC.

  • A line number, given as a positive integer.

md_flavor

Markdown flavor. Possible values include:

  • "github":

  • "gitlab":

add_title

Include a TOC title? Logical. Note that no backlinks are added at all if add_title = FALSE and no header line are found above position, regardless of add_backlinks = TRUE.

title

Title of the TOC. A character scalar.

title_tier

Tier/formatting of TOC title. Possible values include:

  • An integer between 1L and 6L representing the <h1><h6> tier.

  • "regular": Simple unformatted non-header text.

  • "bold": Bold (<strong>) non-header text.

  • "italic": Italic (<em>) non-header text.

add_backlinks

Add a link back to the TOC to each Markdown header. A logical scalar. Note that if add_backlinks = TRUE and add_title = FALSE, as a fallback the backlinks point to the next header line above position (if any found). This will also be the case if md_flavor = "gitlab" and title_tier is set to a non-header value ("regular", "bold" or "italic") because GitLab currently ignores manually set HTML <id> attributes.

backlink_strings

String(s) to use as link text back to the TOC. A character vector of length 1 or 2. If two strings are provided, the first one will be used for backlinks below position, the second one for backlinks above position. Note that at least Unicode 7.0 support is required for the default symbols \U1F805 and \U1F807 to be correctly displayed.

backlink_position

Position of the backlinks. Possible values include:

  • "before": Before the actual header text.

  • "after": After the actual header text.

listing_style

Format to use for listing the TOC entries. Possible values include:

  • "-": Create an unordered list using a hyphen as listing symbol.

  • "*": Create an unordered list using an asterisk as listing symbol.

  • "ordered": Create an ordered list using 1, 2, 3, ... as listing symbols.

  • "indented": Use non-breaking spaces (&nbsp;) to create visual indentation. Useful if the headers are already numbered.

toc_id

HTML <id> attribute of the TOC title if title_tier is set to a non-header value ("regular", "bold" or "italic"). A character scalar.

old_toc_id

HTML <id> attribute of the old TOC title (in order to have old backlinks with an ID other than toc_id removed). A character scalar.

Value

The processed Markdown document as a character vector (one string per line).

Details

This function tries to adhere to the CommonMark specification, i.e. to interpret the Markdown syntax the same way as the commonmark.js reference implementation under <try.commonmark.org> does.

Examples

md <- paste0("https://raw.githubusercontent.com/ropensci/pdftools/",
             "e7248d9956c7e73968628fa3a8ed37f0a8c23b37/README.md")

md |>
  tocr::add_toc(position = 9) |>
  pal::cat_lines()
#> # pdftools
#> 
#> [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
#> [![Build Status](https://travis-ci.org/ropensci/pdftools.svg?branch=master)](https://travis-ci.org/ropensci/pdftools)
#> [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pdftools?branch=master&svg=true)](https://ci.appveyor.com/project/jeroen/pdftools)
#> [![Coverage Status](https://codecov.io/github/ropensci/pdftools/coverage.svg?branch=master)](https://codecov.io/github/ropensci/pdftools?branch=master)
#> [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdftools)](http://cran.r-project.org/package=pdftools)
#> [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/pdftools)](http://cran.r-project.org/web/packages/pdftools/index.html)
#> 
#> <!-- TOC BEGIN -- leave this comment untouched to allow auto update -->
#> 
#> ## Table of contents
#> 
#> - [Introduction](#-introduction)
#> - [Installation](#-installation)
#>     - [Building from source](#-building-from-source)
#> - [Getting started](#-getting-started)
#> - [Bonus feature: rendering pdf](#-bonus-feature-rendering-pdf)
#> - [Limitations](#-limitations)
#> 
#> <!-- TOC END -- leave this comment untouched to allow auto update -->
#> 
#> ## [↑](#table-of-contents) Introduction
#> 
#> Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.
#> 
#> The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
#> 
#> 
#> ## [↑](#table-of-contents) Installation
#> 
#> On Windows and Mac the binary packages can be installed directly from CRAN:
#> 
#> ```r
#> install.packages("pdftools")
#> ```
#> 
#> Installation on Linux requires the poppler development library. On Debian/Ubuntu:
#> 
#> ```
#> sudo apt-get install libpoppler-cpp-dev
#> ```
#> 
#> If you want to install the package from source on Mac OS-X you need brew:
#> 
#> ```
#> brew install poppler
#> ```
#> 
#> On Fedora:
#> 
#> ```
#> sudo yum install poppler-cpp-devel
#> ```
#> 
#> ### [↑](#table-of-contents) Building from source
#> 
#> On CentOS the `libpoppler-cpp` library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
#> 
#> ```sh
#> # Build dependencies
#> yum install wget xz libjpeg-devel openjpeg2-devel
#> 
#> # Download and extract
#> wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
#> tar -Jxvf poppler-0.47.0.tar.xz
#> cd poppler-0.47.0
#> 
#> # Build and install
#> ./configure
#> make
#> sudo make install
#> ```
#> 
#> By default libraries get installed in `/usr/local/lib` and `/usr/local/include`. On CentOS this is not a default search path so we need to set `PKG_CONFIG_PATH` and  `LD_LIBRARY_PATH` to point R to the right directory:
#> 
#> ```sh
#> export LD_LIBRARY_PATH="/usr/local/lib"
#> export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
#> ```
#> 
#> We can then start R and install `pdftools`.
#> 
#> ## [↑](#table-of-contents) Getting started
#> 
#> The `?pdftools` manual page shows a brief overview of the main utilities. The most important function is `pdf_text` which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
#> 
#> ```r
#> library(pdftools)
#> download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
#> txt <- pdf_text("1403.2805.pdf")
#> 
#> # first page text
#> cat(txt[1])
#> 
#> # second page text
#> cat(txt[2])
#> ```
#> 
#> In addition, the package has some utilities to extract other data from the PDF file. The `pdf_toc` function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
#> 
#> ```r
#> # Table of contents
#> toc <- pdf_toc("1403.2805.pdf")
#> 
#> # Show as JSON
#> jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
#> ```
#> 
#> Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
#> 
#> 
#> ```r
#> # Author, version, etc
#> info <- pdf_info("1403.2805.pdf")
#> 
#> # Table with fonts
#> fonts <- pdf_fonts("1403.2805.pdf")
#> ```
#> 
#> ## [↑](#table-of-contents) Bonus feature: rendering pdf
#> 
#> A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use `pdf_render_page` to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
#> 
#> ```r
#> # renders pdf to bitmap array
#> bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
#> 
#> # save bitmap image
#> png::writePNG(bitmap, "page.png")
#> jpeg::writeJPEG(bitmap, "page.jpeg")
#> webp::write_webp(bitmap, "page.webp")
#> ```
#> 
#> This feature is still experimental and currently does not work on Windows.
#> 
#> ## [↑](#table-of-contents) Limitations
#> 
#> Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.
#> 
#> ```r
#> txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")
#> 
#> # some tables
#> cat(txt[18])
#> cat(txt[19])
#> ```
#> 
#> Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.
#> 
#> 
#> [![](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)

md |>
  tocr::add_toc(listing_style = "ordered") |>
  pal::cat_lines()
#> # pdftools
#> 
#> <!-- TOC BEGIN -- leave this comment untouched to allow auto update -->
#> 
#> ## Table of contents
#> 
#> 1. [Introduction](#-introduction)
#> 2. [Installation](#-installation)
#>     1. [Building from source](#-building-from-source)
#> 1. [Getting started](#-getting-started)
#> 2. [Bonus feature: rendering pdf](#-bonus-feature-rendering-pdf)
#> 3. [Limitations](#-limitations)
#> 
#> <!-- TOC END -- leave this comment untouched to allow auto update -->
#> 
#> [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
#> [![Build Status](https://travis-ci.org/ropensci/pdftools.svg?branch=master)](https://travis-ci.org/ropensci/pdftools)
#> [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pdftools?branch=master&svg=true)](https://ci.appveyor.com/project/jeroen/pdftools)
#> [![Coverage Status](https://codecov.io/github/ropensci/pdftools/coverage.svg?branch=master)](https://codecov.io/github/ropensci/pdftools?branch=master)
#> [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdftools)](http://cran.r-project.org/package=pdftools)
#> [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/pdftools)](http://cran.r-project.org/web/packages/pdftools/index.html)
#> 
#> ## [↑](#table-of-contents) Introduction
#> 
#> Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.
#> 
#> The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
#> 
#> 
#> ## [↑](#table-of-contents) Installation
#> 
#> On Windows and Mac the binary packages can be installed directly from CRAN:
#> 
#> ```r
#> install.packages("pdftools")
#> ```
#> 
#> Installation on Linux requires the poppler development library. On Debian/Ubuntu:
#> 
#> ```
#> sudo apt-get install libpoppler-cpp-dev
#> ```
#> 
#> If you want to install the package from source on Mac OS-X you need brew:
#> 
#> ```
#> brew install poppler
#> ```
#> 
#> On Fedora:
#> 
#> ```
#> sudo yum install poppler-cpp-devel
#> ```
#> 
#> ### [↑](#table-of-contents) Building from source
#> 
#> On CentOS the `libpoppler-cpp` library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
#> 
#> ```sh
#> # Build dependencies
#> yum install wget xz libjpeg-devel openjpeg2-devel
#> 
#> # Download and extract
#> wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
#> tar -Jxvf poppler-0.47.0.tar.xz
#> cd poppler-0.47.0
#> 
#> # Build and install
#> ./configure
#> make
#> sudo make install
#> ```
#> 
#> By default libraries get installed in `/usr/local/lib` and `/usr/local/include`. On CentOS this is not a default search path so we need to set `PKG_CONFIG_PATH` and  `LD_LIBRARY_PATH` to point R to the right directory:
#> 
#> ```sh
#> export LD_LIBRARY_PATH="/usr/local/lib"
#> export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
#> ```
#> 
#> We can then start R and install `pdftools`.
#> 
#> ## [↑](#table-of-contents) Getting started
#> 
#> The `?pdftools` manual page shows a brief overview of the main utilities. The most important function is `pdf_text` which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
#> 
#> ```r
#> library(pdftools)
#> download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
#> txt <- pdf_text("1403.2805.pdf")
#> 
#> # first page text
#> cat(txt[1])
#> 
#> # second page text
#> cat(txt[2])
#> ```
#> 
#> In addition, the package has some utilities to extract other data from the PDF file. The `pdf_toc` function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
#> 
#> ```r
#> # Table of contents
#> toc <- pdf_toc("1403.2805.pdf")
#> 
#> # Show as JSON
#> jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
#> ```
#> 
#> Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
#> 
#> 
#> ```r
#> # Author, version, etc
#> info <- pdf_info("1403.2805.pdf")
#> 
#> # Table with fonts
#> fonts <- pdf_fonts("1403.2805.pdf")
#> ```
#> 
#> ## [↑](#table-of-contents) Bonus feature: rendering pdf
#> 
#> A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use `pdf_render_page` to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
#> 
#> ```r
#> # renders pdf to bitmap array
#> bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
#> 
#> # save bitmap image
#> png::writePNG(bitmap, "page.png")
#> jpeg::writeJPEG(bitmap, "page.jpeg")
#> webp::write_webp(bitmap, "page.webp")
#> ```
#> 
#> This feature is still experimental and currently does not work on Windows.
#> 
#> ## [↑](#table-of-contents) Limitations
#> 
#> Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.
#> 
#> ```r
#> txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")
#> 
#> # some tables
#> cat(txt[18])
#> cat(txt[19])
#> ```
#> 
#> Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.
#> 
#> 
#> [![](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)