Extracting Text from a PDF file

PDF is usually used as an output format but you may need to use a PDF as input file. There are 3 Java APIs available to extract text from PDF:

Apache PDFBox

The following code given in Lisiting 1. extracts plain text using Apache PDFBox.

PDDocument pdf = PDDocument.load(new File(“sample.pdf”));

PDFTextStripper stripper = new PDFTextStripper();

String plainText = stripper.getText(pdf);

Listing 1. Extract plain text using Apache PDFBox

iText

The following code given in Listing 2 uses iText to extract plain text.

PDFReader reader = new PDFReader(“sample.pdf”);

PDFReaderContextParser parser = new

PDFReaderContextParser(reader);

TextExtractionStrategy strategy = null;

for(int i = 1; i <= reader.getNumberOfPages(); i++) {

strategy = parser.processContext(i,

new SimpleTextExtractionStrategy());

System.out.println(strategy.getResultantText());

}

Listing 2. Extract plain text using iText

Difference between Apache PDFBox, iText and Snowtide PDFTextStream

License - Apache PDFBox is published under Apache License v2.0. iText is an open source API which is published under GNU Affero General Public License. Snowtide PDFTextStream can be used for free when used in a single threaded application. For multi-threaded access you need to buy its license.

Extracted Plain Text – Both Apache PDFBox and iText do not retain the text layout while extracting text from PDF. As a result, spaces between text content are not retained in the extracted text using these two libraries. This could impact reading tabular data with empty cells. See Sample 1 and 2 below:

Employee Id First Name Middle Name Last Name Department

10011 John Gates Marketing

10012 Claudia Steven Parkar Marketing

Sample 1. Tabular data extracted using Snowtide PDFTextStream

EmployeeId FirstName Middle Name Last Name Department

10011 John Gates Marketing

10012 Claudia Steven Parkar Marketing

Sample 2. Tabular data extracted using Apache PDFBox and iText

Extracting text using Snowtide PDFTextStream by retaining the layout

Listing 3 shows use of PDFTextStream to extract text retaining its layout.

PDFTextStream stream = new PDFTextStream(new File(“sample.pdf”));

StringWriter writer = new StringWriter();

VisualOutputTarget target = new VisualOutputTarget(writer);

stream.pipe(target);

stream.close();

writer.flush();

String plainText = writer.toString();

Listing 3. Extract plain text using Snowtide PDFTextStream

Drawback of Snowtide PDFTextStream

It is found that some text characters are not extracted properly while using this API.

Extracting Text from a PDF file

Similar Blog

Helping Organizations Thrive in the Digital Age