PDF is usually used as an output format but you may need to use a PDF as input file. There are 3 Java APIs available to extract text from PDF:
Apache PDFBox
The following code given in Lisiting 1. extracts plain text using Apache PDFBox.
PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
PDFTextStripper stripper = new PDFTextStripper();
String plainText = stripper.getText(pdf);
Listing 1. Extract plain text using Apache PDFBox
iText
The following code given in Listing 2 uses iText to extract plain text.
PDFReader reader = new PDFReader(“sample.pdf”);
PDFReaderContextParser parser = new
PDFReaderContextParser(reader);
TextExtractionStrategy strategy = null;
for(int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContext(i,
new SimpleTextExtractionStrategy());
System.out.println(strategy.getResultantText());
}
Listing 2. Extract plain text using iText
Difference between Apache PDFBox, iText and Snowtide PDFTextStream
License - Apache PDFBox is published under Apache License v2.0. iText is an open source API which is published under GNU Affero General Public License. Snowtide PDFTextStream can be used for free when used in a single threaded application. For multi-threaded access you need to buy its license.
Extracted Plain Text – Both Apache PDFBox and iText do not retain the text layout while extracting text from PDF. As a result, spaces between text content are not retained in the extracted text using these two libraries. This could impact reading tabular data with empty cells. See Sample 1 and 2 below:
Employee Id First Name Middle Name Last Name Department
10011 John Gates Marketing
10012 Claudia Steven Parkar Marketing
Sample 1. Tabular data extracted using Snowtide PDFTextStream
EmployeeId FirstName Middle Name Last Name Department
10011 John Gates Marketing
10012 Claudia Steven Parkar Marketing
Sample 2. Tabular data extracted using Apache PDFBox and iText
Extracting text using Snowtide PDFTextStream by retaining the layout
Listing 3 shows use of PDFTextStream to extract text retaining its layout.
PDFTextStream stream = new PDFTextStream(new File(“sample.pdf”));
StringWriter writer = new StringWriter();
VisualOutputTarget target = new VisualOutputTarget(writer);
stream.pipe(target);
stream.close();
writer.flush();
String plainText = writer.toString();
Listing 3. Extract plain text using Snowtide PDFTextStream
Drawback of Snowtide PDFTextStream
It is found that some text characters are not extracted properly while using this API.