Verification from PDF File

Selenium or other automation framework is all about to automate browser. We do not have any feature to extract or parse text from PDF file or other file formats and verify it. Today we can see many applications are using PDF file formats specially for documentation. Generally we get a link when we click it opens PDF file. In such cases we skip this functionality to automate as we don not have any support in Selenium or other automation tools. I came across same problem so I am providing solution here.

Few words about library we are going to use

we are using Apache Tika for extracting data from PDF File, we can develop a content extractor to extract both structured text as well as metadata from different types of documents such as PDFs, spreadsheets, text documents, images and even multimedia input formats to a certain extent.

The parser interface of org.apache.tika.parser is the key interface for parsing documents in Tika. This Interface extracts the text and the metadata from a document and summarizes it for external users who are willing to write parser plugins.

How to use it?

If we are using maven project then we can simply use maven dependency for Apache Tika.

<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.24.1</version>
</dependency>

If you are using gradle then you can use

// https://mvnrepository.com/artifact/org.apache.tika/tika-parsers
 implementation group: 'org.apache.tika', name: 'tika-parsers', 
    version: '2.1.0', ext: 'pom'

we have other dependency from Apache Tika but for parsing this is enough.

Code snippet for Parse Content from PDF file from URL (No need to download PDF file)I have tried to write code with comments how it is working. Please have a look on code snippet.

public static String getTextFromPdf(String url) throws 
      IOException, TikaException, SAXException {

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream =new URL(url).openStream();
ParseContext parsedContext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,parsedContext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
return handler.toString();
}

`A sample test with parser function`

@Test
public void verifyPDFContent() throws 
TikaException, IOException, SAXException {
String[] validationTexts = {"Enter", "Multiple", "Words", "From PDF"};
String textFromPdf= PdfParser.getTextFromPdf("Url of PDF File.");
Assert.assertTrue(Arrays.stream(validationTexts).parallel()
.allMatch(textFromPdf::contains));
}



I hope this small but useful blog will help you.

QA Genes

Pages

Monday, November 29, 2021

Verification from PDF File

Few words about library we are going to use

How to use it?

Code snippet for Parse Content from PDF file from URL (No need to download PDF file)

`A sample test with parser function`

No comments:

Post a Comment

Blog Archive