Verification from PDF File
Selenium or other automation framework is all about to automate browser. We do not have any feature to extract or parse text from PDF file or other file formats and verify it. Today we can see many applications are using PDF file formats specially for documentation. Generally we get a link when we click it opens PDF file. In such cases we skip this functionality to automate as we don not have any support in Selenium or other automation tools. I came across same problem so I am providing solution here.
Few words about library we are going to use
we are using Apache Tika for extracting data from PDF File, we can develop a content extractor to extract both structured text as well as metadata from different types of documents such as PDFs, spreadsheets, text documents, images and even multimedia input formats to a certain extent.
The parser interface of org.apache.tika.parser is the key interface for parsing documents in Tika. This Interface extracts the text and the metadata from a document and summarizes it for external users who are willing to write parser plugins.
How to use it?
If we are using maven project then we can simply use maven dependency for Apache Tika.
<
dependency
>
<
groupId
>org.apache.tika</
groupId
>
<
artifactId
>tika-parsers</
artifactId
>
<
version
>1.24.1</
version
>
</
dependency
>
If you are using gradle then you can use
// https://mvnrepository.com/artifact/org.apache.tika/tika-parsers
implementation group: 'org.apache.tika', name: 'tika-parsers',
version: '2.1.0', ext: 'pom'
we have other dependency from Apache Tika but for parsing this is enough.
Code snippet for Parse Content from PDF file from URL (No need to download PDF file)
I have tried to write code with comments how it is working. Please have a look on code snippet.
public
static
String getTextFromPdf(String url)
throws
IOException, TikaException, SAXException {
BodyContentHandler handler =
new
BodyContentHandler();
Metadata metadata =
new
Metadata();
InputStream inputstream =
new
URL(url).openStream();
ParseContext parsedContext =
new
ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser =
new
PDFParser();
pdfparser.parse(inputstream, handler, metadata,parsedContext);
//getting the content of the document
System.out.println(
"Contents of the PDF :"
+ handler.toString());
//getting metadata of the document
System.out.println(
"Metadata of the PDF:"
);
String[] metadataNames = metadata.names();
for
(String name : metadataNames) {
System.out.println(name+
" : "
+ metadata.get(name));
}
return
handler.toString();
}
A sample test with parser function
@Test
public
void
verifyPDFContent()
throws
TikaException, IOException, SAXException {
String[] validationTexts = {
"Enter", "Multiple", "Words", "From PDF"
};String textFromPdf= PdfParser.getTextFromPdf(
"Url of PDF File."
);
Assert.assertTrue(Arrays.stream(validationTexts).parallel()
.allMatch(textFromPdf::contains));
}
I hope this small but useful blog will help you.
No comments:
Post a Comment