Batch convert Word documents into PDF format files
Problem Description
I have a large number of Word documents (.docx format files) saved in a directory. I need to convert all the Word documents to PDF format files.
Problem Refinement
-
Input data, specify the directory, get all the files in the directory, and exclude those that are not Word documents.
-
Format conversion requirements, when converting a Word document into a PDF file, the original format, title, text size, font, indentation, table… is what the Office Word software sees and the PDF file you want to convert. That’s what it’s like.
-
Save the converted PDF file to the specified directory, such as the
output_pdf
directory.
Program Implementation Plan One
Word to PDF is essentially calling the Office Word program to process and output.
On Windows operating systems, if Microsoft Office Word software has been installed, if it is a single Word document, you can actually open it manually with Word software and save/export it as PDF format.
The problem here is that there are many Word files that need to be converted to PDF files. For example, if there are hundreds or even more, then manual operation one by one is too inefficient. We need to use programs to help automate this task.
Under the Windows system, directly using the Python pywin32 library, you can call Word, operate the underlying VBA, convert Word format files into PDF files, and perform automated processing. You can also use the docx2pdf library, which has encapsulated calls to Office Word and is easier to use. Install before use pip install docx2pdf
Simple example code:
|
|
In addition, after installing the docx2pdf library, you can also directly use the command line to convert without writing Python code.
Enter docx2pdf -h
in the terminal window to view detailed instructions.
Program Implementation Plan Two
Looking at the implementation source code of the docx2pdf library, you can find that it actually calls the pywin32 library. It’s just that after it’s packaged, it’s more convenient for us to use.
Of course, it is not complicated to directly operate win32. The code is as follows:
|
|
Both of the above methods can easily convert Word documents to PDF files.
[Special Note] Both program implementation options 1 and 2 require that the Office Word program has been installed on your Windows system.
Program Implementation Plan Three
But I am considering cross-platform here and want the program to run on Windows, MacOS and Linux systems. So I chose LibreOffice office software. Because the LibreOffice program can run on MacOS/Linux/Windows at the same time, and it is open source and free. Yes, completely free.
Install the open source and free LibreOffice
Download the multi-platform version of LibreOffice here https://www.libreoffice.org/download/download-libreoffice/
The specific installation operation will not be explained in detail. Please refer to the official website documentation for installation.
LibreOffice headless mode
Then we use LibreOffice’s headless mode to convert Word files to PDF.
Headless mode means that it can run without a graphical interface.
In Python, you can use the subprocess module to call LibreOffice’s command line tool for conversion.
Reference Code:
|
|
Please replace libreoffice in the above program with the real libreoffice full path on your own computer.
Specifically, under MacOS systems, such as on my computer, I need to replace it with /Applications/LibreOffice.app/Contents/MacOS/soffice
How to judge whether the complete path you found is correct or not. It’s very simple. In the command line window, enter the complete path you are looking for, and then -h
. A large series of help prompts will appear. Something like this:
|
|
LibreOffice supported file formats
LibreOffice supports the mutual conversion of files in multiple formats. Using the above Python code snippet and modifying the parameters, you can realize the mutual conversion of files in various formats, such as xlsx->PDF, word->html, ppt->PDF, …………
For specific supported formats, please check the official instructions: https://help.libreoffice.org/6.3/en-US/text/shared/guide/convertfilters.html
Program Implementation Plan Four
If you don’t want to install Office Word or LibreOffice, is there any way to convert Word documents into PDF files?
There is still a way, but it cannot retain the original format, such as various titles, text sizes, fonts, indents, tables… There is no way to guarantee this information.
The principle is to use the third-party library pip install python-docx
to read the content of Word files. It is a pure Python library and can be used without installing Office Word software. Then use another third-party library to generate PDF files. For example, reportlab
, pdfkit
, etc.
However, this implementation effect is relatively poor, and the program implementation is also very troublesome. I won’t go into details here.
Final complete code
Finally, the complete Python code is attached:
|
|