Batch convert Word documents into PDF format files

2022-12-12 1117 words 6 minutes views

Contents

Problem Description

I have a large number of Word documents (.docx format files) saved in a directory. I need to convert all the Word documents to PDF format files.

Input data, specify the directory, get all the files in the directory, and exclude those that are not Word documents.
Format conversion requirements, when converting a Word document into a PDF file, the original format, title, text size, font, indentation, table… is what the Office Word software sees and the PDF file you want to convert. That’s what it’s like.
Save the converted PDF file to the specified directory, such as the output_pdf directory.

Program Implementation Plan One

Word to PDF is essentially calling the Office Word program to process and output.

On Windows operating systems, if Microsoft Office Word software has been installed, if it is a single Word document, you can actually open it manually with Word software and save/export it as PDF format.

The problem here is that there are many Word files that need to be converted to PDF files. For example, if there are hundreds or even more, then manual operation one by one is too inefficient. We need to use programs to help automate this task.

Under the Windows system, directly using the Python pywin32 library, you can call Word, operate the underlying VBA, convert Word format files into PDF files, and perform automated processing. You can also use the docx2pdf library, which has encapsulated calls to Office Word and is easier to use. Install before use pip install docx2pdf

Simple example code:

1
2
3
4
5
6


from docx2pdf import convert

inputFile = "document.docx"
outputFile = "document2.pdf"

convert(inputFile, outputFile)

In addition, after installing the docx2pdf library, you can also directly use the command line to convert without writing Python code.

Enter docx2pdf -h in the terminal window to view detailed instructions.

Program Implementation Plan Two

Looking at the implementation source code of the docx2pdf library, you can find that it actually calls the pywin32 library. It’s just that after it’s packaged, it’s more convenient for us to use.

Of course, it is not complicated to directly operate win32. The code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


from win32com.client import constants, gencache

def createPdf(wordPath, pdfPath):
    word = gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(wordPath, ReadOnly=1)
    doc.ExportAsFixedFormat(pdfPath,
                            constants.wdExportFormatPDF,
                            Item=constants.wdExportDocumentWithMarkup,
                            CreateBookmarks=constants.wdExportCreateHeadingBookmarks)
    word.Quit(constants.wdDoNotSaveChanges)

Both of the above methods can easily convert Word documents to PDF files.

[Special Note] Both program implementation options 1 and 2 require that the Office Word program has been installed on your Windows system.

Program Implementation Plan Three

But I am considering cross-platform here and want the program to run on Windows, MacOS and Linux systems. So I chose LibreOffice office software. Because the LibreOffice program can run on MacOS/Linux/Windows at the same time, and it is open source and free. Yes, completely free.

Install the open source and free LibreOffice

Download the multi-platform version of LibreOffice here https://www.libreoffice.org/download/download-libreoffice/

The specific installation operation will not be explained in detail. Please refer to the official website documentation for installation.

LibreOffice headless mode

Then we use LibreOffice’s headless mode to convert Word files to PDF.

Headless mode means that it can run without a graphical interface.

In Python, you can use the subprocess module to call LibreOffice’s command line tool for conversion.

Reference Code:

1
2
3
4


import subprocess

def convert_to_pdf(input_file, output_directory):
    subprocess.run(['libreoffice', '--headless', '--convert-to', 'pdf', input_file, '--outdir', output_directory])

Please replace libreoffice in the above program with the real libreoffice full path on your own computer.

Specifically, under MacOS systems, such as on my computer, I need to replace it with /Applications/LibreOffice.app/Contents/MacOS/soffice

How to judge whether the complete path you found is correct or not. It’s very simple. In the command line window, enter the complete path you are looking for, and then -h. A large series of help prompts will appear. Something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


/Applications/LibreOffice.app/Contents/MacOS/soffice -h
LibreOffice 7.6.4.1 e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1

Usage: soffice [argument...]
       argument - switches, switch parameters and document URIs (filenames).

Using without special arguments:
Opens the start center, if it is used without any arguments.
   {file}              Tries to open the file (files) in the components
                       suitable for them.
   {file} {macro:///Library.Module.MacroName}
                       Opens the file and runs specified macro from
                       My Macros container.
   {file} {macro://./Library.Module.MacroName}
                       Opens the file and runs specified macro from
                       the file.

Getting help and information:
   --help | -h | -?    Shows this help and quits.
   ......
   ......

LibreOffice supported file formats

LibreOffice supports the mutual conversion of files in multiple formats. Using the above Python code snippet and modifying the parameters, you can realize the mutual conversion of files in various formats, such as xlsx->PDF, word->html, ppt->PDF, …………

For specific supported formats, please check the official instructions: https://help.libreoffice.org/6.3/en-US/text/shared/guide/convertfilters.html

Program Implementation Plan Four

If you don’t want to install Office Word or LibreOffice, is there any way to convert Word documents into PDF files?

There is still a way, but it cannot retain the original format, such as various titles, text sizes, fonts, indents, tables… There is no way to guarantee this information.

The principle is to use the third-party library pip install python-docx to read the content of Word files. It is a pure Python library and can be used without installing Office Word software. Then use another third-party library to generate PDF files. For example, reportlab, pdfkit, etc.

However, this implementation effect is relatively poor, and the program implementation is also very troublesome. I won’t go into details here.

Final complete code

Finally, the complete Python code is attached:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


import os
import docx2pdf
import subprocess

def convert_to_pdf(input_file, output_file):
    subprocess.run(['/Applications/LibreOffice.app/Contents/MacOS/soffice', '--headless', '--convert-to', 'pdf', input_file, '--outdir', output_file])


word_path = './word-files'
output_pdf_path = './output_pdf'

# Search for files with extensions .docx and .doc in the specified directory, 
# ignore subdirectories, and ignore files with other extensions
# If you need to find subdirectories, you can use os.walk()
files = [
    f
    for f in os.listdir(word_path)
    if (f.lower().endswith('.doc')) or (f.lower().endswith('.docx'))
]
files.sort()
# print(files)

if not os.path.exists(output_pdf_path):
    os.makedirs(output_pdf_path)  # Create the folder if it does not exist

for file in files:
    # Use splitext() to separate the file name and suffix. 
    # Note that the file_ext variable will contain the suffix part including the dot (eg: .docx)
    file_name, file_ext = os.path.splitext(file)
    word_filename = os.path.join(word_path, file)
    pdf_filename = os.path.join(output_pdf_path, file_name + '.pdf')
    print(word_filename, '-->', pdf_filename)
    convert_to_pdf(word_filename, output_pdf_path)

    # file = open(pdf_filename, "w")
    # file.close()
    # docx2pdf.convert(word_filename, pdf_filename)