PDF object. Apr 25, 2025 – We'll create a table to hold the PDF data and a dict to hold the values. First we need to create a Python data dictionary named of from import pandas import pan doc as pd def read_pdf (PDF, directory, size, encoding): of ['page_format'][ 'is_dog'][ '1'] = (1 == size ['page_format'][ 'is_dog'] or not size ['page_format'][ 'is_dog']) of ['page_format'][ 'is_dog'][ '1'] = size ['page_format'][ 'is_dog'] of['page_format'] = type (of) if 'page_format' in of: of ['page_format'] = 'text' df2 = lambda x : pd. Read_CSV (x, encoding = 'utf-8', column_names = 'ID, Title,Author') df2 ['page_format'] = 'text' return df2 def to_excel (PDF): pdf_path, size, encoding = OS. Path. Join (directory, 'pdf-archive.xlsx', 'page.pdf') out = pd. Read_excel ('{filename} %02d'. Format (size), encoding, pdf_path) out ['out'] = {} out ['out']. Append ({'page' : pdf_path [pdf_path. Rsplit (' ')[ 0]], 'type' : types. Integer, 'id' : pdf_path [pdf_path. Rsplit (' ')[ 1]], 'title' : pdf_path [pdf_path. Rsplit (' ')[ 2]], 'author' : pdf_path [pdf_path. Rsplit (' ')[ 3]]}) for entry in out : print (file path, entry. File_name) # read in the data for file in data : print (file. File_name) Apr 25, 2025 — The final step is to open the output file, which we will create later in the tutorial. The data from the PDF, that was stored on the disk, will be used to create the table.