Posted by CANbike on Thu, 11 Jun 2015

grep: Searching 55785 PDF Files

The following are tips and tricks learnt while searching PDF files for keywords and phrases.

The Problem: 55785 PDF Files

A folder containing 55785 PDF files, totaling over 8.23 GB of data, needs to be searched for keywords or key phrases.

55785 PDF Files-8 GB of data.png

A PDF file is a binary format, often compressed. As a result, grep will not work directly with PDF files.

There are other command line tools to search PDF files, but they lack the power and functionality of grep.

The Solution: grep + pdftotext

Fortunately there is pdftotext. a command-line utility. for converting PDF files to plain text files. As a result, it can be used to convert PDF files to text on the fly, and piped to grep. Thus, providing all the features of grep!

Example 1: grep PDF files

The command to search PDF files for lines of text with grep is as follows:

find PDFs/ -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep -i --with-filename --label="{}" --color "perry sound"' \;

where

find - command to locate all the ".pdf" files in directory PDFs.
    "PDFs/"       - The directory to search for files
    name          - Option to specify the pattern to search for
    '*.pdf'       - The pattern
    exec          - Option to execute a command.
    \;            - Marks the end of command for exec option

sh - executes a GNU Bourne-Again Shell
    c             - option to read command from a string

pdftotext - command to convert a PDF files to text
    "{}"          - The file fed from the find command
    '-'           - Text is sent to stdout

| - Pass the output from pdftext to grep

grep - command to search input files for lines containing a match to the given PATTERN
    i             - option to ignore-case (case insensitive)
    with-filename - option to print the filename for each match
    label         - option to display input actually coming from standard input as input 
                    coming from	file LABEL.
    "{}"          - The file fed from the find command
    color         - Option to surround the matching string with the marker find in GREP_COLOR
                    environment variable
    "perry sound" - The grep search pattern

grep PDF files.png

Example 2: grep PDF files and display additional lines of text

The default settings displays the line of text containing the search pattern “perry sound”. However, to also display the previous 2 lines and subsequent to 2 lines of text, the grep C option can be specified.

find PDFs/ -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep -C 2 -i --with-filename --label="{}" --color "perry sound"' \;

where

-C NUM, --context=NUM
	      Print  NUM lines of output context.  Places a line containing --
	      between contiguous groups of matches.

grep PDF files and display search results plus additional lines.png

Example 3: grep PDF files and display only filenames

To grep PDF files and print the list of files with matching results, grep l option can be specified.

find PDFs/ -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep -l -i --with-filename --label="{}" --color "perry sound"' \;

where

-l, --files-with-matches
	      Suppress normal output; instead print the	 name  of  each	 input
	      file  from  which	 output would normally have been printed.  The
	      scanning will stop on the first match.

Search PDF files and display only filename results.png

Example 4: grep PDF files and copy/move files with matching results

To grep PDF files and copy or move files with matching results will require piping results to xargs.

However, before this can be done, filenames need to be terminated with a null to handle special characters such as spaces, quotes, apostrophes, backslashes, or newlines.

With grep the following options needs to be specified,

-Z, --null
        Output a zero byte (the ASCII NUL character) instead of the
        character that normally follows a file name. For example, grep
        -lZ outputs a zero byte after each  file name instead of the
        usual newline. This option makes the output unambiguous, even
        in the presence of file names containing unusual characters like
        newlines. This option can be used with commands like find
        -print0, perl -0, sort -z, and xargs -0 to process arbitrary
        file names, even those that contain newline characters.

With xargs the following options needs to be specified,

--null, -0
      Input  items  are	 terminated  by a null character instead of by
      whitespace, and the quotes and backslash are not special	(every
      character is taken literally).  Disables the end of file string,
      which is treated like any other  argument.   Useful  when	 input
      items  might  contain  white space, quote marks, or backslashes.
      The GNU find -print0 option produces  input  suitable  for  this
      mode.

Thus,

find PDFs/ -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep -l -Z -i --with-filename --label="{}" --color "perry sound" | xargs -0 -I{} cp -v {} Found/' \;

where

xargs - build and execute command lines from standard input
    I         - Replace occurrences of replace-str in the initial-arguments with
                names read from standard input.  Also, unquoted  blanks  do  not
                terminate input	items;	instead	 the  separator is the newline
                character.
    "{}"      - The file fed from the find command

cp - copy files and directories
    -v        - -Verbose, explain what is being done
    "{}"      - The file fed from the find command
    "Found/"  - The directory to copy files to

grep PDF files and copy the files with matching content.png

grep + AND Logical Operator

grep does not support the logical “AND” operator.

Nevertheless, to simulate an AND logical operator, grep has to be run twice. In this case, first to search for files containing “perry sound”, and then to search those files for “October 26, 2004”.

find PDFs/ -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep -l -Z -i --label="{}" "perry sound" | xargs -0 -I{} grep -i -l --label="{}" "October 26, 2004" "{}"' \;

grep PDF files-AND Multiple Keywords.png


Related Item(s):