How to extract images from PDF files using c# and itextsharp

Welcome to my new blog! For my first post here, I’m rewriting and updating a post that originally appeared on my other blog some time back. Even though I wrote it back in September 2011, it remains my most popular programming post there, simply because of the lack of c# code examples online to do this.

PdfUtils source code zip file

Maybe things have changed now, but back when I wrote the original article, I was shocked to find no decent c# code examples that reliably, efficiently and quickly extracted images from pdf files. All the samples I found were copies of the same horrendous code, that iterated all the objects in a pdf file (this is terribly slow) and then used some uneducated guesswork to determine the formats of the image streams it had found.

If you take just a moment to think about such a technique, ask yourself where the collection of all embedded objects in the pdf came from. Most probably, itextsharp used a private method to parse the entire document and build up this collection of all objects. There are plenty of different kinds of objects that can be embedded, so this collection could potentially contain thousands of irrelevant objects. Now you iterate this entire collection again? There isn’t a right and a wrong way to extract images from a pdf file programmatically, but clearly, this way does more wrong than it does right.

A better way would surely be to iterate the pdf pages, and for each page, get a collection of only the images contained by that page, on the fly. As it happens, itextsharp supports this, with their PdfReaderContentParser type. All you need to do is call its ProcessContent method, passing it an instance of their IRenderListener interface (which you have to implement) each time it processes a page.

As a bonus, the IRenderListener interface has access to the PdfImageObject type, which has a GetDrawingImage method. This method does all the “heavy lifting” required to deserialize the image stream from the pdf file, so you don’t have to worry about what type of image stream was saved in the file, and how to read the stream data back into an image.

I didn’t find a sample that did this online; all I found was a vague reference to the technique by a developer, in the context of the original Java itext library, on which itextsharp is based, but once I knew the basics, finding the relevant types to use in itextsharp was a piece of cake. (After a few minutes of panicking paranoia… thinking “Oh crap, I hope they implemented this when porting it to .Net”)

My solution

If you download my PdfUtils.zip file, you’ll find a very small class library (the PdfUtils project) that contains a single static class, PdfImageExtractor, which uses an internal implementation of the IRenderListener interface. The solution also contains a console application that demonstrates extracting different format images from a pdf file, and saves them to disk.

Here is the full PdfImageExtractor source code as well:

  1. using iTextSharp.text.pdf;
  2. using iTextSharp.text.pdf.parser;
  3. using System;
  4. using System.Collections.Generic;
  5. using System.IO;
  6.  
  7. namespace PdfUtils
  8. {
  9.     /// <summary>Helper class to extract images from a PDF file. Works with the most
  10.     /// common image types embedded in PDF files, as far as I can tell.</summary>
  11.     /// <example>
  12.     /// Usage example:
  13.     /// <code>
  14.     /// foreach (var filename in Directory.GetFiles(searchPath, “*.pdf”, SearchOption.TopDirectoryOnly))
  15.     /// {
  16.     ///    var images = ImageExtractor.ExtractImages(filename);
  17.     ///    var directory = Path.GetDirectoryName(filename);
  18.     ///
  19.     ///    foreach (var name in images.Keys)
  20.     ///    {
  21.     ///       images[name].Save(Path.Combine(directory, name));
  22.     ///    }
  23.     ///  }
  24.     /// </code></example>
  25.     public static class PdfImageExtractor
  26.     {
  27.         #region Methods
  28.         #region Public Methods
  29.         /// <summary>Checks whether a specified page of a PDF file contains images.</summary>
  30.         /// <returns>True if the page contains at least one image; false otherwise.</returns>
  31.         public static bool PageContainsImages(string filename, int pageNumber)
  32.         {
  33.             using (var reader = new PdfReader(filename))
  34.             {
  35.                 var parser = new PdfReaderContentParser(reader);
  36.                 ImageRenderListener listener = null;
  37.                 parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
  38.                 return listener.Images.Count > 0;
  39.             }
  40.         }
  41.         /// <summary>Extracts all images (of types that iTextSharp knows how to decode) from a PDF file.</summary>
  42.         public static Dictionary<string, System.Drawing.Image> ExtractImages(string filename)
  43.         {
  44.             var images = new Dictionary<string, System.Drawing.Image>();
  45.             using (var reader = new PdfReader(filename))
  46.             {
  47.                 var parser = new PdfReaderContentParser(reader);
  48.                 ImageRenderListener listener = null;
  49.                 for (var i = 1; i <= reader.NumberOfPages; i++)
  50.                 {
  51.                     parser.ProcessContent(i, (listener = new ImageRenderListener()));
  52.                     var index = 1;
  53.                     if (listener.Images.Count > 0)
  54.                     {
  55.                         Console.WriteLine(“Found {0} images on page {1}.”, listener.Images.Count, i);
  56.                         foreach (var pair in listener.Images)
  57.                         {
  58.                             images.Add(string.Format(“{0}_Page_{1}_Image_{2}{3}”,
  59.                                 Path.GetFileNameWithoutExtension(filename), i.ToString(“D4”), index.ToString(“D4”), pair.Value), pair.Key);
  60.                             index++;
  61.                         }
  62.                     }
  63.                 }
  64.                 return images;
  65.             }
  66.         }
  67.         /// <summary>Extracts all images (of types that iTextSharp knows how to decode)
  68.         /// from a specified page of a PDF file.</summary>
  69.         /// <returns>Returns a generic <see cref=”Dictionary&lt;string, System.Drawing.Image&gt;”/>,
  70.         /// where the key is a suggested file name, in the format: PDF filename without extension,
  71.         /// page number and image index in the page.</returns>
  72.         public static Dictionary<string, System.Drawing.Image> ExtractImages(string filename, int pageNumber)
  73.         {
  74.             Dictionary<string, System.Drawing.Image> images = new Dictionary<string, System.Drawing.Image>();
  75.             PdfReader reader = new PdfReader(filename);
  76.             PdfReaderContentParser parser = new PdfReaderContentParser(reader);
  77.             ImageRenderListener listener = null;
  78.             parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
  79.             int index = 1;
  80.             if (listener.Images.Count > 0)
  81.             {
  82.                 Console.WriteLine(“Found {0} images on page {1}.”, listener.Images.Count, pageNumber);
  83.                 foreach (KeyValuePair<System.Drawing.Image, string> pair in listener.Images)
  84.                 {
  85.                     images.Add(string.Format(“{0}_Page_{1}_Image_{2}{3}”,
  86.                         Path.GetFileNameWithoutExtension(filename), pageNumber.ToString(“D4”), index.ToString(“D4”), pair.Value), pair.Key);
  87.                     index++;
  88.                 }
  89.             }
  90.             return images;
  91.         }
  92.         #endregion Public Methods
  93.         #endregion Methods
  94.     }
  95.     internal class ImageRenderListener : IRenderListener
  96.     {
  97.         #region Fields
  98.         Dictionary<System.Drawing.Image, string> images = new Dictionary<System.Drawing.Image, string>();
  99.         #endregion Fields
  100.         #region Properties
  101.         public Dictionary<System.Drawing.Image, string> Images
  102.         {
  103.             get { return images; }
  104.         }
  105.         #endregion Properties
  106.         #region Methods
  107.         #region Public Methods
  108.         public void BeginTextBlock() { }
  109.         public void EndTextBlock() { }
  110.         public void RenderImage(ImageRenderInfo renderInfo)
  111.         {
  112.             PdfImageObject image = renderInfo.GetImage();
  113.             PdfName filter = (PdfName)image.Get(PdfName.FILTER);
  114.  
  115.             //int width = Convert.ToInt32(image.Get(PdfName.WIDTH).ToString());
  116.             //int bitsPerComponent = Convert.ToInt32(image.Get(PdfName.BITSPERCOMPONENT).ToString());
  117.             //string subtype = image.Get(PdfName.SUBTYPE).ToString();
  118.             //int height = Convert.ToInt32(image.Get(PdfName.HEIGHT).ToString());
  119.             //int length = Convert.ToInt32(image.Get(PdfName.LENGTH).ToString());
  120.             //string colorSpace = image.Get(PdfName.COLORSPACE).ToString();
  121.             /* It appears to be safe to assume that when filter == null, PdfImageObject
  122.              * does not know how to decode the image to a System.Drawing.Image.
  123.              *
  124.              * Uncomment the code above to verify, but when I’ve seen this happen,
  125.              * width, height and bits per component all equal zero as well. */
  126.             if (filter != null)
  127.             {
  128.                 System.Drawing.Image drawingImage = image.GetDrawingImage();
  129.                 string extension = “.”;
  130.                 if (filter == PdfName.DCTDECODE)
  131.                 {
  132.                     extension += PdfImageObject.ImageBytesType.JPG.FileExtension;
  133.                 }
  134.                 else if (filter == PdfName.JPXDECODE)
  135.                 {
  136.                     extension += PdfImageObject.ImageBytesType.JP2.FileExtension;
  137.                 }
  138.                 else if (filter == PdfName.FLATEDECODE)
  139.                 {
  140.                     extension += PdfImageObject.ImageBytesType.PNG.FileExtension;
  141.                 }
  142.                 else if (filter == PdfName.LZWDECODE)
  143.                 {
  144.                     extension += PdfImageObject.ImageBytesType.CCITT.FileExtension;
  145.                 }
  146.                 /* Rather than struggle with the image stream and try to figure out how to handle
  147.                  * BitMapData scan lines in various formats (like virtually every sample I’ve found
  148.                  * online), use the PdfImageObject.GetDrawingImage() method, which does the work for us. */
  149.                 this.Images.Add(drawingImage, extension);
  150.             }
  151.         }
  152.         public void RenderText(TextRenderInfo renderInfo) { }
  153.         #endregion Public Methods
  154.         #endregion Methods
  155.     }
  156. }

					
Advertisements

About Jerome

I am a senior C# developer in Johannesburg, South Africa. I am also a recovering addict, who spent nearly eight years using methamphetamine. I write on my recovery blog about my lessons learned and sometimes give advice to others who have made similar mistakes, often from my viewpoint as an atheist, and I also write some C# programming articles on my programming blog.
This entry was posted in Programming and tagged , , . Bookmark the permalink.

23 Responses to How to extract images from PDF files using c# and itextsharp

  1. Pingback: On extracting images from PDF files | A Recovered Meth Addict's Blog

  2. Pingback: What makes a good blog post? And what makes a blog popular? | A Recovered Meth Addict's Blog

  3. Daniel says:

    This looks like a very, very good solution so far. However, I cannot run it yet as I get the error:

    ‘iTextSharp.text.pdf.parser.PdfImageObject’ does not contain a definition for ‘ImageBytesType’

    It may have been replaced by: PdfImageObject.TYPE_JPG; and similarly for the rest of the types.

    Liked by 1 person

    • Jerome says:

      My article relates specifically to the type referenced, that I downloaded when I wrote it. The original article on my other blog (http://recoveredmethaddict.wordpress.com/2011/09/20/on-extracting-images-from-pdf-files/) used a different type, which I updated when I rewrote the post. (It took me all of 10 seconds to figure out how the types had changed in the itextsharp implementation – Hit Ctrl+Space on the keyboard.)

      I assume anybody who reads my posts can figure out these things for themselves. (And yes, your assumption about the types sounds about right.)

      I don’t mean to be rude – the main reason I don’t write articles on a site like CodeProject is that the comments mostly ask obvious questions of things they should be able to figure out for themselves. I would not be able to answer the comments without being rude and insulting, which is counter to the reasons for sharing knowledge in the first place.

      Like

  4. Thank you very much!! This is very very helpful to me.

    Like

  5. Sathyamoorthy says:

    Hi,
    Can you tell me How to get the position of the image from the PDF file?

    Thanks in advance..

    Like

    • Jerome says:

      I don’t recall, as its been a while since I used this.

      In any case, I wrote the article because it is not intuitive how to get at just the image objects in a PDF. Getting the position of elements inside a PDF however, is intuitive, and this is something you should be able to figure out on your own in a couple of minutes, not something to ask random blog writers of tangentially related posts.

      Like

  6. Jake Johnson says:

    I wonder what would be the best way to do the following with this code:

    Instead of extracting the image to a file I need to read the image into a memory stream, however, if it is a compressed image such as *.jpg or *.tiff I would need to decompress it to a bitmap. My ultimate goal is to write a small utility that will automatically remove pages from a PDF if they are blank. And I would determine blankness by the ratio of white to non-white pixels. That way the ratio could even be adjusted during runtime if needed.

    There are free libraries for converting jpg to bmp and for deleting pages from a pdf so all I need to do is figure out how to read the images directly to memory to for efficiency reasons and I’m set. I haven’t looked at this too deeply yet but any ideas to get me pointed in the right direction would be greatly appreciated.

    Thanks.

    Like

    • Jerome says:

      It’s been a long time since I worked with this code, and I have noticed it doesn’t work with some PDFs, for example any that are saved as PDF by mobile phone apps.

      Having said that, I think you are over-thinking this… It doesn’t matter what the image format in the PDF is; once you have the image, their GetDrawingImage gets it as a BitMap anyway. At least, as a System.Drawing.Image descendant. So the embedded format is abstracted from the memory image that you get in the end. If you then wanted to save it as whatever image format you like, you can use the built-in image encoder classes and save it in whatever supported format you prefer.

      Like

  7. Das says:

    Hi Jerome,
    Thanks for the nice info. It has been Really helpful. However, I need a little more.
    Is it possible to get the coordinates of the image? (Top-left and Bottom-Right?)
    I am trying to locate the rectangle where the image is present with in the pdf.

    Like

  8. Hi Jerome,
    Thanks so much for the free code. It made my job much easier!

    Like

  9. fravexblog says:

    Muchas gracias amigo 🙂
    This topic was a great help with my current project.

    Like

    • fravexblog says:

      I linked my blog with this post… Hope I have your leave to do so ?

      Liked by 1 person

      • Jerome says:

        I don’t mind. Glad it still works…

        I wrote that code in 2011 and didn’t think anything much of it at the time. It was written when I was tweaking my head off on crystal meth, before I turned my life around. Yet I have never been able to write anything quite as popular since.

        Like

  10. Pingback: More coming soon… | fravexblog

    • peter says:

      Ooos, I pressed on Enter too soon … but anyway many thanks for sharing your approach. It worked for me right away after re-compiling on Windows 10 with VS2015 CE.
      Best regards,
      Peter

      Liked by 1 person

      • Jerome says:

        Wow, I’m glad it did.

        It’s so weird, this post seems to be my single most popular post, and was originally written on another blog back in 2011 while I was high on crystal meth and had been awake for several days. I was out of my mind, yet somehow figured out how to use that library and shared my code, and seemed to be the first person to do so in c#.

        None of my other posts have come close to the number of views or the amount of praise, and even though I’ve been clean and very much normal and sane for years now, nothing I write comes close to this.

        But I am glad it worked for you. I suspect that it doesn’t work for scanner-produced PDF files, but normal PDF documents with embedded images seem to be covered by it quite well.

        Liked by 1 person

  11. Urr says:

    in RenderImage(), its better to do string extension = “.” + image.GetImageBytesType().FileExtension.ToString(). In my case it made tif file to propperly be extracted. (4 colors – CMYK FlateDecode stream).

    Liked by 1 person

    • Urr says:

      So the full code for RenderImage() shrinks down to this:
      public void RenderImage(ImageRenderInfo renderInfo)
      {
      PdfImageObject image = renderInfo.GetImage();
      PdfName filter = (PdfName)image.Get(PdfName.FILTER);
      if (filter != null)
      {
      System.Drawing.Image drawingImage = image.GetDrawingImage();
      string extension = “.” + image.GetImageBytesType().FileExtension.ToString();
      this.Images.Add(drawingImage, extension);
      }
      }

      Liked by 1 person

      • Jerome says:

        Very nice… Will have to try this when I have a chance. TBH I haven’t looked at this code for years.

        I did write it in my bad old days while tweaking my head off on meth after being awake for several days, and have been amazed ever since that the code works at all and remains my most popular post. (Almost three years clean now.)

        Like

  12. daitran says:

    PdfImageObject image = renderInfo.GetImage();
    error: The color space [/Indexed, /DeviceCMYK, 1, 26 0 R] is not supported

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s