Welcome to my new blog! For my first post here, I’m rewriting and updating a post that originally appeared on my other blog some time back. Even though I wrote it back in September 2011, it remains my most popular programming post there, simply because of the lack of c# code examples online to do this.
Maybe things have changed now, but back when I wrote the original article, I was shocked to find no decent c# code examples that reliably, efficiently and quickly extracted images from pdf files. All the samples I found were copies of the same horrendous code, that iterated all the objects in a pdf file (this is terribly slow) and then used some uneducated guesswork to determine the formats of the image streams it had found.
If you take just a moment to think about such a technique, ask yourself where the collection of all embedded objects in the pdf came from. Most probably, itextsharp used a private method to parse the entire document and build up this collection of all objects. There are plenty of different kinds of objects that can be embedded, so this collection could potentially contain thousands of irrelevant objects. Now you iterate this entire collection again? There isn’t a right and a wrong way to extract images from a pdf file programmatically, but clearly, this way does more wrong than it does right.
A better way would surely be to iterate the pdf pages, and for each page, get a collection of only the images contained by that page, on the fly. As it happens, itextsharp supports this, with their PdfReaderContentParser type. All you need to do is call its ProcessContent method, passing it an instance of their IRenderListener interface (which you have to implement) each time it processes a page.
As a bonus, the IRenderListener interface has access to the PdfImageObject type, which has a GetDrawingImage method. This method does all the “heavy lifting” required to deserialize the image stream from the pdf file, so you don’t have to worry about what type of image stream was saved in the file, and how to read the stream data back into an image.
I didn’t find a sample that did this online; all I found was a vague reference to the technique by a developer, in the context of the original Java itext library, on which itextsharp is based, but once I knew the basics, finding the relevant types to use in itextsharp was a piece of cake. (After a few minutes of panicking paranoia… thinking “Oh crap, I hope they implemented this when porting it to .Net”)
My solution
If you download my PdfUtils.zip file, you’ll find a very small class library (the PdfUtils project) that contains a single static class, PdfImageExtractor, which uses an internal implementation of the IRenderListener interface. The solution also contains a console application that demonstrates extracting different format images from a pdf file, and saves them to disk.
Here is the full PdfImageExtractor source code as well:
- using iTextSharp.text.pdf;
- using iTextSharp.text.pdf.parser;
- using System;
- using System.Collections.Generic;
- using System.IO;
- namespace PdfUtils
- {
- /// <summary>Helper class to extract images from a PDF file. Works with the most
- /// common image types embedded in PDF files, as far as I can tell.</summary>
- /// <example>
- /// Usage example:
- /// <code>
- /// foreach (var filename in Directory.GetFiles(searchPath, “*.pdf”, SearchOption.TopDirectoryOnly))
- /// {
- /// var images = ImageExtractor.ExtractImages(filename);
- /// var directory = Path.GetDirectoryName(filename);
- ///
- /// foreach (var name in images.Keys)
- /// {
- /// images[name].Save(Path.Combine(directory, name));
- /// }
- /// }
- /// </code></example>
- public static class PdfImageExtractor
- {
- #region Methods
- #region Public Methods
- /// <summary>Checks whether a specified page of a PDF file contains images.</summary>
- /// <returns>True if the page contains at least one image; false otherwise.</returns>
- public static bool PageContainsImages(string filename, int pageNumber)
- {
- using (var reader = new PdfReader(filename))
- {
- var parser = new PdfReaderContentParser(reader);
- ImageRenderListener listener = null;
- parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
- return listener.Images.Count > 0;
- }
- }
- /// <summary>Extracts all images (of types that iTextSharp knows how to decode) from a PDF file.</summary>
- public static Dictionary<string, System.Drawing.Image> ExtractImages(string filename)
- {
- var images = new Dictionary<string, System.Drawing.Image>();
- using (var reader = new PdfReader(filename))
- {
- var parser = new PdfReaderContentParser(reader);
- ImageRenderListener listener = null;
- for (var i = 1; i <= reader.NumberOfPages; i++)
- {
- parser.ProcessContent(i, (listener = new ImageRenderListener()));
- var index = 1;
- if (listener.Images.Count > 0)
- {
- Console.WriteLine(“Found {0} images on page {1}.”, listener.Images.Count, i);
- foreach (var pair in listener.Images)
- {
- images.Add(string.Format(“{0}_Page_{1}_Image_{2}{3}”,
- Path.GetFileNameWithoutExtension(filename), i.ToString(“D4”), index.ToString(“D4”), pair.Value), pair.Key);
- index++;
- }
- }
- }
- return images;
- }
- }
- /// <summary>Extracts all images (of types that iTextSharp knows how to decode)
- /// from a specified page of a PDF file.</summary>
- /// <returns>Returns a generic <see cref=”Dictionary<string, System.Drawing.Image>”/>,
- /// where the key is a suggested file name, in the format: PDF filename without extension,
- /// page number and image index in the page.</returns>
- public static Dictionary<string, System.Drawing.Image> ExtractImages(string filename, int pageNumber)
- {
- Dictionary<string, System.Drawing.Image> images = new Dictionary<string, System.Drawing.Image>();
- PdfReader reader = new PdfReader(filename);
- PdfReaderContentParser parser = new PdfReaderContentParser(reader);
- ImageRenderListener listener = null;
- parser.ProcessContent(pageNumber, (listener = new ImageRenderListener()));
- int index = 1;
- if (listener.Images.Count > 0)
- {
- Console.WriteLine(“Found {0} images on page {1}.”, listener.Images.Count, pageNumber);
- foreach (KeyValuePair<System.Drawing.Image, string> pair in listener.Images)
- {
- images.Add(string.Format(“{0}_Page_{1}_Image_{2}{3}”,
- Path.GetFileNameWithoutExtension(filename), pageNumber.ToString(“D4”), index.ToString(“D4”), pair.Value), pair.Key);
- index++;
- }
- }
- return images;
- }
- #endregion Public Methods
- #endregion Methods
- }
- internal class ImageRenderListener : IRenderListener
- {
- #region Fields
- Dictionary<System.Drawing.Image, string> images = new Dictionary<System.Drawing.Image, string>();
- #endregion Fields
- #region Properties
- public Dictionary<System.Drawing.Image, string> Images
- {
- get { return images; }
- }
- #endregion Properties
- #region Methods
- #region Public Methods
- public void BeginTextBlock() { }
- public void EndTextBlock() { }
- public void RenderImage(ImageRenderInfo renderInfo)
- {
- PdfImageObject image = renderInfo.GetImage();
- PdfName filter = (PdfName)image.Get(PdfName.FILTER);
- //int width = Convert.ToInt32(image.Get(PdfName.WIDTH).ToString());
- //int bitsPerComponent = Convert.ToInt32(image.Get(PdfName.BITSPERCOMPONENT).ToString());
- //string subtype = image.Get(PdfName.SUBTYPE).ToString();
- //int height = Convert.ToInt32(image.Get(PdfName.HEIGHT).ToString());
- //int length = Convert.ToInt32(image.Get(PdfName.LENGTH).ToString());
- //string colorSpace = image.Get(PdfName.COLORSPACE).ToString();
- /* It appears to be safe to assume that when filter == null, PdfImageObject
- * does not know how to decode the image to a System.Drawing.Image.
- *
- * Uncomment the code above to verify, but when I’ve seen this happen,
- * width, height and bits per component all equal zero as well. */
- if (filter != null)
- {
- System.Drawing.Image drawingImage = image.GetDrawingImage();
- string extension = “.”;
- if (filter == PdfName.DCTDECODE)
- {
- extension += PdfImageObject.ImageBytesType.JPG.FileExtension;
- }
- else if (filter == PdfName.JPXDECODE)
- {
- extension += PdfImageObject.ImageBytesType.JP2.FileExtension;
- }
- else if (filter == PdfName.FLATEDECODE)
- {
- extension += PdfImageObject.ImageBytesType.PNG.FileExtension;
- }
- else if (filter == PdfName.LZWDECODE)
- {
- extension += PdfImageObject.ImageBytesType.CCITT.FileExtension;
- }
- /* Rather than struggle with the image stream and try to figure out how to handle
- * BitMapData scan lines in various formats (like virtually every sample I’ve found
- * online), use the PdfImageObject.GetDrawingImage() method, which does the work for us. */
- this.Images.Add(drawingImage, extension);
- }
- }
- public void RenderText(TextRenderInfo renderInfo) { }
- #endregion Public Methods
- #endregion Methods
- }
- }
Pingback: On extracting images from PDF files | A Recovered Meth Addict's Blog
Pingback: What makes a good blog post? And what makes a blog popular? | A Recovered Meth Addict's Blog
This looks like a very, very good solution so far. However, I cannot run it yet as I get the error:
‘iTextSharp.text.pdf.parser.PdfImageObject’ does not contain a definition for ‘ImageBytesType’
It may have been replaced by: PdfImageObject.TYPE_JPG; and similarly for the rest of the types.
LikeLiked by 1 person
My article relates specifically to the type referenced, that I downloaded when I wrote it. The original article on my other blog (http://recoveredmethaddict.wordpress.com/2011/09/20/on-extracting-images-from-pdf-files/) used a different type, which I updated when I rewrote the post. (It took me all of 10 seconds to figure out how the types had changed in the itextsharp implementation – Hit Ctrl+Space on the keyboard.)
I assume anybody who reads my posts can figure out these things for themselves. (And yes, your assumption about the types sounds about right.)
I don’t mean to be rude – the main reason I don’t write articles on a site like CodeProject is that the comments mostly ask obvious questions of things they should be able to figure out for themselves. I would not be able to answer the comments without being rude and insulting, which is counter to the reasons for sharing knowledge in the first place.
LikeLike
Thank you very much!! This is very very helpful to me.
LikeLike
Hi,
Can you tell me How to get the position of the image from the PDF file?
Thanks in advance..
LikeLike
I don’t recall, as its been a while since I used this.
In any case, I wrote the article because it is not intuitive how to get at just the image objects in a PDF. Getting the position of elements inside a PDF however, is intuitive, and this is something you should be able to figure out on your own in a couple of minutes, not something to ask random blog writers of tangentially related posts.
LikeLike
I wonder what would be the best way to do the following with this code:
Instead of extracting the image to a file I need to read the image into a memory stream, however, if it is a compressed image such as *.jpg or *.tiff I would need to decompress it to a bitmap. My ultimate goal is to write a small utility that will automatically remove pages from a PDF if they are blank. And I would determine blankness by the ratio of white to non-white pixels. That way the ratio could even be adjusted during runtime if needed.
There are free libraries for converting jpg to bmp and for deleting pages from a pdf so all I need to do is figure out how to read the images directly to memory to for efficiency reasons and I’m set. I haven’t looked at this too deeply yet but any ideas to get me pointed in the right direction would be greatly appreciated.
Thanks.
LikeLike
It’s been a long time since I worked with this code, and I have noticed it doesn’t work with some PDFs, for example any that are saved as PDF by mobile phone apps.
Having said that, I think you are over-thinking this… It doesn’t matter what the image format in the PDF is; once you have the image, their GetDrawingImage gets it as a BitMap anyway. At least, as a System.Drawing.Image descendant. So the embedded format is abstracted from the memory image that you get in the end. If you then wanted to save it as whatever image format you like, you can use the built-in image encoder classes and save it in whatever supported format you prefer.
LikeLike
Hi Jerome,
Thanks for the nice info. It has been Really helpful. However, I need a little more.
Is it possible to get the coordinates of the image? (Top-left and Bottom-Right?)
I am trying to locate the rectangle where the image is present with in the pdf.
LikeLike
Hi Jerome,
Thanks so much for the free code. It made my job much easier!
LikeLike
Muchas gracias amigo 🙂
This topic was a great help with my current project.
LikeLike
I linked my blog with this post… Hope I have your leave to do so ?
LikeLiked by 1 person
I don’t mind. Glad it still works…
I wrote that code in 2011 and didn’t think anything much of it at the time. It was written when I was tweaking my head off on crystal meth, before I turned my life around. Yet I have never been able to write anything quite as popular since.
LikeLike
Pingback: More coming soon… | fravexblog
Hi
LikeLike
Ooos, I pressed on Enter too soon … but anyway many thanks for sharing your approach. It worked for me right away after re-compiling on Windows 10 with VS2015 CE.
Best regards,
Peter
LikeLiked by 1 person
Wow, I’m glad it did.
It’s so weird, this post seems to be my single most popular post, and was originally written on another blog back in 2011 while I was high on crystal meth and had been awake for several days. I was out of my mind, yet somehow figured out how to use that library and shared my code, and seemed to be the first person to do so in c#.
None of my other posts have come close to the number of views or the amount of praise, and even though I’ve been clean and very much normal and sane for years now, nothing I write comes close to this.
But I am glad it worked for you. I suspect that it doesn’t work for scanner-produced PDF files, but normal PDF documents with embedded images seem to be covered by it quite well.
LikeLiked by 2 people
in RenderImage(), its better to do string extension = “.” + image.GetImageBytesType().FileExtension.ToString(). In my case it made tif file to propperly be extracted. (4 colors – CMYK FlateDecode stream).
LikeLiked by 1 person
So the full code for RenderImage() shrinks down to this:
public void RenderImage(ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
PdfName filter = (PdfName)image.Get(PdfName.FILTER);
if (filter != null)
{
System.Drawing.Image drawingImage = image.GetDrawingImage();
string extension = “.” + image.GetImageBytesType().FileExtension.ToString();
this.Images.Add(drawingImage, extension);
}
}
LikeLiked by 1 person
Very nice… Will have to try this when I have a chance. TBH I haven’t looked at this code for years.
I did write it in my bad old days while tweaking my head off on meth after being awake for several days, and have been amazed ever since that the code works at all and remains my most popular post. (Almost three years clean now.)
LikeLike
Thanks, great.
LikeLike
PdfImageObject image = renderInfo.GetImage();
error: The color space [/Indexed, /DeviceCMYK, 1, 26 0 R] is not supported
LikeLike
Nice code..
LikeLike
Thanks for the code.
For anyone interested, I successfully used this as the first step in OCRing scanned PDFs. I saved the image (which in a scanned PDF usually is 1 image per page) as .tif and then used the free Microsoft Office Document Imaging (downloaded as part of the free Sharepoint Designer 2007 from MS site) via COM to OCR and return the text.
Code:
//steps as per the post above to get dictionary of images…
//save the temp image as tif
images.ElementAt(0).Value.Save(@”C:\temp\ocr.tif”, System.Drawing.Imaging.ImageFormat.Tiff);
//with the Microsoft Document Imaging COM object…
MODI.Document md = new MODI.Document();
md.Create(@”C:\temp\ocr.tif”);
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
MODI.Image img = (MODI.Image)md.Images[0];
//dump the OCR’d text to console
Console.WriteLine(img.Layout.Text);
Hope that’s helpful for someone. Seemed simpler than some of the other OCR solutions for C# and the performance is quite good.
LikeLiked by 2 people
2019 and still no good examples…except yours! Thank you very much.
LikeLiked by 1 person
How I can get whole page as image not only images at page ?
LikeLike
Wow, I wrote this years ago and I’m surprised if it still works. This is about extracting images that were saved as image streams to the PDF. It isn’t suitable for saving pages as images. There are probably better ways to do that but I imagine you can use the classes in System.Drawing to capture a screenshot of a window.
Graphics.CopyFromScreen(Bounds of your window)
Second answer on here has it… all you’d need to do is have something that can render the PDF in a window first.
https://stackoverflow.com/questions/1163761/capture-screenshot-of-active-window
LikeLike
For anyone interested,
image.Get(PdfName.FILTER) does not always return a PdfName. It may return a PdfArray. So I tweaked a bit the part where the filter is determined. My code
var filterObj = image.Get(PdfName.FILTER);
PdfName filter = null;
if (filterObj is PdfName) {
filter = (PdfName)filterObj;
} else if (filterObj is PdfArray array) {
foreach (var item in array) {
if (item is PdfName) filter = (PdfName)item;
}
}
LikeLiked by 1 person
Can i get the position of image from that page ?
LikeLike
Pingback: [SOLVED] Azure Computer Vision returns garbage for a pdf with vector graphics – BugsFixing