The release date of the upcoming 0.9.3 version is drawing nearer. Several format classes have already been exposed to Python and in this post I’m going to show you some code snippets. Since it’s impossible to demonstrate all format classes (12 have already been exposed) and all their methods (a single class may contain dozens of methods), the purpose of the snippets below is only to give the reader an idea of what can be achieved.
The SDK organization has changed a bit: because of its increasing size it made sense to subdivide it into modules. Thus, there’s now the Pro.Core module, the Pro.UI one and one module for each format (e.g. Pro.PE).
This is how we can output to text the raw stream of a PDF:
from Pro.Core import *
from Pro.PDF import *
c = createContainerFromFile(fname)
pdf = PDFObject()
pdf.Load(c)
objtable = pdf.BuildObjectTable()
pdf.SetObjectTable(objtable)
oid = PDFObject.OBJID(3, 0)
ret, dict, content, info = pdf.ParseObject(objtable, oid)
out = NTTextBuffer()
out.printHex(content)
print(out.buffer)
Output:
0 1 2 3 4 5 6 7 8 9 A B C D E F Ascii
0000 48 89 24 8D CD 0A 83 30 10 84 EF 81 BC C3 1C 93 H.$....0........
0010 8B 4D 52 63 E3 B5 D0 0A 42 A1 D0 DC C4 83 D4 F8 .MRc....B.......
0020 D3 D6 0A 2A F5 F5 BB B6 B0 CC 2E C3 37 3B 1A 2D ...*........7;.-
0030 67 2A B2 C8 38 D3 C8 A1 F0 80 C6 8A 18 17 14 7B g*..8..........{
0040 94 0A 35 67 BB EC 66 D0 CE 1B D1 83 34 75 48 92 ..5g..f.....4uH.
0050 04 46 C7 B0 0E 53 E0 EC 48 E3 09 3C 1B 4A FB 86 .F...S..H..<.J..
0060 18 43 AF 14 68 19 7D 88 1C 05 52 05 3F 50 D7 DF .C..h.}...R.?P..
0070 C7 73 3B FD FD A7 2B 67 85 B8 CA 58 89 6A 5E 02 .s;...+g...X.j^.
0080 96 2E A0 E9 C3 AB 46 F5 AE 31 8C 52 5B F1 91 C6 ......F..1.R[...
0090 8A 20 95 C0 32 A2 0B 53 D8 CC 48 96 3E E7 EC 44 . ..2..S..H.>..D
00A0 CD 5F 01 06 00 88 1E 2A AA 0D 0A ._.....*...
Streams in PDFs are usually compressed. Here’s how we can decode the same stream:
from Pro.Core import *
from Pro.PDF import *
c = createContainerFromFile(fname)
pdf = PDFObject()
pdf.Load(c)
objtable = pdf.BuildObjectTable()
pdf.SetObjectTable(objtable)
oid = PDFObject.OBJID(3, 0)
ret, dict, content, info = pdf.ParseObject(objtable, oid)
content = pdf.DecodeObjectStream(content, dict, oid)
out = NTTextBuffer()
out.printHex(content)
print(out.buffer)
Output:
0 1 2 3 4 5 6 7 8 9 A B C D E F Ascii
0000 31 20 67 0D 0A 30 2E 35 20 47 0D 0A 31 20 4A 20 1 g..0.5 G..1 J
0010 30 20 6A 20 31 20 77 20 34 20 4D 20 5B 33 20 5D 0 j 1 w 4 M [3 ]
0020 30 20 64 0D 0A 2F 47 53 32 20 67 73 0D 0A 31 20 0 d../GS2 gs..1
0030 69 20 0D 0A 31 39 38 20 36 36 36 20 32 31 34 20 i ..198 666 214
0040 35 38 20 72 65 0D 0A 42 0D 0A 42 54 0D 0A 2F 46 58 re..B..BT../F
0050 32 20 31 20 54 66 0D 0A 31 32 20 30 20 30 20 31 2 1 Tf..12 0 0 1
0060 32 20 32 31 37 2E 38 38 20 36 39 30 20 54 6D 0D 2 217.88 690 Tm.
0070 0A 30 20 30 20 30 20 31 20 6B 0D 0A 30 20 54 63 .0 0 0 1 k..0 Tc
0080 0D 0A 30 20 54 77 0D 0A 5B 28 50 29 34 30 28 61 ..0 Tw..[(P)40(a
0090 73 74 65 20 74 68 65 20 66 69 65 6C 64 20 61 6E ste the field an
00A0 64 20 6D 6F 29 31 35 28 76 29 32 35 28 65 29 30 d mo)15(v)25(e)0
00B0 28 20 74 6F 20 68 65 72 65 29 31 35 28 2E 29 5D ( to here)15(.)]
00C0 54 4A 0D 0A 45 54 0D 0A TJ..ET..
We might also want to iterate through the key/value pairs of a PDF dictionary. Thus, iterators have been implemented everywhere they could be applied. While they don’t yet support the standard Python syntax they are very easy to use:
from Pro.Core import *
from Pro.PDF import *
c = createContainerFromFile(fname)
pdf = PDFObject()
pdf.Load(c)
objtable = pdf.BuildObjectTable()
pdf.SetObjectTable(objtable)
oid = PDFObject.OBJID(3, 0)
ret, dict, content, info = pdf.ParseObject(objtable, oid)
it = dict.iterator()
while it.hasNext():
k, v = it.next()
print(k + " - " + v)
Output:
/Length - 171
/Filter - /FlateDecode
Iterating through the objects of a PDF amounts to the same logic:
from Pro.Core import *
from Pro.PDF import *
c = createContainerFromFile(fname)
pdf = PDFObject()
pdf.Load(c)
objtable = pdf.BuildObjectTable()
it = objtable.iterator()
while it.hasNext():
k, v = it.next()
# print out the object id
print(str(k >> 32))
CFBF (DOC, XLS, PPT, MSI, etc.)
Iterating through the directories of a CFBF can be as simple as:
from Pro.Core import *
from Pro.CFBF import *
def visitor(obj, ud, dir_id, children):
name = obj.DirectoryName(dir_id)
print(name)
return 0
c = createContainerFromFile(fname)
cfb = CFBObject()
cfb.Load(c)
dirs = cfb.BuildDirectoryTree()
cfb.SetDirectoryTree(dirs)
cfb.VisitDirectories(dirs, visitor, None)
Output:
Root Entry
CompObj
Ole
1Table
SummaryInformation
WordDocument
DocumentSummaryInformation
Retrieving a stream is equally easy:
from Pro.Core import *
from Pro.CFBF import *
c = createContainerFromFile(fname)
cfb = CFBObject()
cfb.Load(c)
dirs = cfb.BuildDirectoryTree()
cfb.SetDirectoryTree(dirs)
s = cfb.Stream(1)
b = s.read(0, s.size()) # read bytes
t = NTTextBuffer()
t.printHex(b)
print(t.buffer)
Output:
0 1 2 3 4 5 6 7 8 9 A B C D E F Ascii
0000 01 00 FE FF 03 0A 00 00 FF FF FF FF 06 09 02 00 ................
0010 00 00 00 00 C0 00 00 00 00 00 00 46 18 00 00 00 ...........F....
0020 4D 69 63 72 6F 73 6F 66 74 20 57 6F 72 64 2D 44 Microsoft Word-D
0030 6F 6B 75 6D 65 6E 74 00 0A 00 00 00 4D 53 57 6F okument.....MSWo
0040 72 64 44 6F 63 00 10 00 00 00 57 6F 72 64 2E 44 rdDoc.....Word.D
0050 6F 63 75 6D 65 6E 74 2E 38 00 F4 39 B2 71 00 00 ocument.8..9.q..
0060 00 00 00 00 00 00 00 00 00 00 ..........
SWF
Here’s how to output the disasm of an ActionScript2 Flash file:
from Pro.Core import *
from Pro.SWF import *
c = createContainerFromFile(fname)
swf = SWFObject()
swf.Load(c)
if swf.IsCompressed():
swf.Decompress()
tl = swf.EnumerateTags()
swf.SetStoredTags(tl)
out = NTTextBuffer()
swf.AS2Disassemble(out)
print(out.buffer)
The same can be done for ActionScript3 using the ABCFileObject class.
Class
This is how to disassemble a Java Class file:
from Pro.Core import *
from Pro.Class import *
c = createContainerFromFile(fname)
cl = ClassObject()
cl.Load(c)
cl.ProcessClass()
out = NTTextBuffer()
cl.Disassemble(out)
print(out.buffer)
DEX
This is how to disassemble an Android DEX file class:
from Pro.Core import *
from Pro.DEX import *
c = createContainerFromFile(fname)
dex = DEXObject()
dex.Load(c)
# disassemble the last class
classes = dex.Classes()
token = classes.Count() - 1
out = NTTextBuffer()
dex.Disassemble(out, token)
print(out.buffer)
In the upcoming post(s) I’m going to put it all together and do some very interesting things.
So stay tuned as the best has yet to come!