If you want to do custom statistical analysis on US Patent applications and awards, there’s no substitute for downloading the raw archives from the US Patent Office (USPTO) and manipulating them in the database of choice. However, in 2005 the USPTO uptdated its XML encoding standard to ST.36 (also referred to as “Patent Grant Bibliographic Data/XML”), the World Intellectual Property Organization’s (WIPO) XML standard. On the plus side, this means US data files will be published in a more consistent and comparable international standard; on the downside, that format can be hard to work with. In this post, I detail how and where to obtain US Patent data files, and how to extract information from them using simple open-source unix/linux tools.
What you will need:
- An ftp browser, or the unix ftp application
- An unzipping tool to unarchive the datafiles
- An installed recent version of xmlstarlet
- A unix or linux shell
- And a database to import and manipulate the output when you’re done.
Although the USPTO has a central page describing its formats and ftp archive, it doesn’t bother to provide you an actual link to the XML datatype definition. So, here are a few important links for you to refer to regarding future changes and updates:
Once you have all the parts ready, open a console window to your linux or unix shell (I’m working on Apple OSX-Darwin, so some details may vary depending on what you’re using).
- Open an anonymous ftp connection to the USPTO patent data server: ftp://ftp.uspto.gov/grants
- Select, download, and extract (unzip) the year and week files that interest you. Patent award data are released by weeks and organized by date. For patent awards, you will want the compressed (.zip) files beginning with “ipgb”, for example “ipbg070116.zip”.
- Clean the data. the ST.36 format supports some apparently non-XML standard elements that cause xmlstarlet to crash, such as multiple DOCTYPE and XML declarations as well as the presence of ampersands (”&”), which also has special XML significance. Just to be safe, we’ll also strip out any newlines and embedded tabs which might sneak in, since our final file format will be tab-separated, newline broken text. Finally, we need to wrap the entire file in an enclosing tag for xmlstarlet to be happy, thus we add a “<uspto>” tag and end-tag to the entire file. You can strip all of the problem causing content out without affecting the data you are attempting to extract with the following shell command, or something similar:
echo '<uspto>'>/tmp/patentconversion.xml cat ipg*.xml | grep -v '<?xml' | grep -v '<!DOCTYPE' | tr '\n#&\t' ' ' >> /tmp/patentconversion.xml
echo '</uspto>' >> /tmp/patentconversion.xml
- Extract the data. The following console commands will add a header line, and extract most of the useful fields from each patent.
echo 'PatNum#IssueDate#MainClass#FurtherClass#intlClas#Title#inventor_firstname#inventor_lastname#InventorCity#InventorState#InventorCountry#assignee#AssigneeCity#AssigneeState#AssigneeCountry'> /tmp/patentconversion.txt /sw/bin/xml sel -t -m //us-patent-grant/us-bibliographic-data-grant/ -v publication-reference/document-id/doc-number -o "#" -v publication-reference/document-id/date -o "#" -v classification-national/main-classification -o "#" -v classification-national/further-classification -o "#" -v classification-locarno/main-classification -o "#" -v invention-title -o "#" -v parties/applicants/applicant/addressbook/first-name -o "#" -v parties/applicants/applicant/addressbook/last-name -o "#" -v parties/applicants/applicant/addressbook/address/city -o "#" -v parties/applicants/applicant/addressbook/address/state -o "#" -v parties/applicants/applicant/addressbook/address/country -o "#" -v assignees/assignee/addressbook/orgname -o "#" -v assignees/assignee/addressbook/address/city -o "#" -v assignees/assignee/addressbook/address/state -o "#" -v assignees/assignee/addressbook/address/country -n /tmp/patentconversion.xml >> /tmp/patentconversion.txt
cat /tmp/patentconversion.txt | tr '#' '\t' > patentextract.txt
All done - now you can import the data as simple tab-separated text values into any spreadsheet or database system. Note, xmlstarlet doesn’t support outputting actual tabs, so we had to do a little workaround by outputting the pound sign (”#”) or something similar, and using the unix “tr” tool to translate them back into tabs at the end.
While this process works, be warned - patent files can be very lengthy, and processing xml can be very CPU- and time-intensive! You might run into “out of memory” errors if you process too many week files at once with xmlstarlet; if that happens, you might want to consider writing a shell script to process the files one at a time. I’ve written a shell script to do exactly that, which you can view / download from the following link:
Note, you will have to make the script executable before you can use it, and it does require xmlstarlet to be already installed (hint: chmod 755 patentconvert).
usage: ls <filenames> | ./patentconvert
Previous Post:
Next Post: