Background/Problem
Investing has always been a passion of mine, and I have managed a personal investment portfolio since high school.
While monitoring most company related events can be done following press releases and SEC filings, market-moving events for pharmaceutical/bio-technology companies often have to do with their clinical trials.
clinicaltrials.gov is the go-to repository for viewing any ongoing publicly funded clinical trials around the world. The standard process to view a specific company’s ongoing clinical trial is as follows:
- head over to clinicaltrials.gov
- enter a relevant search term (company name, drug name, condition/disease, etc)
- locate the specific trial and click on the study link for more information
However, as an investor, I may want to know when a clinical trial is updated without having to go to clinicaltrials.gov, searching the relevant term, and pulling up the specific trial that I am interested in every single time.
This was indeed the motivation behind building out this scraper. As soon as a clinical trial is updated for a company that I am following, I should be notified through email with a link pertaining to that trial.
ClinicalTrials.gov API
Thankfully, clinicaltrials.gov has an API that makes it fairly easy for me to grab the relevant companies and trials that I am after. Otherwise, I may have had build a more traditional crawler using tools like scrapy or selenium.
The api can be found here if you’re interested. There are 3 Query URL Types that the clinical trials api has available. For my purposes, I was interested in a couple pharmaceutical companies that had several ongoing clinical studies.
Thus, the study fields url type made the most sense for my purposes. Here’s a description for the study fields query url from the website:
“Retrieves the values of one or more fields from up to all study records returned for a submitted query by default. Returns up to 1,000 study records per query when the minimum rank and maximum rank parameters are set in a query URL and up to all study records using the Study Fields interactive demonstration.“
Now that I had decided upon the query url, it was fairly straightforward to pick the query parameters. I wanted the company name, the date the clinical trial was last updated, and the NCTId (unique trial identification code).
Scraper Setup
Using the requests library in python, I could easily make an HTTP request to the specific urls that I was after. At the time I wrote this scraper up, I for some reason opted for XML (extensible markup language) as opposed to JSON (Javascript Object Notation).
Thus, I used beautiful soup 4, a powerful python library to parse data from HTML and XML files. After I converted all the XML into a more accessible format using beautiful soup, I wrote up a bit of regular expressions to grab out the last updated clinical trial date values.
Then, using python’s datetime module, I was able to check if a clinical trial was updated recently. If it was, I went ahead and appended the relevant clinical trial link to an array.
Finally, utilizing yagmail, a GMAIL/SMTP client, I was easily able to set up sending an email to myself containing the array of relevant clinical trial urls. I additionally implemented OAuth2.0, an authorization framework that is widely used in practice, with gmail.
Again, please find my associated code and repo here if you want to go into more detail.
Conclusion
If you made it all the way down here, I appreciate you reading this far!
One particularly interesting extension could be to hook up a front end to this. I could prompt a particular user for an email and a search term, capture that input, and set up automatic emails to that individual whenever a clinical trial that the user is interested in is updated. I will have to get around to this soon! It would be an interesting and fun project.
The associated code is here.
Thanks for reading and let me know what you think! Feel free to email me at abhin@abhin-sharma.com. I’d love to hear your thoughts, questions, or any feedback.