ProHealth Summer REU – Week 9 – July 22, 2016


Date Hours Worked Total Hours Wall-sits Caffeine
Monday, 7/18/16 8:25am – 5:40pm 9hr15min 0min 3 cups: 273mg
Tuesday, 7/19/16 8:48am – 5:06pm 8hr18min 0min 0 cups: 0 mg
Wednesday, 7/20/16 8:33am – 5:45pm 9hr12min 0min 4 cups: 364mg
Thursday, 7/21/16 8:36am – 6:15pm 9hr39min 0min 0 cups: 0mg
Friday, 7/22/16  7:35am – 6:00pm 10hr25min 3.5min 2 cups: 182mg



My morning began with working more on the icons for Devon and Aislinn’s SENSE app.  After reviewing my progress from last week I was extremely dissatisfied with the results, I scrapped the designs and started over.  Previously I scanned my paper copies and adapted them in Photoshop, with horrendous results.  I started from scratch, using my designs as references but creating everything in Photoshop with much better results.

With a few complete, I jumped back into my main project, which had me reading through the documentation for DeepDive a little longer.  Those who have been reading my blog entries for some time might remember the trouble I had getting Fedora installed on my laptop, I finally moved to my terminal choice and installed a copy with VirtualBox.

DeepDive installed fairly easily (run the command, get options)
$ bash <(curl -fsSL

Installation seemed pretty straightforward, but the commands didn’t want to run.  The quickstart guide didn’t seem to offer any specific recommendations for debugging, but after trying a few things I was under the impression that the PATH variables were configured incorrectly.  Around this time I received a text from Savannah saying that her and Dileep were trying to get some scripts to run in the lab downstairs, STARAI’s weekly lab meeting would begin soon as well so I headed downstairs.

Savannah and Dileep were looking through the openFDA information, and filled me in on what the issues were.

  1. Dileep had a python script ( for parsing the openFDA data.
  2. The bulk of the work was handled by a python package (nltk – Natural Language Toolkit)
  3. nltk throws a massive fit when non-ascii characters appear in the text files.

At the lab meeting, Sriraam introduced us to a representative from Crane, and each of us introduced ourselves and our work.  Savannah and I walked through the order of operations to get everything to run properly, then I headed off to GRE prep.


Another morning spent working on icons.  After some feedback from Devon, Aislinn, and Majdah I realized I left white backgrounds when blank ones would be more helpful.  I cleaned up the previous ones and worked on the next set of questions.

Following what Savannah and I discussed yesterday, the steps seemed fairly straightforward.

  1. Convert the text files to US-ASCII.
  2. Install the nltk packages on Odin so we can parallelize the parsing.
  3. Write the script to run in parallel.

I put in a request at the SOIC Help Desk at 11:00:56am, outlining the nltk package and some recommendations on its installation.

At 13:29:23, I received the news that this would likely be impossible on Odin, and that the version of python that was installed there wouldn’t be able to support it (after checking, I found that it was running Python 2.4.3 from March 2006; for reference, Python 2.7.3 was released in April 2012).  Bruce asked how I planned to run my jobs, how many nodes I needed, and whether I had considered using Big Red II or the Karst cluster instead.

I sent my response at 14:06:14, outlining that I would use srun, allocate around 50 nodes, and that I would consider Big Red II or Karst but wasn’t as familiar with their Moab scheduler.

At 14:32:49, Bruce had the fantastic news that the package was successfully installed on Silo, Tank, and Hulk, and that I could run my code there so long as “you don’t fire up too many of them at the same time.”

I jumped onto Silo and pulled in the files I’d been working with.  Sure enough, nltk was waiting for me and I downloaded ‘punkt’ and ‘averaged_perceptron_tagger’ following Dileep’s instructions.  There were a few errors when I actually ran the script on Abacavir+Sulfate though, and after combing through the traceback I found out there was another set of files I had missed called ‘wordnet.’  Finding that was the issue was a bit of a challenge, but fixing it was as simple as running'wordnet').

With everything finally installed successfully I ran it again.  The script was able to successfully split the files into sentences (1-2500), but seemed to crash when it tried parsing the sentences.

I went downstairs and talked with Devendra, during which I ran the script on the Fedora virtual machine to see if I could do some debugging.  We put a plan together for the next few days.


Another morning, another set of icons.  As an additional thing to fix, it turned out the Microsoft band used inverted colors (focusing on negative space instead of positive space).  I altered a few of the images and sent them to Devon for testing.

I realized that since I was an undergrad I did not have access to Hulk or Tank, but wanted to do some testing on both to find out whether it was an issue with Silo I was experiencing yesterday. This was easily resolved after a couple more emails.

Technically there was another issue that I needed to resolve before I could run on any of the files.  Almost all of the 2027 text files contained unicode characters outside of the ASCII range.  These could be almost anything, some of the ones Dileep and Savannah found earlier in the week were: (•, ≥, ≤, ’, é, ”, “, †, ®, –, ↓, ↑, ï, ™).  When nltk encountered any of them it would throw an error and stop executing.

Their original solution was to manually go through the documents and to find these characters, then add them to a list of symbols to remove from the document.  Savannah mentioned trying to convert between encoding formats without any success.

To their point, I found out the hard way that converting between character encoding was not an exact science. I was pretty sure the text files were in utf-8 since they were copied from a website, but converting them from utf-8 into US-ASCII (iconv -f utf8 -t US-ASCII file) had no effect. Every file I ran through still showed errors propagating everywhere.

After a decent amount of searching, the view sounded consistent that there was not a perfect solution to this problem.  I stumbled into a stackoverflow thread that had a really interesting suggestion: sed -n 'l0' file. It seemed that this would convert symbols into how they looked when sed read them: “•” becomes “\342\200\242.” This still wasn’t perfect but it was a step in the right direction.

When my Sharks Cluster access kicked in, I jumped onto both Hulk and Tank to test the scripts out on there.  To my bitter disappointment, the same issue seemed to be occurring: the file could be split into sentences but nothing could be parsed.


I skipped working on icons in the morning to get to Informatics, wanting to ensure that I could solve whatever was wrong with the scripts and put a solid dent into my weekly deliverables.

One of the big things that needed updated was the GibHub repository.  Most of the scripts I had worked on were still filed under my IU GitHub, but the documentation and other pieces needed to be stored under the ProHealth GitHub page.

I spent quite some time working on the for our page, ensuring that there was effective documentation for each script.

At 1:30pm I met Devendra to discuss how to precede since running the python scripts looked fairly hopeless.  He also thought it was strange that the files weren’t being parsed properly, but recognized that we had limited time for solving this problem.  We decided to start with a subset that Savannah and I could process on our physical devices over the course of a couple days.  I had a dataset of the top specialty drugs, drugs by cost, and most prescribed drugs from a fortune 500 company.  Out of those I selected 51 drugs that we had solid information about, and copied them into a separate folder to focus on.

For the rest of the day I was squashing bugs and testing things.  After reading group; Devendra, Savannah, and I worked out how to precede.