Fall 2016 – Week 4: Extending the LVI pipeline

Monday, September 12, 2016: 9:54am – 6:06pm (8.2 hours)

Monday morning started by wrapping up the Bootstrap Learning paper. Even after combing through the sections on interpreting relevance and the strategy for converting to ILP, I wasn’t completely sold on how it worked, though the paper made frequent allusions to the code.

At 11:30am Kaushik and I talked through the next few steps in the pipeline, which included the two of us going through the files and discussing the important aspects of each.

As a side note, we thought that the RDN-boost code might benefit from having a visual output for the trees, at least more so than the text documents.  There is an output folder called dotFiles (RDN-Boost/data/train/models/bRDNs/dotFiles/) that contains a fairly human-readable output of the tree structure, which looks similar to an XML/JSON output.  We suspect that parsing the documents with some kind of graph-visualization tool wouldn’t be too hard. Since we’re attempting to make the boosting code user-friendly during the hackathons, having human-readable outputs for the learned trees might be helpful.

Minutes from our lab meeting:

  • New rule for submitting papers: “Always have submission time memorized.”
  • Tuesday: emergency reading of Phillip’s paper, read and give feedback on weak points.
  • 10:00am Friday: Lab Picture Day (new photo for website)
  • 10:30am Friday: Phillip will give a practice talk.
  • For the next few Fridays, Professor will give a STARAI talk (time permitting).
  • Next Monday: Kaushik and I will give a 10-minute presentation on our progress.
  • Create a common STARAI resources cloud folder, upload slides for talks there.

Hackathon started promptly after our meeting.

Phillip, Kaushik and I spent some time talking about how to work with data, types, predicates, and modes.  After some whiteboarding we concluded that it was an interesting problem that would be difficult to generalize. A good place to focus would be on the background.txt file where the modes and bridges were set for RDN-boost.

starai white board september 12 2016

After about fifteen minutes of brainstorming between the three of us, we concluded that setting modes automatically would be difficult at best. Assisting the user would most likely be done when creating predicates.  In this example, we considered a dataset of students and professors with relations describing them.  Predicates are created by setting modes, then RDN-boost finds the walks between the relations.

I spent some time on a bash script that could generated all of the types given a dataset in the form of a facts file.

# run the script with ./typesFromData.sh train_facts.txt

MYVARIABLE=$(cat $INPUTFILE | cut -s -d "(" -f 1 | sort -u)
while read -r line; do
    DOMAIN=$(grep $line $INPUTFILE | head -n 1)
    NUMBEROFTYPES=$[$(grep -o "," <<< $DOMAIN | wc -l)+1]
    echo "$line $NUMBEROFTYPES"
done <<EOF

In this case we return:

beginningWordInSentence 2
endingWordInSentence 2
LemmaOfWordInSentence 3
midWordInSentence 2
nextWordInSentence 3
POSInSentence 3
wordInSentenceNumber 2
wordStringInSentence 3

The number describes how many types will need to set: “LemmaOfWordInSentence 3” would become something like “LemmaOfWordInSentence(+SID,+WID,#WLEMMA),” referring to the sentenceID, WordID, and a constant describing the lemma, respectfully.

However, this solution assumes that the data is already in a certain format to begin with. A general solution that could work for different datasets is still unsolved. Furthermore, setting the modes and what each type refers to is still unsolved.

I have a lot of problem solving left.

Tuesday, September 13, 2016:

As someone who never misses class, I’m regretting not being able to attend Professor’s full lecture. I have gaps in my knowledge.  We discussed decision trees: mutual information, entropy, advantages of discrete features (and domain experts), fitting/overfitting, and ID3.

Wednesday, September 14, 2016: 9:42am – 3:54pm (6.2 hours)

After reviewing our progress, Homa and Patrick want us to run the pipeline on docs800. Following is an overview of that process.

  • Find secondary shares in each file in docs800/:

$> for file in docs800/*; do echo $file && echo $file >> OUTPUT.txt && bash strip.sh $file > TEMPO.tmp && python extractFinancialLines.py TEMPO.tmp >> OUTPUT.txt; done

No secondary share
No secondary share
Common stock offered by the selling stockholder 20,000,000 Score of sentence: 11
Common stock offered by the selling stockholder 20,000,000 Score of sentence: 11
  • Convert OUTPUT.txt to a .csv file for easier viewing.

$> cp OUTPUT.txt initialtest.txt && python splitToCSV.py initialtest.txt && rm -f initialtest.txt && mv extractionOutputSecondary500.csv extractionOutputSecondary800.csv

docs800/AAAP-000157104915009106-t1502530-424b1.txt, "No secondary share"
docs800/AAVL-000119312515005317-d832044d424b4.txt, "Common stock offered by the selling stockholders 390,000 Score of sentence: 11"
docs800/ABCO-000119312515017821-d841810d424b5.txt, "Common stock offered by selling stockholder 1,050,000 Score of sentence: 12"
docs800/ABTX-000119312515340382-d917185d424b4.txt, "No secondary share"
  • Find all instances of secondary shares, label them as positive examples, format them:

$> grep "Score of sentence" extractionOutputSecondary800.csv >> posEx.txt
$> cat posEx.txt | while read in; do VAR=`echo "$in" | tr -d "\""` && bash buildposEx.sh "$VAR" >> TEMPORARY2.tmp; done
$> mv TEMPORARY2.tmp posEx.txt

AAVL-000119312515005317-d832044d424b4:"Common stock offered by the selling stockholders 390,000":390000
ABCO-000119312515017821-d841810d424b5:"Common stock offered by selling stockholder 1,050,000":1050000
ABY-000119312515012977-d832817d424b4:"selling shareholder named in this prospectus is offering 9,200,000":9200000
ACHC-000119312515288231-d13844d424b7:"Common stock offered by the selling stockholders 5,033,230":5033230

The code works beautifully even with the blatant inefficiencies.  When we package the software Kaushik and I will condense a couple of these steps and make it so an input can be specified (i.e. input directory, primary/secondary).

We ran the full pipeline for the secondary shares, producing strong results.  There is an indexing problem in makeResults.py that prevents some the returns from being displayed correctly, for now the list was short enough to fix by hand.  Kaushik started the pipeline for primary shares.

Phillip gave a practice talk (a practice talk for his Friday practice talk rather) on “Actively Interacting with Experts: Probabilistic Logic Approach.”  Professor reminded us as a lab what we needed to focus on when writing papers.

Thursday, September 15, 2016: 1:00pm – 2:12pm (1.2 hours)

Shuo was running a fever, so professor led reading group with “Probabilistic Theorem Proving.”

Friday, September 16, 2016: 10:06am – 11:24am; 12:18pm – 3:18pm (4.3 hours)

photo of the starai group members, Alexander took the photo

Phillip started our morning with the polished version of the practice talk he gave on Wednesday.  Lab picture day was immediately after (I took the first photo, I’ll be in a separate one or we’ll get someone to take a full group photo).

I had a Serve-IT meeting and took off for about an hour, but returned and rewrote extractFinancialLines.py.  The complicated steps that I outlined on Wednesday are now handled by a single function, extractFinancialLines.py docs500/ now returns a posEx.txt file with all of the lines extracted from files in a directory.

My next step is to write an interface and automatically run strip.sh (or something equivalent) on the files before extractFinancialLines.py.

Kaushik and I did some preparation for our presentation on Monday, I’ll put some finishing touches on it Monday morning.