Friday, July 28, 2017

GSoC 2017 : Week 6 and 7

    Okay, It has been *some* time since I last posted an update on my project. The last 2 weeks have probably been the busiest of my life, working on my GSoC project, last minute preparation for the upcoming campus placement drive and of course, the exams and interviews of various companies. It was a long, lifeless, sleep deprived, caffeine ridden mindf*ck of an experience, and the most intense time I've ever been through all my life. 


    Thankfully, After a few failed tests and ample amount of disappointments, I finally cleared the initial rounds, and eventually ended up clearing all 6 rounds of a certain *reputed* company, culminating in a job offer. I was happy, but I still had a big task ahead.... and that would be completing the Summer of Code. So, I took a "deserved" nap and started working on my project again.


    So, by the end of week 6, I was supposed to complete the user defined mappings part, since I had to start implementing JSONpedia library integration from week 7. Custom mapping rules were finalized last week, and this week's major challenge was adding the custom mapper functions. This was a bit more complicated than simply adding new rules, as I had to store settings which could be run as an independent mapper function. So, after some planning, I came up with a basic, yet powerful procedure. I'd store all the features related to the mapper function in a separate settings file (custom_mappers.json), which would be used by a generic function which would be used as a mapper. Since most of the project was completely modular, this was possible.

    The idea was to isolate all the steps in the triple extraction process, and implement each of them separately and combine them into a complete mapper function. So, we keep a dictionary entry in a JSON file, each representing a mapper function. The first task would be identifying section headers, followed by finding keywords to find subsections and the ontology classes/properties for those keywords. We'll also give user the power to select the extractor functions that would be used in the process, letting him choose the trade-off between the quality and the quantity of the extraction. A sample mapper function settings would look something like this:

{
    "MUSIC_GENRE_MAPPER": {
 "headers": {
  "en": ["bands", "artists"]
 },
 "extractors": [1, 2, 3, 4],
 "ontology": {
  "en": {
   "default": "notableArtist",
   "artist": "notableArtist",
   "band": "notableBand",
   "Subgenre" : "SubGenre",
   "division" : "SubGenre",
   "festivals" : "relatedFestivals"
  }
 },
 "years": "Yes"
}

    Then, a common method was written that made use of these settings to run the extraction process. The mapper settings would be dynamically loaded each time a new setting was added so that it could be used by rulesGenerator. After that, we also had to make sure that the select mapper identifies the user defined mapper functions. The following snippet does the trick:

is_custom_map_fn = False
try:
    if lang in eval(domain):
        domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
        print("The language provided is not available yet for this mapping")
        return 0
except NameError:  #key not found(predefined mappers)
    if domain not in CUSTOM_MAPPERS.keys():
        print "Cannot find the domain's mapper function!!"
        print 'You can add a mapper function for this mapping using rulesGenerator.py and try again...\n'
        return 0
    else:
        is_custom_map_fn = True
        domain_keys = CUSTOM_MAPPERS[domain]["headers"][lang]

    And with that, the custom mapper was ready. A lot of details have been omitted here about the implementation. If you're interested, you can head over to GitHub page and check out the code ;)

    Week 7's job was looking at the JSONpedia Live service and figure out a way to use the JSONpedia library. The goal was to figure out a way to use JSONpedia library in the project so that we could get rid of the dependency on the live web-service. So, I downloaded the JSONpedia library and went through the code for a while. I also had to come up with a plan to use this library, which was written in Java, in my list-extractor program, which was written in python.  

    After giving it some thought, I came up with the idea of retrieving the JSON representation using the library and the use the json.loads() to load that string into a dictionary and then work normally with that dictionary.  So, I decided that I could run the library independently and then pipe the output to the python string variable and use it.

    So, I started and completed writing a Java wrapper function for JSONpedia library, that'll take commandline arguments, parse it and make appropriate calls to JSONpedia and print the output to stdout. So, I can fork a subprocess that computes this and pipes the result to the python file and hence we can integrate it. I used JCommander to parse commandline parameters to the wrapper function that would emulate the Live query. I'm still working in this process.

    The coming week would be the evaluation week, and I'll be continuing to work on this irrespective of the result. Hope I pass though, fingers crossed!



    You can follow my project here.

Monday, July 10, 2017

GSoC 2017 : Week 5

    So, the results of the first evaluations were out last week, and thankfully, I passed the evaluation with flying colors. My mentors seemed happy with my work so far and asked me to keep it up!



    So, its back to business. This week, my job was to create a tool that could create mapping rules and mapper functions as per the user's demands. This would be something completely opposite to what I've been doing all month, as it'll generalize all the work for future domains instead of me (or any other developer) writing specialized rules for each domain. Hence, this is a **huge** step in increasing the scalability of the project.

    I came up with a structured plan during the evaluation week on how to implement a tool that would allow users to add custom rules and mapping functions to the list extractor, which the extractor could use in conjunction with the existing pre-defined rules, and its impact on the current code-base. 

    Then, I started working on the plans to complete rulesGenerator, which would allow users to do all that. At first, I coded up a prototype for the rulesGenerator that could create/modify rules. After testing it, I made changes to the existing list extractor, which will now look at 2 additional files for the mapping rules: One is the pre-defined mapping_rules.py, which contains all the core mapping_rules and the user defined settings.json and custom_mappers.json, which contain  user defined mapping rules. The extractor can hence run on previously unmapped domains too!


{
 "MAPPING": {
  "Writer": ["BIBLIOGRAPHY", "HONORS"],
  "EducationalInstitution": ["ALUMNI", "PROGRAMS_OFFERED", "STAFF"],
  "Actor": ["FILMOGRAPHY", "DISCOGRAPHY", "HONORS"],
  "Band": ["DISCOGRAPHY", "CONCERT_TOURS", "BAND_MEMBERS", "HONORS"],
  "PeriodicalLiterature": ["CONTRIBUTORS", "OTHER_LITERATURE_DETAILS", "HONORS", "BIBLIOGRAPHY"],
  }
}
 
    I added the newly structured Mapping Rules to the list extractor and it can now accept optional command-line argument to select class of mapper functions.  After that, I also completed working on custom mappers using rulesGenerator.py which could take settings on how the triple extracting mapper function should work, and run the custom mapper function according to those settings, expanding the coverage of the extractor. Below is the sample settings that the mapping functions will use for extraction purposes.


{
 "Actor": {
  "years": true,
  "headers": {
   "de": ["Filmografie"],
   "en": ["filmography", "shows"]
  },
  "ontology": {
   "de": {
    "Darsteller": "Starring",
    "Regisseur": "Director"
   },
   "en": {
    "Actor": "Starring",
    "Director": "Director"
   }
  },
  "extractors": ["1", "2"]
 }
}
   
    Next week, I'll code up a general mapper function in `mapper.py` that can use the `settings.json` and `custom_mappers.json` files to create a totally user defined list-extractor module! I'll also get in touch with Luca to discuss further possible improvements.


    Finally,



    You can follow my project on github here.