Thursday, September 14, 2017

Post GSoC Adventures

    After a week long wait, the results of Google Summer of Code were finally announced. And to my utmost elation, I passed the evaluation! In fact all the 7 students that worked with DBpedia passed their evaluation and all our efforts have finally paid off!



    After taking a *well deserved* week's rest, I have started working on a new idea. Me and Luca (who also participated in GSoC'17 with DBpedia), have decided to start working on a small idea, using gestures to control our computer. At first, we'll try to control the mouse using gestures, but as the idea and the program develops, we'll look forward to integrating other features as well. This is an exciting idea to work upon, dipping our feet in the realms of Computer Vision, Motion detection, Touch-less computing and automation and hopefully, we can build something useful out of this!

    We would love to hear suggestion and comments on the idea! If someone is interested in working on this idea, they are more than welcome! Contact me for more details.




Friday, September 1, 2017

Summary of GSoC'17 Project

    This blog post is going to be an "informal" summary of my work over the past 3-4 months, with excerpts from the official report. For the official, more "formal" and "technically articulated" report, refer the following links:



  • The detailed final progress report of my work and contributions during GSoC'17 can be found here.
  • GSoC'17 Final results and challenges available here.
  • The List-Extractor can be found here.



So, let's begin!

    Finally, after 3 long and intense months (see:eternity) of coding, brainstorming and caffeine ridden hysteria, my GSoC project finally reached it's inevitable conclusion. It was a 12 week program, but it did seem forever, considering the effort and time that went into it. No wonder developers are paid so much :P



    My project's main goal was to extract relevant data form Wikipedia lists (obviously, duh! :P), and then form appropriate RDF triples with the same, which would lead to the formation (extending it, to be precise) of a knowledge graph, which could then be merged with DBpedia's datasets, which in turn can be used for various purposes, like a QA bot.

    So, my journey with this project started back in January. The idea was pretty fascinating, as Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. There’s also a lot of data present in form of lists which are quite syntactically unstructured and hence its difficult to form into a semantic relationship. With more than 15 million articles in different languages, the lists could prove to be a goldmine of structured data. So, I looked into the idea and started working on the idea. As a part of my warm up task, I had to implement another domain to the existing list-extractor. I added `MusicalArtist` domain and had a discussion on the direction of the project. After consulting with my mentors, I wrote a proposal for the project and I was selected!

There were 3 main goals as proposed in my GSoC proposal:
  1. Creation of new datasets. 
  2. Making the extractor more scalable, so that users can easily add their own rules and extract triples from different domains. 
  3. Removing the JSONpedia Live Service bottleneck by integrating the existing JSONpedia library with the list-extractor. 



    New datasets were created for some domains like MusicalArtist, Actor, Band, University, Magazine, Newspaper etc. All the sample datasets combined, that were created with the list-extractor, were created after processing about 1.3 million list elements, generating about 2.8 million triples. More triples can be created by running the extractor over different domains.

    The biggest challenge of the project was to make the list-extractor more scalable. The previous extractor had hand written functions for each property and for each domain.  Even though the properties were being managed in another file, the whole process was still cumbersome, as every new domain required the user to implement a new function that could use those properties. So, the main idea behind this year's project related to this goal was to automate this process, or, to state in simple terms, write a function that writes a function!

"The extractor was also made more scalable, by adding several more common mapper functions that can be used, while also making the selection of the mapper functions more flexible for every domain, by shifting the MAPPING dict to settings.json and allowing multiple mapper functions for a single domain. But, a bigger impact on the scalability came from the creation of rulesGenerator, which would now allow the users to create their own mapping rules and mapper functions from a interactive console program, without having to write code for the same! A sample domain MusicGenre was tested for the working of rulesGenerator, and the results/datasets are also present. Although the domain did not have much information that could be extracted, this still showed the ability of the rulesGenerator, a tool that can be used by people who are not programmers or don't have much knowledge about the inner working of the extractor, to generate triples and produce decent results."

    The third goal of this year's project was to remove the JSONpedia Live web-service's dependency. JSONpedia is a framework designed to simplify access at MediaWiki contents transforming everything into JSON. It is being used in the project to fetch the Wiki resources in a more desirable JSON form, rather than manually scraping and restructuring the data. The service was hosted on a small server, which could go down if it incurred too many requests, and with the new extractor requesting tens of thousands of pages within minutes, the service would defintiely go down at some point. So, instead of using the web-service, I had to use the available library. Only a little problem..... the library is written in Java, so I couldn't directly use it in my extractor and had to figure out a way integrate the library. Turns out the System Software course turned out quite useful xD

"The dependency on JSONpedia Live Service was removed and JSONpedia Library is now being used for obtaining the JSON representation of the resource. This was achieved by writing a wrapper function (jsonpedia-wrapper.jar) on the actual JSONpedia library, so that it could be manipulated easily by the list-extractor. The JSONpedia wrapper is a command-line program that'll take some commandline parameters and output the retrieved JSON. The wrapper can be individually run using the following command:
java -jar jsonpedia_wrapper.jar -l [language] -r [resource_name] -p [processors] -f [filters] 
So, the list-extractor simply forks another process that runs the JSONpedia wrapper with the parameters provided by the list-extractor, and the output is piped back to the list-extractor's stdin, which is then converted to JSON using the json.loads() method, hence completely emulating the previous behavior and eliminating the bottleneck."

    And with that, all the proposed goals were achieved. The results were pretty encouraging too. The extractor worked fine for the previous and new domains, and generated more triples. The efficiency also increased, with the extractor generating more triples from the same list elements.

"For a comparative analysis, we take a look at the Actors dataset, results for which are available from previous year. The accuracy of the extractor has improved (accuracy is defined as the ratio of list elements that succesfully contributed to a triple generation to the total number of list-elements present). We also see, despite there being less resources than the previous year, the list extractor was able to generate about 22k more triples from the same domain.This can be due to many factors. On of them could be people adding new list entries in the wikipedia resources, causing the number to increase. This, of course, cannot be influenced by us and hence could have lead to an increase in that number. From a programmer's perspective, the major addition in this year's project was the new year_mapper(), which helped in extracting time periods from the list elements, as well as changing the select_mapping() method, which previously allowed only one mapper function per domain. The newer version of select_mapper() allows selecting several mapping functions to be used with a single domain, allowing more sections to be considered for extraction and consequently, creating more triples from the existing list elements."

    And with that, my project was complete but it wasn't without it's fair share of challenges. The main challenge remained the same from the last year, which was the extreme variability of lists. Unfortunately, there is no real standard, structure or consistency in the resource's article and there are multiple formats used along with different meanings depending on the user who edited the page. Also, the strong dependability on the topic as well as the use of unrestricted natural language makes it impossible to find a precise general rule to extract semantic information without knowing in advance the kind of list and the resource type. Hence the knowledge of the domain is also extremely important to write a good set of mapping rules and mapper functions, which would require the user to go through hundreds of Wikipedia pages of the same domain to find out the finer structure and relationships in the domain, which is very time consuming and exhausting. Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad/wrong formatting, which is obviously reflected in the impurity of extracted data. These were the main challenges that were present in the tool in general. 

Ironically, the feature that makes Wikipedia great is also the root cause of our biggest challenges (i.e. openly accessed and modifiable by any/everyone), which reminds me of a popular phrase from the Holy Bible : "Lord giveth, and the lord taketh away."

    The past 3-4 months were enlightening and it was an incredible experience. Exposure to massive code-bases and keeping up with the whole development cycles and commits have really given me a realistic glimpse to the software industry. The work was completely different from what we generally learn in universities, where more emphasis is given on theoretical knowledge, whereas GSoC provides a platform for a more practical experience. Being a part of such a large community also helped me as a developer as I got to interact with many qualified, experienced people and fellow developers from all over the world and sharing our ideas, knowledge and experiences. GSoC is hands down the best work I've ever done so far in my life & has definitely helped me grow as a software developer. Now, I wait for my results. Hopefully I'll pass. After all, every story deserves a happy ending :P


Keep Calm and Keep Coding!!!



Friday, July 28, 2017

GSoC 2017 : Week 6 and 7

    Okay, It has been *some* time since I last posted an update on my project. The last 2 weeks have probably been the busiest of my life, working on my GSoC project, last minute preparation for the upcoming campus placement drive and of course, the exams and interviews of various companies. It was a long, lifeless, sleep deprived, caffeine ridden mindf*ck of an experience, and the most intense time I've ever been through all my life. 


    Thankfully, After a few failed tests and ample amount of disappointments, I finally cleared the initial rounds, and eventually ended up clearing all 6 rounds of a certain *reputed* company, culminating in a job offer. I was happy, but I still had a big task ahead.... and that would be completing the Summer of Code. So, I took a "deserved" nap and started working on my project again.


    So, by the end of week 6, I was supposed to complete the user defined mappings part, since I had to start implementing JSONpedia library integration from week 7. Custom mapping rules were finalized last week, and this week's major challenge was adding the custom mapper functions. This was a bit more complicated than simply adding new rules, as I had to store settings which could be run as an independent mapper function. So, after some planning, I came up with a basic, yet powerful procedure. I'd store all the features related to the mapper function in a separate settings file (custom_mappers.json), which would be used by a generic function which would be used as a mapper. Since most of the project was completely modular, this was possible.

    The idea was to isolate all the steps in the triple extraction process, and implement each of them separately and combine them into a complete mapper function. So, we keep a dictionary entry in a JSON file, each representing a mapper function. The first task would be identifying section headers, followed by finding keywords to find subsections and the ontology classes/properties for those keywords. We'll also give user the power to select the extractor functions that would be used in the process, letting him choose the trade-off between the quality and the quantity of the extraction. A sample mapper function settings would look something like this:

{
    "MUSIC_GENRE_MAPPER": {
 "headers": {
  "en": ["bands", "artists"]
 },
 "extractors": [1, 2, 3, 4],
 "ontology": {
  "en": {
   "default": "notableArtist",
   "artist": "notableArtist",
   "band": "notableBand",
   "Subgenre" : "SubGenre",
   "division" : "SubGenre",
   "festivals" : "relatedFestivals"
  }
 },
 "years": "Yes"
}

    Then, a common method was written that made use of these settings to run the extraction process. The mapper settings would be dynamically loaded each time a new setting was added so that it could be used by rulesGenerator. After that, we also had to make sure that the select mapper identifies the user defined mapper functions. The following snippet does the trick:

is_custom_map_fn = False
try:
    if lang in eval(domain):
        domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
        print("The language provided is not available yet for this mapping")
        return 0
except NameError:  #key not found(predefined mappers)
    if domain not in CUSTOM_MAPPERS.keys():
        print "Cannot find the domain's mapper function!!"
        print 'You can add a mapper function for this mapping using rulesGenerator.py and try again...\n'
        return 0
    else:
        is_custom_map_fn = True
        domain_keys = CUSTOM_MAPPERS[domain]["headers"][lang]

    And with that, the custom mapper was ready. A lot of details have been omitted here about the implementation. If you're interested, you can head over to GitHub page and check out the code ;)

    Week 7's job was looking at the JSONpedia Live service and figure out a way to use the JSONpedia library. The goal was to figure out a way to use JSONpedia library in the project so that we could get rid of the dependency on the live web-service. So, I downloaded the JSONpedia library and went through the code for a while. I also had to come up with a plan to use this library, which was written in Java, in my list-extractor program, which was written in python.  

    After giving it some thought, I came up with the idea of retrieving the JSON representation using the library and the use the json.loads() to load that string into a dictionary and then work normally with that dictionary.  So, I decided that I could run the library independently and then pipe the output to the python string variable and use it.

    So, I started and completed writing a Java wrapper function for JSONpedia library, that'll take commandline arguments, parse it and make appropriate calls to JSONpedia and print the output to stdout. So, I can fork a subprocess that computes this and pipes the result to the python file and hence we can integrate it. I used JCommander to parse commandline parameters to the wrapper function that would emulate the Live query. I'm still working in this process.

    The coming week would be the evaluation week, and I'll be continuing to work on this irrespective of the result. Hope I pass though, fingers crossed!



    You can follow my project here.

Monday, July 10, 2017

GSoC 2017 : Week 5

    So, the results of the first evaluations were out last week, and thankfully, I passed the evaluation with flying colors. My mentors seemed happy with my work so far and asked me to keep it up!



    So, its back to business. This week, my job was to create a tool that could create mapping rules and mapper functions as per the user's demands. This would be something completely opposite to what I've been doing all month, as it'll generalize all the work for future domains instead of me (or any other developer) writing specialized rules for each domain. Hence, this is a **huge** step in increasing the scalability of the project.

    I came up with a structured plan during the evaluation week on how to implement a tool that would allow users to add custom rules and mapping functions to the list extractor, which the extractor could use in conjunction with the existing pre-defined rules, and its impact on the current code-base. 

    Then, I started working on the plans to complete rulesGenerator, which would allow users to do all that. At first, I coded up a prototype for the rulesGenerator that could create/modify rules. After testing it, I made changes to the existing list extractor, which will now look at 2 additional files for the mapping rules: One is the pre-defined mapping_rules.py, which contains all the core mapping_rules and the user defined settings.json and custom_mappers.json, which contain  user defined mapping rules. The extractor can hence run on previously unmapped domains too!


{
 "MAPPING": {
  "Writer": ["BIBLIOGRAPHY", "HONORS"],
  "EducationalInstitution": ["ALUMNI", "PROGRAMS_OFFERED", "STAFF"],
  "Actor": ["FILMOGRAPHY", "DISCOGRAPHY", "HONORS"],
  "Band": ["DISCOGRAPHY", "CONCERT_TOURS", "BAND_MEMBERS", "HONORS"],
  "PeriodicalLiterature": ["CONTRIBUTORS", "OTHER_LITERATURE_DETAILS", "HONORS", "BIBLIOGRAPHY"],
  }
}
 
    I added the newly structured Mapping Rules to the list extractor and it can now accept optional command-line argument to select class of mapper functions.  After that, I also completed working on custom mappers using rulesGenerator.py which could take settings on how the triple extracting mapper function should work, and run the custom mapper function according to those settings, expanding the coverage of the extractor. Below is the sample settings that the mapping functions will use for extraction purposes.


{
 "Actor": {
  "years": true,
  "headers": {
   "de": ["Filmografie"],
   "en": ["filmography", "shows"]
  },
  "ontology": {
   "de": {
    "Darsteller": "Starring",
    "Regisseur": "Director"
   },
   "en": {
    "Actor": "Starring",
    "Director": "Director"
   }
  },
  "extractors": ["1", "2"]
 }
}
   
    Next week, I'll code up a general mapper function in `mapper.py` that can use the `settings.json` and `custom_mappers.json` files to create a totally user defined list-extractor module! I'll also get in touch with Luca to discuss further possible improvements.


    Finally,



    You can follow my project on github here.

Tuesday, June 27, 2017

GSoC 2017 : Week 4

    Last Sunday marked the end of the 4th week of my 3-month long Summer of code project. Another significant corollary from that is, this was the final week before the first evaluations that take place this week. I've done what I could've and now my fate lies in the hands of my mentors...



    Anyway, continuing from last week, this week too, I continued with adding new domains for the list-extractor. The main idea is trying to figure out domains that could *potentially* contain list elements. These mapper functions can later be used by other domains too. So, this week, I started working on the `PeriodicalLiterature` domain, since it contained many lists which were common to many institutions. So I started working on mapping rules and mapper functions for `PeriodicalLiterature`.

    While exploring these domains, I realized that most of the list elements had a date entry, which is an important information present in the lists. The existing year extractor only extracted years in the regex form: 

#old regex
year_regex = ur'\s(\d{4})\s'

which missed out nearly all the information, as it didn't support months, or a period of time. A major effort this week was spent on re-writing the `year_mapper()` to add months (if present) with the dates. Also, if present, the new mapper tries to extract the period of years of the particular element (start date - end date).

#regex to figure out the presence of months in elements
month_list = { r'(january\s?)\d{4}':'1^', r'\W(jan\s?)\d{4}':'1^', r'(february\s?)\d{4}':'2^', r'\W(feb\s?)\d{4}':'2^',
                    r'(march\s?)\d{4}':'3^', r'\W(mar\s?)\d{4}':'3^',r'(april\s?)\d{4}':'4^',r'\W(apr\s?)\d{4}':'4^', 
                    r'(may\s?)\d{4}':'5^', r'\W(may\s?)\d{4}':'5^',r'(june\s?)\d{4}':'6^',r'\W(jun\s?)\d{4}':'6^',
                    r'(july\s?)\d{4}':'7^',r'\W(jul\s?)\d{4}':'7^', r'(august\s?)\d{4}':'8^', r'\W(aug\s?)\d{4}':'8^', 
                    r'(september\s?)\d{4}':'9^', r'\W(sep\s?)\d{4}':'9^',r'\W(sept\s?)\d{4}':'9^', r'(october\s?)\d{4}':'10^',
                    r'\W(oct\s?)\d{4}':'10^',r'(november\s?)\d{4}':'11^', r'\W(nov\s?)\d{4}':'11^' ,
                    r'(december\s?)\d{4}':'12^', r'\W(dec\s?)\d{4}':'12^'}
    
#flags to check presence of months/period
month_present = False
period_dates = False

for mon in month_list:
    if re.search(mon, list_elem, re.IGNORECASE):
        rep = re.search(mon, list_elem, re.IGNORECASE).group(1)
        list_elem = re.sub(rep, month_list[mon], list_elem, flags=re.I)
        month_present = True

#new year regex (complex, isn't it :P)
year_regex = ur'(?:\(?\d{1,2}\^)?\s?\d{4}\s?(?:–|-)\s?(?:\d{1,2}\^)?\s?\d{4}(?:\))?'  #regex for checking if its a single year or period

if re.search(period_regex, list_elem, flags=re.IGNORECASE):
    period_dates = True

    After re-writing year_mapper, I finished mappers and rules for `PeriodicalLiterature` and tested it on some resources belonging to `Magazine`, `Newspaper` and `AcademicJournal` sub-domains. I also updated the awards/honors mapper, which can now differentiate honorary degrees and awards. I finished this week's work by optimizing the code a bit, removing redundant code and replaced the existing `year_mapper` with the the new mapper in each module; adding the newly written `quote_mapper` resource extractor in the URI extracting process. After that, I merged all progress into master, as this is the final stable running version before the evaluations begin.

    Should I pass the first evaluation (I really feel I would :P), my next task would be, as discussed with Luca, working on a module that'll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.

    Let's hope for the best!! :)

    You can follow my project on github here.

Saturday, June 24, 2017

GSoC 2017 : Week 3

    Time is passing by ever so quickly and things are starting to get *real intense*. Although it has only been three weeks, it feels like I'm a veteran developer now (professional developers everywhere cringed :P). Anyways, here's the progress report from my third week.

    These next few weeks, my focus would majorly be on expanding the scope of extractor, adding few common domains and working on making it more scalable to handle previously unseen lists with existing rules. This week, I've started working on adding new domains. This time around, I took my mentor's suggestion and tried to implement a single mapper that can map multiple list items instead of having a mapping function for every single type of element. Previously, all the properties were present in the mapping functions itself, like in the example below:


# mapping bibliography for Writer, snippet from mapper.py
g.add((rdflib.URIRef(uri), dbo.author, res))
isbn = isbn_mapper(elem)
if isbn:
    g.add((rdflib.URIRef(uri), dbo.isbn, rdflib.Literal(isbn, datatype=rdflib.XSD.string)))
if year:
    add_years_to_graph(g, uri, year)
if lit_genre:
    g.add((rdflib.URIRef(uri), dbo.literaryGenre, dbo + rdflib.URIRef(lit_genre)))

    This led to me changing the way Federica and I have been using the mapping rules. The ontology classes/properties now stored in the mapping_rules.py instead of the mapping functions.

# (new)mapping contribution type for Person, snippet from mapper.py
contrib_type == None:
feature = bracket_feature_mapper(elem)
for t in CONTRIBUTION_TYPE[lang]:
    if re.search(t, feature, re.IGNORECASE):
        contrib_type = CONTRIBUTION_TYPE[lang][t]

if contrib_type:
    g.add((rdflib.URIRef(uri), dbo[contrib_type], res))  #notice the property!


    In new domains, I also analysed the `EducationalInstitution` domain, and completed writing the rules/mappers for that. The list extractor can now extract triples from EducationalInstitution, as well as its subdomains like `College`, `School` and `University`. After that I looked at different Domains within `Person` in order to generalize extractor to work on this superclass. Domains like `Painter`, `Architect` `Architect`, `Astronaut`, `Ambassador` `Athelete`, `BusinessPerson`, `Chef`, `Celebrity`, `Coach` etc. will now also work with extractor, increasing its coverage, but I still have to work on the quality of extraction as Person is one of the biggest Domain on Wiki and has extreme variability. For that, I changed various functions in order to support generalized domains (eg. year_mapper, role etc.). Now extracts all the years in which the person has won same award/honors. 

    In the end, I had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects, another meeting is scheduled next week after discussing the idea with mentors. For the next week, I'll keep on adding domains to the extractor, while adding the new rules/functions in a generalized way. I also hope to come to a resolution about the final structure of my extractor after a discussion with Luca.


    You can follow my project on github here.