Thursday, September 14, 2017

Post GSoC Adventures

    After a week long wait, the results of Google Summer of Code were finally announced. And to my utmost elation, I passed the evaluation! In fact all the 7 students that worked with DBpedia passed their evaluation and all our efforts have finally paid off!



    After taking a *well deserved* week's rest, I have started working on a new idea. Me and Luca (who also participated in GSoC'17 with DBpedia), have decided to start working on a small idea, using gestures to control our computer. At first, we'll try to control the mouse using gestures, but as the idea and the program develops, we'll look forward to integrating other features as well. This is an exciting idea to work upon, dipping our feet in the realms of Computer Vision, Motion detection, Touch-less computing and automation and hopefully, we can build something useful out of this!

    We would love to hear suggestion and comments on the idea! If someone is interested in working on this idea, they are more than welcome! Contact me for more details.




Friday, September 1, 2017

Summary of GSoC'17 Project

    This blog post is going to be an "informal" summary of my work over the past 3-4 months, with excerpts from the official report. For the official, more "formal" and "technically articulated" report, refer the following links:



  • The detailed final progress report of my work and contributions during GSoC'17 can be found here.
  • GSoC'17 Final results and challenges available here.
  • The List-Extractor can be found here.



So, let's begin!

    Finally, after 3 long and intense months (see:eternity) of coding, brainstorming and caffeine ridden hysteria, my GSoC project finally reached it's inevitable conclusion. It was a 12 week program, but it did seem forever, considering the effort and time that went into it. No wonder developers are paid so much :P



    My project's main goal was to extract relevant data form Wikipedia lists (obviously, duh! :P), and then form appropriate RDF triples with the same, which would lead to the formation (extending it, to be precise) of a knowledge graph, which could then be merged with DBpedia's datasets, which in turn can be used for various purposes, like a QA bot.

    So, my journey with this project started back in January. The idea was pretty fascinating, as Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. There’s also a lot of data present in form of lists which are quite syntactically unstructured and hence its difficult to form into a semantic relationship. With more than 15 million articles in different languages, the lists could prove to be a goldmine of structured data. So, I looked into the idea and started working on the idea. As a part of my warm up task, I had to implement another domain to the existing list-extractor. I added `MusicalArtist` domain and had a discussion on the direction of the project. After consulting with my mentors, I wrote a proposal for the project and I was selected!

There were 3 main goals as proposed in my GSoC proposal:
  1. Creation of new datasets. 
  2. Making the extractor more scalable, so that users can easily add their own rules and extract triples from different domains. 
  3. Removing the JSONpedia Live Service bottleneck by integrating the existing JSONpedia library with the list-extractor. 



    New datasets were created for some domains like MusicalArtist, Actor, Band, University, Magazine, Newspaper etc. All the sample datasets combined, that were created with the list-extractor, were created after processing about 1.3 million list elements, generating about 2.8 million triples. More triples can be created by running the extractor over different domains.

    The biggest challenge of the project was to make the list-extractor more scalable. The previous extractor had hand written functions for each property and for each domain.  Even though the properties were being managed in another file, the whole process was still cumbersome, as every new domain required the user to implement a new function that could use those properties. So, the main idea behind this year's project related to this goal was to automate this process, or, to state in simple terms, write a function that writes a function!

"The extractor was also made more scalable, by adding several more common mapper functions that can be used, while also making the selection of the mapper functions more flexible for every domain, by shifting the MAPPING dict to settings.json and allowing multiple mapper functions for a single domain. But, a bigger impact on the scalability came from the creation of rulesGenerator, which would now allow the users to create their own mapping rules and mapper functions from a interactive console program, without having to write code for the same! A sample domain MusicGenre was tested for the working of rulesGenerator, and the results/datasets are also present. Although the domain did not have much information that could be extracted, this still showed the ability of the rulesGenerator, a tool that can be used by people who are not programmers or don't have much knowledge about the inner working of the extractor, to generate triples and produce decent results."

    The third goal of this year's project was to remove the JSONpedia Live web-service's dependency. JSONpedia is a framework designed to simplify access at MediaWiki contents transforming everything into JSON. It is being used in the project to fetch the Wiki resources in a more desirable JSON form, rather than manually scraping and restructuring the data. The service was hosted on a small server, which could go down if it incurred too many requests, and with the new extractor requesting tens of thousands of pages within minutes, the service would defintiely go down at some point. So, instead of using the web-service, I had to use the available library. Only a little problem..... the library is written in Java, so I couldn't directly use it in my extractor and had to figure out a way integrate the library. Turns out the System Software course turned out quite useful xD

"The dependency on JSONpedia Live Service was removed and JSONpedia Library is now being used for obtaining the JSON representation of the resource. This was achieved by writing a wrapper function (jsonpedia-wrapper.jar) on the actual JSONpedia library, so that it could be manipulated easily by the list-extractor. The JSONpedia wrapper is a command-line program that'll take some commandline parameters and output the retrieved JSON. The wrapper can be individually run using the following command:
java -jar jsonpedia_wrapper.jar -l [language] -r [resource_name] -p [processors] -f [filters] 
So, the list-extractor simply forks another process that runs the JSONpedia wrapper with the parameters provided by the list-extractor, and the output is piped back to the list-extractor's stdin, which is then converted to JSON using the json.loads() method, hence completely emulating the previous behavior and eliminating the bottleneck."

    And with that, all the proposed goals were achieved. The results were pretty encouraging too. The extractor worked fine for the previous and new domains, and generated more triples. The efficiency also increased, with the extractor generating more triples from the same list elements.

"For a comparative analysis, we take a look at the Actors dataset, results for which are available from previous year. The accuracy of the extractor has improved (accuracy is defined as the ratio of list elements that succesfully contributed to a triple generation to the total number of list-elements present). We also see, despite there being less resources than the previous year, the list extractor was able to generate about 22k more triples from the same domain.This can be due to many factors. On of them could be people adding new list entries in the wikipedia resources, causing the number to increase. This, of course, cannot be influenced by us and hence could have lead to an increase in that number. From a programmer's perspective, the major addition in this year's project was the new year_mapper(), which helped in extracting time periods from the list elements, as well as changing the select_mapping() method, which previously allowed only one mapper function per domain. The newer version of select_mapper() allows selecting several mapping functions to be used with a single domain, allowing more sections to be considered for extraction and consequently, creating more triples from the existing list elements."

    And with that, my project was complete but it wasn't without it's fair share of challenges. The main challenge remained the same from the last year, which was the extreme variability of lists. Unfortunately, there is no real standard, structure or consistency in the resource's article and there are multiple formats used along with different meanings depending on the user who edited the page. Also, the strong dependability on the topic as well as the use of unrestricted natural language makes it impossible to find a precise general rule to extract semantic information without knowing in advance the kind of list and the resource type. Hence the knowledge of the domain is also extremely important to write a good set of mapping rules and mapper functions, which would require the user to go through hundreds of Wikipedia pages of the same domain to find out the finer structure and relationships in the domain, which is very time consuming and exhausting. Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad/wrong formatting, which is obviously reflected in the impurity of extracted data. These were the main challenges that were present in the tool in general. 

Ironically, the feature that makes Wikipedia great is also the root cause of our biggest challenges (i.e. openly accessed and modifiable by any/everyone), which reminds me of a popular phrase from the Holy Bible : "Lord giveth, and the lord taketh away."

    The past 3-4 months were enlightening and it was an incredible experience. Exposure to massive code-bases and keeping up with the whole development cycles and commits have really given me a realistic glimpse to the software industry. The work was completely different from what we generally learn in universities, where more emphasis is given on theoretical knowledge, whereas GSoC provides a platform for a more practical experience. Being a part of such a large community also helped me as a developer as I got to interact with many qualified, experienced people and fellow developers from all over the world and sharing our ideas, knowledge and experiences. GSoC is hands down the best work I've ever done so far in my life & has definitely helped me grow as a software developer. Now, I wait for my results. Hopefully I'll pass. After all, every story deserves a happy ending :P


Keep Calm and Keep Coding!!!



Friday, July 28, 2017

GSoC 2017 : Week 6 and 7

    Okay, It has been *some* time since I last posted an update on my project. The last 2 weeks have probably been the busiest of my life, working on my GSoC project, last minute preparation for the upcoming campus placement drive and of course, the exams and interviews of various companies. It was a long, lifeless, sleep deprived, caffeine ridden mindf*ck of an experience, and the most intense time I've ever been through all my life. 


    Thankfully, After a few failed tests and ample amount of disappointments, I finally cleared the initial rounds, and eventually ended up clearing all 6 rounds of a certain *reputed* company, culminating in a job offer. I was happy, but I still had a big task ahead.... and that would be completing the Summer of Code. So, I took a "deserved" nap and started working on my project again.


    So, by the end of week 6, I was supposed to complete the user defined mappings part, since I had to start implementing JSONpedia library integration from week 7. Custom mapping rules were finalized last week, and this week's major challenge was adding the custom mapper functions. This was a bit more complicated than simply adding new rules, as I had to store settings which could be run as an independent mapper function. So, after some planning, I came up with a basic, yet powerful procedure. I'd store all the features related to the mapper function in a separate settings file (custom_mappers.json), which would be used by a generic function which would be used as a mapper. Since most of the project was completely modular, this was possible.

    The idea was to isolate all the steps in the triple extraction process, and implement each of them separately and combine them into a complete mapper function. So, we keep a dictionary entry in a JSON file, each representing a mapper function. The first task would be identifying section headers, followed by finding keywords to find subsections and the ontology classes/properties for those keywords. We'll also give user the power to select the extractor functions that would be used in the process, letting him choose the trade-off between the quality and the quantity of the extraction. A sample mapper function settings would look something like this:

{
    "MUSIC_GENRE_MAPPER": {
 "headers": {
  "en": ["bands", "artists"]
 },
 "extractors": [1, 2, 3, 4],
 "ontology": {
  "en": {
   "default": "notableArtist",
   "artist": "notableArtist",
   "band": "notableBand",
   "Subgenre" : "SubGenre",
   "division" : "SubGenre",
   "festivals" : "relatedFestivals"
  }
 },
 "years": "Yes"
}

    Then, a common method was written that made use of these settings to run the extraction process. The mapper settings would be dynamically loaded each time a new setting was added so that it could be used by rulesGenerator. After that, we also had to make sure that the select mapper identifies the user defined mapper functions. The following snippet does the trick:

is_custom_map_fn = False
try:
    if lang in eval(domain):
        domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
        print("The language provided is not available yet for this mapping")
        return 0
except NameError:  #key not found(predefined mappers)
    if domain not in CUSTOM_MAPPERS.keys():
        print "Cannot find the domain's mapper function!!"
        print 'You can add a mapper function for this mapping using rulesGenerator.py and try again...\n'
        return 0
    else:
        is_custom_map_fn = True
        domain_keys = CUSTOM_MAPPERS[domain]["headers"][lang]

    And with that, the custom mapper was ready. A lot of details have been omitted here about the implementation. If you're interested, you can head over to GitHub page and check out the code ;)

    Week 7's job was looking at the JSONpedia Live service and figure out a way to use the JSONpedia library. The goal was to figure out a way to use JSONpedia library in the project so that we could get rid of the dependency on the live web-service. So, I downloaded the JSONpedia library and went through the code for a while. I also had to come up with a plan to use this library, which was written in Java, in my list-extractor program, which was written in python.  

    After giving it some thought, I came up with the idea of retrieving the JSON representation using the library and the use the json.loads() to load that string into a dictionary and then work normally with that dictionary.  So, I decided that I could run the library independently and then pipe the output to the python string variable and use it.

    So, I started and completed writing a Java wrapper function for JSONpedia library, that'll take commandline arguments, parse it and make appropriate calls to JSONpedia and print the output to stdout. So, I can fork a subprocess that computes this and pipes the result to the python file and hence we can integrate it. I used JCommander to parse commandline parameters to the wrapper function that would emulate the Live query. I'm still working in this process.

    The coming week would be the evaluation week, and I'll be continuing to work on this irrespective of the result. Hope I pass though, fingers crossed!



    You can follow my project here.

Monday, July 10, 2017

GSoC 2017 : Week 5

    So, the results of the first evaluations were out last week, and thankfully, I passed the evaluation with flying colors. My mentors seemed happy with my work so far and asked me to keep it up!



    So, its back to business. This week, my job was to create a tool that could create mapping rules and mapper functions as per the user's demands. This would be something completely opposite to what I've been doing all month, as it'll generalize all the work for future domains instead of me (or any other developer) writing specialized rules for each domain. Hence, this is a **huge** step in increasing the scalability of the project.

    I came up with a structured plan during the evaluation week on how to implement a tool that would allow users to add custom rules and mapping functions to the list extractor, which the extractor could use in conjunction with the existing pre-defined rules, and its impact on the current code-base. 

    Then, I started working on the plans to complete rulesGenerator, which would allow users to do all that. At first, I coded up a prototype for the rulesGenerator that could create/modify rules. After testing it, I made changes to the existing list extractor, which will now look at 2 additional files for the mapping rules: One is the pre-defined mapping_rules.py, which contains all the core mapping_rules and the user defined settings.json and custom_mappers.json, which contain  user defined mapping rules. The extractor can hence run on previously unmapped domains too!


{
 "MAPPING": {
  "Writer": ["BIBLIOGRAPHY", "HONORS"],
  "EducationalInstitution": ["ALUMNI", "PROGRAMS_OFFERED", "STAFF"],
  "Actor": ["FILMOGRAPHY", "DISCOGRAPHY", "HONORS"],
  "Band": ["DISCOGRAPHY", "CONCERT_TOURS", "BAND_MEMBERS", "HONORS"],
  "PeriodicalLiterature": ["CONTRIBUTORS", "OTHER_LITERATURE_DETAILS", "HONORS", "BIBLIOGRAPHY"],
  }
}
 
    I added the newly structured Mapping Rules to the list extractor and it can now accept optional command-line argument to select class of mapper functions.  After that, I also completed working on custom mappers using rulesGenerator.py which could take settings on how the triple extracting mapper function should work, and run the custom mapper function according to those settings, expanding the coverage of the extractor. Below is the sample settings that the mapping functions will use for extraction purposes.


{
 "Actor": {
  "years": true,
  "headers": {
   "de": ["Filmografie"],
   "en": ["filmography", "shows"]
  },
  "ontology": {
   "de": {
    "Darsteller": "Starring",
    "Regisseur": "Director"
   },
   "en": {
    "Actor": "Starring",
    "Director": "Director"
   }
  },
  "extractors": ["1", "2"]
 }
}
   
    Next week, I'll code up a general mapper function in `mapper.py` that can use the `settings.json` and `custom_mappers.json` files to create a totally user defined list-extractor module! I'll also get in touch with Luca to discuss further possible improvements.


    Finally,



    You can follow my project on github here.

Tuesday, June 27, 2017

GSoC 2017 : Week 4

    Last Sunday marked the end of the 4th week of my 3-month long Summer of code project. Another significant corollary from that is, this was the final week before the first evaluations that take place this week. I've done what I could've and now my fate lies in the hands of my mentors...



    Anyway, continuing from last week, this week too, I continued with adding new domains for the list-extractor. The main idea is trying to figure out domains that could *potentially* contain list elements. These mapper functions can later be used by other domains too. So, this week, I started working on the `PeriodicalLiterature` domain, since it contained many lists which were common to many institutions. So I started working on mapping rules and mapper functions for `PeriodicalLiterature`.

    While exploring these domains, I realized that most of the list elements had a date entry, which is an important information present in the lists. The existing year extractor only extracted years in the regex form: 

#old regex
year_regex = ur'\s(\d{4})\s'

which missed out nearly all the information, as it didn't support months, or a period of time. A major effort this week was spent on re-writing the `year_mapper()` to add months (if present) with the dates. Also, if present, the new mapper tries to extract the period of years of the particular element (start date - end date).

#regex to figure out the presence of months in elements
month_list = { r'(january\s?)\d{4}':'1^', r'\W(jan\s?)\d{4}':'1^', r'(february\s?)\d{4}':'2^', r'\W(feb\s?)\d{4}':'2^',
                    r'(march\s?)\d{4}':'3^', r'\W(mar\s?)\d{4}':'3^',r'(april\s?)\d{4}':'4^',r'\W(apr\s?)\d{4}':'4^', 
                    r'(may\s?)\d{4}':'5^', r'\W(may\s?)\d{4}':'5^',r'(june\s?)\d{4}':'6^',r'\W(jun\s?)\d{4}':'6^',
                    r'(july\s?)\d{4}':'7^',r'\W(jul\s?)\d{4}':'7^', r'(august\s?)\d{4}':'8^', r'\W(aug\s?)\d{4}':'8^', 
                    r'(september\s?)\d{4}':'9^', r'\W(sep\s?)\d{4}':'9^',r'\W(sept\s?)\d{4}':'9^', r'(october\s?)\d{4}':'10^',
                    r'\W(oct\s?)\d{4}':'10^',r'(november\s?)\d{4}':'11^', r'\W(nov\s?)\d{4}':'11^' ,
                    r'(december\s?)\d{4}':'12^', r'\W(dec\s?)\d{4}':'12^'}
    
#flags to check presence of months/period
month_present = False
period_dates = False

for mon in month_list:
    if re.search(mon, list_elem, re.IGNORECASE):
        rep = re.search(mon, list_elem, re.IGNORECASE).group(1)
        list_elem = re.sub(rep, month_list[mon], list_elem, flags=re.I)
        month_present = True

#new year regex (complex, isn't it :P)
year_regex = ur'(?:\(?\d{1,2}\^)?\s?\d{4}\s?(?:–|-)\s?(?:\d{1,2}\^)?\s?\d{4}(?:\))?'  #regex for checking if its a single year or period

if re.search(period_regex, list_elem, flags=re.IGNORECASE):
    period_dates = True

    After re-writing year_mapper, I finished mappers and rules for `PeriodicalLiterature` and tested it on some resources belonging to `Magazine`, `Newspaper` and `AcademicJournal` sub-domains. I also updated the awards/honors mapper, which can now differentiate honorary degrees and awards. I finished this week's work by optimizing the code a bit, removing redundant code and replaced the existing `year_mapper` with the the new mapper in each module; adding the newly written `quote_mapper` resource extractor in the URI extracting process. After that, I merged all progress into master, as this is the final stable running version before the evaluations begin.

    Should I pass the first evaluation (I really feel I would :P), my next task would be, as discussed with Luca, working on a module that'll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.

    Let's hope for the best!! :)

    You can follow my project on github here.

Saturday, June 24, 2017

GSoC 2017 : Week 3

    Time is passing by ever so quickly and things are starting to get *real intense*. Although it has only been three weeks, it feels like I'm a veteran developer now (professional developers everywhere cringed :P). Anyways, here's the progress report from my third week.

    These next few weeks, my focus would majorly be on expanding the scope of extractor, adding few common domains and working on making it more scalable to handle previously unseen lists with existing rules. This week, I've started working on adding new domains. This time around, I took my mentor's suggestion and tried to implement a single mapper that can map multiple list items instead of having a mapping function for every single type of element. Previously, all the properties were present in the mapping functions itself, like in the example below:


# mapping bibliography for Writer, snippet from mapper.py
g.add((rdflib.URIRef(uri), dbo.author, res))
isbn = isbn_mapper(elem)
if isbn:
    g.add((rdflib.URIRef(uri), dbo.isbn, rdflib.Literal(isbn, datatype=rdflib.XSD.string)))
if year:
    add_years_to_graph(g, uri, year)
if lit_genre:
    g.add((rdflib.URIRef(uri), dbo.literaryGenre, dbo + rdflib.URIRef(lit_genre)))

    This led to me changing the way Federica and I have been using the mapping rules. The ontology classes/properties now stored in the mapping_rules.py instead of the mapping functions.

# (new)mapping contribution type for Person, snippet from mapper.py
contrib_type == None:
feature = bracket_feature_mapper(elem)
for t in CONTRIBUTION_TYPE[lang]:
    if re.search(t, feature, re.IGNORECASE):
        contrib_type = CONTRIBUTION_TYPE[lang][t]

if contrib_type:
    g.add((rdflib.URIRef(uri), dbo[contrib_type], res))  #notice the property!


    In new domains, I also analysed the `EducationalInstitution` domain, and completed writing the rules/mappers for that. The list extractor can now extract triples from EducationalInstitution, as well as its subdomains like `College`, `School` and `University`. After that I looked at different Domains within `Person` in order to generalize extractor to work on this superclass. Domains like `Painter`, `Architect` `Architect`, `Astronaut`, `Ambassador` `Athelete`, `BusinessPerson`, `Chef`, `Celebrity`, `Coach` etc. will now also work with extractor, increasing its coverage, but I still have to work on the quality of extraction as Person is one of the biggest Domain on Wiki and has extreme variability. For that, I changed various functions in order to support generalized domains (eg. year_mapper, role etc.). Now extracts all the years in which the person has won same award/honors. 

    In the end, I had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects, another meeting is scheduled next week after discussing the idea with mentors. For the next week, I'll keep on adding domains to the extractor, while adding the new rules/functions in a generalized way. I also hope to come to a resolution about the final structure of my extractor after a discussion with Luca.


    You can follow my project on github here.


Sunday, June 18, 2017

Cyberoam Auto-Login

    Most of the universities and institutions nowadays use cyberoam to control and monitor the way their students/employees use their Internet connection. While using VPNs can help in overcoming the whole login process altogether, but many times it doesn't work. Now, being an CS student, I can't really live without Internet for long and it's pretty annoying when cyberoam logs out in between a streaming football match, or while downloading a really large file. Since it logs out after a fixed amount of time, how about a script that can automate the login process even before the connection times out. No more worrying about stopping Internet.

    To achieve this, I used mechanize instead of the standard urllib2 module as the script is essentially a bot which logs in for us every few hours and it is a more easier to use and more powerful module as compared to the latter as it can easily simulate the browser properties.

    The next step is setting the payload with your login credentials and then posting the credentials with proper access tokens, which should do the trick. Then a few lines of regex would be enough to scrape out required information.

def post_request(username,password,value,mode):
	
	br = browser()
	#POST request values
	values ={
		"mode":mode,
		"username":username,
		"password":password,
		"btnSubmit":value
	}

	data = urllib.urlencode(values)
	page = br.open(url,data)
	response = page.read()
	br._factory.is_html = True
	soup = BeautifulSoup(response,"lxml")

	regex = re.compile(r"<message><!\[CDATA\[(.*)\]\]><\/message>")

	x = re.search(regex,response)
	print username + ": " + x.group(1)
	return br

    The login and logout functions are now simply periodic POST requests to the server. The login process could be run again after a pre-determined timing by using an infinite loop with a sleep function. As long as the program runs, user is logged in. Then, include a simple signal handler, that can post the logout request whenever the SIGINT signal is received and hence terminating the script automatically logs the user out.

def login():
	br = post_request(username,password,"Login","191")
	return br

def logout():
	br = post_request(username,password,"Logout","193")
	return br

   A small problem that we'll face is the SSL Verification, which will fail due to absence of a proper verification certificate. So, we'll have to bypass this validation for now. For more information, you can see PEP 476 -Enabling certificate verification by default for stdlib http clients. The following snippet bypasses the SSL check:


def bypass_ssl():
	### Bypassing SSL certification ###
	try:
	    _create_unverified_https_context = ssl._create_unverified_context
	except AttributeError:
	    # Legacy Python that doesn't verify HTTPS certificates by default
	    pass
	else:
	    # Handle target environment that doesn't support HTTPS verification
	    ssl._create_default_https_context = _create_unverified_https_context
	return


   And that's pretty much it. Set your credentials in the 2 fields in login.py, set your cyberoam portal's URL and port, update the re-login time and you're done! 

Forget about ever logging in again, but do remember to log out, or else, you're FOREVER stuck :P

Link to the code: Github

Friday, June 16, 2017

GSoC 2017 : Week 2


    The second week has flown by, and now it's time for the second week's progress report. 

    The primary tasks for the second week was to add support for Spanish and German for the existing Writer and Actor domain, along with integrating the MusicalArtist domain that was part of my warmup task. 

    So, I started off with adding the MusicalArtist domain to my current codebase. It was fairly straightforward (for the most part) and it worked like a charm.


ref = reference_mapper(elem)  # look for resource references
if ref:  # current element contains a reference
 uri = wikidataAPI_call(ref, lang)  #try to reconcile resource with Wikidata API
    if uri:
  dbpedia_uri = find_DBpedia_uri(uri, lang)  # try to find equivalent DBpedia resource
        if dbpedia_uri:  # if you can find a DBpedia res, use it as the statement subject
   uri = dbpedia_uri
        else:  # Take the reference name anyway if you can't reconcile it
             ref = list_elem_clean(ref)
             elem = elem.replace(ref, "")  #subtract reference part from list element, to facilitate further parsing
             uri_name = ref.replace(' ', '_')
             uri_name = urllib2.quote(uri_name)  ###
             uri = dbr + uri_name.decode('utf-8', errors='ignore')
        g.add((rdflib.URIRef(uri), rdf.type, dbo.Album))
        g.add((rdflib.URIRef(uri), dbo.musicalArtist, res))


    However, diving deeper into many musical artists, I noticed that the extractor wasn't working very efficiently and constantly missed many elements. It was then when I realised, an actor could well have recorded a few songs or a musician might've acted in a movie, and the extractor was specifically looking for a particular section for a resource. Its funny and astounding at the same time that one can completely miss such an intuitive thing. Anyway, a big overhaul was needed.
 
    So, after analyzing many articles from different domains, I realised that there are several domains that have intersecting sections, and I had to change my approach. So from now, I'll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.

    For this, I had to completely restructure my current mapping_rules file. The rules now contain 2 multi-level dictionaries, first of which maps the domain of the resource to the sections it could be related to, and the second one mapping the sections to the appropriate mappings.


MAPPING = { 
            'Person': ['FILMOGRAPHY', 'DISCOGRAPHY', 'BIBLIOGRAPHY', 'HONORS'],
            'Writer': ['BIBLIOGRAPHY', 'HONORS'], 
            'MusicalArtist': ['DISCOGRAPHY','FILMOGRAPHY', 'CONCERT_TOURS', 'HONORS'],
            'Band':['DISCOGRAPHY', 'CONCERT_TOURS', 'BAND_MEMBERS', 'HONORS'],
}

BIBLIOGRAPHY = {
    'en': ['bibliography', 'works', 'novels', 'books', 'publications'],
    'it': ['opere', 'romanzi', 'saggi', 'pubblicazioni', 'edizioni'],
    'de': ['bibliographie', 'werke','arbeiten', 'bücher', 'publikationen'],
    'es': ['Obras', 'Bibliografía']
}

I also had to change the select mapping function to handle multiple sections.


domains = MAPPING[res_class]  # e.g. ['BIBLIOGRAPHY', 'FILMOGRAPHY']
domain_keys = []
resource_class = res_class

for domain in domains:
    if domain in mapped_domains:
        continue
    if lang in eval(domain):
 domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
 print("The language provided is not available yet for this mapping")

    mapped_domains.append(domain)  #this domain won't be used again for mapping
    
    for res_key in resDict.keys():  # iterate on resource dictionary keys
 mapped = False

 for dk in domain_keys:  # search for resource keys related to the selected domain
     # if the section hasn't been mapped yet and the title match, apply domain related mapping
     dk = dk.decode('utf-8') #make sure utf-8 mismatches don't skip sections 
     if not mapped and re.search(dk, res_key, re.IGNORECASE):
     mapper = "map_" + domain.lower() + "(resDict[res_key], res_key, db_res, lang, g, 0)"
     res_elems += eval(mapper)  # calls the proper mapping for that domain and counts extracted elements
     mapped = True  # prevents the same section to be mapped again

    So, this major change in the selection of mapper functions greatly improved the working of the extractor. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.

    Then, I continued with adding support for German and Spanish language in all the 3 initial domains (Actor, Writer, MusicalArtist). And that concluded the work for my second week.

    This coming week, I'll be looking forward to adding new domains to the extractor. Another task next week would be discussing an approach with Luca, my friend who is also working for DBpedia on another similar project, for potentially coming up with a possible template /mapping rules, to make a more effective and scalable extractor.

    You can follow my project on github here.


Monday, June 5, 2017

GSoC 2017 : Week 1

    With the first week now past us, it's time for the first week's progress report. 

    First week was mainly about checking the existing code for potential improvements. So, this week, I went over the existing code and made slight tweaks to them. I added the __init__ module and docstring. I also worked on improving the method that was used to create the resource dictionary, and which extracted triples and stored them, to get rid of the junk values that were observed during extraction. One of the added methods was remove_symbols().

def remove_symbols(listDict_key):
    ''' removes other sybols are garbage characters that pollute the values to be inserted 

    :param listDict_key: dictionary entries(values) obtained from parsing
    :return: a dictionary without empty values
    '''
    for i in range(len(listDict_key)):
        value = listDict_key[i]
        if type(value)==list:
            value=remove_symbols(value)
        else:
            listDict_key[i] = value.replace('&nbsp;','')

    return listDict_key

    Another addition was a method that stores the statistical results of all the extractions that would take place in a csv file. This method would be used in future for evaluation of the performance of the extractor and logging the statistics of the extractions that would be performed in meantime.

def evaluate(lang, source, tot_extracted_elems, tot_elems):
''' Evaluates the extaction process and stores it in a csv file. :param source: resource type(dbpedia ontology type) :param tot_extracted_elems: number of list elements extracted in the resources. :param tot_elems: total number of list elements present in the resources. ''' print "\nEvaluation:\n===========\n" print "Resource Type:", lang + ":" + source print "Total list elements found:", tot_elems print "Total elements extracted:", tot_extracted_elems accuracy = (1.0*tot_extracted_elems)/tot_elems print "Accuracy:", accuracy with open('evaluation.csv', 'a') as csvfile: filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) filewriter.writerow([lang, source, tot_extracted_elems, tot_elems, accuracy])

    Lastly, I merged the MusicalArtist domain to the existing code, which was already part of my GSoC warmup task. This however, requires finer extraction functions, which would be added later on. As discussed with mentors, I'm currently looking at ways to make the list-extractor more scalable. I'll also look for potential problems in the existing code and improve it wherever required.

    This week, I'll be looking forward to adding more languages to the existing domains, and then, as discussed with my mentors, I would look into the scalability potential in the list-extractor.

You can follow my project on github here.

Wednesday, May 31, 2017

About my Project for GSoC 2017: List-Extractor

    Okay, so today I'll be writing a brief summary of what my project is all about. As the name itself suggests...
  
It extracts data from Wikipedia lists. 

wikipedia lists

Now hold on.... 

    Isn't that a simple task? That's something a *noob* can do by writing a simple script that scrapes data off the Wikipedia pages. What's so special about your project, huh?

    It's slightly more subtle than that. It's not all about just scraping the data and dumping it. The whole point of this project is to extract data and make it meaningful and connected, the very essence of semantic Web. Also, making it user friendly so that a person with limited computer knowledge can add more domains to the extractor.

    From the existing data present in the wiki lists, we form triples, which follow the W3C-RDF standards. Instead of using static constants or strings, we actually store the URI of the resources, which helps us in connecting all the triples, which as a result forms a large knowledge graph, which can be used to answer complex queries. The following snippet shows the sample extracted triplets for a musical album and its related artist. Pretty sweet eh?


@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .


dbr:In_The_Light_of_Fires_Burning dbo:musicalArtist <http://dbpedia.org/resource/John_Howard_(singer-songwriter)> ;
    dbo:releaseYear "2016-01-01"^^xsd:gYear .

dbr:In_The_Mood dbo:musicalArtist dbr:Nicole_Moudaber ;
    dbo:releaseYear "2013-01-01"^^xsd:gYear .

   We use the existing dbpedia ontologies to gather the related resources. In this project, we use JSONpedia Live, another project which was started in GSoC 2014 and is being currently maintained by Michele Mostrada. This live service provides us with a valid JSON response to a given resource, which can be parsed to extract relevant information. Of course, being a Web based service, it might be down if it receives high volume requests, so we need to use the JSONpedia library in our project. A small catch though, it's written in Java. Integrating the library will be an important task in my project in later stages.

    So, to summarize, the main objective of my project will be to add more data to the existing knowledge base, extend the existing list-extractor tool and add different resources, and as a result generating new datasets which can be added in the DBpedia datasets, along with integrating the JSONpedia library with the project to make the extractor independent of using the live service!

Let the code begin!! 

Sunday, May 21, 2017

About DBpedia

    With Community Bonding going on in its full swing, let me tell you something about the organization I'm contributing to, i.e. DBpedia. Since they already have a fantastic summary about the organisation on their page, I'm going to summarize  (more like quote) the summary they have already provided. (Yes, I'm very lazy :P)




    DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data.

    Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors. The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.

     The DBpedia knowledge base has several advantages over existing knowledge bases: it covers many domains; it represents real community agreement; it automatically evolves as Wikipedia changes, and it is truly multilingual. The DBpedia knowledge base allows you to ask quite surprising queries against Wikipedia, for instance “Give me all cities in New Jersey with more than 10,000 inhabitants” or “Give me all Italian musicians from the 18th century”. Altogether, the use cases of the DBpedia knowledge base are widespread and range from enterprise knowledge management, over Web search to revolutionizing Wikipedia search.


So, to summarize the summary of the summary,

What is DBpedia?

  • DBpedia is an open, free and comprehensive knowledge base constantly improved and extended by a large global community
  • DBpedia can be used to directly answer fact questions about a wide range of topics
  • users exploit DBpedia as background knowledge for document ranking, natural language understanding, as well as data integration methods
  • our data grows with Wikipedia and Wikidata
  • the extractors are updated frequently to build our 8.8 billion fact, large-scale-cross-domain knowledge graph
  • DBpedia has thousands of users, for example: 
    • large companies such as Wolters Kluwer
    • libraries
    • researchers
    • web developers

Why is DBpedia important?


    DBpedia provides a complementary service to Wikipedia by exposing knowledge (from 130 Wikimedia projects, in particular the English Wikipedia, Commons, Wikidata and over 100 Wikipedia language editions) in a quality-controlled form compatible with tools covering ad-hoc structured data querying, business intelligence & analytics, entity extraction, natural language processing, reasoning & inference, machine learning services, and artificial intelligence in general. Data is published strictly in line with “Linked Data” principles using open standards (e.g., URIs, HTTP, HTML, RDF, and SPARQL) and open data licensing. 

You can visit the official DBpedia website for more information about the DBpedia Organisation, community, projects and more!