'Finding Images with Verbals'

STITPROnounce Project

Project Description

The 'Finding Images with Verbals' project (also called STITPROnounce) funded by STITPRO Foundation broadly aims to explore dialogue-driven design for mobile speech interfaces. The project consists of a six-week ethnographic field-research and development of IBM’s Spoken-Web (or World Wide Telecom Web) based mobile systems for the domain of education in rural environments in India. The project includes design and implementation of open-source mobile application showcases referred as ‘Verbals mobile system’. We refer to ‘Spoken Audio Tags’ (Dutch: spraaktags) as ‘Verbals’.

Speech Interfaces For The Education Sector In Rural India

The rural areas of developing nations represent a challenging but highly significant environments for speech applications. We conducted an ethnographic and participatory field-research in rural villages of Mewat district of India in collaboration with IBM Research India and SRF Foundation. Mewat district is one of the least developed districts of India but has a high penetration of mobile phones. The field-research assisted in identifying scenarios of use of speech-technology and narrative-structure for mobile applications addressing education in low-literacy environments.

We followed a participatory design and rapid prototyping approach to identify and develop two design concepts and application prototypes: ‘Spoken-English Cricket Game’ and ‘Spoken-Web based Data Flow System’. Please refer to video-1 and video-2 below. These applications are based on the Spoken Web platform developed by IBM Research India. The Spoken-Web platform is entirely a server-side technology and hence independent of type of telephone device.

A visual overview of the field-research:Education and ICTs in Rural India [Slideshare, pdf, 26 Slides]

Video-1: Spoken-English Cricket GameVideo-2: Spoken-Web based Data Flow System

Design and Implementation of Verbals Mobile System

To showcase dialogue-driven design for mobile speech interfaces, the STITPROnounce project’s phase-1 started with the aim to design and develop dialogue-driven mobile speech applications that facilitates a smart phone user to speech tag and speech search images using speech recognition technology. We have implemented ‘Verbals mobile system’, i.e., open-source mobile application showcases on Android platform (Java) using Google's Speech Recognition service, Text-to-Speech Engine and Flickr API.

The speech recognition technology is still evolving and managing user expectation is crucial for speech based interfaces. During the course of the project we extended dialogue-driven design to narrative-driven design, i.e. overlaying dialogues with a narrative structure. Narratives are used to engage, manage user expectations, to give a personality to a mobile application as a technology that is ‘not-perfect’ but evolving. Please see video-3 below. The narrative structure of ‘Verbals mobile system’ is based on the theme of ‘Communication with The Past’ and the genre of ‘adventure’. The central character of the application is a unique bird, called ‘Pica’, who has special the ability to fly to ‘The Past’ and some ability to understand human voice.  A user interacts with Pica and could direct her to travel back in time to access the human memories in ‘The Past'.

Video-3: Verbals Mobile System

The project results are summarized in two publications:
Abhigyan Singh and Martha Larson. 2013. Narrative-driven Multimedia Tagging and Retrieval: Investigating Design and Practice for Speech-based Mobile Applications. Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013), Marseille, France, August 22-23, 2013, CEUR-WS.org, online http://ceur-ws.org/Vol-1012/papers/paper-16.pdf

Martha Larson, Nitendra Rajput, Abhigyan Singh, and Saurabh Srivastava (alphabetical) 2013. I want to be Sachin Tendulkar!: a spoken English cricket game for rural students. In Proceedings of the 2013 conference on Computer supported cooperative work (CSCW '13). ACM, New York, NY, USA, 1353-1364, online: http://dl.acm.org/citation.cfm?id=2441928