Hirofumi Inaguma

News😀

  • 07/2020: Four papers got accepted to Interspeech2020.
        ”CTC-synchronous Training for Monotonic Attention Modelt” [arxiv] (1st-author)
        ”Enhancing Monotonic Multihead Attention for Streaming ASR” [arxiv] (1st-author)
        ”Distilling the Knowledge of BERT for Sequence-to-Sequence ASR” (co-author)
        ”End-to-end speech-to-dialog-act recognition” [arxiv] (co-author)
  • 05/2020: Our work on ESPnet-ST is appeared in slator.
  • 04/2020: One paper got accepted to ACL2020 system demo session.
        ”ESPnet-ST: All-in-One Speech Translation Toolkit” [arxiv] [slide] (1st-author)
  • 01/2020: One paper got accepted to IEEE ICASSP2020. See you in Barcelona, Spain🇪🇸!
        ”Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR” [arxiv] [slide] (1st-author)
  • 10/2019: Recieved the Microsoft Research Asia Ph.D. Fellowship Award 2019. [link]
  • 10/2019: Finished the summer internship at Microsoft!
  • 09/2019: Two papers got accepted to IEEE ASRU2019. See you in Sentosa, Singapore🇸🇬!
        ”Multilingual End-to-End Speech Translation” [arxiv] (1st-author)
        ”Comparative Study on Transformer vs RNN in Speech Applications” [arxiv] (co-author)
  • About Me

     3rd-year Ph.D. student at Graduate School of Informatics, Kyoto University, Kyoto, Japan.
    My CV is available here.

    Email (office): inaguma [at] sap.ist.i.kyoto-u.ac.jp
    Email (private): hiro.mhbc [at] gmail.com
    Address (office): Research Building No.7 Room 407, Yoshida-honmachi, Sakyo-ku, Kyoto-shi, Kyoto, 606-8501, Japan

    Google Scholar | GitHub | LinkedIn | Twitter

    Research interests🤔

    Automatic speech recognition (ASR)
    • End-to-end speech recognition
    • Multilingual end-to-end speech recognition
    • Language modeling
    • Online streaming ASR
    Speech translation
    • End-to-end speech translation
    • Multilingual end-to-end speech translation

    Research topic🧐

    Monotonic Multihead Atteniton for Streaming ASR CTC-synchronous training for monotonic chunkwise attention (MoChA)
       Proposed a new training method, CTC-synchronous training (CTC-ST), for MoChA to learn reliable alignments (token boundaries) by leveraging CTC spikes as reference alignments. MoChA with CTC-ST is very robust to long utterances. CTC-ST can also bring out the full potential of SpecAugment for MoChA. CTC is trained by sharing the same encoder and thus training is still end-to-end.
      See details in [arxiv].
    Minimum latency training for online streaming sequence-to-sequence ASR
       Tackled the delayed token generation problem to minimize perceived latency in online streaming sequence-to-sequence ASR models. Proposed latency reduction methods by leveraging external hard alignments extracted from the conventional hybird ASR systems.
      See details in [Inaguma et al., ICASSP2020].
    Multilingual end-to-end speech translation
       Proposed an effective multilingual training into the end-to-end speech translation (E2E-ST) task, and showed significant improvements of translation performances compared to conventional bilingual models.
      See details in [Inaguma et al., ASRU2019].
    Multilingual end-to-end speech recognition
       Proposed an adaptation starategy of the language-independent end-to-end ASR model based on a single sequence-to-sequence architecture to unseen languages by integrating information from the external language model trained on the target language.
      See details in [Inaguma et al., ICASSP2018].
    Acoustic-to-word (A2W) sequence-to-sequence speech recognition without out-of-vocabulary (OOV) words
       Proposed a sequence-to-sequence ASR model which directly generates whole word tokens, and resolves the out-of-vocabulary (OOV) problem by referring to the character-level hypothesis obtained from the character-level decoder by sharing the same encoder.
      See details in [Ueno et al., ICASSP2018], [Inaguma et al., SLT2018], [Mimura et al., SLT2018].
    Joint social signal detection (SSD) and automatic speech recognition (ASR)

    Education🎓

    Ph.D. in Computer Science, Kyoto University, Kyoto, Japan (April 2018 - Present)
    • Department of Intelligence Science and Technology, Graduate School of Informatics
    • Supervisor: Prof. Tatsuya Kawahara
    M.E. in Computer Science, Kyoto University, Kyoto, Japan (April 2016 - March 2018)
    • Department of Intelligence Science and Technology, Graduate School of Informatics
    • Thesis title: Joint Social Signal Detection and Automatic Speech Recognition based on End-to-End Modeling and Multi-task Learning
    • Supervisor: Prof. Tatsuya Kawahara
    B.E. in Computer Science, Kyoto University, Kyoto, Japan (April 2012 - March 2016)
    • Supervisor: Prof. Tatsuya Kawahara

    Work experiences💻

    Microsoft Research, Redmond, WA, USA, Research Internship (July 2019 - October 2019)
    • Mentor: Yifan Gong, Jinyu Li, Yashesh, Gaur, and Liang Lu
    Johns Hopkins University, Baltimore, MD, USA, Research Internship (July 2018 - September 2018)
    • Worked on end-to-end speech recognition and translation
    • Participated in the JSALT workshop (topic: multilingual end-to-end speech recognition)
    • Participated in IWSLT2018 end-to-end speech translation evaluation campaign
    • Mentor: Prof. Shinji Watanabe
    IBM research AI, Tokyo, Japan, Research Internship (September 2017 - November 2017)
    • Worked on end-to-end ASR systems
    • Mentor: Gakuto Kurata and Takashi Fukuda

    Awards & Honors 🏆

    Awards
    • Yamashita SIG Research Award, from Information Processing Society of Japan (IPSJ), March 2019. [link]
      Paper title: "An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition"
    • Yahoo! JAPAN award (best student paper), from SIG-SLP, June 2018. [link]
    • Student award, from the Acoustical Society of Japan (ASJ), March 2018. [link]
    • Student award, from the 79th of National Convention of Information Processing Society of Japan (IPSJ), March 2017
    Fellowship
    • Microsoft Research Asia Ph.D. Fellowship, from Microsoft Research Asia (MSRA), October 2019. [link]
    • Research Fellowship for Young Scientists (DC1), from Japan Society for the Promotion of Science (JSPS), April 2018 - March 2021

    International conference (Review paper, first author)

    • Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara, ”CTC-synchronous Training for Monotonic Attention Model”, 21th Annual Conference of International Speech Communication Association (Interspeech), 2020. (Acceptance Rate: 47.0%) [arxiv]

    • Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara, ”Enhancing Monotonic Multihead Atteniton for Streaming ASR”, 21th Annual Conference of International Speech Communication Association (Interspeech), 2020. [arxiv] [demo]

    • Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Yalta, Tomoki Hayashi and Shinji Watanabe, ”ESPnet-ST: All-in-One Speech Translation Toolkit”, the 58th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations, 2020. [arxiv] [slide] [ACL Anthology]

    • Hirofumi Inaguma, Yashesh, Gaur, Liang Lu, Jinyu Li, and Yifan Gong, ”Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. (Acceptance Rate: 47%, Oral) [arxiv] [slide] [IEEE Xplore]

    • Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, and Shinji Watanabe, ”Multilingual End-to-End Speech Translation”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019. (Acceptance Rate: 144/299=48.1%) [arxiv] [pdf] [poster] [IEEE Xplore]

    • Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, and Shinji Watanabe, ”Transfer Learning of Language-Independent End-to-End ASR with Language Model Fusion”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. (Acceptance Rate: 1774/3815=46.5%) [arxiv] [pdf] [poster] [IEEE Xplore]

    • Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, ”Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR”, IEEE Spoken Language Technology Workshop (SLT), 2018. (Acceptance Rate: 150/257=58.3%) [arxiv] [pdf] [poster] [IEEE Xplore]

    • Hirofumi Inaguma, Xuan Zhang, Zhiqi Wang, Adithya Renduchintala, Shinji Watanabe and Kevin Duh, ”The JHU/KyotoU Speech Translation System for IWSLT 2018”, 15th International Workshop on Spoken Language Translation (IWSLT), 2018. [pdf]

    • Hirofumi Inaguma, Masato Mimura, Koji Inoue, Kazuyoshi Yoshii, and Tatsuya Kawahara, ”An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. (Acceptance Rate: 1406/2830=49.7%) [pdf] [poster] [IEEE Xplore]

    • Hirofumi Inaguma, Koji Inoue, Masato Mimura, and Tatsuya Kawahara, ”Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC”, 18th Annual Conference of International Speech Communication Association (Interspeech), 2017. (Acceptance Rate: 799/1582=52.0%) [pdf] [poster]

      International conference (Review paper, co-author)

    • Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai Tatsuya Kawahara, ”Distilling the Knowledge of BERT for Sequence-to-Sequence ASR”, 21th Annual Conference of International Speech Communication Association (Interspeech), 2020.

    • Trung V. Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara, ”End-to-end speech-to-dialog-act recognition”, 21th Annual Conference of International Speech Communication Association (Interspeech), 2020. [arxiv]

    • Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shnji Watanabe, Takenori Yoshimura, Wangyou Zhang, ”A Comparative Study on Transformer vs RNN in Speech Applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019. [arxiv] [IEEE Xplore]

    • Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, Najim Dehak, ”Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. [arxiv] [IEEE Xplore]

    • Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke Sakai, and Tatsuya Kawahara, ”Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition”, IEEE Spoken Language Technology Workshop (SLT), 2018. [pdf] [IEEE Xplore]

    • Sei Ueno, Hirofumi Inaguma, Masato Mimura, and Tatsuya Kawahara, ”Acoustic-to-Word Attention-Based Model Complemented with Character-level CTC-Based Model”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. [pdf] [IEEE Xplore]