Golang Regular Expression: Getting index position of variable

Issue

I have a regular expression that has variables (?P<next_tok>) how can I grab the index of that variable match?

Here is the complete regexp:
\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))

Example:
http://play.golang.org/p/7CYfK50W2Q

I want to get the matches AND the index of any variable in the regexp match. Is this possible in golang?

EDIT:
I couldn’t figure out how to get next_tok by name, but I was able to get all the submatches via FindAllStringSubmatchIndex

http://play.golang.org/p/SEaCLVKisr

Solution

You can use .FindAllStringSubmatchIndex:

package main

import (
    "fmt"
    "regexp"
    "unicode/utf8"
)

func main() {
    text := "Here... are some initials E.R.B. and also an etc. in the middle.\nPeriods that form part of an abbreviation but are taken to be end-of-sentence markers\nor vice versa do not only introduce errors in the determination of sentence boundaries.\nSegmentation errors propagate into further components which rely on accurate\nsentence segmentation and subsequent analyses are most likely affected negatively.\nWalker et al. (2001), for example, stress the importance of correct sentence boundary\ndisambiguation for machine translation and Kiss and Strunk (2002b) show that errors\nin sentence boundary detection lead to a higher error rate in part-of-speech tagging.\nIn this paper, we present an approach to sentence boundary detection that builds\non language-independent methods and determines sentence boundaries with high accuracy.\nIt does not make use of additional annotations, part-of-speech tagging, or precompiled\nlists to support sentence boundary detection but extracts all necessary data\nfrom the corpus to be segmented. Also, it does not use orthographic information as primary\nevidence and is thus suited to process single-case text. It focuses on robustness\nand flexibility in that it can be applied with good results to a variety of languages without\nany further adjustments. At the same time, the modular structure of the proposed\nsystem makes it possible in principle to integrate language-specific methods and clues\nto further improve its accuracy. The basic algorithm has been determined experimentally\non the basis of an unannotated development corpus of English. We have applied\nthe resulting system to further corpora of English text as well as to corpora from ten\nother languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,\nSpanish, Swedish, and Turkish. Without further additions or amendments to\nthe system produced through experimentation on the development corpus, the mean\naccuracy of sentence boundary detection on newspaper corpora in eleven languages is\n98.74 %."

    var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))`
    sent := regexp.MustCompile(periodContextFmt)
    matches := sent.FindAllStringSubmatchIndex(text, -1)

    for _, match := range matches {
        fmt.Println("context: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
        fmt.Println("next_tok: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
        fmt.Println("start: ", utf8.RuneCountInString(text[:match[2]]))
        fmt.Println("end: ", utf8.RuneCountInString(text[:match[4]]))
        fmt.Println("------")
    }
}

See the Go demo.

Note that the unicode/utf8 import and utf8.RuneCountInString is necessary to get the Unicode character indices in Unicode strings, otherwise, you will get byte offsets. See Identify the correct hashtag indexes in tweet messages.

Answered By – Wiktor Stribi┼╝ew

Answer Checked By – Katrina (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.