How can I preserve line breaks in html table cell when scraping with gocolly


I’m trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like:
<td> address line1<br>1 address line2</td>

When colly extracts this, I get the following:
address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>


As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package’s OutputHTML method

const htmlPage = `
<html xmlns="" xml:lang="en">
    <title>Your page title here</title>
    AddressLine 1 
    AddresLine 2

doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
xmlNode := htmlquery.FindOne(doc, "//p")
result := htmlquery.OutputHTML(xmlNode, false)

output of result variable is like below now:

 AddressLine 1
 AddresLine 2

You can now parse result by <br/> tag and achive what you want.

But I am also new in go, so maybe there may be better way to do it.

Answered By – Sinan Ulker

Answer Checked By – Mary Flores (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.