How can I preserve line breaks in html table cell when scraping with gocolly

Issue

I’m trying to preserve the formatting
in table cells when I extract the contents of a <td> cell.

What happens is if there are two lines of text (for e.g, an address) in the , the code may look like:
<td> address line1<br>1 address line2</td>

When colly extracts this, I get the following:
address line1address line2

with no spacing or line breaks since all the html has been stripped from the text.

How can I work around / fix this so I receive readable text from the <td>

Solution

As far as I know gocolly does not support such formatting, but you can basically do something like below, by using htmlquery(which gocolly uses it internally) package’s OutputHTML method

const htmlPage = `
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>Your page title here</title>
  </head>
  <body>
    <p>
    AddressLine 1 
    <br>
    AddresLine 2
    </p>
  </body>
</html>
`

doc, _ := htmlquery.Parse(strings.NewReader(htmlPage))
xmlNode := htmlquery.FindOne(doc, "//p")
result := htmlquery.OutputHTML(xmlNode, false)

output of result variable is like below now:

 AddressLine 1
   <br/>
 AddresLine 2

You can now parse result by <br/> tag and achive what you want.

But I am also new in go, so maybe there may be better way to do it.

Answered By – Sinan Ulker

Answer Checked By – Mary Flores (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.