18 Şubat 2013 Pazartesi

Parsing html documents which contain script tags without cdata

If you are dealing with html document which has a script tag and some javascript code inside without an opening cdata section and trying to parse it as an xml, i.e. for xpath querying, then you might get in trouble. It's because the code inside script tag becomes against the xml syntax and parser libraries fail while parsing it.

I tried the same thing in python using lxml and elementtree libraries and they gave the same result. Happily, at the last time I was about to lose my hope, beautifulsoup library came to the rescue and did my job. Appearantly they treat html files differently than an xml, which saves a day in my case.