背景:昨天一個學(xué)金融的同學(xué)讓我?guī)退龔囊粋網(wǎng)站上抓取數(shù)據(jù),然后導(dǎo)出到excel,粗略看了下有1000+條記錄,人工統(tǒng)計的話確實不可能。雖說不會,但作為一個學(xué)計算機的,我還是厚著臉皮答應(yīng)了。 。
剛開始想的是直接發(fā)送GET請求,然后再解析返回的html不就可以獲取需要的信息嗎?的確,如果是不需要登錄的網(wǎng)站,這樣可行,但對于這個網(wǎng)站就行不通。所以首先我們需要做的就是抓包,即分析用戶登錄時瀏覽器向服務(wù)器發(fā)送的POST請求。許多瀏覽器都自帶抓包工具,但我還是更喜歡[httpwatch]
抓包過程:
1.安裝httpwatch
2.用IE瀏覽器進入網(wǎng)站的登錄頁面
3.打開httpwatch的Record開始跟蹤
4.輸入賬號密碼,確認登錄,得到下面的數(shù)據(jù):
重點看POST請求中的Url和postdata,以及服務(wù)器返回的cookies
cookie里面包含有登錄信息,保險起見,我們可以把這4個cookie值都傳給服務(wù)器。
首先給出C#發(fā)送POST請求的代碼:(目的是得到服務(wù)器返回的cookie)
string Url = "URL"; string postDataStr = "POST Data";//因為上面都是離散的鍵值對,我們可以從Stream中直接找到postDataStr //登錄并獲取cookie HttpPost(Url, postDataStr, ref cookie); private string HttpPost(string Url, string postDataStr, ref CookieContainer cookie) { HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url); request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; byte[] postData = Encoding.UTF8.GetBytes(postDataStr); request.ContentLength = postData.Length; request.CookieContainer = cookie; Stream myRequestStream = request.GetRequestStream(); myRequestStream.Write(postData, 0, postData.Length); myRequestStream.Close(); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); response.Cookies = cookie.GetCookies(response.ResponseUri); Stream myResponseStream = response.GetResponseStream(); StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("utf-8")); string retString = myStreamReader.ReadToEnd(); myStreamReader.Close(); myResponseStream.Close(); return retString; }
有了cookie后,就可以從網(wǎng)站上抓取自己需要的數(shù)據(jù)了,接下來就是通過發(fā)送GET請求
private string HttpGet(string Url, string postDataStr, CookieContainer cookie) { HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url + (postDataStr == "" ? "" : "?") + postDataStr); request.Method = "GET"; request.ContentType = "text/html;charset=UTF-8"; request.CookieContainer = cookie; HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream myResponseStream = response.GetResponseStream(); StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("utf-8")); string retString = myStreamReader.ReadToEnd(); myStreamReader.Close(); myResponseStream.Close(); return retString; }
因為服務(wù)器返回的是html,如何快速從大量的html中獲取需要的信息呢?此處,我們可以引用一個高效且強大的第三方庫NSoup(網(wǎng)上也有人推薦使用htmlparser,但通過我個人比較發(fā)現(xiàn),htmlparser無論是在效率還是簡潔性上,都遠不如NSoup)
由于網(wǎng)上對于NSoup的教程比較上,大家也可以參考JSoup的教程:http://www.open-open.com/jsoup/
最后給出我從網(wǎng)站上抓取的部分?jǐn)?shù)據(jù):
(責(zé)任編輯:admin)本文地址:http://www.bmm520.net/info/net/2020/1128/22114.html