Determine the idea of ​​web main body content by analyzing HTML format

xiaoxiao2021-03-06  25

Determine the idea of ​​web main body content by analyzing HTML format

Doing Web programming sometimes need to understand the size, composition of the HTML file, and prepare for future processing. For example, when the web page is automatically classified by the CRAWLER, it is best to extract the main information in the web page, filter out the page head, non-host information of the page corner; there is also a comparison of 2 web page content related to the relationship. Technology. The simplest: Analyze a number of IFRAME in a web page, and the number of internal and external links, etc., etc. need to be analyzed to the HTML file format.

If you want to know the part of the web page, there should be a lot of judgment standards. Let's start from the simplest form. Most of the web pages are now made by the form. Then you can determine the primary relationship of the table by analyzing the foothold of the table in the HTML page.

The problem is not large, but for the analysis of the home page such as Sina, SOHU may not be good because it is full of tables.

So I want to start with some news pages, I don't know if you have any good attention! !

Here is a small program to extract all iframes in the page.