Capture content

 

Capture content nodes are nodes in "scrape page" nodes to assign which content will be captured and will be saved into which column. It notifies the program that you wish to extract specific data elements from the page. 

 

 

Target

See select target. Assign which DOM on pages will be captured.

 

Extract Type

Use the extract type selector to specify which attribute in the selected DOM(s) is to be extracted. The default setting is for text, however, you have a number of options include the element's HTML, DOM attribute, page attribute, download elements and regular expression, static data.

 

1. Text

Capture text content of the target(s).

 

2. Html

Capture HTML code of the target(s).

 

3. Dom Attribute

Capture Dom attribute of the target(s), here you should input the attribute name(e.g. href, class...).

For example, if you want to scrape URL of a link, set it to "href"; if you want to scrape URL of a image, set it to "src".

 

4. Page attribute

Capture attribute of the page.

  • page title
  • page metadata
  • page URL
  • parent URL
 

4. Download

Here you should assign a folder to hold the downloaded files.

 

  • Link: Download file from the link of the target.
  • Image: Download image of the selected target.
  • Wait download: Wait download is for some special situation and the program will wait until some download request(e.g a page has a button, when click button something will be downloaded).
  • Screenshot: Create screenshot of the page, and save it to the folder.

 

5. Regular Express

Extract data from the target's HTML code with a "regular express", For example:

(\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)

will scrape email address.  It must contain parentheses, FMiner will extract string in brackets.

 

6. Static data

Save a static data here, and you can also set an inputted data(format as [$table.column$] or [%variable%]), see input data. Here's 3 special data:

[!current_time!] - The time stamp when save the data.

[!start_time!] - The time stamp starting the project.

[!project_name!] - Project name.

 

Save to Database Column

If you assign a data table in its ancestor "scrape page" node, you can select a column field to keep this value.

 

Adjust Data with Javascript (for Pro/Mac version)

When checked it, you can change the captured data with javascript.

The scraped data is a variable of "data", and the last line of the javascript codes returned value will be as the result data. For example, if you captured "mailto:support@fminer.com", and javascript code like this:

data.substring(7)

You will get "support@fminer.com". Another example: We scraped "price:100$", and we want to just need"100", we can write javascript code like this:

i = data.indexOf(':')
data.slice(i, -1)

Some useful javascript function:

data.indexOf   // find sub string postion
data.lastIndexOf  //find last sub string postion
data.split  //split string, for example, data.split('\n')[0] will return the first line

 

And you can also use regular expression to adjust data.

 

For example, we scraped organization information in one block like this:

XXXX Center, LLC
xxx First Street North 
Alabaster, AL 1xxx
205-6xx-8xxx

And we need split them to Name, Address, City, State, Zipcode, PhoneNumber, we can write the script like this:

Name:

data.split('\n')[0]

Address:

data.split('\n')[1]

City:

line = data.split('\n')[2]
line.split(',')[0]

State:

line = data.split('\n')[2]
sz = line.split(',')[1].trim()
i = sz.indexOf(' ')
sz.slice(0,i)

Zipcode:

line = data.split('\n')[2]
sz = line.split(',')[1].trim()
i = sz.indexOf(' ')
sz.slice(i+1)

Phone:

data.split('\n')[3]