• You are not logged in.

#1 Jan. 24, 2013 18:06:42

techclub
Registered: 2013-01-24
Posts: 2
Reputation: +  0  -
Profile   Send e-mail  

FMiner scrapping, parsing and resuming misc. questions.

First post, so hello everyone, hopefully someone has already encountered these problems in the past, while using other harvesters, I have found some issues that other do not handle as well, I just wondering if there is a way to overcome them with FMiner. I really like the visual flow chart of FMiner, much easier to use and follow than other products. I separated my questions into sections and tried to explain them the best I could, hopefully I did make myself clear; If not and if it’s ok, I can PM the actual website links.

SCRAPPING PAGES WITH NO NEXT BUTTON
A- Moving forward with results with no “next” button:

For example, if a search results has a total of 80 pages, sometimes there is no “next” button to go from page 1 to 2, to 3, etc.
In some cases there might be a skip button (e.g. skip to next 10 results, from page 1-10 to 11-20) or in other cases there is a page jump button (e.g. jump to page 50, like “1,2,3,4,5,6….(50)”, so the 50 button will skip you to pages “51,52,53,….80”). In either, there is no next button; you can always click the desired page number, but how to get this automatically in sequence?

B- Moving through pages, with JavaScript. For example if a search has a total of 300 page results, this particular website, it shows the navigation in two separate navigations. Showing the first 10 links, in 10 different pages (a total of the first 100 results), but with a “show next 100 results”, showing again 10 pages with 10 results each (now for results 101 to 200).

This is how it looks like:
page 1,2,3,4,6,7,8,9,10 (where page 1 shows 10, page 2 another 10 results, for a total of 100 results)
Next 100 results (then there is another next 100 results, that shows another 1 through 10, as above, in this case results 101 to 200).

There is no actual next button, only a “javascript:_doPostBack” function, for the page changes.

PARSING SITES WITH OVERFLOW/HARVESTING PROTECTION AND RESUMING PROJECTS
C- How do you parse a website that has overflow protection, asking you to solve a captcha after X number of page views? At first the date can be parsed with no problems, but after a couple of pages views, it starts asking to solve a captcha, and keeps doing that subsequently after a number of page views.

D- For websites that temporally block the IP for a number of hours to protect from harvesters, is there a way to resume the project after this period?

MISC. QUESTIONS
E- when scrapping a page, how can you capture that particular page's address link? Useful when parsing many sub links, and a reference of the extracted from page is required?

F- Can flv videos be downloaded (like from youtube)? and is there an option to download embedded or streamed videos/audio from pages that require a click on a static snapshot of the video/audio, in order to start streaming and playing?.

Thanks!

Edited techclub (Jan. 24, 2013 19:09:41)

Offline

#2 Jan. 24, 2013 21:02:09

admin
Registered: 2012-03-15
Posts: 289
Reputation: +  1  -
Profile   Send e-mail  

FMiner scrapping, parsing and resuming misc. questions.

Very thanks for Using FMiner, for your questions:

SCRAPPING PAGES WITH NO NEXT BUTTON
A- Moving forward with results with no “next” button:

You can select all “page links” with “group select”, and deal with these links just like “next” link. For example, when the links are: 2, 3,4,5 and no next links, just make 2 as “next” link to add “open link(s)” actions, then click “select target” and “group select” to select all these links 2,3,4,5 as “next” links. Then the program will work on all these links as “next” link. And don't worry links are repeated, FMiner will remove the duplicate links in the same action.

But one issue you must be care, the same links' URLs must be same in different pages, for example page 3 in the first page and page 3 in the second page must be same URL, most sites are same, but I found google's links are different with long different parameters, and FMiner will think they are different links, and there no an easy way to deal with this case, you will have to add all links in a “Goto” action with “batch add urls” manually.

B- Moving through pages, with JavaScript.

Just add a “click” action to click the link to deal with JavaScript, and drag a line from the action's joint to make a loop. Like this tutorial http://www.fminer.com/login-facebook-do-searing-and-scrape-searching-results/

PARSING SITES WITH OVERFLOW/HARVESTING PROTECTION AND RESUMING PROJECTS
C- How do you parse a website that has overflow protection, asking you to solve a captcha after X number of page views? At first the date can be parsed with no problems, but after a couple of pages views, it starts asking to solve a captcha, and keeps doing that subsequently after a number of page views.

Here you should add “validate” action to judge whether here's a captcha image exists, and add captcha solving actions follow it, then drag a line from its left joint(mean can't find image) to pass these captcha solving actions. Then FMiner will pass these captcha solving actions when can't find image, and do these actions when find. Here http://www.fminer.com/some-complex-projects/ the project 1 work as this, you will find a diamond node, it's a “validate” action.

D- For websites that temporally block the IP for a number of hours to protect from harvesters, is there a way to resume the project after this period?

Yes, now FMiner can resume running at any time without missing a page, and you can use proxies list. During or after extraction, you can click “Statistics” to see worked links and error links. The banned and error links will be shown in the error links list. Click “move all error links to remaining links” can run the error links again.

MISC. QUESTIONS
E- when scrapping a page, how can you capture that particular page's address link? Useful when parsing many sub links, and a reference of the extracted from page is required?

Select “extract type” to “page attribute”, then select “url” to capture page's URL, and here you can also select “parent url” to capture this page's parent url. It's very useful to associate different data tables from different pages.

F- Can flv videos be downloaded (like from youtube)? and is there an option to download embedded or streamed videos/audio from pages that require a click on a static snapshot of the video/audio, in order to start streaming and playing?.

Sorry, FMiner can't download streamed video, it can just download file with their links or clicking popup files.

Offline

#3 Jan. 29, 2013 08:26:54

techclub
Registered: 2013-01-24
Posts: 2
Reputation: +  0  -
Profile   Send e-mail  

FMiner scrapping, parsing and resuming misc. questions.

Thank you for all your responses and explaining them in detail!

Offline

Board footer

Moderator control

Powered by DjangoBB