The Site > Site Suggestions & Support

Improving search

<< < (2/2)

Iam that kemmler:
Crawling the site vs indexed searches I can assure you that crawling uses up more server time since you are building the web page, indexing it (writing all information to your database) , and moving to the next web page and indexing that (writing all information to your database)- rinse/lather/repeat. Not only are you pulling data, you are also using resources to build the page.

If you were doing that to any of my sites, I'd bounce your bot after 20 requests with a redirect to a random choice of search engines. Essentially, you want to take all of the data generated by making all of the pages show up and taking that information. That is far more invasive than just running a search query because each query is doing a search to get the data that is needed to generate the web page. If you are concerned about server load - a proper search tool on a website is less resource intensive than a bot using the current search tool to get the data plus actually creating the page.




knnn:
Yes, I have to build a page and load up all the overhead that comes with it, but in terms of server load I'd argue this is ultimately no different than me browsing the actual threads/posts with my web browser.  Thus even though the amount of of work I'm putting on the server is larger than if I'd used an indexed search, would me creating such a bot really be that problematic? 

Remember that I'd be setting it to read one thread per hour.  That way it's not pulling up the webpages any faster than a normal person who wanted to read the entire contents of the reference section which is 117 threads in all.  We're talking about reading in the entire reference section once over the course of a week.  Would that really be that much of an imposition or even noticable over the background noise?

Serack:

--- Quote from: Iam that kemmler on July 10, 2015, 04:42:20 AM ---Crawling the site vs indexed searches I can assure you that crawling uses up more server time since you are building the web page, indexing it (writing all information to your database) , and moving to the next web page and indexing that (writing all information to your database)- rinse/lather/repeat. Not only are you pulling data, you are also using resources to build the page.

If you were doing that to any of my sites, I'd bounce your bot after 20 requests with a redirect to a random choice of search engines. Essentially, you want to take all of the data generated by making all of the pages show up and taking that information. That is far more invasive than just running a search query because each query is doing a search to get the data that is needed to generate the web page. If you are concerned about server load - a proper search tool on a website is less resource intensive than a bot using the current search tool to get the data plus actually creating the page.

--- End quote ---

yah, I checked some of Iago's old posts on this subject, and it sounds like it took nearly a week to build the search indexes for the search engine when he started from scratch several years ago.

Iam that kemmler:

--- Quote from: knnn on July 10, 2015, 11:48:05 AM ---Yes, I have to build a page and load up all the overhead that comes with it, but in terms of server load I'd argue this is ultimately no different than me browsing the actual threads/posts with my web browser.  Thus even though the amount of of work I'm putting on the server is larger than if I'd used an indexed search, would me creating such a bot really be that problematic? 

Remember that I'd be setting it to read one thread per hour.  That way it's not pulling up the webpages any faster than a normal person who wanted to read the entire contents of the reference section which is 117 threads in all.  We're talking about reading in the entire reference section once over the course of a week.  Would that really be that much of an imposition or even noticable over the background noise?

--- End quote ---

Yep, you are correct - if you slow the bot to one request per hour then you spread the server hit to a minimum - assuming that Iago pays for thresholds from his hosting service would mean that it wouldn't directly impact his pocket.

Your ultimate goal is to create an offsite search engine - If I would help you do that I would have you create the queries to get the data you would be accessing - export them to a csv or other shared data source type and simply let you have them. Which is what I posted previously.

Since it's only 117 threads that you are after which contain what 20 to 30 replies on average? that's a lot of hours. I liked your Dresden Game KNN. Just offering some hints or tips for success. Fu.

Navigation

[0] Message Index

[*] Previous page

Go to full version