How to extract good posts from a thread? (Resolved)

  • Thread starter AMG.
  • 10 comments
  • 980 views

AMG.

Staff Emeritus
3,116
Netherlands
NL, Leiden
GTP_AMG
Not sure if this is the right forum to post this question or whether it's been asked before. Moderator(s) feel free to move if needed.

I’m looking at a huge thread in GT4 (e.g the B-spec or 200 A-spec threads)
99% of the posts in those threads contains ‘non info’ msgs and it takes a long time to read > 1000 posts or so.

I select MSIE menu-option View, Source and I can see the ‘code’.
I also see that each post in the code starts with <!-- message --> and ends with <!-- / message -->

I can save this source file as a txt file in UltraEdit. The challenge is how can I extract the messages from the huge txt file or remove unwanted code till I’m left with only the messages. With further manual pruning it will leave me with the posts I believe are worth having for future reference.

Easy question which may result in a lengthy answer.
It doesn’t have to be perfect (an 80% solution is good enough for me).

Anyone out there that’s done this before or can help me out on how to achieve this ?

Thx.

AMG.
 
Are you looking for high level help (program flow, general concept etc) or are you looking for more low level help (how to parse the file, reading multiple pages etc)?

What programming languages do you use and how fluent in them are you?

edit: Also, do you want a GUI for the program or would you be happy with a command line interface that did the job well enough?

What about coping with things like links, quotes and smilies in the thread? Would you remove them or leave them in? What about removing single-line or single-word posts (to try to get rid of spam)?
 
amp88
looking for more low level help (how to parse the file, reading multiple pages etc......a command line interface...
What about coping with things like links, quotes and smilies in the thread? Would you remove them or leave them in? What about removing single-line or single-word posts (to try to get rid of spam)?

Thx Amp, wow a lot of questions but understandable.

hmmm, more like a once (maybe twice) off exercise for 1 or 2 threads, nothing fancy needed like gui's etc. Low level help needed.
I can understand VB (in Office apps e.g. MS Access) to a degree but dont ask me to do difficult coding in vb.
Its more like if then else, loops, replace x with y, apply formatting, open file write to file.
I'm not a start from scratch coder, I'll record a macro in MS Access then convert to VBA and amend / enhance the result to my taste.
( I wish I could write from scratch)

Smileys and one liners is what I consider to be the manual labour Id have to perform to remove and that's ok with me.

Multi pages, Id be happy to capture x pages of 40 posts manually and then concatenate x pages in to 1 large file and strip anything which is not between those 2 "tags" mentioned earlier.

Perhaps I should rephrase my question 💡
Would someone please take the 200A and the B-spec threads and copy all msgs in to a Word or Notepad doc for me. :scared:

That may be more efficient for both parties as I can see that the answer to my 'simple' question is as I feared going to be 'difficult'.

Thx
AMG.
 
Cool. I did something similar a while back for the WRS. I was trying to extract sector times from posts people had made in particular threads, but I ended up failing because of some of the intricacies in formatting and labour-intensive coding for forums.

To give you an idea of how much work is involved in just getting the thread title and a list of usernames in a thread, here's some code I wrote this morning (in Java):

Code:
	URL url = new URL(URL);
	BufferedReader input = new BufferedReader(new InputStreamReader(url.openStream()));
	
	File outFile = new File("output.txt");
	FileOutputStream outFileStream = new FileOutputStream(outFile);
	PrintWriter outStream = new PrintWriter(outFileStream);
	
	String inputLine = input.readLine(), threadTitle, currentUsername, currentPost;

	// Find thread title
	while(inputLine.indexOf("<title") == -1)
		inputLine = input.readLine();
	
	threadTitle = inputLine.substring(inputLine.indexOf("<title>")+7, 
			inputLine.indexOf("- GTP Forums")-1);
	
	System.out.println(threadTitle);
	outStream.println(threadTitle);
	
	while(inputLine.compareTo("</html>") != 0)
	{
		// Find post(s)
		while(inputLine.indexOf("<table id=\"post") == -1)
		{
			inputLine = input.readLine();
			
			while(inputLine == null)
				inputLine = input.readLine();
		}
		
		// Find username of this post
		while(inputLine.indexOf("<a class=\"bigusername") == -1)
		{
			inputLine = input.readLine();
			
			while(inputLine == null)
				inputLine = input.readLine();
		}
		
		// Premium Member
		if(inputLine.indexOf("font-weight: bold; color: #B82727") != -1)
		{
			currentUsername = inputLine.substring(inputLine.indexOf("#B82727")+9,
					inputLine.indexOf("</span>"));
			
			// System.out.println("Premium member: "+currentUsername);
		}
		// Admin
		else if(inputLine.indexOf("<i>") != -1)
		{
			currentUsername = inputLine.substring(inputLine.indexOf("<i>")+3,
					inputLine.indexOf("</i>"));
			
			// System.out.println("Admin: "+currentUsername);
		}
		// Mod/SuperMod
		else if(inputLine.indexOf("<strong>") != -1)
		{
			currentUsername = inputLine.substring(inputLine.indexOf("<strong>")+8,
				inputLine.indexOf("</strong>"));
			
			// System.out.println("SuperMod: "+currentUsername);
		}
		// Normal member
		else
		{
			currentUsername = inputLine.substring(inputLine.indexOf(">")+1,
					inputLine.indexOf("</a>"));
			
			// System.out.println("Normal member: "+currentUsername);
		}
		
		outStream.println(currentUsername);

I had a look at stripping the code from the posts, but I think it might be a bit too much work, to be brutally honest, sorry.

I honestly think you'd be better off copy/pasting the threads as you suggest :lol:
 
amp88
I was trying to extract sector times from posts people had made in particular threads,
:lol:

Ah that was you.... yes i remember reading about that. Shoot!

The print to PDF seems an option but sofar not too sure how to proceed with that. (investigates further)

AMG
 
AMG.
The print to PDF seems an option but sofar not too sure how to proceed with that. (investigates further)AMG

CarDude2004. Your solution to change the URL from showthread in to printthread is a 75% solution. Many thanks. Good enough for me.

AMP, thx for your efforts but that's beyond my capabilities.

AMG.
 
AMG.
CarDude2004. Your solution to change the URL from showthread in to printthread is a 75% solution. Many thanks. Good enough for me.

AMP, thx for your efforts but that's beyond my capabilities.

AMG.

Glad to see you got a workable solution 👍

You're welcome for my input, but I reckon it would be just too fiddly to code a properly working solution. :(
 
do you still need help?

what format do you need?

<username>:
<content>

?
the easier way to do it is to use "Show 100 posts from this thread on one page" option, and then save the page as text

the hard way is by using text parsing, strip_tags() or something equal.
 
sucahyo
do you still need help?

Thx for responding Sucahyo, as the thread title suggests I've found a 'workable' answer. Still leaves a lot of work to filter out the one liners.

Also gave your option a try (save as) and that too may work.

Maybe I should start a poll .....

Option 1 Save As
Option 2 Printthread

nah, j/k.:sly:

AMG.
 

Latest Posts

Back